Building Systems

Tuesday, 27 December 2016

Data science prototyping with spark/scala

The practice now is to use R or python for data science prototyping and after having validated the model, the algorithm and the models are migrated to the Big Data Hadoop environment and ported to Spark.

But with Scala available in interactive mode and notebooks like Zeppelin/databricks available, it is possible to straight-away evaluate the models in spark/scala. The only difficulty will be the ease of graph visualisations.

Just to experiment i used the scala code from
https://github.com/mblanc/spark-ml/blob/master/src/main/scala/fr/xebia/sparkml/Titanic.scala
on zeppelin with spark running in local mode.

It was really fun to use spark interactively in a notebook. Atleast it is useful for learners like me to use scala/spark in notebook

12/27/2016

TitanicKaggle

import java.io.File

FINISHED

import org.apache.commons.io.FileUtils

import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.feature._

import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} import org.apache.spark.ml.{Pipeline, PipelineModel}

import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types._

import org.apache.spark.{SparkConf, SparkContext}

Took 1 sec. Last updated by anonymous at December 27 2016, 8:19:44 PM.

FINISHED val csv = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").loa

csv: org.apache.spark.sql.DataFrame = [PassengerId: string, Survived: string ... 10 more ﬁelds]

Took 0 sec. Last updated by anonymous at December 27 2016, 8:20:08 PM.

	csv.printSchema()	FINISHED
root
\|-- PassengerId: string (nullable = true)
\|-- Survived: string (nullable = true)
\|-- Pclass: string (nullable = true)
\|-- Name: string (nullable = true)
\|-- Sex: string (nullable = true)
\|-- Age: string (nullable = true)
\|-- SibSp: string (nullable = true)
\|-- Parch: string (nullable = true)
\|-- Ticket: string (nullable = true)
\|-- Fare: string (nullable = true)
\|-- Cabin: string (nullable = true)
\|-- Embarked: string (nullable = true)
Took 0 sec. Last updated by anonymous at December 27 2016, 8:20:21 PM.
	val df = csv.select(	FINISHED
	val df = csv.select(	FINISHED
	$"Survived".as("label").cast(DoubleType),
	$"Age".as("age").cast(IntegerType),
	$"Fare".as("fare").cast(DoubleType),
	$"Pclass".as("class").cast(DoubleType),
	$"Sex".as("sex"),
	$"Name".as("name")
	)
df: org.apache.spark.sql.DataFrame = [label: double, age: int ... 4 more ﬁelds]

1/6

12/27/2016

Took 0 sec. Last updated by anonymous at December 27 2016, 8:20:33 PM.

df.printSchema()

root

|-- label: double (nullable = true) |-- age: integer (nullable = true) |-- fare: double (nullable = true)

|-- class: double (nullable = true) |-- sex: string (nullable = true) |-- name: string (nullable = true)

Took 0 sec. Last updated by anonymous at December 27 2016, 8:20:43 PM.

	df.show()
+	-----+----	+-------	+-----	+------	+--------------------			+
\|label\| age\| fare\|class\| sex\| name\|
+-----	+----	+-------	+-----	+------	+--------------------			+
\| 0.0\| 22\| 7.25\| 3.0\| male\|Braund, Mr. Owen ...							\|
\| 1.0\| 38\|71.2833\| 1.0\|female\|Cumings, Mrs. Joh								...\|
\| 1.0\| 26\| 7.925\| 3.0\|female\|Heikkinen, Miss.						...	\|
\| 1.0\| 35\| 53.1\| 1.0\|female\|Futrelle, Mrs. Ja...						\|
\| 0.0\| 35\| 8.05\| 3.0\| male\|Allen, Mr. Willia...\|
\| 0.0\|null\| 8.4583\| 3.0\| male\| Moran, Mr. James\|
\| 0.0\| 54\|51.8625\| 1.0\| male\|McCarthy, Mr. Tim...								\|
\| 0.0\| 2\| 21.075\| 3.0\| male\|Palsson, Master. ...							\|
\| 1.0\| 27\|11.1333\| 3.0\|female\|Johnson, Mrs. Osc								...\|
\| 1.0\| 14\|30.0708\| 2.0\|female\|Nasser, Mrs. Nich...								\|
\| 1.0\| 4\| 16.7\| 3.0\|female\|Sandstrom, Miss. ...							\|
\| 1.0\| 58\| 26.55\| 1.0\|female\|Bonnell, Miss. El...							\|
\| 0.0\| 20\| 8.05\| 3.0\| male\|Saundercock, Mr. ...							\|
\| 0.0\| 39\| 31.275\| 3.0\| male\|Andersson, Mr. An...\|
\| 0.0\| 14\| 7.8542\| 3.0\|female\|Vestrom, Miss. Hu...								\|
\| 1.0\| 55\| 16.0\| 2.0\|female\|Hewlett, Mrs. (Ma...							\|
\| 0.0\| 2\| 29.125\| 3.0\| male\|Rice, Master. Eugene\|
\| 1.0\|null\| 13.0\| 2.0\| male\|Williams, Mr. Cha...						\|
\| 0.0\| 31\| 18.0\| 3.0\|female\|Vander Planke, Mr...							\|
\| 1.0\|null\| 7.225\| 3.0\|female\|Masselmani, Mrs.							...\|
+-----	+----	+-------	+-----	+------	+--------------------			+

only showing top 20 rows

Took 0 sec. Last updated by anonymous at December 27 2016, 8:20:54 PM.

FINISHED


	df.describe(df.columns: _*).show()							FINISHED
+-------		+-------------------	+------------------	+-----------------	+------------------	+------	+	--------------------+
\|summary\| label\| age\| fare\| class\| sex\| name\|
+-------		+-------------------	+------------------	+-----------------	+------------------	+------	+--------------------	+
\| count\| 891\| 714\| 891\| 891\| 891\| 891\|

2/6

| mean| 0.3838383838383838|29.712885154061624| 32.2042079685746| 2.308641975308642| null| null|

| stddev|0.48659245426485753|14.529273128376575|49.69342859718089|0.8360712409770491| null| null|

\| min\| 0.0\| 0\| 0.0\| 1.0\|female\|"Andersson, Mr. A...\|
\| max\| 1.0\| 80\| 512.3292\| 3.0\| male\|van Melkebeke, Mr...				\|
+-------	+-------------------	+------------------	+-----------------	+------------------	+------	+--------------------	+

Took 1 sec. Last updated by anonymous at December 27 2016, 8:21:07 PM.

val select = df.na.fill(Map("age" -> 30, "fare" -> 32.2))

FINISHED

select: org.apache.spark.sql.DataFrame = [label: double, age: int ... 4 more ﬁelds]

Took 1 sec. Last updated by anonymous at December 27 2016, 8:21:23 PM.

val Array(trainSet, validationSet) = select.randomSplit(Array(0.75, 0.25))

FINISHED

trainSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, age: int ... 4 more ﬁelds] validationSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, age: int ... 4 more ﬁelds]

Took 0 sec. Last updated by anonymous at December 27 2016, 8:21:34 PM.

// The stages of our	pipeline	FINISHED
val sexIndexer =	new StringIndexer().setInputCol("sex").setOutputCol("sexIndex")
val classEncoder	= new OneHotEncoder().setInputCol("class").setOutputCol("classVec")

val tokenizer = new Tokenizer().setInputCol("name").setOutputCol("words")

val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("hash") val vectorAssembler = new VectorAssembler().setInputCols(Array("hash", "age", "fare", "se val logisticRegression = new LogisticRegression()

val pipeline = new Pipeline().setStages(Array(sexIndexer, classEncoder, tokenizer, hashin

sexIndexer: org.apache.spark.ml.feature.StringIndexer = strIdx_9a6568bd71ad classEncoder: org.apache.spark.ml.feature.OneHotEncoder = oneHot_d8409501e075 tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_024529536be6

hashingTF: org.apache.spark.ml.feature.HashingTF = hashingTF_dbfc7eaf30f2 vectorAssembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_0c2becd4ab86

logisticRegression: org.apache.spark.ml.classiﬁcation.LogisticRegression = logreg_a92cd69c19ea pipeline: org.apache.spark.ml.Pipeline = pipeline_d00b1d3771e8

Took 1 sec. Last updated by anonymous at December 27 2016, 8:21:47 PM.

val crossValidator = new CrossValidator()	FINISHED
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)

crossValidator: org.apache.spark.ml.tuning.CrossValidator = cv_ee05a0b10cc4

Took 0 sec. Last updated by anonymous at December 27 2016, 8:22:00 PM.

//set the params	FINISHED
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(2, 5, 1000))
.addGrid(logisticRegression.regParam, Array(1, 0.1, 0.01))
.addGrid(logisticRegression.maxIter, Array(10, 50, 100))
.build()

3/6

12/27/2016

crossValidator.setEstimatorParamMaps(paramGrid)

paramGrid: Array[org.apache.spark.ml.param.ParamMap] = Array({

logreg_a92cd69c19ea-maxIter: 10, hashingTF_dbfc7eaf30f2-numFeatures: 2, logreg_a92cd69c19ea-regParam: 1.0

}, { logreg_a92cd69c19ea-maxIter: 50,

hashingTF_dbfc7eaf30f2-numFeatures: 2, logreg_a92cd69c19ea-regParam: 1.0

}, {

logreg_a92cd69c19ea-maxIter: 100, hashingTF_dbfc7eaf30f2-numFeatures: 2, logreg_a92cd69c19ea-regParam: 1.0

}, { logreg_a92cd69c19ea-maxIter: 10,

hashingTF_dbfc7eaf30f2-numFeatures: 5, logreg_a92cd69c19ea-regParam: 1.0

}, { logreg_a92cd69c19ea-maxIter: 50,

hashingTF_dbfc7eaf30f2-numFeatures: 5, logreg_a92cd69c19ea-regParam: 1.0

}, {

logreg_a92cd69c19ea-maxIter: 100, hashingTF_dbfc7eaf30f2-numFeatures: 5, logreg_a92cd69c19ea-regParam: 1.0

}, { logreg_a92cd69c19ea-maxIter: 10, hashingTF_db...

res18: crossValidator.type = cv_ee05a0b10cc4

Took 0 sec. Last updated by anonymous at December 27 2016, 8:22:20 PM.

crossValidator.setNumFolds(3)

res19: crossValidator.type = cv_ee05a0b10cc4

Took 0 sec. Last updated by anonymous at December 27 2016, 8:22:32 PM.

val cvModel = crossValidator.fit(trainSet)

cvModel: org.apache.spark.ml.tuning.CrossValidatorModel = cv_ee05a0b10cc4

Took 44 sec. Last updated by anonymous at December 27 2016, 8:23:25 PM.

FINISHED

for (stage <- cvModel.bestModel.asInstanceOf[PipelineModel].stages) println(stage.explainParaFINISHED

handleInvalid: how to handle invalid entries. Options are skip (which will ﬁlter out rows with bad values), or error (which will throw an error). More options may be added later (default: error)

inputCol: input column name (current: sex)

outputCol: output column name (default: strIdx_9a6568bd71ad__output, current: sexIndex)

4/6

12/27/2016

dropLast: whether to drop the last category (default: true) inputCol: input column name (current: class)

outputCol: output column name (default: oneHot_d8409501e075__output, current: classVec) inputCol: input column name (current: name)

outputCol: output column name (default: tok_024529536be6__output, current: words)

binary: If true, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts (default: false)

inputCol: input column name (current: words)

numFeatures: number of features (> 0) (default: 262144, current: 2)

outputCol: output column name (default: hashingTF_dbfc7eaf30f2__output, current: hash) inputCols: input column names (current: [Ljava.lang.String;@54455089)

outputCol: output column name (default: vecAssembler_0c2becd4ab86__output, current: features) elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0)

featuresCol: features column name (default: features) ﬁtIntercept: whether to ﬁt an intercept term (default: true) labelCol: label column name (default: label)

maxIter: maximum number of iterations (>= 0) (default: 100, current: 50) predictionCol: prediction column name (default: prediction)

probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well- calibrated probability estimates! These probabilities should be treated as conﬁdences, not precise probabilities (default: probability)

rawPredictionCol: raw prediction (a.k.a. conﬁdence) column name (default: rawPrediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.01)

standardization: whether to standardize the training features before ﬁtting the model (default: true) threshold: threshold in binary classiﬁcation prediction, in range [0, 1] (default: 0.5)

thresholds: Thresholds in multi-class classiﬁcation to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold (undeﬁned)

tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)

weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (undeﬁned)

Took 0 sec. Last updated by anonymous at December 27 2016, 8:23:42 PM.

val validationPredictions = cvModel.transform(validationSet)

FINISHED

validationPredictions: org.apache.spark.sql.DataFrame = [label: double, age: int ... 12 more ﬁelds]

Took 0 sec. Last updated by anonymous at December 27 2016, 8:23:53 PM.

val binaryClassificationEvaluator: BinaryClassificationEvaluator = new BinaryClassificationEvFINISH D

binaryClassiﬁcationEvaluator: org.apache.spark.ml.evaluation.BinaryClassiﬁcationEvaluator = binEval_24074166d153

Took 0 sec. Last updated by anonymous at December 27 2016, 8:24:07 PM.

println(s"${binaryClassificationEvaluator.getMetricName} ${binaryClassificationEvaluatorFINISHED.ev

areaUnderROC 0.839148351648352

Took 0 sec. Last updated by anonymous at December 27 2016, 8:24:16 PM.

5/6

val total = validationPredictions.count() FINISHED total: Long = 218

Took 0 sec. Last updated by anonymous at December 27 2016, 8:24:26 PM.

val goodPredictionCount = validationPredictions.filter(validationPredictions("label")FINISHED=== val goodPredictionCount: Long = 173

Took 0 sec. Last updated by anonymous at December 27 2016, 8:24:37 PM.

println(s"correct prediction percentage : ${goodPredictionCount / total.toDouble}") FINISHED

correct prediction percentage : 0.7935779816513762

Took 0 sec. Last updated by anonymous at December 27 2016, 8:24:55 PM.

READY

6/6

Sunday, 21 August 2016

Effectiveness of weight lifting exercise

mysoln.R

Problem Statement
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

SOLUTION
The solution follows the below steps

Read in the input train and test data and understand it
Clean up the data
Do Principal component analysis and find the most important variables.
Create cross validation datasets
Fit two models, LDA and random forests

We select the model with the greater accuracy

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

mainDir="/home/sdhandap/WeightLiftPredict"
setwd(mainDir)
source("helperfuncs.R")

STEP1: Read in the input train and test data and understand it

Read in the training and test data

pmltraindata <- read.table("pml-training.csv",sep=",",header=TRUE)
pmltestdata <- read.table("pml-testing.csv",sep=",",header=TRUE)

Explore the structure of the data

preclean_explore(pmltraindata, pmltestdata)

## =============================
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","0.000673",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","0.006078",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","0.000000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","0.000000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","0.00","0.0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","0.01388","0.01574",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","0.00000","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","0.00000","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","0.0016","-0.0035",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","0.0045","0.0130",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","0.0011","0.0014",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","0.0063",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","0.0","-0.1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]
## 
## 'data.frame':    20 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 6 5 5 1 4 5 5 5 2 3 ...
##  $ raw_timestamp_part_1    : int  1323095002 1322673067 1322673075 1322832789 1322489635 1322673149 1322673128 1322673076 1323084240 1322837822 ...
##  $ raw_timestamp_part_2    : int  868349 778725 342967 560311 814776 510661 766645 54671 916313 384285 ...
##  $ cvtd_timestamp          : Factor w/ 11 levels "02/12/2011 13:33",..: 5 10 10 1 6 11 11 10 3 2 ...
##  $ new_window              : Factor w/ 1 level "no": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  74 431 439 194 235 504 485 440 323 664 ...
##  $ roll_belt               : num  123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
##  $ pitch_belt              : num  27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
##  $ yaw_belt                : num  -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
##  $ total_accel_belt        : int  20 4 5 17 3 4 4 4 4 18 ...
##  $ kurtosis_roll_belt      : logi  NA NA NA NA NA NA ...
##  $ kurtosis_picth_belt     : logi  NA NA NA NA NA NA ...
##  $ kurtosis_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt      : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_belt.1    : logi  NA NA NA NA NA NA ...
##  $ skewness_yaw_belt       : logi  NA NA NA NA NA NA ...
##  $ max_roll_belt           : logi  NA NA NA NA NA NA ...
##  $ max_picth_belt          : logi  NA NA NA NA NA NA ...
##  $ max_yaw_belt            : logi  NA NA NA NA NA NA ...
##  $ min_roll_belt           : logi  NA NA NA NA NA NA ...
##  $ min_pitch_belt          : logi  NA NA NA NA NA NA ...
##  $ min_yaw_belt            : logi  NA NA NA NA NA NA ...
##  $ amplitude_roll_belt     : logi  NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : logi  NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : logi  NA NA NA NA NA NA ...
##  $ var_total_accel_belt    : logi  NA NA NA NA NA NA ...
##  $ avg_roll_belt           : logi  NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : logi  NA NA NA NA NA NA ...
##  $ var_roll_belt           : logi  NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : logi  NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : logi  NA NA NA NA NA NA ...
##  $ var_pitch_belt          : logi  NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : logi  NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : logi  NA NA NA NA NA NA ...
##  $ var_yaw_belt            : logi  NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
##  $ gyros_belt_y            : num  -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
##  $ gyros_belt_z            : num  -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
##  $ accel_belt_x            : int  -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
##  $ accel_belt_y            : int  69 11 -1 45 4 -16 2 -2 1 63 ...
##  $ accel_belt_z            : int  -179 39 49 -156 27 38 35 42 32 -158 ...
##  $ magnet_belt_x           : int  -13 43 29 169 33 31 50 39 -6 10 ...
##  $ magnet_belt_y           : int  581 636 631 608 566 638 622 635 600 601 ...
##  $ magnet_belt_z           : int  -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
##  $ roll_arm                : num  40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
##  $ pitch_arm               : num  -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
##  $ yaw_arm                 : num  178 0 0 -142 102 0 0 0 -167 -75.3 ...
##  $ total_accel_arm         : int  10 38 44 25 29 14 15 22 34 32 ...
##  $ var_accel_arm           : logi  NA NA NA NA NA NA ...
##  $ avg_roll_arm            : logi  NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : logi  NA NA NA NA NA NA ...
##  $ var_roll_arm            : logi  NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : logi  NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : logi  NA NA NA NA NA NA ...
##  $ var_pitch_arm           : logi  NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : logi  NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : logi  NA NA NA NA NA NA ...
##  $ var_yaw_arm             : logi  NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
##  $ gyros_arm_y             : num  0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
##  $ gyros_arm_z             : num  -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
##  $ accel_arm_x             : int  16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
##  $ accel_arm_y             : int  38 215 245 -57 200 130 79 175 111 -42 ...
##  $ accel_arm_z             : int  93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
##  $ magnet_arm_x            : int  -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
##  $ magnet_arm_y            : int  385 447 474 257 275 176 15 215 335 294 ...
##  $ magnet_arm_z            : int  481 434 413 633 617 516 217 385 520 493 ...
##  $ kurtosis_roll_arm       : logi  NA NA NA NA NA NA ...
##  $ kurtosis_picth_arm      : logi  NA NA NA NA NA NA ...
##  $ kurtosis_yaw_arm        : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_arm       : logi  NA NA NA NA NA NA ...
##  $ skewness_pitch_arm      : logi  NA NA NA NA NA NA ...
##  $ skewness_yaw_arm        : logi  NA NA NA NA NA NA ...
##  $ max_roll_arm            : logi  NA NA NA NA NA NA ...
##  $ max_picth_arm           : logi  NA NA NA NA NA NA ...
##  $ max_yaw_arm             : logi  NA NA NA NA NA NA ...
##  $ min_roll_arm            : logi  NA NA NA NA NA NA ...
##  $ min_pitch_arm           : logi  NA NA NA NA NA NA ...
##  $ min_yaw_arm             : logi  NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : logi  NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : logi  NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : logi  NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  -17.7 54.5 57.1 43.1 -101.4 ...
##  $ pitch_dumbbell          : num  25 -53.7 -51.4 -30 -53.4 ...
##  $ yaw_dumbbell            : num  126.2 -75.5 -75.2 -103.3 -14.2 ...
##  $ kurtosis_roll_dumbbell  : logi  NA NA NA NA NA NA ...
##  $ kurtosis_picth_dumbbell : logi  NA NA NA NA NA NA ...
##  $ kurtosis_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ skewness_roll_dumbbell  : logi  NA NA NA NA NA NA ...
##  $ skewness_pitch_dumbbell : logi  NA NA NA NA NA NA ...
##  $ skewness_yaw_dumbbell   : logi  NA NA NA NA NA NA ...
##  $ max_roll_dumbbell       : logi  NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : logi  NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : logi  NA NA NA NA NA NA ...
##  $ min_roll_dumbbell       : logi  NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : logi  NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : logi  NA NA NA NA NA NA ...
##  $ amplitude_roll_dumbbell : logi  NA NA NA NA NA NA ...
##   [list output truncated]
## 
## Features in traindata that is not in testdata:[1] "classe"
## 
## Features in testdata that is not in traindata:[1] "problem_id"

STEP2: Clean up the data

start with eliminating the NA variables

cleantrain <- eliminate_NAs(pmltraindata)
cleantest <- eliminate_NAs(pmltestdata)

next eliminating the NULLs

cleantrain <- eliminate_Nulls(cleantrain)
cleantest <- eliminate_Nulls(cleantest)

next eliminating the near zero variance variables

cleantrain <- eliminate_zeroVarFactors(cleantrain)
cleantest <- eliminate_zeroVarFactors(cleantest)

Drop unnecessary varibles for prediction

answer <- cleantest["problem_id"]
drops <- c("problem_id","X")
cleantest <- cleantest[ , !(names(cleantest) %in% drops)]
cleantrain <- cleantrain[ , !(names(cleantrain) %in% drops)]
#do_some_visualisation(cleantrain)

Just explore the data again after exploration

postclean_explore(cleantrain, cleantest)

## =============================
## 'data.frame':    19622 obs. of  58 variables:
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1: int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2: int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp      : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## 
## 'data.frame':    20 obs. of  57 variables:
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 6 5 5 1 4 5 5 5 2 3 ...
##  $ raw_timestamp_part_1: int  1323095002 1322673067 1322673075 1322832789 1322489635 1322673149 1322673128 1322673076 1323084240 1322837822 ...
##  $ raw_timestamp_part_2: int  868349 778725 342967 560311 814776 510661 766645 54671 916313 384285 ...
##  $ cvtd_timestamp      : Factor w/ 11 levels "02/12/2011 13:33",..: 5 10 10 1 6 11 11 10 3 2 ...
##  $ num_window          : int  74 431 439 194 235 504 485 440 323 664 ...
##  $ roll_belt           : num  123 1.02 0.87 125 1.35 -5.92 1.2 0.43 0.93 114 ...
##  $ pitch_belt          : num  27 4.87 1.82 -41.6 3.33 1.59 4.44 4.15 6.72 22.4 ...
##  $ yaw_belt            : num  -4.75 -88.9 -88.5 162 -88.6 -87.7 -87.3 -88.5 -93.7 -13.1 ...
##  $ total_accel_belt    : int  20 4 5 17 3 4 4 4 4 18 ...
##  $ gyros_belt_x        : num  -0.5 -0.06 0.05 0.11 0.03 0.1 -0.06 -0.18 0.1 0.14 ...
##  $ gyros_belt_y        : num  -0.02 -0.02 0.02 0.11 0.02 0.05 0 -0.02 0 0.11 ...
##  $ gyros_belt_z        : num  -0.46 -0.07 0.03 -0.16 0 -0.13 0 -0.03 -0.02 -0.16 ...
##  $ accel_belt_x        : int  -38 -13 1 46 -8 -11 -14 -10 -15 -25 ...
##  $ accel_belt_y        : int  69 11 -1 45 4 -16 2 -2 1 63 ...
##  $ accel_belt_z        : int  -179 39 49 -156 27 38 35 42 32 -158 ...
##  $ magnet_belt_x       : int  -13 43 29 169 33 31 50 39 -6 10 ...
##  $ magnet_belt_y       : int  581 636 631 608 566 638 622 635 600 601 ...
##  $ magnet_belt_z       : int  -382 -309 -312 -304 -418 -291 -315 -305 -302 -330 ...
##  $ roll_arm            : num  40.7 0 0 -109 76.1 0 0 0 -137 -82.4 ...
##  $ pitch_arm           : num  -27.8 0 0 55 2.76 0 0 0 11.2 -63.8 ...
##  $ yaw_arm             : num  178 0 0 -142 102 0 0 0 -167 -75.3 ...
##  $ total_accel_arm     : int  10 38 44 25 29 14 15 22 34 32 ...
##  $ gyros_arm_x         : num  -1.65 -1.17 2.1 0.22 -1.96 0.02 2.36 -3.71 0.03 0.26 ...
##  $ gyros_arm_y         : num  0.48 0.85 -1.36 -0.51 0.79 0.05 -1.01 1.85 -0.02 -0.5 ...
##  $ gyros_arm_z         : num  -0.18 -0.43 1.13 0.92 -0.54 -0.07 0.89 -0.69 -0.02 0.79 ...
##  $ accel_arm_x         : int  16 -290 -341 -238 -197 -26 99 -98 -287 -301 ...
##  $ accel_arm_y         : int  38 215 245 -57 200 130 79 175 111 -42 ...
##  $ accel_arm_z         : int  93 -90 -87 6 -30 -19 -67 -78 -122 -80 ...
##  $ magnet_arm_x        : int  -326 -325 -264 -173 -170 396 702 535 -367 -420 ...
##  $ magnet_arm_y        : int  385 447 474 257 275 176 15 215 335 294 ...
##  $ magnet_arm_z        : int  481 434 413 633 617 516 217 385 520 493 ...
##  $ roll_dumbbell       : num  -17.7 54.5 57.1 43.1 -101.4 ...
##  $ pitch_dumbbell      : num  25 -53.7 -51.4 -30 -53.4 ...
##  $ yaw_dumbbell        : num  126.2 -75.5 -75.2 -103.3 -14.2 ...
##  $ total_accel_dumbbell: int  9 31 29 18 4 29 29 29 3 2 ...
##  $ gyros_dumbbell_x    : num  0.64 0.34 0.39 0.1 0.29 -0.59 0.34 0.37 0.03 0.42 ...
##  $ gyros_dumbbell_y    : num  0.06 0.05 0.14 -0.02 -0.47 0.8 0.16 0.14 -0.21 0.51 ...
##  $ gyros_dumbbell_z    : num  -0.61 -0.71 -0.34 0.05 -0.46 1.1 -0.23 -0.39 -0.21 -0.03 ...
##  $ accel_dumbbell_x    : int  21 -153 -141 -51 -18 -138 -145 -140 0 -7 ...
##  $ accel_dumbbell_y    : int  -15 155 155 72 -30 166 150 159 25 -20 ...
##  $ accel_dumbbell_z    : int  81 -205 -196 -148 -5 -186 -190 -191 9 7 ...
##  $ magnet_dumbbell_x   : int  523 -502 -506 -576 -424 -543 -484 -515 -519 -531 ...
##  $ magnet_dumbbell_y   : int  -528 388 349 238 252 262 354 350 348 321 ...
##  $ magnet_dumbbell_z   : int  -56 -36 41 53 312 96 97 53 -32 -164 ...
##  $ roll_forearm        : num  141 109 131 0 -176 150 155 -161 15.5 13.2 ...
##  $ pitch_forearm       : num  49.3 -17.6 -32.6 0 -2.16 1.46 34.5 43.6 -63.5 19.4 ...
##  $ yaw_forearm         : num  156 106 93 0 -47.9 89.7 152 -89.5 -139 -105 ...
##  $ total_accel_forearm : int  33 39 34 43 24 43 32 47 36 24 ...
##  $ gyros_forearm_x     : num  0.74 1.12 0.18 1.38 -0.75 -0.88 -0.53 0.63 0.03 0.02 ...
##  $ gyros_forearm_y     : num  -3.34 -2.78 -0.79 0.69 3.1 4.26 1.8 -0.74 0.02 0.13 ...
##  $ gyros_forearm_z     : num  -0.59 -0.18 0.28 1.8 0.8 1.35 0.75 0.49 -0.02 -0.07 ...
##  $ accel_forearm_x     : int  -110 212 154 -92 131 230 -192 -151 195 -212 ...
##  $ accel_forearm_y     : int  267 297 271 406 -93 322 170 -331 204 98 ...
##  $ accel_forearm_z     : int  -149 -118 -129 -39 172 -144 -175 -282 -217 -7 ...
##  $ magnet_forearm_x    : int  -714 -237 -51 -233 375 -300 -678 -109 0 -403 ...
##  $ magnet_forearm_y    : int  419 791 698 783 -787 800 284 -619 652 723 ...
##  $ magnet_forearm_z    : int  617 873 783 521 91 884 585 -32 469 512 ...
## =============================
## =============================
## 
## Features in traindata that is not in testdata:

## Warning in trainfeatures != testfeatures: longer object length is not a
## multiple of shorter object length

## [1] "classe"
## 
## Features in testdata that is not in traindata:

## Warning in trainfeatures != testfeatures: longer object length is not a
## multiple of shorter object length

## [1] NA

now combine the training and test data so that when we do prediction we dont get the errors like mismatch of type of predictors (“Type of predictors in new data do not match that of the training data.”)

testnumrows <- nrow(cleantest)
cleantest[,"classe"] <- NA
combinedData <- rbind(cleantrain,cleantest)
allrows <- nrow(combinedData)
finaltest <- combinedData[(allrows-testnumrows+1):allrows, ]
trainingset <- combinedData[1:(allrows-testnumrows), ]

We are done with the data clean up stage

STEP3: Do Principal component analysis

nonNumericVars <- c("user_name","classe","cvtd_timestamp")
pcadata <- combinedData[ , !(names(combinedData) %in% nonNumericVars)]
pca <- prcomp(pcadata, scale = TRUE)
biplot(pca, scale = 0)

std_dev <- pca$sdev
pr_var <- std_dev^2
prop_varex <- pr_var/sum(pr_var)

plot(prop_varex, xlab = "Principal Component",
           ylab = "Proportion of Variance Explained",
           type = "b")

plot(cumsum(prop_varex), xlab = "Principal Component",
           ylab = "Cumulative Proportion of Variance Explained",
           type = "b")

selectcols = which(prop_varex >= 0.002)
dataAfterPCA <- pcadata[,selectcols]
pr_cols <- names(dataAfterPCA)
pr_cols <- c(pr_cols, "classe")



train.pca.data <- trainingset[,(names(trainingset) %in% pr_cols)]
test.pca.data <- finaltest[,(names(finaltest) %in% pr_cols)]

Now we are done with the principal component analysis. Next step is to split the data for cross validatiopn do the simple train test and validation split

STEP4: Create cross validation datasets

training.rows <- createDataPartition(train.pca.data$classe,  p = 0.8, list = FALSE)


train.batch <- train.pca.data[training.rows, ]
test.batch <- train.pca.data[-training.rows, ]

Now we have the data split into three: the train.batch, test.batch and the finaltest dataset

STEP5: Model creation and Validation

dependentvarname <- "classe"
AllVariables <- names(train.pca.data)
PredictorVariables <- setdiff(AllVariables, dependentvarname)
Formula <- formula(paste( paste(dependentvarname, " ~ ", sep =""), 
                            paste(PredictorVariables, collapse=" + ")))
print(Formula)

## classe ~ raw_timestamp_part_1 + raw_timestamp_part_2 + num_window + 
##     roll_belt + pitch_belt + yaw_belt + total_accel_belt + gyros_belt_x + 
##     gyros_belt_y + gyros_belt_z + accel_belt_x + accel_belt_y + 
##     accel_belt_z + magnet_belt_x + magnet_belt_y + magnet_belt_z + 
##     roll_arm + pitch_arm + yaw_arm + total_accel_arm + gyros_arm_x + 
##     gyros_arm_y + gyros_arm_z + accel_arm_x + accel_arm_y + accel_arm_z + 
##     magnet_arm_x + magnet_arm_y + magnet_arm_z + roll_dumbbell + 
##     pitch_dumbbell + yaw_dumbbell + total_accel_dumbbell + gyros_dumbbell_x + 
##     gyros_dumbbell_y + gyros_dumbbell_z

Fit the random forest model

rf_fit <- randomForest(Formula,
                       data=train.batch, 
                       importance=TRUE, 
                       ntree=2000)
pred.rf <- predict(rf_fit, test.batch)
confusionMatrix(pred.rf, test.batch$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    1    0    0    0
##          B    0  758    2    0    0
##          C    0    0  681    2    0
##          D    0    0    1  641    0
##          E    0    0    0    0  721
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9985          
##                  95% CI : (0.9967, 0.9994)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9981          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9987   0.9956   0.9969   1.0000
## Specificity            0.9996   0.9994   0.9994   0.9997   1.0000
## Pos Pred Value         0.9991   0.9974   0.9971   0.9984   1.0000
## Neg Pred Value         1.0000   0.9997   0.9991   0.9994   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1932   0.1736   0.1634   0.1838
## Detection Prevalence   0.2847   0.1937   0.1741   0.1637   0.1838
## Balanced Accuracy      0.9998   0.9990   0.9975   0.9983   1.0000

accuracy.rf <- (round(mean(pred.rf == test.batch$classe),3))

The accuracy in the random forest model

print(accuracy.rf)

## [1] 0.998

Fit the Linear discriminant analysis model

model_lda <- train(Formula, method = "lda", data = train.batch)

## Loading required package: MASS

pred.lda <- predict(model_lda, test.batch)
confusionMatrix(pred.lda, test.batch$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 793 104 212 109  72
##          B  84 488 107  37 127
##          C  97 117 258  46  65
##          D 132  30  93 409  91
##          E  10  20  14  42 366
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5899          
##                  95% CI : (0.5743, 0.6053)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4784          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7106   0.6430  0.37719   0.6361   0.5076
## Specificity            0.8229   0.8878  0.89966   0.8945   0.9731
## Pos Pred Value         0.6147   0.5789  0.44254   0.5417   0.8097
## Neg Pred Value         0.8773   0.9120  0.87246   0.9261   0.8977
## Prevalence             0.2845   0.1935  0.17436   0.1639   0.1838
## Detection Rate         0.2021   0.1244  0.06577   0.1043   0.0933
## Detection Prevalence   0.3288   0.2149  0.14861   0.1925   0.1152
## Balanced Accuracy      0.7668   0.7654  0.63843   0.7653   0.7404

accuracy.lda <-(round(mean(pred.lda == test.batch$classe),3))

The accuracy in the LDA model is

print(accuracy.lda)

## [1] 0.59

if(accuracy.rf >= accuracy.lda) {

  #'
  #' Select Random Forest model for the final prediction
  #' 
  #' 
  pred.final <- predict(rf_fit, finaltest)
  
  
}else {
  #'
  #' Select LDA model for the final prediction
  #' 
  #' 
  pred.final <- predict(model_lda, finaltest)
}

This is the final prediction output.

print(pred.final)

## 19623 19624 19625 19626 19627 19628 19629 19630 19631 19632 19633 19634 
##     B     A     B     A     A     E     D     B     A     A     B     C 
## 19635 19636 19637 19638 19639 19640 19641 19642 
##     B     A     E     E     A     B     B     B 
## Levels: A B C D E

	df.show()
+	-----+----	+-------	+-----	+------	+--------------------			+
\|label\| age\| fare\|class\| sex\| name\|
+-----	+----	+-------	+-----	+------	+--------------------			+
\| 0.0\| 22\| 7.25\| 3.0\| male\|Braund, Mr. Owen ...							\|
\| 1.0\| 38\|71.2833\| 1.0\|female\|Cumings, Mrs. Joh								...\|
\| 1.0\| 26\| 7.925\| 3.0\|female\|Heikkinen, Miss.						...	\|
\| 1.0\| 35\| 53.1\| 1.0\|female\|Futrelle, Mrs. Ja...						\|
\| 0.0\| 35\| 8.05\| 3.0\| male\|Allen, Mr. Willia...\|
\| 0.0\|null\| 8.4583\| 3.0\| male\| Moran, Mr. James\|
\| 0.0\| 54\|51.8625\| 1.0\| male\|McCarthy, Mr. Tim...								\|
\| 0.0\| 2\| 21.075\| 3.0\| male\|Palsson, Master. ...							\|
\| 1.0\| 27\|11.1333\| 3.0\|female\|Johnson, Mrs. Osc								...\|
\| 1.0\| 14\|30.0708\| 2.0\|female\|Nasser, Mrs. Nich...								\|
\| 1.0\| 4\| 16.7\| 3.0\|female\|Sandstrom, Miss. ...							\|
\| 1.0\| 58\| 26.55\| 1.0\|female\|Bonnell, Miss. El...							\|
\| 0.0\| 20\| 8.05\| 3.0\| male\|Saundercock, Mr. ...							\|
\| 0.0\| 39\| 31.275\| 3.0\| male\|Andersson, Mr. An...\|
\| 0.0\| 14\| 7.8542\| 3.0\|female\|Vestrom, Miss. Hu...								\|
\| 1.0\| 55\| 16.0\| 2.0\|female\|Hewlett, Mrs. (Ma...							\|
\| 0.0\| 2\| 29.125\| 3.0\| male\|Rice, Master. Eugene\|
\| 1.0\|null\| 13.0\| 2.0\| male\|Williams, Mr. Cha...						\|
\| 0.0\| 31\| 18.0\| 3.0\|female\|Vander Planke, Mr...							\|
\| 1.0\|null\| 7.225\| 3.0\|female\|Masselmani, Mrs.							...\|
+-----	+----	+-------	+-----	+------	+--------------------			+