Chapter 2 Data Processing and Cleaning
The original dataset WageData has 526 records but some are not complete. We can see this alreay when looking at the Table 2.1, which shows the first 10 records:
Wage | Educ | Exper | Tenure | NonWhite | Sex | Married | NumDep | Smsa | Region |
---|---|---|---|---|---|---|---|---|---|
3.10 | 11 | 2 | 0 | NA | Female | No | 2 | Yes | West |
3.24 | 12 | 22 | 2 | No | Female | Yes | 3 | Yes | West |
3.00 | 11 | 2 | 0 | No | Male | No | 2 | No | West |
6.00 | 8 | 44 | 28 | No | Male | Yes | 0 | Yes | West |
5.30 | 12 | 7 | 2 | No | Male | Yes | 1 | No | West |
8.75 | 16 | 9 | 8 | No | Male | Yes | 0 | Yes | West |
11.25 | 18 | 15 | 7 | No | Male | No | 0 | Yes | West |
5.00 | 12 | 5 | NA | No | Female | No | 0 | Yes | West |
3.60 | NA | 26 | 4 | No | Female | No | 2 | Yes | West |
18.18 | 17 | 22 | 21 | No | NA | Yes | 0 | Yes | West |
Using the code below we can clean and pre-process the data:
library(tidyverse)
library(tidymodels)
library(skimr)
library(readxl)
<- read_excel("WageDataModifRTalk.xlsx",
WageDataOrg sheet = "DataClean")
=nrow(WageDataOrg)
NOrg=recipe(Wage~Educ+Exper+Tenure+NonWhite+Sex, data=WageDataOrg)%>%
RecipeClstep_dummy(all_nominal()) %>%
step_meanimpute(Educ) %>%
step_medianimpute(NonWhite_Yes) %>%
step_knnimpute(Tenure) %>%
step_naomit(Sex_Male) %>%
prep()
=juice(RecipeCl)
WageData
=skim(WageData)
DataInfoNew
=nrow(WageData) N
The new dataset WageData has 523 records and all are complete. We can see this when looking at the Table 2.2, which shows the first 10 records:
Educ | Exper | Tenure | Wage | NonWhite_Yes | Sex_Male |
---|---|---|---|---|---|
11.00000 | 2 | 0.0 | 3.10 | 0 | 0 |
12.00000 | 22 | 2.0 | 3.24 | 0 | 0 |
11.00000 | 2 | 0.0 | 3.00 | 0 | 1 |
8.00000 | 44 | 28.0 | 6.00 | 0 | 1 |
12.00000 | 7 | 2.0 | 5.30 | 0 | 1 |
16.00000 | 9 | 8.0 | 8.75 | 0 | 1 |
18.00000 | 15 | 7.0 | 11.25 | 0 | 1 |
12.00000 | 5 | 2.4 | 5.00 | 0 | 0 |
12.56489 | 26 | 4.0 | 3.60 | 0 | 0 |
16.00000 | 8 | 2.0 | 6.25 | 0 | 0 |