Chapter 2 Data Processing and Cleaning

The original dataset WageData has 526 records but some are not complete. We can see this alreay when looking at the Table 2.1, which shows the first 10 records:

Table 2.1: The Original Data
Wage Educ Exper Tenure NonWhite Sex Married NumDep Smsa Region
3.10 11 2 0 NA Female No 2 Yes West
3.24 12 22 2 No Female Yes 3 Yes West
3.00 11 2 0 No Male No 2 No West
6.00 8 44 28 No Male Yes 0 Yes West
5.30 12 7 2 No Male Yes 1 No West
8.75 16 9 8 No Male Yes 0 Yes West
11.25 18 15 7 No Male No 0 Yes West
5.00 12 5 NA No Female No 0 Yes West
3.60 NA 26 4 No Female No 2 Yes West
18.18 17 22 21 No NA Yes 0 Yes West

Using the code below we can clean and pre-process the data:

library(tidyverse)
library(tidymodels)
library(skimr)
library(readxl)
WageDataOrg <- read_excel("WageDataModifRTalk.xlsx", 
                                 sheet = "DataClean")

NOrg=nrow(WageDataOrg)
RecipeCl=recipe(Wage~Educ+Exper+Tenure+NonWhite+Sex, data=WageDataOrg)%>% 
  step_dummy(all_nominal()) %>% 
  step_meanimpute(Educ) %>% 
  step_medianimpute(NonWhite_Yes) %>% 
  step_knnimpute(Tenure) %>% 
  step_naomit(Sex_Male) %>% 
  prep()

WageData=juice(RecipeCl)

DataInfoNew=skim(WageData)

N=nrow(WageData)

The new dataset WageData has 523 records and all are complete. We can see this when looking at the Table 2.2, which shows the first 10 records:

Table 2.2: The Cleaned and Processed Data
Educ Exper Tenure Wage NonWhite_Yes Sex_Male
11.00000 2 0.0 3.10 0 0
12.00000 22 2.0 3.24 0 0
11.00000 2 0.0 3.00 0 1
8.00000 44 28.0 6.00 0 1
12.00000 7 2.0 5.30 0 1
16.00000 9 8.0 8.75 0 1
18.00000 15 7.0 11.25 0 1
12.00000 5 2.4 5.00 0 0
12.56489 26 4.0 3.60 0 0
16.00000 8 2.0 6.25 0 0