Lab_D_3_RM
Asmi Ariv
2022-10-14
Data Cleaning and Preprocessing 2
In this lab, we will learn advance techniques to handle missing values, top-coding, bottom-coding, etc.
Random Imputation (try the following in R):
The following function replaces missing values of “var” with randomly selected records from “var”
Imp_Rand = function(var) {
id = is.na(var) #Indices of missing values
nm = sum(id) # number of missing values
var.new = var[!id] # copying the variable without missing values
var.imp = var # Replicating original variable
#Replacing missing values with random sample of the original data
var.imp[id] = sample(var.new, nm, replace=TRUE)
return (var.imp)
}Let’s try an example
x = c(10, 15, 16, 20, 30, 41, 62, "AAA", 12, "ZZZ", 99) #Some sample
x = as.numeric(x) ; x #Converting to numeric, else R would treat x as a character variable## Warning: NAs introduced by coercion## [1] 10 15 16 20 30 41 62 NA 12 NA 99x.Imp = Imp_Rand(x); x.Imp #Missing values replaced by random sampling of x## [1] 10 15 16 20 30 41 62 10 12 41 99Impute Function (try the following in R):
The following function replaces missing values of x with the values of another variable impute_var
impute = function (x, impute_var){
imputed = ifelse (is.na(x), impute_var, x)
return(imputed)
}Let’s try an example
set.seed(1); y = sample(10:100, length(x), replace=TRUE); y #Some random impute_var, just for example## [1] 77 48 10 43 96 52 23 91 68 60 94x ## [1] 10 15 16 20 30 41 62 NA 12 NA 99x.Imp2 = impute(x, y); x.Imp2## [1] 10 15 16 20 30 41 62 91 12 60 99Top coding or capping and bottom coding (try the following in R):
- An extreme value can affect the representation of a general population
- Top coding helps us undersand the variable better by getting rid of extreme values
- E.g., a super rich person in a middle-income group can distort the average income
- So, we use top-coded data (values above upper bound are censored)
- We replace all the values higher than upper bound by upper bound itself
top_coded <- function (x, upper_bound){
tc = ifelse(x>upper_bound, upper_bound, x)
return(tc)
}Let’s try using the example of y and upper_bound = 55
y## [1] 77 48 10 43 96 52 23 91 68 60 94y_topcoded = top_coded(y, 55)
y_topcoded## [1] 55 48 10 43 55 52 23 55 55 55 55- Bottom coding follows similar logic as top coding, only difference is it sets lower bound
- An extremely low value can distort the average of a general population
- So, we replace all values lower than lower bound by lower bound itself
- Especially, if there are unexpected negative values
- E.g. days or hours worked cannot be negative, they are set to zero
Exercise: Write two functions: one for bottom coding and another for zero coding. Bottom coding function should replace all the values lower than lower bound by the lower bound itself. Zero coding function should set all the negative values to zero. Also, try writing only one function to achieve both objectives.
Hint: You can use the function ifelse() in R
Use y for bottom coding with lower bound = 20
Use hrs_worked_neg to test your zero coding
set.seed(1); hrs_worked = sample(10:60, 10, replace=T)
hrs_worked_neg = c(hrs_worked, c(-2, -10, -4))
Using linear regression to predict missing values (try the following in R):
- We build a predictive model using the numerical variable as a response variable
- Predict the response variable using the model
- Replace all the missing values by predicted values
- For this, we use the impute(), we built earlier
Let’s look at one example:
set.seed(1); age = rnorm(1000, mean=45, sd=10)
salary = 2 + 5*age + rnorm(1000, mean=50, sd=20)
age_missing = age #Creating a copy of original variable
set.seed(1); Ind_miss = sample(1:100, 20, replace=F) #Random indices for missing values
age_missing [Ind_miss] = NA #Setting Random indices to NA
plot(age, salary)max(age_missing, na.rm=T); min(age_missing, na.rm=T); mean(age_missing, na.rm=T)## [1] 83.10277## [1] 14.91951## [1] 44.85673max(salary, na.rm=T); min(salary, na.rm=T); mean(salary, na.rm=T)## [1] 494.3234## [1] 100.7741## [1] 276.0924Our experts say, we need to top code both age and salary to get a better result
Upper_bound_age = 55
Upper_bound_sal = 350
age_missing_tc = top_coded(age_missing, Upper_bound_age )
max(age_missing_tc, na.rm=T);## [1] 55salary_tc = top_coded(salary, Upper_bound_sal)
max(salary_tc, na.rm=T);## [1] 350lm_imp_age = lm(age_missing_tc ~salary_tc) #Building linear regression model
pred = predict(lm_imp_age, data.frame(salary_tc)) #Getting the predicted values of age_missingLet’s impute the missing values by the predicted values by using impute function
age.Imp1 = impute(age_missing_tc, pred); #age.Imp1
age_missing_tc[is.na(age_missing_tc)]## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NAage.Imp1[is.na(age_missing_tc)]## [1] 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
## [9] 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## [17] 55.32351 49.42955 56.32743 39.75878pred[is.na(age_missing_tc)]## 1 7 14 21 34 37 39 43
## 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
## 51 54 59 68 73 74 79 82
## 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## 83 85 87 97
## 55.32351 49.42955 56.32743 39.75878To get the optimum values for missing data, we use iterative regression imputation
age.imp2 = Imp_Rand(age_missing_tc ) #Initialize with random imputation
for(i in 1:10){
lm_imp_age1 = lm(age.imp2 ~salary_tc) #Building linear regression model
pred1 = predict(lm_imp_age1, data.frame(salary_tc))
#Getting the predicted values of age_missing
age.imp2 = impute(age_missing_tc, pred1)
}
age_missing_tc[is.na(age_missing_tc)]## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NAage.imp2[is.na(age_missing_tc)]## [1] 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
## [9] 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## [17] 55.32351 49.42955 56.32743 39.75878pred1[is.na(age_missing_tc)]## 1 7 14 21 34 37 39 43
## 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
## 51 54 59 68 73 74 79 82
## 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## 83 85 87 97
## 55.32351 49.42955 56.32743 39.75878Our model seems to preform extremely well in the first iteration itself and hence, there isn’t much improvement any further in the later iterations, perhaps, because our data is quite simple.
Comments
Post a Comment