Data Cleaning and Preprocessing 2

In this lab, we will learn advance techniques to handle missing values, top-coding, bottom-coding, etc.

Random Imputation (try the following in R):

The following function replaces missing values of “var” with randomly selected records from “var”

Imp_Rand = function(var) {

id = is.na(var) #Indices of missing values

nm = sum(id) # number of missing values

var.new = var[!id] # copying the variable without missing values 

var.imp = var # Replicating original variable

#Replacing missing values with random sample of the original data
var.imp[id] = sample(var.new, nm, replace=TRUE)

return (var.imp)

}

Let’s try an example

x =  c(10, 15, 16, 20, 30, 41, 62, "AAA", 12, "ZZZ", 99)    #Some sample
x = as.numeric(x) ; x   #Converting to numeric, else R would treat x as a character variable

## Warning: NAs introduced by coercion

##  [1] 10 15 16 20 30 41 62 NA 12 NA 99

x.Imp = Imp_Rand(x); x.Imp      #Missing values replaced by random sampling of x

##  [1] 10 15 16 20 30 41 62 10 12 41 99

Impute Function (try the following in R):

The following function replaces missing values of x with the values of another variable impute_var

impute = function (x, impute_var){ 

imputed = ifelse (is.na(x), impute_var, x)

return(imputed)

}

Let’s try an example

set.seed(1); y = sample(10:100, length(x), replace=TRUE); y #Some random impute_var, just for example

##  [1] 77 48 10 43 96 52 23 91 68 60 94

##  [1] 10 15 16 20 30 41 62 NA 12 NA 99

x.Imp2 = impute(x, y); x.Imp2

##  [1] 10 15 16 20 30 41 62 91 12 60 99

Top coding or capping and bottom coding (try the following in R):

An extreme value can affect the representation of a general population
Top coding helps us undersand the variable better by getting rid of extreme values
E.g., a super rich person in a middle-income group can distort the average income
So, we use top-coded data (values above upper bound are censored)
We replace all the values higher than upper bound by upper bound itself

top_coded <- function (x, upper_bound){
    tc = ifelse(x>upper_bound, upper_bound, x)
    return(tc)
}

Let’s try using the example of y and upper_bound = 55

##  [1] 77 48 10 43 96 52 23 91 68 60 94

y_topcoded = top_coded(y, 55)
y_topcoded

##  [1] 55 48 10 43 55 52 23 55 55 55 55

Bottom coding follows similar logic as top coding, only difference is it sets lower bound
An extremely low value can distort the average of a general population
So, we replace all values lower than lower bound by lower bound itself
Especially, if there are unexpected negative values
E.g. days or hours worked cannot be negative, they are set to zero

Exercise: Write two functions: one for bottom coding and another for zero coding. Bottom coding function should replace all the values lower than lower bound by the lower bound itself. Zero coding function should set all the negative values to zero. Also, try writing only one function to achieve both objectives.

Hint: You can use the function ifelse() in R

Use y for bottom coding with lower bound = 20

Use hrs_worked_neg to test your zero coding

set.seed(1); hrs_worked = sample(10:60, 10, replace=T)

hrs_worked_neg = c(hrs_worked, c(-2, -10, -4))

Using linear regression to predict missing values (try the following in R):

We build a predictive model using the numerical variable as a response variable
Predict the response variable using the model
Replace all the missing values by predicted values
For this, we use the impute(), we built earlier

Let’s look at one example:

set.seed(1); age = rnorm(1000, mean=45, sd=10)

salary = 2 + 5*age + rnorm(1000, mean=50, sd=20)

age_missing = age                           #Creating a copy of original variable

set.seed(1); Ind_miss = sample(1:100, 20, replace=F)    #Random indices for missing values
age_missing [Ind_miss] = NA                 #Setting Random indices to NA           

plot(age, salary)

max(age_missing, na.rm=T); min(age_missing, na.rm=T); mean(age_missing, na.rm=T)

## [1] 83.10277

## [1] 14.91951

## [1] 44.85673

max(salary, na.rm=T); min(salary, na.rm=T); mean(salary, na.rm=T)

## [1] 494.3234

## [1] 100.7741

## [1] 276.0924

Our experts say, we need to top code both age and salary to get a better result

Upper_bound_age = 55
Upper_bound_sal = 350

age_missing_tc = top_coded(age_missing, Upper_bound_age )
max(age_missing_tc, na.rm=T);

## [1] 55

salary_tc = top_coded(salary, Upper_bound_sal) 
max(salary_tc, na.rm=T);

## [1] 350

lm_imp_age = lm(age_missing_tc ~salary_tc)      #Building linear regression model

pred = predict(lm_imp_age, data.frame(salary_tc))  #Getting the predicted values of age_missing

Let’s impute the missing values by the predicted values by using impute function

age.Imp1 = impute(age_missing_tc, pred); #age.Imp1

age_missing_tc[is.na(age_missing_tc)]

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

age.Imp1[is.na(age_missing_tc)]

##  [1] 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
##  [9] 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## [17] 55.32351 49.42955 56.32743 39.75878

pred[is.na(age_missing_tc)]

##        1        7       14       21       34       37       39       43 
## 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351 
##       51       54       59       68       73       74       79       82 
## 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898 
##       83       85       87       97 
## 55.32351 49.42955 56.32743 39.75878

To get the optimum values for missing data, we use iterative regression imputation

age.imp2 =  Imp_Rand(age_missing_tc )           #Initialize with random imputation

for(i in 1:10){

lm_imp_age1 = lm(age.imp2 ~salary_tc)       #Building linear regression model

pred1 = predict(lm_imp_age1, data.frame(salary_tc))         
#Getting the predicted values of age_missing

age.imp2 = impute(age_missing_tc, pred1)

}

age_missing_tc[is.na(age_missing_tc)]

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

age.imp2[is.na(age_missing_tc)]

##  [1] 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
##  [9] 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## [17] 55.32351 49.42955 56.32743 39.75878

pred1[is.na(age_missing_tc)]

##        1        7       14       21       34       37       39       43 
## 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351 
##       51       54       59       68       73       74       79       82 
## 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898 
##       83       85       87       97 
## 55.32351 49.42955 56.32743 39.75878

Our model seems to preform extremely well in the first iteration itself and hence, there isn’t much improvement any further in the later iterations, perhaps, because our data is quite simple.

Click the links below to learn more

Data Cleaning and Preprocessing 1

Metaverse needs better technology, scalable infra, strong governance

Many minds have been intrigued by the idea of metaverse, and its effect is such that the social media giant like Facebook has been rebranded as Meta. Yet, there is a big question mark on the future of this technology. The enablers of metaverse such as augmented reality, mixed reality and virtual reality operating on computers, smartphones and other devices have failed to give the complete real-world like immersive experience to end users. There is a clear lack of standard virtual environment and technical specifications for implementing metaverse – a bottleneck in using technologies from different proprietors. Due to the business privacy and transparency concerns, interoperability of services from various providers has become a big challenge. Although, the efforts to standardize virtual reality, such as Universal Scene Description, glTF and OpenXR may help in a long run, but a lot more needs to be put in. The technologies and devices, such as wireless he...

Business and AI

Search This Blog