Skip to main content

Data Cleaning and Preprocessing 2

 

Data Cleaning and Preprocessing 2

In this lab, we will learn advance techniques to handle missing values, top-coding, bottom-coding, etc.

Random Imputation (try the following in R):

The following function replaces missing values of “var” with randomly selected records from “var”

Imp_Rand = function(var) {

id = is.na(var) #Indices of missing values

nm = sum(id) # number of missing values

var.new = var[!id] # copying the variable without missing values 

var.imp = var # Replicating original variable

#Replacing missing values with random sample of the original data
var.imp[id] = sample(var.new, nm, replace=TRUE)

return (var.imp)

}

Let’s try an example

x =  c(10, 15, 16, 20, 30, 41, 62, "AAA", 12, "ZZZ", 99)    #Some sample
x = as.numeric(x) ; x   #Converting to numeric, else R would treat x as a character variable
## Warning: NAs introduced by coercion
##  [1] 10 15 16 20 30 41 62 NA 12 NA 99
x.Imp = Imp_Rand(x); x.Imp      #Missing values replaced by random sampling of x
##  [1] 10 15 16 20 30 41 62 10 12 41 99

Impute Function (try the following in R):

The following function replaces missing values of x with the values of another variable impute_var

impute = function (x, impute_var){ 

imputed = ifelse (is.na(x), impute_var, x)

return(imputed)

}

Let’s try an example

set.seed(1); y = sample(10:100, length(x), replace=TRUE); y #Some random impute_var, just for example
##  [1] 77 48 10 43 96 52 23 91 68 60 94
x 
##  [1] 10 15 16 20 30 41 62 NA 12 NA 99
x.Imp2 = impute(x, y); x.Imp2
##  [1] 10 15 16 20 30 41 62 91 12 60 99

Top coding or capping and bottom coding (try the following in R):

  • An extreme value can affect the representation of a general population
  • Top coding helps us undersand the variable better by getting rid of extreme values
  • E.g., a super rich person in a middle-income group can distort the average income
  • So, we use top-coded data (values above upper bound are censored)
  • We replace all the values higher than upper bound by upper bound itself
top_coded <- function (x, upper_bound){
    tc = ifelse(x>upper_bound, upper_bound, x)
    return(tc)
}

Let’s try using the example of y and upper_bound = 55

y
##  [1] 77 48 10 43 96 52 23 91 68 60 94
y_topcoded = top_coded(y, 55)
y_topcoded
##  [1] 55 48 10 43 55 52 23 55 55 55 55
  • Bottom coding follows similar logic as top coding, only difference is it sets lower bound
  • An extremely low value can distort the average of a general population
  • So, we replace all values lower than lower bound by lower bound itself
  • Especially, if there are unexpected negative values
  • E.g. days or hours worked cannot be negative, they are set to zero

Exercise: Write two functions: one for bottom coding and another for zero coding. Bottom coding function should replace all the values lower than lower bound by the lower bound itself. Zero coding function should set all the negative values to zero. Also, try writing only one function to achieve both objectives.

Hint: You can use the function ifelse() in R

Use y for bottom coding with lower bound = 20

Use hrs_worked_neg to test your zero coding

set.seed(1); hrs_worked = sample(10:60, 10, replace=T)

hrs_worked_neg = c(hrs_worked, c(-2, -10, -4))

Using linear regression to predict missing values (try the following in R):

  • We build a predictive model using the numerical variable as a response variable
  • Predict the response variable using the model
  • Replace all the missing values by predicted values
  • For this, we use the impute(), we built earlier

Let’s look at one example:

set.seed(1); age = rnorm(1000, mean=45, sd=10)

salary = 2 + 5*age + rnorm(1000, mean=50, sd=20)

age_missing = age                           #Creating a copy of original variable

set.seed(1); Ind_miss = sample(1:100, 20, replace=F)    #Random indices for missing values
age_missing [Ind_miss] = NA                 #Setting Random indices to NA           

plot(age, salary)

max(age_missing, na.rm=T); min(age_missing, na.rm=T); mean(age_missing, na.rm=T)
## [1] 83.10277
## [1] 14.91951
## [1] 44.85673
max(salary, na.rm=T); min(salary, na.rm=T); mean(salary, na.rm=T)
## [1] 494.3234
## [1] 100.7741
## [1] 276.0924

Our experts say, we need to top code both age and salary to get a better result

Upper_bound_age = 55
Upper_bound_sal = 350

age_missing_tc = top_coded(age_missing, Upper_bound_age )
max(age_missing_tc, na.rm=T);
## [1] 55
salary_tc = top_coded(salary, Upper_bound_sal) 
max(salary_tc, na.rm=T);
## [1] 350
lm_imp_age = lm(age_missing_tc ~salary_tc)      #Building linear regression model

pred = predict(lm_imp_age, data.frame(salary_tc))  #Getting the predicted values of age_missing

Let’s impute the missing values by the predicted values by using impute function

age.Imp1 = impute(age_missing_tc, pred); #age.Imp1

age_missing_tc[is.na(age_missing_tc)]
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
age.Imp1[is.na(age_missing_tc)]
##  [1] 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
##  [9] 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## [17] 55.32351 49.42955 56.32743 39.75878
pred[is.na(age_missing_tc)]
##        1        7       14       21       34       37       39       43 
## 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351 
##       51       54       59       68       73       74       79       82 
## 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898 
##       83       85       87       97 
## 55.32351 49.42955 56.32743 39.75878

To get the optimum values for missing data, we use iterative regression imputation

age.imp2 =  Imp_Rand(age_missing_tc )           #Initialize with random imputation

for(i in 1:10){

lm_imp_age1 = lm(age.imp2 ~salary_tc)       #Building linear regression model

pred1 = predict(lm_imp_age1, data.frame(salary_tc))         
#Getting the predicted values of age_missing

age.imp2 = impute(age_missing_tc, pred1)

}

age_missing_tc[is.na(age_missing_tc)]
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
age.imp2[is.na(age_missing_tc)]
##  [1] 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351
##  [9] 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898
## [17] 55.32351 49.42955 56.32743 39.75878
pred1[is.na(age_missing_tc)]
##        1        7       14       21       34       37       39       43 
## 43.18234 51.10766 22.06430 50.91937 43.39659 40.64742 56.32743 47.03351 
##       51       54       59       68       73       74       79       82 
## 47.91204 43.21128 54.67013 50.46524 49.09073 37.87065 46.95970 43.02898 
##       83       85       87       97 
## 55.32351 49.42955 56.32743 39.75878

Our model seems to preform extremely well in the first iteration itself and hence, there isn’t much improvement any further in the later iterations, perhaps, because our data is quite simple.


Click the links below to learn more

Comments

Popular posts from this blog

Metaverse needs better technology, scalable infra, strong governance

Many minds have been intrigued by the idea of metaverse, and its effect is such that the social media giant like Facebook has been rebranded as Meta. Yet, there is a big question mark on the future of this technology. The enablers of metaverse such as augmented reality, mixed reality and virtual reality operating on computers, smartphones and other devices have failed to give the complete real-world like immersive experience to end users. There is a clear lack of standard virtual environment and technical specifications for implementing metaverse  –  a bottleneck in using technologies from different proprietors. Due to the business privacy and transparency concerns, interoperability of services from various providers has become a big challenge. Although, the efforts to standardize virtual reality, such as Universal Scene Description, glTF and OpenXR may help in a long run, but a lot more needs to be put in.  The technologies and devices, such as wireless he...

What is ChatGPT?

Introduction ChatGPT is a language model developed by OpenAI based on the GPT-3.5 architecture. It is designed to perform various natural language processing tasks such as language translation, text summarization, question-answering, and chatbot interactions. In this blog, we will discuss ChatGPT, its architecture, applications, and benefits. Architecture ChatGPT is based on the GPT-3.5 architecture, which is an extension of the GPT-3 architecture. The model has 175 billion parameters, making it one of the largest language models available. The architecture consists of 96 transformer blocks with a hidden size of 12,288 and 10 attention heads. The model is trained using a combination of unsupervised and supervised learning techniques. Applications ChatGPT has a wide range of applications in various fields such as healthcare, finance, customer service, and education. Some of the applications of ChatGPT are as follows: Language translation: ChatGPT can translate text from one language to ...

Exploratory Data Analysis

  Lab_D_2_RM Asmi Ariv 2022-10-14 Exploratory Data Analysis In this lab, we will go through various steps to explore a dataset using descriptive statistics, summary of data, different graphs, etc. Factor Variables (try the following in R): data = read.csv( "patient.csv" );data #Reading patient data ## Patient Gender Age Group ## 1 Dick M 20 2 ## 2 Anna F 25 1 ## 3 Sam M 30 3 ## 4 Jennie F 28 2 ## 5 Joss M 29 3 ## 6 Don M 21 2 ## 7 Annie F 26 1 ## 8 John M 32 3 ## 9 Rose F 27 2 ## 10 Jack M 31 3 data$Gender #It is a string/character variable ## [1] "M" "F" "M" "F" "M" "M" "F" "M" "F" "M" data$Gender = factor(data$Gender,levels=c( "M" , "F" ), ordered= TRUE ) #...