Text Analytics in R

In this lab we will go through various steps for text analytics that involves natural language processing(NLP).

Kindly go through the video lectures to understand what we are doing here.

For all the preprocessing of text, we will use a package “tm” and we will transform the text into a bag of words (which is also known as tokenization) with their tf-idf into a matrix, which can be used for classification (if we have response variable such as opinions, etc.) or clustering (if we do not have response variable in the data).

Load required packages

library(tm)
library(readtext)
library(cluster)
library(e1071)
library(nnet)
library(rtweet)

Tweet Data for clustering

In oder to collect data from twitter, we need to create a twitter app using its developer site: https://developer.twitter.com/en/portal/petition/use-case

Read the instructions on the following page about creating developer account and twitter API/access tokens:

https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html

Once we have created a twitter app and generated API key, API secret key, Access Token and Access token secret, we can use the create_token() function as shown below to authenticate our twitter API account in R session.

create_token(app = app_name, 
 consumer_key = consumer_key, 
 consumer_secret = consumer_secret, 
 access_token = access_token, 
 access_secret = access_secret)

Download tweets

Once we have authenticated our twitter API account, R is connected to twitter and now we can download data that we like.

Let’s download some data on Queen Elizabeth II.

We will use search_tweets() function to do that. Type ?search_tweets in R to know more about the function.

tweets_data <- search_tweets('Queen Elizabeth II OR Queen Elizabeth OR ', include_rts = FALSE, lang = 'en', n = 18000)

Let’s save the data for our future use

save(tweets_data, file="tweets_data.Rdata")

Load tweets

Let’s load the tweets we just saved in our working directory.

load("tweets_data.Rdata")
dim(tweets_data)

## [1] 17977    43

So, our search has returned 17977 records (or tweets) with 43 attributes (or variables).

Let’s look at the variables in the data

names(tweets_data)

##  [1] "created_at"                    "id"                           
##  [3] "id_str"                        "full_text"                    
##  [5] "truncated"                     "display_text_range"           
##  [7] "entities"                      "metadata"                     
##  [9] "source"                        "in_reply_to_status_id"        
## [11] "in_reply_to_status_id_str"     "in_reply_to_user_id"          
## [13] "in_reply_to_user_id_str"       "in_reply_to_screen_name"      
## [15] "geo"                           "coordinates"                  
## [17] "place"                         "contributors"                 
## [19] "is_quote_status"               "retweet_count"                
## [21] "favorite_count"                "favorited"                    
## [23] "retweeted"                     "possibly_sensitive"           
## [25] "lang"                          "retweeted_status"             
## [27] "quoted_status_id"              "quoted_status_id_str"         
## [29] "quoted_status"                 "withheld_scope"               
## [31] "withheld_in_countries"         "text"                         
## [33] "favorited_by"                  "scopes"                       
## [35] "display_text_width"            "quoted_status_permalink"      
## [37] "quote_count"                   "timestamp_ms"                 
## [39] "reply_count"                   "filter_level"                 
## [41] "query"                         "withheld_copyright"           
## [43] "possibly_sensitive_appealable"

Our main focus is “text”, which is the variable for all the tweets.

Let’s look at some of the tweets

tweets_data$text[10:20]

##  [1] "RT @ForcesNews: .@HMSQNLZ vs. @Warship_78 <U+2693>\n\nHow do these two giants of the sea compare? We have had a look at their weapons, aircraft, siz…"                                                                                                                                                           
##  [2] "We will be closed on Monday for the state funeral of Queen Elizabeth II, so we can give our staff time to pay their respects.<U+0001F917> #SceneConcept  \nOriginal: architecttiles https://t.co/gbXptIGgx2"                                                                                                     
##  [3] "RT @RoyalFamily: Thank you to the people of Aberdeenshire <U+0001F3F4><U+000E0067><U+000E0062><U+000E0073><U+000E0063><U+000E0074><U+000E007F>\n\nThe King and The Queen Consort have met the team, including fire and amb…"                                                                                     
##  [4] "@redfishstream's account has been withheld in India in response to a legal demand. Learn more."                                                                                                                                                                                                                  
##  [5] "The life of #QueenElizabethII is celebrated in a new #comicbook \n#Sharjah24\nhttps://t.co/pnRiluLGsc"                                                                                                                                                                                                           
##  [6] "I wanna wish the badarse queen herself the most happiest of birthdays to @AboutElizabethM !!!<U+0001F973> I Hope you have a fantastic time on your special day, Elizabeth. Never stop being such a wonderful person in life. Can't wait to meet you and Jason Liebrecht next month at SupaNova! <U+0001F601> https://t.co/feAEthPL1O"
##  [7] "RT @QueenLilibet_: Queen Elizabeth II stands with Archbishop of Canterbury Rowan Williams at a Diamond Jubilee multi-faith reception at Lam…"                                                                                                                                                                    
##  [8] "Just realized that the Queen Elizabeth 2 and Marilyn Monroe were both born in the same year -  1926. \n\nOnly a little over a month apart. \n\nComes as a bit of a shock. And makes you realize how incredibly young Marilyn was when she died."                                                                 
##  [9] "@DramaWarship A strange thought for me, but my great uncle was on the Queen Elizabeth at that time, as a Royal Marine gunner."                                                                                                                                                                                   
## [10] "RT @nomoremonarchs: Widow's mite? Elizabeth Bowes-Lyon Windsor (aka \"the queen mother\") was paid by parliament for 50 years, from the day h…"                                                                                                                                                                  
## [11] "Princess Anne Hosts First Investiture Ceremony Since Queen Elizabeth's Death at Buckingham Palace https://t.co/VmeMOzucDd"

Preprocessing the tweets to construct Document Term Matrix

Let’s use tm’s VectorSource(), VCorpus() to convert the dataset in a corpus and tm_map() to process the text

myCorpus <- VCorpus(VectorSource(tweets_data$text)) #For creating corpus source

inspect(myCorpus[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 289

# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)

# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)

# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)

#Stemming 
myCorpus <- tm_map(myCorpus,stemDocument)

# remove stop words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))

# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

# convert to plain text document
myCorpus<- tm_map(myCorpus, PlainTextDocument)

# Covert into Document Term Frequency Matrix using tf, we can also try tf-idf with weightTfIdf
myTdm<- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf, stopwords = TRUE, minWordLength=2))

dim(myTdm)

## [1] 17977 16181

#Removing sparse terms 
myTdm <- removeSparseTerms(myTdm, 0.99)   
dim(myTdm)

## [1] 17977   159

#Converting to matrix
temp = as.matrix(myTdm)

dim(temp)

## [1] 17977   159

We are left with 159 terms (from 16181 terms) in the final dataset after removing sparse terms

Removing empty rows

val = apply(temp,1,sum) #Getting the values of each record
idx = val > 0           #Indices of non-empty rows
temp = temp[idx,]       #Dataset without empty records
rowcount = nrow(temp)
rowcount

## [1] 17744

There were only 233 (17977 - 17744) rows which had no entry considering we have removed some of the sparse terms

Generating distance matrix

To use the data for clustering one way is to generate distance matrix. Let’s do it for a subset of the data by selecting only 1000 records, else the system may slow down a bit

distance = matrix(nrow=1000,ncol=1000)

for(i in 1:1000){
  for(j in 1:1000){
        distance[i,j]= 1- sum(temp[i,]*temp[j,])/(sqrt(sum(temp[i,]*temp[i,]))*sqrt(sum(temp[j,]*temp[j,])))
  }
} 

dissmatrix = as.dist(distance)

Let’s save it in our working directory for future use

save(dissmatrix, file="dissmatrix.twt.Rdata")

Let’s load the distance matrix that we just saved

load("dissmatrix.twt.Rdata")
dim(dissmatrix)

## [1] 1000 1000

So we have a 1000-by-1000 distance matrix

Let’s look at some of the values

as.matrix(dissmatrix)[1:3,1:10]

##           1         2         3         4         5         6         7
## 1 0.0000000 0.5703311 0.6127017 0.4522774 0.4836022 0.7418011 0.5527864
## 2 0.5703311 0.0000000 0.5839749 0.4116516 0.5839749 0.7226499 0.5196155
## 3 0.6127017 0.5839749 0.0000000 0.2928932 0.5000000 0.7500000 0.4226497
##           8         9 10
## 1 0.4522774 0.3554966  1
## 2 0.4116516 0.4615385  1
## 3 0.2928932 0.4452998  1

Clustering

Let’s run the cluster algorithm using pam() from cluster package to divide the data in two clusters.

pam.result <- pam(dissmatrix,k=2,diss=TRUE)
pam.result$clustering

##    [1] 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1
##   [38] 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1 1 1 2
##   [75] 1 1 2 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
##  [112] 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
##  [149] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1
##  [186] 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [223] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 2
##  [260] 2 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1
##  [297] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
##  [334] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 2 1 1 1
##  [371] 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1
##  [408] 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 1 1 2 1 1 2 2 1
##  [445] 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
##  [482] 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 1
##  [519] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2
##  [556] 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 2
##  [593] 1 1 1 1 1 2 1 1 2 2 1 1 2 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1
##  [630] 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
##  [667] 1 1 2 2 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 2 1 1 2
##  [704] 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 2 2 1 2 1 1 2 1
##  [741] 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 2 2 1 1 1 1 1 1
##  [778] 1 1 2 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
##  [815] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1 2 1 1 2 1 2 2 1
##  [852] 1 1 2 2 2 1 1 1 2 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
##  [889] 2 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 1 1 1 2 2 1 2 1 2
##  [926] 2 2 1 2 1 1 2 2 1 1 2 2 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 2
##  [963] 2 1 2 2 1 2 2 1 1 2 1 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [1000] 2

Top words in each cluster

Divide the data into cluster 1 and cluster 2

data = temp[1:1000,]
clus1 = data[pam.result$clustering==1,]
clus2 = data[pam.result$clustering==2,]

dim(clus1)

## [1] 756 159

dim(clus2)

## [1] 244 159

So, we have 756 records in cluster 1 and 244 records in cluster 2

Let’s get the indices of terms in decreasing order of their total term frequency values in each cluster

clus1_terms_idx = order(colSums(clus1), decreasing=T)
clus2_terms_idx = order(colSums(clus2), decreasing=T)

Get the top 20 words in each cluster

cat("\n top 20 terms used in cluster 1 \n")

## 
##  top 20 terms used in cluster 1

colnames(temp)[clus1_terms_idx[1:20]]

##  [1] "queen"     "elizabeth" "thank"     "king"      "amp"       "majesti"  
##  [7] "pictur"    "honour"    "show"      "carri"     "princ"     "royal"    
## [13] "anujdhar"  "princess"  "year"      "will"      "fo…"       "death"    
## [19] "charl"     "attend"

cat("\n \n top 20 terms used in cluster 2 \n")

## 
##  
##  top 20 terms used in cluster 2

colnames(temp)[clus2_terms_idx[1:20]]

##  [1] "new"              "royal"            "york"             "british"         
##  [5] "navi"             "hms"              "queen"            "elizabeth"       
##  [9] "around"           "jet"              "harbor"           "suit"            
## [13] "fli"              "httpstcogufzqpzi" "use"              "valaafshar"      
## [17] "royalfamili"      "volunt"           "duchess"          "visit"

Exercise: Try downloading tweets on some of your favorite topics and perform all the tasks as above.

Movie Review Data for classification

For this section, we will use a data set from the website: https://www.cs.cornell.edu/people/pabo/movie-review-data/

The source page is a distribution site for movie-review data for use in sentiment-analysis experiments.

The dataset was introduced in the papers by Bo Pang and Lillian Lee

We will use the version: polarity dataset v2.0

The data set has 1000 negative reviews and 1000 positive reviews.

However, the data is available in a .tar.gz format on the website. This zipped file contains a folder “txt_sentoken”, which has two subfolders, namely, “neg” and “pos”, each containing 1000 text files of negative and positive reviews respectively.

We will download and convert these text files into a csv file with reviews in one column and opinions (positive or negative) in another.

We will use a package, “readtext” which has a function readtext() to read all the text files from different subfolders in a folder and store the text in a data frame. The data frame has two columns, “doc_id”, which is needed by tm package if we are creating a data frame source for text processing.

We will then add a new column of opinions (negative and positive) to the data frame and save the data as a .csv file in our working directory for our future use.

Following are the codes, please go through them.

sr <- "https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz" #Web Link for data set

download.file(sr,destfile="review_polarity.tar.gz") #Download it in your working directory

untar("review_polarity.tar.gz") #Extract the folders in your wd, txt_sentoken with neg and pos subfolders

data = readtext("txt_sentoken/*") #It will read all the text files in folders neg and pos in a data frame

#Since text files are first read from neg folder and then pos folder, we can add another column with opinions

opinions = c(rep("negative", 1000), rep("positive", 1000))

data_op = cbind(data, opinions)

write.csv(data_op, file="movie_rev.csv", row.names=F)

Preprocessing the movie reviews to construct Document Term Matrix

Loading data

Let’s load the movie data from the .csv file we just created

opinions = read.csv("movie_rev.csv",stringsAsFactors=FALSE) #reading .csv file we created in wd
nrow(opinions)

## [1] 2000

names(opinions)

## [1] "doc_id"   "text"     "opinions"

As we can see that there are 2000 records, and three columns, doc_id is unique for all reviews, text is reviews by audience, opinions are either positive or negative for each record.

Randomize the data

The records in our data have been stacked up based on the two classes of opinions in the order as stored by us, i.e. first all negatives, followed by all positives.

In data science, unless we are dealing with a sequence or time series dataset, we must randomize our data for unbiased selection of records while model building.

m = nrow(opinions) #number of records in the dataset

set.seed(1); rand = sample(m, replace=F) #randomizing row numbers, e.g. instead of 1,2,3 it could be 2,1,3

rand_op = opinions[rand, ] #Reading data in a random order

dim(rand_op)

## [1] 2000    3

Preprocessing data

Let’s use tm’s DataframeSource(), VCorpus() to convert the dataset in a corpus and tm_map() to process the text

ds <- DataframeSource(as.data.frame(rand_op[,1:2])) #For creating data frame source
myCorpus<- VCorpus(ds) #Converting data into a corpus
inspect(myCorpus[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 2316

# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)

# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)

# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)

#Stemming 
myCorpus <- tm_map(myCorpus,stemDocument)

# remove stop words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))

# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

# convert to plain text document
myCorpus<- tm_map(myCorpus, PlainTextDocument)

# Covert into Document Term Frequency Matrix using tf, we can also try tf-idf with weightTfIdf
myTdm<- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf, stopwords = TRUE, minWordLength=2))

dim(myTdm)

## [1]  2000 30570

#Removing sparse terms 
myTdm <- removeSparseTerms(myTdm, 0.85)   
dim(myTdm)

## [1] 2000  312

#Converting to matrix
temp = as.matrix(myTdm)

dim(temp)

## [1] 2000  312

We are left with 312 terms (from 30570 terms) in the final dataset after removing sparse terms

Training a classifier

Let’s train a classification model on the dataset using SVM

set.seed(1); trainset = sample(1:nrow(temp), trunc(0.7*nrow(temp)))

svm.cl = svm(as.factor(rand_op[trainset, 3])~., data = temp[trainset, ], kernel= "radial", scale=FALSE)

Accuracy of the model

Let’s check the accuracy of the model on both train and test data sets.

trainpredicted = predict(svm.cl,temp[trainset, ])
train_conf=table(trainpredicted,rand_op[trainset, 3])
train_conf

##               
## trainpredicted negative positive
##       negative      664       43
##       positive       33      660

train_error = mean(trainpredicted != rand_op[trainset, 3])*100
train_error

## [1] 5.428571

train_accuracy = mean(trainpredicted == rand_op[trainset, 3])*100
train_accuracy

## [1] 94.57143

testpredicted = predict(svm.cl,temp[-trainset, ])
test_conf=table(testpredicted,rand_op[-trainset, 3])
test_conf

##              
## testpredicted negative positive
##      negative      234       75
##      positive       69      222

test_error = mean(testpredicted != rand_op[-trainset, 3])*100
test_error

## [1] 24

test_accuracy = mean(testpredicted == rand_op[-trainset, 3])*100
test_accuracy

## [1] 76

We can say that the model has done well on train set, but hasn’t done so well on test data. But this gives us an idea how to use text data, preprocess it and then use for training a model.

Exercise: Try using different preprocessing options such as instead of weightTf (term frequency), use weightTfIdf (tf-idf), also change the value of sparsity from 0.85 to some other number such as 0.9. Then train the model using some other algorithm such as logistics regression or neural net and see if you are able to improve the performance of the model.

Deceptive Opinion Spam Data

This corpus consists of truthful and deceptive hotel reviews of 20 Chicago hotels. The data is described in two papers according to the sentiment of the review. In particular, we discuss positive sentiment reviews in [1] and negative sentiment reviews in [2].

The source of dataset is: https://myleott.com/op-spam.html

This corpus contains:

400 truthful positive reviews from TripAdvisor (described in [1])
400 deceptive positive reviews from Mechanical Turk (described in [1])
400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp (described in [2])
400 deceptive negative reviews from Mechanical Turk (described in [2])

Download the zip file from: https://myleott.com/op_spam_v1.4.zip

When you download and unzip the file, make sure to have only one main folder. Sometimes, unzipping creates two main folders (one inside the other)

Main folder: op_spam_v1.4 (It has two subfolders: negative_polarity and positive_polarity)

negative_polarity has two subfolders: deceptive_from_MTurk and truthful_from_Web (each of them have 5 subfolders with text files)

positive_polarity has two subfolders: deceptive_from_MTurk and truthful_from_TripAdvisor (each of them have 5 subfolders, each with 80 text files)

Let’s create a csv file containing all the reviews with their respective categories. viz. decep_pos, true_pos, decp_neg and true_neg

# Reading all deceptive positive
decep_pos = readtext("op_spam_v1.4/positive_polarity/deceptive_from_MTurk/*")

# Adding category
opinions = c(rep("decep_pos", 400))
decep_pos = cbind(decep_pos, opinions)


# Reading all truthful positive
true_pos = readtext("op_spam_v1.4/positive_polarity/truthful_from_TripAdvisor/*")

# Adding category
opinions = c(rep("true_pos", 400))
true_pos = cbind(true_pos, opinions)


# Reading all deceptive negative
decep_neg = readtext("op_spam_v1.4/negative_polarity/deceptive_from_MTurk/*")

# Adding category
opinions = c(rep("decep_neg", 400))
decep_neg = cbind(decep_neg, opinions)


# Reading all truthful positive
true_neg = readtext("op_spam_v1.4/negative_polarity/truthful_from_Web/*")

# Adding category
opinions = c(rep("true_neg", 400))
true_neg = cbind(true_neg, opinions)


decep_data_op = rbind(decep_pos, true_pos, decep_neg, true_neg)

write.csv(decep_data_op, file="decep_op.csv", row.names=F)

Preprocessing Deceptive Opinion Spam Data to construct Document Term Matrix

Loading data

Let’s load the Deceptive Opinion Spam Data from the .csv file we just created

opinions_decep = read.csv("decep_op.csv",stringsAsFactors=FALSE) #reading .csv file we created in wd
nrow(opinions_decep)

## [1] 1600

names(opinions_decep)

## [1] "doc_id"   "text"     "opinions"

As we can see that there are 1600 records, and three columns, doc_id is unique for all reviews, text is reviews by customers, opinions has of the of the four classes decep_pos, true_pos, decep_neg, true_neg for each record

Randomize the data

The records in our data have been stacked up based on the four classes of opinions in the order as stored by us, i.e. first all decep_pos’s, followed by all true_pos’s, next all decep_neg’s and at the end true_neg’s.

In data science, unless we are dealing with a sequence or time series dataset, we must randomize our data for unbiased selection of records while model building.

m = nrow(opinions_decep) #number of records in the dataset

set.seed(1); rand = sample(m, replace=F) #randomizing row numbers, e.g. instead of 1,2,3 it could be 2,1,3

rand_op_decep = opinions_decep[rand, ] #Reading data in a random order

dim(rand_op_decep)

## [1] 1600    3

Preprocessing data

Let’s use tm’s DataframeSource(), VCorpus() to convert the dataset in a corpus and tm_map() to process the text

library(tm)
ds <- DataframeSource(as.data.frame(rand_op_decep[,1:2])) #For creating data frame source
myCorpus<- VCorpus(ds) #Converting data into a corpus
inspect(myCorpus[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 896

# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)

# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)

# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)

#Stemming 
myCorpus <- tm_map(myCorpus,stemDocument)

# remove stop words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))

# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)

# convert to plain text document
myCorpus<- tm_map(myCorpus, PlainTextDocument)

# Covert into Document Term Frequency Matrix using tf, we can also try tf-idf with weightTfIdf
myTdm<- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf, stopwords = TRUE, minWordLength=2))

dim(myTdm)

## [1] 1600 7005

#Removing sparse terms 
myTdm <- removeSparseTerms(myTdm, 0.95)   
dim(myTdm)

## [1] 1600  264

#Converting to matrix
temp = as.matrix(myTdm)

dim(temp)

## [1] 1600  264

We are left with 264 terms (from 7005 terms) in the final dataset after removing sparse terms

Training a classifier with SVM

Let’s train a classification model on the dataset using SVM

set.seed(1); trainset = sample(1:nrow(temp), trunc(0.7*nrow(temp)))

svm.cl = svm(as.factor(rand_op_decep[trainset, 3])~., data = temp[trainset, ], kernel= "radial", scale=FALSE)

Accuracy of the model

Let’s check the accuracy of the model on both train and test data sets.

trainpredicted = predict(svm.cl,temp[trainset, ])
train_conf=table(trainpredicted,rand_op_decep[trainset, 3])
train_conf

##               
## trainpredicted decep_neg decep_pos true_neg true_pos
##      decep_neg       226         4       12        0
##      decep_pos        12       224        7       26
##      true_neg         33         4      267       16
##      true_pos          5        37       19      228

train_error = mean(trainpredicted != rand_op_decep[trainset, 3])*100
train_error

## [1] 15.625

train_accuracy = mean(trainpredicted == rand_op_decep[trainset, 3])*100
train_accuracy

## [1] 84.375

testpredicted = predict(svm.cl,temp[-trainset, ])
test_conf=table(testpredicted,rand_op_decep[-trainset, 3])
test_conf

##              
## testpredicted decep_neg decep_pos true_neg true_pos
##     decep_neg        80         6       12        3
##     decep_pos        15        95        3       14
##     true_neg         23         5       62       12
##     true_pos          6        25       18      101

test_error = mean(testpredicted != rand_op_decep[-trainset, 3])*100
test_error

## [1] 29.58333

test_accuracy = mean(testpredicted == rand_op_decep[-trainset, 3])*100
test_accuracy

## [1] 70.41667

Training a classifier with neural net

Let’s train a classification model on the dataset using SVM

nn.model = nnet(as.factor(rand_op_decep[trainset, 3])~., data = temp[trainset, ], siz=3, decay = 0.01, maxit=400)

## # weights:  811
## initial  value 1681.932615 
## iter  10 value 1008.710953
## iter  20 value 705.898543
## iter  30 value 507.502092
## iter  40 value 414.196590
## iter  50 value 365.470515
## iter  60 value 348.523933
## iter  70 value 311.344583
## iter  80 value 278.895431
## iter  90 value 250.163439
## iter 100 value 231.671415
## iter 110 value 224.941924
## iter 120 value 214.881750
## iter 130 value 210.911801
## iter 140 value 197.700727
## iter 150 value 181.618320
## iter 160 value 150.275406
## iter 170 value 141.031255
## iter 180 value 132.874072
## iter 190 value 121.809840
## iter 200 value 111.417397
## iter 210 value 108.770413
## iter 220 value 105.942940
## iter 230 value 99.456531
## iter 240 value 97.630396
## iter 250 value 92.080532
## iter 260 value 91.302170
## iter 270 value 86.427684
## iter 280 value 84.589490
## iter 290 value 83.284061
## iter 300 value 82.659005
## iter 310 value 80.932303
## iter 320 value 78.453799
## iter 330 value 77.409931
## iter 340 value 73.464043
## iter 350 value 72.248530
## iter 360 value 68.589227
## iter 370 value 67.704409
## iter 380 value 67.354195
## iter 390 value 67.104181
## iter 400 value 63.791961
## final  value 63.791961 
## stopped after 400 iterations

Accuracy of the model

Let’s check the accuracy of the model on both train and test data sets.

trainpredicted = predict(nn.model,temp[trainset, ], type="class")
train_conf=table(trainpredicted,rand_op_decep[trainset, 3])
train_conf

##               
## trainpredicted decep_neg decep_pos true_neg true_pos
##      decep_neg       275         1        7        0
##      decep_pos         1       268        0        1
##      true_neg          0         0      298        1
##      true_pos          0         0        0      268

train_error = mean(trainpredicted != rand_op_decep[trainset, 3])*100
train_error

## [1] 0.9821429

train_accuracy = mean(trainpredicted == rand_op_decep[trainset, 3])*100
train_accuracy

## [1] 99.01786

testpredicted = predict(nn.model,temp[-trainset, ], type="class")
test_conf=table(testpredicted,rand_op_decep[-trainset, 3])
test_conf

##              
## testpredicted decep_neg decep_pos true_neg true_pos
##     decep_neg        91         6       23        3
##     decep_pos        13        96        3       26
##     true_neg         14         4       52       17
##     true_pos          6        25       17       84

test_error = mean(testpredicted != rand_op_decep[-trainset, 3])*100
test_error

## [1] 32.70833

test_accuracy = mean(testpredicted == rand_op_decep[-trainset, 3])*100
test_accuracy

## [1] 67.29167

As we can see that although, on train set, SVM (84% accuracy) has not performed as well as neural net (99% accuracy), but the test accuracy of SVM is 70.4% which is better than the test accuracy of neural net (67.3%). Since we have used only two-layered neural net, the performance of the model on test data might have suffered (resulting in bias). If we use multi-layered neural net, the results would be much better.

For unstructured data neural net and svm perform really well compared to any other algorithm. Neural Net is the most preferred algorithm for unstructured data.

Exercise: Try using different preprocessing options such as instead of weightTf (term frequency), use weightTfIdf (tf-idf), also change the value of sparsity from 0.95 to some other number such as 0.9 or 0.99. Then train the model using some other algorithm such as logistics regression and see if you are able to improve the performance of the model.

Business and AI

Search This Blog

Text Analytics in R

Lab_NLP_1_RM

Asmi Ariv

2022-10-12

Text Analytics in R

Load required packages

Tweet Data for clustering

Download tweets

Load tweets

Preprocessing the tweets to construct Document Term Matrix

Removing empty rows

Generating distance matrix

Clustering

Top words in each cluster

Movie Review Data for classification

Preprocessing the movie reviews to construct Document Term Matrix

Loading data

Randomize the data

Preprocessing data

Training a classifier

Accuracy of the model

Deceptive Opinion Spam Data

Preprocessing Deceptive Opinion Spam Data to construct Document Term Matrix

Loading data

Randomize the data

Preprocessing data

Training a classifier with SVM

Accuracy of the model

Training a classifier with neural net

Accuracy of the model

Labels

Comments

Post a Comment

Popular posts from this blog

Metaverse needs better technology, scalable infra, strong governance

What is ChatGPT?

Exploratory Data Analysis