Lab_NLP_1_RM
Asmi Ariv
2022-10-12
Text Analytics in R
In this lab we will go through various steps for text analytics that involves natural language processing(NLP).
Kindly go through the video lectures to understand what we are doing here.
For all the preprocessing of text, we will use a package “tm” and we will transform the text into a bag of words (which is also known as tokenization) with their tf-idf into a matrix, which can be used for classification (if we have response variable such as opinions, etc.) or clustering (if we do not have response variable in the data).
Load required packages
library(tm)
library(readtext)
library(cluster)
library(e1071)
library(nnet)
library(rtweet)Tweet Data for clustering
In oder to collect data from twitter, we need to create a twitter app using its developer site: https://developer.twitter.com/en/portal/petition/use-case
Read the instructions on the following page about creating developer account and twitter API/access tokens:
https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html
Once we have created a twitter app and generated API key, API secret key, Access Token and Access token secret, we can use the create_token() function as shown below to authenticate our twitter API account in R session.
create_token(app = app_name,
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_secret)Download tweets
Once we have authenticated our twitter API account, R is connected to twitter and now we can download data that we like.
Let’s download some data on Queen Elizabeth II.
We will use search_tweets() function to do that. Type ?search_tweets in R to know more about the function.
tweets_data <- search_tweets('Queen Elizabeth II OR Queen Elizabeth OR ', include_rts = FALSE, lang = 'en', n = 18000)Let’s save the data for our future use
save(tweets_data, file="tweets_data.Rdata")Load tweets
Let’s load the tweets we just saved in our working directory.
load("tweets_data.Rdata")
dim(tweets_data)## [1] 17977 43So, our search has returned 17977 records (or tweets) with 43 attributes (or variables).
Let’s look at the variables in the data
names(tweets_data)## [1] "created_at" "id"
## [3] "id_str" "full_text"
## [5] "truncated" "display_text_range"
## [7] "entities" "metadata"
## [9] "source" "in_reply_to_status_id"
## [11] "in_reply_to_status_id_str" "in_reply_to_user_id"
## [13] "in_reply_to_user_id_str" "in_reply_to_screen_name"
## [15] "geo" "coordinates"
## [17] "place" "contributors"
## [19] "is_quote_status" "retweet_count"
## [21] "favorite_count" "favorited"
## [23] "retweeted" "possibly_sensitive"
## [25] "lang" "retweeted_status"
## [27] "quoted_status_id" "quoted_status_id_str"
## [29] "quoted_status" "withheld_scope"
## [31] "withheld_in_countries" "text"
## [33] "favorited_by" "scopes"
## [35] "display_text_width" "quoted_status_permalink"
## [37] "quote_count" "timestamp_ms"
## [39] "reply_count" "filter_level"
## [41] "query" "withheld_copyright"
## [43] "possibly_sensitive_appealable"Our main focus is “text”, which is the variable for all the tweets.
Let’s look at some of the tweets
tweets_data$text[10:20]## [1] "RT @ForcesNews: .@HMSQNLZ vs. @Warship_78 <U+2693>\n\nHow do these two giants of the sea compare? We have had a look at their weapons, aircraft, siz…"
## [2] "We will be closed on Monday for the state funeral of Queen Elizabeth II, so we can give our staff time to pay their respects.<U+0001F917> #SceneConcept \nOriginal: architecttiles https://t.co/gbXptIGgx2"
## [3] "RT @RoyalFamily: Thank you to the people of Aberdeenshire <U+0001F3F4><U+000E0067><U+000E0062><U+000E0073><U+000E0063><U+000E0074><U+000E007F>\n\nThe King and The Queen Consort have met the team, including fire and amb…"
## [4] "@redfishstream's account has been withheld in India in response to a legal demand. Learn more."
## [5] "The life of #QueenElizabethII is celebrated in a new #comicbook \n#Sharjah24\nhttps://t.co/pnRiluLGsc"
## [6] "I wanna wish the badarse queen herself the most happiest of birthdays to @AboutElizabethM !!!<U+0001F973> I Hope you have a fantastic time on your special day, Elizabeth. Never stop being such a wonderful person in life. Can't wait to meet you and Jason Liebrecht next month at SupaNova! <U+0001F601> https://t.co/feAEthPL1O"
## [7] "RT @QueenLilibet_: Queen Elizabeth II stands with Archbishop of Canterbury Rowan Williams at a Diamond Jubilee multi-faith reception at Lam…"
## [8] "Just realized that the Queen Elizabeth 2 and Marilyn Monroe were both born in the same year - 1926. \n\nOnly a little over a month apart. \n\nComes as a bit of a shock. And makes you realize how incredibly young Marilyn was when she died."
## [9] "@DramaWarship A strange thought for me, but my great uncle was on the Queen Elizabeth at that time, as a Royal Marine gunner."
## [10] "RT @nomoremonarchs: Widow's mite? Elizabeth Bowes-Lyon Windsor (aka \"the queen mother\") was paid by parliament for 50 years, from the day h…"
## [11] "Princess Anne Hosts First Investiture Ceremony Since Queen Elizabeth's Death at Buckingham Palace https://t.co/VmeMOzucDd"Preprocessing the tweets to construct Document Term Matrix
Let’s use tm’s VectorSource(), VCorpus() to convert the dataset in a corpus and tm_map() to process the text
myCorpus <- VCorpus(VectorSource(tweets_data$text)) #For creating corpus source
inspect(myCorpus[1])## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 289# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#Stemming
myCorpus <- tm_map(myCorpus,stemDocument)
# remove stop words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
# convert to plain text document
myCorpus<- tm_map(myCorpus, PlainTextDocument)
# Covert into Document Term Frequency Matrix using tf, we can also try tf-idf with weightTfIdf
myTdm<- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf, stopwords = TRUE, minWordLength=2))
dim(myTdm)## [1] 17977 16181#Removing sparse terms
myTdm <- removeSparseTerms(myTdm, 0.99)
dim(myTdm) ## [1] 17977 159#Converting to matrix
temp = as.matrix(myTdm)
dim(temp)## [1] 17977 159We are left with 159 terms (from 16181 terms) in the final dataset after removing sparse terms
Removing empty rows
val = apply(temp,1,sum) #Getting the values of each record
idx = val > 0 #Indices of non-empty rows
temp = temp[idx,] #Dataset without empty records
rowcount = nrow(temp)
rowcount## [1] 17744There were only 233 (17977 - 17744) rows which had no entry considering we have removed some of the sparse terms
Generating distance matrix
To use the data for clustering one way is to generate distance matrix. Let’s do it for a subset of the data by selecting only 1000 records, else the system may slow down a bit
distance = matrix(nrow=1000,ncol=1000)
for(i in 1:1000){
for(j in 1:1000){
distance[i,j]= 1- sum(temp[i,]*temp[j,])/(sqrt(sum(temp[i,]*temp[i,]))*sqrt(sum(temp[j,]*temp[j,])))
}
}
dissmatrix = as.dist(distance)Let’s save it in our working directory for future use
save(dissmatrix, file="dissmatrix.twt.Rdata")Let’s load the distance matrix that we just saved
load("dissmatrix.twt.Rdata")
dim(dissmatrix)## [1] 1000 1000So we have a 1000-by-1000 distance matrix
Let’s look at some of the values
as.matrix(dissmatrix)[1:3,1:10]## 1 2 3 4 5 6 7
## 1 0.0000000 0.5703311 0.6127017 0.4522774 0.4836022 0.7418011 0.5527864
## 2 0.5703311 0.0000000 0.5839749 0.4116516 0.5839749 0.7226499 0.5196155
## 3 0.6127017 0.5839749 0.0000000 0.2928932 0.5000000 0.7500000 0.4226497
## 8 9 10
## 1 0.4522774 0.3554966 1
## 2 0.4116516 0.4615385 1
## 3 0.2928932 0.4452998 1Clustering
Let’s run the cluster algorithm using pam() from cluster package to divide the data in two clusters.
pam.result <- pam(dissmatrix,k=2,diss=TRUE)
pam.result$clustering## [1] 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1
## [38] 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1 1 1 2
## [75] 1 1 2 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [112] 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
## [149] 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1
## [186] 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 2
## [260] 2 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1
## [334] 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 2 1 1 1
## [371] 1 1 1 2 1 1 2 1 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1
## [408] 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 1 1 2 1 1 2 2 1
## [445] 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
## [482] 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 1
## [519] 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2
## [556] 1 2 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 2
## [593] 1 1 1 1 1 2 1 1 2 2 1 1 2 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1
## [630] 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [667] 1 1 2 2 1 1 2 1 2 1 1 2 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 2 2 1 1 2
## [704] 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 2 2 1 2 1 1 2 1
## [741] 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 2 2 2 1 2 2 2 1 1 1 1 1 1
## [778] 1 1 2 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1
## [815] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 1 1 2 1 1 2 1 2 2 1
## [852] 1 1 2 2 2 1 1 1 2 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1
## [889] 2 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 1 1 1 2 2 1 2 1 2
## [926] 2 2 1 2 1 1 2 2 1 1 2 2 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 2
## [963] 2 1 2 2 1 2 2 1 1 2 1 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [1000] 2Top words in each cluster
Divide the data into cluster 1 and cluster 2
data = temp[1:1000,]
clus1 = data[pam.result$clustering==1,]
clus2 = data[pam.result$clustering==2,]dim(clus1)## [1] 756 159dim(clus2)## [1] 244 159So, we have 756 records in cluster 1 and 244 records in cluster 2
Let’s get the indices of terms in decreasing order of their total term frequency values in each cluster
clus1_terms_idx = order(colSums(clus1), decreasing=T)
clus2_terms_idx = order(colSums(clus2), decreasing=T)Get the top 20 words in each cluster
cat("\n top 20 terms used in cluster 1 \n")##
## top 20 terms used in cluster 1colnames(temp)[clus1_terms_idx[1:20]]## [1] "queen" "elizabeth" "thank" "king" "amp" "majesti"
## [7] "pictur" "honour" "show" "carri" "princ" "royal"
## [13] "anujdhar" "princess" "year" "will" "fo…" "death"
## [19] "charl" "attend"cat("\n \n top 20 terms used in cluster 2 \n")##
##
## top 20 terms used in cluster 2colnames(temp)[clus2_terms_idx[1:20]]## [1] "new" "royal" "york" "british"
## [5] "navi" "hms" "queen" "elizabeth"
## [9] "around" "jet" "harbor" "suit"
## [13] "fli" "httpstcogufzqpzi" "use" "valaafshar"
## [17] "royalfamili" "volunt" "duchess" "visit"Exercise: Try downloading tweets on some of your favorite topics and perform all the tasks as above.
Movie Review Data for classification
For this section, we will use a data set from the website: https://www.cs.cornell.edu/people/pabo/movie-review-data/
The source page is a distribution site for movie-review data for use in sentiment-analysis experiments.
The dataset was introduced in the papers by Bo Pang and Lillian Lee
We will use the version: polarity dataset v2.0
The data set has 1000 negative reviews and 1000 positive reviews.
However, the data is available in a .tar.gz format on the website. This zipped file contains a folder “txt_sentoken”, which has two subfolders, namely, “neg” and “pos”, each containing 1000 text files of negative and positive reviews respectively.
We will download and convert these text files into a csv file with reviews in one column and opinions (positive or negative) in another.
We will use a package, “readtext” which has a function readtext() to read all the text files from different subfolders in a folder and store the text in a data frame. The data frame has two columns, “doc_id”, which is needed by tm package if we are creating a data frame source for text processing.
We will then add a new column of opinions (negative and positive) to the data frame and save the data as a .csv file in our working directory for our future use.
Following are the codes, please go through them.
sr <- "https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz" #Web Link for data set
download.file(sr,destfile="review_polarity.tar.gz") #Download it in your working directory
untar("review_polarity.tar.gz") #Extract the folders in your wd, txt_sentoken with neg and pos subfolders
data = readtext("txt_sentoken/*") #It will read all the text files in folders neg and pos in a data frame
#Since text files are first read from neg folder and then pos folder, we can add another column with opinions
opinions = c(rep("negative", 1000), rep("positive", 1000))
data_op = cbind(data, opinions)
write.csv(data_op, file="movie_rev.csv", row.names=F)Preprocessing the movie reviews to construct Document Term Matrix
Loading data
Let’s load the movie data from the .csv file we just created
opinions = read.csv("movie_rev.csv",stringsAsFactors=FALSE) #reading .csv file we created in wd
nrow(opinions)## [1] 2000names(opinions)## [1] "doc_id" "text" "opinions"As we can see that there are 2000 records, and three columns, doc_id is unique for all reviews, text is reviews by audience, opinions are either positive or negative for each record.
Randomize the data
The records in our data have been stacked up based on the two classes of opinions in the order as stored by us, i.e. first all negatives, followed by all positives.
In data science, unless we are dealing with a sequence or time series dataset, we must randomize our data for unbiased selection of records while model building.
m = nrow(opinions) #number of records in the dataset
set.seed(1); rand = sample(m, replace=F) #randomizing row numbers, e.g. instead of 1,2,3 it could be 2,1,3
rand_op = opinions[rand, ] #Reading data in a random order
dim(rand_op)## [1] 2000 3Preprocessing data
Let’s use tm’s DataframeSource(), VCorpus() to convert the dataset in a corpus and tm_map() to process the text
ds <- DataframeSource(as.data.frame(rand_op[,1:2])) #For creating data frame source
myCorpus<- VCorpus(ds) #Converting data into a corpus
inspect(myCorpus[1])## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 2316# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#Stemming
myCorpus <- tm_map(myCorpus,stemDocument)
# remove stop words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
# convert to plain text document
myCorpus<- tm_map(myCorpus, PlainTextDocument)
# Covert into Document Term Frequency Matrix using tf, we can also try tf-idf with weightTfIdf
myTdm<- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf, stopwords = TRUE, minWordLength=2))
dim(myTdm)## [1] 2000 30570#Removing sparse terms
myTdm <- removeSparseTerms(myTdm, 0.85)
dim(myTdm) ## [1] 2000 312#Converting to matrix
temp = as.matrix(myTdm)
dim(temp)## [1] 2000 312We are left with 312 terms (from 30570 terms) in the final dataset after removing sparse terms
Training a classifier
Let’s train a classification model on the dataset using SVM
set.seed(1); trainset = sample(1:nrow(temp), trunc(0.7*nrow(temp)))
svm.cl = svm(as.factor(rand_op[trainset, 3])~., data = temp[trainset, ], kernel= "radial", scale=FALSE)Accuracy of the model
Let’s check the accuracy of the model on both train and test data sets.
trainpredicted = predict(svm.cl,temp[trainset, ])
train_conf=table(trainpredicted,rand_op[trainset, 3])
train_conf##
## trainpredicted negative positive
## negative 664 43
## positive 33 660train_error = mean(trainpredicted != rand_op[trainset, 3])*100
train_error## [1] 5.428571train_accuracy = mean(trainpredicted == rand_op[trainset, 3])*100
train_accuracy## [1] 94.57143testpredicted = predict(svm.cl,temp[-trainset, ])
test_conf=table(testpredicted,rand_op[-trainset, 3])
test_conf##
## testpredicted negative positive
## negative 234 75
## positive 69 222test_error = mean(testpredicted != rand_op[-trainset, 3])*100
test_error## [1] 24test_accuracy = mean(testpredicted == rand_op[-trainset, 3])*100
test_accuracy## [1] 76We can say that the model has done well on train set, but hasn’t done so well on test data. But this gives us an idea how to use text data, preprocess it and then use for training a model.
Exercise: Try using different preprocessing options such as instead of weightTf (term frequency), use weightTfIdf (tf-idf), also change the value of sparsity from 0.85 to some other number such as 0.9. Then train the model using some other algorithm such as logistics regression or neural net and see if you are able to improve the performance of the model.
Deceptive Opinion Spam Data
This corpus consists of truthful and deceptive hotel reviews of 20 Chicago hotels. The data is described in two papers according to the sentiment of the review. In particular, we discuss positive sentiment reviews in [1] and negative sentiment reviews in [2].
The source of dataset is: https://myleott.com/op-spam.html
This corpus contains:
- 400 truthful positive reviews from TripAdvisor (described in [1])
- 400 deceptive positive reviews from Mechanical Turk (described in [1])
- 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp (described in [2])
- 400 deceptive negative reviews from Mechanical Turk (described in [2])
Download the zip file from: https://myleott.com/op_spam_v1.4.zip
When you download and unzip the file, make sure to have only one main folder. Sometimes, unzipping creates two main folders (one inside the other)
Main folder: op_spam_v1.4 (It has two subfolders: negative_polarity and positive_polarity)
negative_polarity has two subfolders: deceptive_from_MTurk and truthful_from_Web (each of them have 5 subfolders with text files)
positive_polarity has two subfolders: deceptive_from_MTurk and truthful_from_TripAdvisor (each of them have 5 subfolders, each with 80 text files)
Let’s create a csv file containing all the reviews with their respective categories. viz. decep_pos, true_pos, decp_neg and true_neg
# Reading all deceptive positive
decep_pos = readtext("op_spam_v1.4/positive_polarity/deceptive_from_MTurk/*")
# Adding category
opinions = c(rep("decep_pos", 400))
decep_pos = cbind(decep_pos, opinions)
# Reading all truthful positive
true_pos = readtext("op_spam_v1.4/positive_polarity/truthful_from_TripAdvisor/*")
# Adding category
opinions = c(rep("true_pos", 400))
true_pos = cbind(true_pos, opinions)
# Reading all deceptive negative
decep_neg = readtext("op_spam_v1.4/negative_polarity/deceptive_from_MTurk/*")
# Adding category
opinions = c(rep("decep_neg", 400))
decep_neg = cbind(decep_neg, opinions)
# Reading all truthful positive
true_neg = readtext("op_spam_v1.4/negative_polarity/truthful_from_Web/*")
# Adding category
opinions = c(rep("true_neg", 400))
true_neg = cbind(true_neg, opinions)
decep_data_op = rbind(decep_pos, true_pos, decep_neg, true_neg)
write.csv(decep_data_op, file="decep_op.csv", row.names=F)Preprocessing Deceptive Opinion Spam Data to construct Document Term Matrix
Loading data
Let’s load the Deceptive Opinion Spam Data from the .csv file we just created
opinions_decep = read.csv("decep_op.csv",stringsAsFactors=FALSE) #reading .csv file we created in wd
nrow(opinions_decep)## [1] 1600names(opinions_decep)## [1] "doc_id" "text" "opinions"As we can see that there are 1600 records, and three columns, doc_id is unique for all reviews, text is reviews by customers, opinions has of the of the four classes decep_pos, true_pos, decep_neg, true_neg for each record
Randomize the data
The records in our data have been stacked up based on the four classes of opinions in the order as stored by us, i.e. first all decep_pos’s, followed by all true_pos’s, next all decep_neg’s and at the end true_neg’s.
In data science, unless we are dealing with a sequence or time series dataset, we must randomize our data for unbiased selection of records while model building.
m = nrow(opinions_decep) #number of records in the dataset
set.seed(1); rand = sample(m, replace=F) #randomizing row numbers, e.g. instead of 1,2,3 it could be 2,1,3
rand_op_decep = opinions_decep[rand, ] #Reading data in a random order
dim(rand_op_decep)## [1] 1600 3Preprocessing data
Let’s use tm’s DataframeSource(), VCorpus() to convert the dataset in a corpus and tm_map() to process the text
library(tm)
ds <- DataframeSource(as.data.frame(rand_op_decep[,1:2])) #For creating data frame source
myCorpus<- VCorpus(ds) #Converting data into a corpus
inspect(myCorpus[1])## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 896# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuation
myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#Stemming
myCorpus <- tm_map(myCorpus,stemDocument)
# remove stop words
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
# remove extra whitespace
myCorpus <- tm_map(myCorpus, stripWhitespace)
# convert to plain text document
myCorpus<- tm_map(myCorpus, PlainTextDocument)
# Covert into Document Term Frequency Matrix using tf, we can also try tf-idf with weightTfIdf
myTdm<- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf, stopwords = TRUE, minWordLength=2))
dim(myTdm)## [1] 1600 7005#Removing sparse terms
myTdm <- removeSparseTerms(myTdm, 0.95)
dim(myTdm) ## [1] 1600 264#Converting to matrix
temp = as.matrix(myTdm)
dim(temp)## [1] 1600 264We are left with 264 terms (from 7005 terms) in the final dataset after removing sparse terms
Training a classifier with SVM
Let’s train a classification model on the dataset using SVM
set.seed(1); trainset = sample(1:nrow(temp), trunc(0.7*nrow(temp)))
svm.cl = svm(as.factor(rand_op_decep[trainset, 3])~., data = temp[trainset, ], kernel= "radial", scale=FALSE)Accuracy of the model
Let’s check the accuracy of the model on both train and test data sets.
trainpredicted = predict(svm.cl,temp[trainset, ])
train_conf=table(trainpredicted,rand_op_decep[trainset, 3])
train_conf##
## trainpredicted decep_neg decep_pos true_neg true_pos
## decep_neg 226 4 12 0
## decep_pos 12 224 7 26
## true_neg 33 4 267 16
## true_pos 5 37 19 228train_error = mean(trainpredicted != rand_op_decep[trainset, 3])*100
train_error## [1] 15.625train_accuracy = mean(trainpredicted == rand_op_decep[trainset, 3])*100
train_accuracy## [1] 84.375testpredicted = predict(svm.cl,temp[-trainset, ])
test_conf=table(testpredicted,rand_op_decep[-trainset, 3])
test_conf##
## testpredicted decep_neg decep_pos true_neg true_pos
## decep_neg 80 6 12 3
## decep_pos 15 95 3 14
## true_neg 23 5 62 12
## true_pos 6 25 18 101test_error = mean(testpredicted != rand_op_decep[-trainset, 3])*100
test_error## [1] 29.58333test_accuracy = mean(testpredicted == rand_op_decep[-trainset, 3])*100
test_accuracy## [1] 70.41667Training a classifier with neural net
Let’s train a classification model on the dataset using SVM
nn.model = nnet(as.factor(rand_op_decep[trainset, 3])~., data = temp[trainset, ], siz=3, decay = 0.01, maxit=400)## # weights: 811
## initial value 1681.932615
## iter 10 value 1008.710953
## iter 20 value 705.898543
## iter 30 value 507.502092
## iter 40 value 414.196590
## iter 50 value 365.470515
## iter 60 value 348.523933
## iter 70 value 311.344583
## iter 80 value 278.895431
## iter 90 value 250.163439
## iter 100 value 231.671415
## iter 110 value 224.941924
## iter 120 value 214.881750
## iter 130 value 210.911801
## iter 140 value 197.700727
## iter 150 value 181.618320
## iter 160 value 150.275406
## iter 170 value 141.031255
## iter 180 value 132.874072
## iter 190 value 121.809840
## iter 200 value 111.417397
## iter 210 value 108.770413
## iter 220 value 105.942940
## iter 230 value 99.456531
## iter 240 value 97.630396
## iter 250 value 92.080532
## iter 260 value 91.302170
## iter 270 value 86.427684
## iter 280 value 84.589490
## iter 290 value 83.284061
## iter 300 value 82.659005
## iter 310 value 80.932303
## iter 320 value 78.453799
## iter 330 value 77.409931
## iter 340 value 73.464043
## iter 350 value 72.248530
## iter 360 value 68.589227
## iter 370 value 67.704409
## iter 380 value 67.354195
## iter 390 value 67.104181
## iter 400 value 63.791961
## final value 63.791961
## stopped after 400 iterationsAccuracy of the model
Let’s check the accuracy of the model on both train and test data sets.
trainpredicted = predict(nn.model,temp[trainset, ], type="class")
train_conf=table(trainpredicted,rand_op_decep[trainset, 3])
train_conf##
## trainpredicted decep_neg decep_pos true_neg true_pos
## decep_neg 275 1 7 0
## decep_pos 1 268 0 1
## true_neg 0 0 298 1
## true_pos 0 0 0 268train_error = mean(trainpredicted != rand_op_decep[trainset, 3])*100
train_error## [1] 0.9821429train_accuracy = mean(trainpredicted == rand_op_decep[trainset, 3])*100
train_accuracy## [1] 99.01786testpredicted = predict(nn.model,temp[-trainset, ], type="class")
test_conf=table(testpredicted,rand_op_decep[-trainset, 3])
test_conf##
## testpredicted decep_neg decep_pos true_neg true_pos
## decep_neg 91 6 23 3
## decep_pos 13 96 3 26
## true_neg 14 4 52 17
## true_pos 6 25 17 84test_error = mean(testpredicted != rand_op_decep[-trainset, 3])*100
test_error## [1] 32.70833test_accuracy = mean(testpredicted == rand_op_decep[-trainset, 3])*100
test_accuracy## [1] 67.29167As we can see that although, on train set, SVM (84% accuracy) has not performed as well as neural net (99% accuracy), but the test accuracy of SVM is 70.4% which is better than the test accuracy of neural net (67.3%). Since we have used only two-layered neural net, the performance of the model on test data might have suffered (resulting in bias). If we use multi-layered neural net, the results would be much better.
For unstructured data neural net and svm perform really well compared to any other algorithm. Neural Net is the most preferred algorithm for unstructured data.
Exercise: Try using different preprocessing options such as instead of weightTf (term frequency), use weightTfIdf (tf-idf), also change the value of sparsity from 0.95 to some other number such as 0.9 or 0.99. Then train the model using some other algorithm such as logistics regression and see if you are able to improve the performance of the model.
Comments
Post a Comment