@jdude48 wrote:
I am learning to use R so that I can create a machine learning classification script that classifies a dataset of movie reviews according to their sentiment scores, either a 1 or a 0 for positive or negative. I believe that I am missing two pieces to my script. First, I need the proper syntax for the test data partition for XGBoost. Second, I wanted to create a confusion matrix to evaluate performance. Could someone please tell me what I am missing in my code? Thanks.
‘’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’
library(text2vec)
library(xgboost)
library(pdp)
setwd(‘C:/rscripts/movies’)imdb = read.csv(‘movies.csv’, stringsAsFactors = FALSE)
Create the document term matrix (bag of words) using the movie_review data
frame provided
in the text2vec package (sentiment analysis problem)
#data(“movie_review”)
Tokenize the movie reviews and create a vocabulary of tokens including
document counts
vocab <- create_vocabulary(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer))Build a document-term matrix using the tokenized review text. This returns
a dgCMatrix object
dtm_train <- create_dtm(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer),
vocab_vectorizer(vocab))Turn the DTM into an XGB matrix using the sentiment labels that are to be
learned
train_matrix <- xgb.DMatrix(dtm_train, label = imdb$class)xgboost model building
xgb_params = list(
objective = “binary:logistic”,
eta = 0.01,
max.depth = 5,
eval_metric = “auc”)xgb_fit <- xgboost(data = train_matrix, params = xgb_params, nrounds = 10)
set.seed(1)
cv <- xgb.cv(data = train_matrix, label = imdb$class, nfold = 5,
nrounds = 60)library(caret)
library(Matrix)Create our prediction probabilities
pred <- predict(xgb_fit, dtm_train)
Set our cutoff threshold
pred.resp <- ifelse(pred >= 0.86, 1, 0)
Create the confusion matrix
confusionMatrix(factor pred.resp),imdb$class, positive=“1”)
Thanks for any help I can get.
Posts: 1
Participants: 1