Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

XGBoost script for classifying text

$
0
0

@jdude48 wrote:

I am learning to use R so that I can create a machine learning classification script that classifies a dataset of movie reviews according to their sentiment scores, either a 1 or a 0 for positive or negative. I believe that I am missing two pieces to my script. First, I need the proper syntax for the test data partition for XGBoost. Second, I wanted to create a confusion matrix to evaluate performance. Could someone please tell me what I am missing in my code? Thanks.

‘’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’’
library(text2vec)
library(xgboost)
library(pdp)
setwd(‘C:/rscripts/movies’)

imdb = read.csv(‘movies.csv’, stringsAsFactors = FALSE)

Create the document term matrix (bag of words) using the movie_review data

frame provided

in the text2vec package (sentiment analysis problem)

#data(“movie_review”)

Tokenize the movie reviews and create a vocabulary of tokens including

document counts
vocab <- create_vocabulary(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer))

Build a document-term matrix using the tokenized review text. This returns

a dgCMatrix object
dtm_train <- create_dtm(itoken(imdb$text,
preprocessor = tolower,
tokenizer = word_tokenizer),
vocab_vectorizer(vocab))

Turn the DTM into an XGB matrix using the sentiment labels that are to be

learned
train_matrix <- xgb.DMatrix(dtm_train, label = imdb$class)

xgboost model building

xgb_params = list(
objective = “binary:logistic”,
eta = 0.01,
max.depth = 5,
eval_metric = “auc”)

xgb_fit <- xgboost(data = train_matrix, params = xgb_params, nrounds = 10)

set.seed(1)
cv <- xgb.cv(data = train_matrix, label = imdb$class, nfold = 5,
nrounds = 60)

library(caret)
library(Matrix)

Create our prediction probabilities

pred <- predict(xgb_fit, dtm_train)

Set our cutoff threshold

pred.resp <- ifelse(pred >= 0.86, 1, 0)

Create the confusion matrix

confusionMatrix(factor pred.resp),imdb$class, positive=“1”)

Thanks for any help I can get.

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles