Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Stopword Removal using NLTK in python

$
0
0

After I used NLTK to clean my dataset, the words in the context became incomplete.
For instance, “countries” became “countr”, “another” became “anoth”, “deliver” became “deliv”.
I’d like to use stopword to remove the word including “of, on, in, an, a, and…” , but keep meaningful words completed.
Is there any way to prevent the word incomplete and remove stopwords(of, on, in, an, a, and…) ?

Here’s the code I used…
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()
a = set(stopwords.words(‘english’))
def preprocess(text):
token = word_tokenize(text)
stopwords = [x for x in token if x not in a]
stem = [ps.stem(x) for x in stopwords]
if len(stem)==0:
stem=[“none”]
return stem

df[“text”] = df[“text”].map(preprocess)
df

1 post - 1 participant

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles