After I used NLTK to clean my dataset, the words in the context became incomplete.
For instance, “countries” became “countr”, “another” became “anoth”, “deliver” became “deliv”.
I’d like to use stopword to remove the word including “of, on, in, an, a, and…” , but keep meaningful words completed.
Is there any way to prevent the word incomplete and remove stopwords(of, on, in, an, a, and…) ?
Here’s the code I used…
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()
a = set(stopwords.words(‘english’))
def preprocess(text):
token = word_tokenize(text)
stopwords = [x for x in token if x not in a]
stem = [ps.stem(x) for x in stopwords]
if len(stem)==0:
stem=[“none”]
return stem
df[“text”] = df[“text”].map(preprocess)
df
1 post - 1 participant