The Python function below cleans up textual data. For parameters, it takes a dataframe and a column name. The function encodes and decodes the text. After that, it performs some basic regex parsing. Finally, all the words that are designated as stop words are then lemmatized using NLTK.
def clean(text): """ A simple function to clean up the data. All the words that are not designated as a stop word is then lemmatized after encoding and basic regex parsing are performed. """ wnl = nltk.stem.WordNetLemmatizer() stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS text = (unicodedata.normalize('NFKD', text) .encode('ascii', 'ignore') .decode('utf-8', 'ignore') .lower()) words = re.sub(r'[^ws]', '', text).split() return [wnl.lemmatize(word) for word in words if word not in stopwords]