Select Page

The Python function below cleans up textual data. For parameters, it takes a dataframe and a column name. The function encodes and decodes the text. After that, it performs some basic regex parsing. Finally, all the words that are designated as stop words are then lemmatized using NLTK.

def clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^ws]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]