Basic cleaning

The Python function below cleans up textual data. For parameters, it takes a dataframe and a column name. The function encodes and decodes the text. After that, it performs some basic regex parsing. Finally, all the words that are designated as stop words are then lemmatized using NLTK.

def clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

Published by

Ednalyn C. De Dios

I’ve always been enamored with code and I love data science because of its inherent power to solve real problems. Having grown up in the Philippines, served in the United States Navy, and worked in the nonprofit sector, I am driven to make the world a better place. I have started and participated in numerous campaigns that aim to reduce domestic violence and child abuse in the community.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.