Select Page

A beginner’s guide to PyCaret’s natural language processing module

I’ve said it before, and I’ll say it again, “the quicker we get from data to insights, the better off we will be.”

PyCaret helps us with this. With a few lines of Python, it helps get us to insights a lot quicker than ever before.

So, what is PyCaret?

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.

In other words, PyCaret makes data science a whole lot easier.

And as usual, instead of telling you all about it, I’d rather show you instead.

Housekeeping

First, let’s install PyCaret. Type the following into the terminal:

pip install pycaret

But wait, a problem!

“pip subprocess to install build dependencies did not run successfully”

Screenshot by the Author

Didn’t work? You may be using an incompatible latest version of pip. Let’s downgrade it a little with:

python.exe -m pip install pip==21.3.1

Second, we’ll download some language packs. For the English language model, type the following one line at a time into the terminal:

python -m spacy download en_core_web_smpython -m textblob.download_corpora

Third, let’s install our best friend, pandas:

pip install pandas

Now, we’re ready to rock ‘n roll.

Rock ‘n Roll

Let’s start a Jupyter notebook, and let’s do some coding!

https://medium.com/media/58efd1a13a6a01cc05b75fcc914c7692

Above, we’re simply importing PyCaret’s NLP module and, of course, pandas. The other blocks of code are completely optional but helpful. The first block of code saves us from having to type the print() statement all the time, while the second block lets us see dataframes in all their glory — that is, without truncating columns and rows.

Next, we’ll get our data and load it into a dataframe.

https://medium.com/media/9ba475757ca3db5dce02b743861d001e

For the rest of this experiment, we’ll only work with a sample of a thousand tweets to make our code run much faster.

https://medium.com/media/a599aa1dedd2b1c7ac431a9dba61267a

Let’s take a peek at our data:

https://medium.com/media/43be23848afad45aad15f13d0dc04250

Screenshot by the Author

Now let’s get to the meat of our little project. Don’t forget to change the target as it applies to your data. In this case, the dataframe column that I want to analyze is in the “tweet” column, so that’s what I put in the target parameter of setup().

https://medium.com/media/ee1da1cc344ed2f3fca69d4a537da1d0

In just three lines of code, we preprocessed the text data, customized the stop words, created the model, and assigned the model back into our dataframe.

Now it’s time for some visualization!

https://medium.com/media/c8ac99489dda09aca5fd6d10cb001da7

Alas, not so fast. if you encountered the error below, don’t panic.

A Hiccup

ModuleNotFoundError: No module named ‘pyLDAvis.gensim’

Screenshot by the Author

Let’s read the error message for clues on how to correct the problem.

In this case, the script did not find the pyLDAvis.gensim module because it changed names. What used to be “gensim” is now “gensim_models” and fixing this would only take changing two lines of code in the pycaret package.

Yes, we’ll change the code on the library itself because it’s outdated.

Screenshot by the Author

The error above tells us to look in line#2512 of the “nlp.py” file. Let’s simply go to the directory and find the file to make some edits.

In the “nlp.py” file, On line#2512 and line#2519, we find two occurrences of pyLDAvis.gensim that needs to be changed to “pyLDAvis.gensim_models”. Let’s change them and save the file.

Screenshot by the Author

Before the changes take effect. We need to restart the kernel on our notebook.

Simply run all the cells, and we shouldn’t get any more errors.

Here are some of the output:

Screenshot by the Author
Screenshot by the Author
Screenshot by the Author

Conclusion

Using PyCaret’s NLP module, we’ve seen how we could quickly go from getting the data to insights in just a few lines of code.