A beginner’s guide to PyCaret’s natural language processing module
I’ve said it before, and I’ll say it again, “the quicker we get from data to insights, the better off we will be.”
PyCaret helps us with this. With a few lines of Python, it helps get us to insights a lot quicker than ever before.
So, what is PyCaret?
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.
In other words, PyCaret makes data science a whole lot easier.
And as usual, instead of telling you all about it, I’d rather show you instead.
First, let’s install PyCaret. Type the following into the terminal:
pip install pycaret
But wait, a problem!
“pip subprocess to install build dependencies did not run successfully”
Didn’t work? You may be using an incompatible latest version of pip. Let’s downgrade it a little with:
python.exe -m pip install pip==21.3.1
Second, we’ll download some language packs. For the English language model, type the following one line at a time into the terminal:
python -m spacy download en_core_web_smpython -m textblob.download_corpora
Third, let’s install our best friend, pandas:
pip install pandas
Now, we’re ready to rock ‘n roll.
Rock ‘n Roll
Let’s start a Jupyter notebook, and let’s do some coding!
Above, we’re simply importing PyCaret’s NLP module and, of course, pandas. The other blocks of code are completely optional but helpful. The first block of code saves us from having to type the print() statement all the time, while the second block lets us see dataframes in all their glory — that is, without truncating columns and rows.
Next, we’ll get our data and load it into a dataframe.
For the rest of this experiment, we’ll only work with a sample of a thousand tweets to make our code run much faster.
Let’s take a peek at our data:
Now let’s get to the meat of our little project. Don’t forget to change the target as it applies to your data. In this case, the dataframe column that I want to analyze is in the “tweet” column, so that’s what I put in the target parameter of setup().
In just three lines of code, we preprocessed the text data, customized the stop words, created the model, and assigned the model back into our dataframe.
Now it’s time for some visualization!
Alas, not so fast. if you encountered the error below, don’t panic.
ModuleNotFoundError: No module named ‘pyLDAvis.gensim’
Let’s read the error message for clues on how to correct the problem.
In this case, the script did not find the pyLDAvis.gensim module because it changed names. What used to be “gensim” is now “gensim_models” and fixing this would only take changing two lines of code in the pycaret package.
Yes, we’ll change the code on the library itself because it’s outdated.
The error above tells us to look in line#2512 of the “nlp.py” file. Let’s simply go to the directory and find the file to make some edits.
In the “nlp.py” file, On line#2512 and line#2519, we find two occurrences of pyLDAvis.gensim that needs to be changed to “pyLDAvis.gensim_models”. Let’s change them and save the file.
Before the changes take effect. We need to restart the kernel on our notebook.
Simply run all the cells, and we shouldn’t get any more errors.
Here are some of the output:
Using PyCaret’s NLP module, we’ve seen how we could quickly go from getting the data to insights in just a few lines of code.