Break into Generative AI & LLMs (Go from Novice to Pro)
Hands down the best course if you are serious about learning about generative AI and large language models. Everything you need to pass interviews and thrive in your role as an AI engineer is covered. Here is what you get:
90+ Lessons, 60+ Hours
Real-world Projects (Build your own AI apps)
Tools: OpenAI, LangChain, LlamaIndex
70,000+ Developer Community
Start here: Click here to access the course
Introduction
Last week’s article covered pandas functions for handling messy text data. Pandas is an excellent library that I highly recommend learning, but the Python ecosystem is vast, and many specialized tools can save you time and reduce the amount of code you need to write.
If you’re looking to streamline your text preprocessing pipeline or discover new ways to analyze your data, you’ve come to the right place. In this article, I want to introduce you to several Python libraries that are particularly useful for cleaning and analyzing text data. These libraries may not have the same buzz as pandas, but they are extremely helpful when working with text.
Here’s what we’ll explore:
TextBlob
spaCy
clean-text
contractions
ftfy
Let’s get started!
1. TextBlob – Quick and Easy NLP
If you want something simple to get started with NLP, TextBlob is your friend. What I love about this library is that it is beginner-friendly and can handle common NLP tasks, such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and translation, all without the steep learning curve of larger frameworks. It’s built on top of NLTK (you must ensure that you have this library installed). You can install TextBlob by running: pip install -U textblob.
Here’s an example of sentiment analysis and noun phrase extraction from a customer review:
Here, the polarity score of -0.03 is overall neutral because the strong positive words (fantastic, stunning) and the strong negative word (terrible) cancel each other out. Because the text is a review, it is highly subjective, which is why the subjective score is 0.97%. The noun phrases provide a quick overview of the aspects being talked about in the review (fantastic camera and battery life).
2. spaCy – Industrial-Strength NLP
While TextBlob is perfect for quick feedback and is beginner-friendly, it may fall short in large-scale analysis. If you want to do it on a large scale, then spaCy is your friend. spaCy is written in Cython and optimized for speed, making it significantly faster than TextBlob. It also provides advanced features such as dependency parsing, named entity recognition (NER), and custom pipelines. To use spaCy, ensure to install: python -m spacy download en_core_web_sm.
Let’s see how it performs on noun phrase extraction. We will use the same review from the previous section:
Notice the difference? With TextBlob, we only got 2 noun phrases. Here, spaCy extracts all key product-related phrases (camera, phone, images, battery life). Adjectives give quick insights into sentiment drivers (positive: fantastic, stunning; negative: terrible). This makes spaCy perfect for aspect-based sentiment analysis, where you want to connect features (camera, battery) with opinions.
3. Clean-text – Text Cleaning Made Easy
You’ve probably noticed that text data is often messy. It can include emojis, HTML tags, or special characters that need to be cleaned up before analysis. While libraries like pandas offer tools for text cleaning, exploring alternatives, especially ones that are simpler to use, can save time and effort.
One such option is the clean-text library. Install the library by running: pip install clean-text. As its name suggests, it’s designed specifically for cleaning up messy text data. Let’s take a look at it in action.
Look at that? The text is standardized to lowercase, punctuation and emojis are removed, and URLs are stripped. This makes it machine-friendly. Ideal before building NLP models on social media data, reviews, or scraped text.
Data Analysis with Python: Practice, Practice, Practice [30% OFF].
The purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your 50-day journey with “50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners.”
Other Resources
Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals—The Ultimate Python Course for Beginners
Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.
4. Contractions – Expanding Text for Cleaner NLP
When working with text data, contractions like “can’t”, “won’t”, or “it’s” can cause problems for tokenization, word counts, or vectorization. Expanding them into “cannot”, “will not”, and “it is” makes the text cleaner and easier for NLP models to understand. Many NLP tools treat contractions as single tokens. Expanding them ensures consistent word representation. This enhances text analysis for word counts, embeddings, or sentiment analysis, ensuring accuracy.
Ensure to install the library by running: pip install contractions.
Let’s look at an example:
The output text has been fixed. Instead of handling contractions manually, the library has automatically expanded them. The contractions library is a lightweight but powerful way to normalize text before analysis or model training. It pairs well with clean-text, spaCy, or TextBlob, ensuring your text data is both clean and consistent.
5. ftfy – Fix Text Encoding Issues
Ftfy (short for “Fixes Text For You”) is a lifesaver when dealing with text from the wild web. It magically fixes common Unicode decoding errors, often referred to as “mojibake”. Those annoying sequences like “é.” The value of this library is that it solves this issue with astonishing accuracy. Instead of trying to guess the encoding, you can often just run ftfy.fix() and the problem disappears.
You can install the library by running: pip install ftfy.
Here is the library in action:
The output has texts without the encoding nightmares.
Wrap-Up
These libraries are a powerful complement to your standard toolkit for handling text data. They can significantly boost both your efficiency and the quality of your text analysis. Next time you start an NLP project, consider reaching for these tools. They really make a difference.
It’s also worth noting that, depending on your needs, you can combine all six libraries into a single workflow to preprocess even the messiest text from start to finish. Tell me that’s not cool! Thanks for reading, and happy text cleaning!








