As a data scientist, we are known to crunch numbers, but what happens when we run into text data? In this tutorial, I will walk through the steps to turn text data into a format that a machine can understand, share some of the most popular text analytics techniques, and showcase several natural language processing (NLP) libraries in Python including NLTK, TextBlob, spaCy and gensim.
Natural language processing (NLP) is an exciting branch of artificial intelligence (AI) that allows machines to break down and understand human language. As a data scientist, I often use NLP techniques to interpret text data that I'm working with for my analysis. During this tutorial, I plan to walk through text pre-processing techniques, machine learning techniques and Python libraries for NLP.
Text pre-processing techniques include tokenization, text normalization and data cleaning. Once in a standard format, various machine learning techniques can be applied to better understand the data. This includes using popular modeling techniques to classify emails as spam or not, or to score the sentiment of a tweet on Twitter. Newer, more complex techniques can also be used such as topic modeling, word embeddings or text generation with deep learning.
We will walk through an example in Jupyter Notebook that goes through all of the steps of a text analysis project, using several NLP libraries in Python including NLTK, TextBlob, spaCy and gensim along with the standard machine learning libraries including pandas and scikit-learn.
Alice Zhao is currently a Senior Data Scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she worked at Cars.com, where she started as the company's first data scientist, supporting multiple functions from Marketing to Technology. During that time, she also co-founded a data science education startup, Best Fit Analytics Workshop, teaching weekend courses to professionals at 1871 in Chicago. Prior to becoming a data scientist, she worked at Redfin as an analyst and at Accenture as a consultant. She has her M.S. in Analytics and B.S. in Electrical Engineering, both from Northwestern University. She blogs about analytics and pop culture on A Dash of Data. Her blog post, "How Text Messages Change From Dating to Marriage" made it onto the front page of Reddit, gaining over half a million views in the first week. She is passionate about teaching and mentoring, and loves using data to tell fun and compelling stories.