Description
Tweets, status updates, emails and reviews are all examples of unstructured data that can give direct access into not only the author’s interests, but their predisposition to certain behaviors and actions. These powerful insights can be unearthed through the use of computational techniques especially suited for restructuring, extracting characteristics from and modeling human language. This class will teach machine learning best practices for extracting actionable information from text.
Prerequisites
- Basic understanding of statistics
- Knowledge of R Programming
- R 3.4.x
- RStudio 1.0.x
Outline
1. Motivation – We’ll kick the course off with a brief discussion of applications of text analysis for predicting sentiment, text categorization and feature extraction. This discussion will also cover the basic concepts and definitions used in text mining.
2. Preprocessing – In this section, we’ll discuss string manipulation with grep, along with the use of the stringi and string r packages in R. Topics will include stemming, the usage of stop words, spelling adjustments, special character exceptions and other preprocessing techniques.
3. Data Representations – Before the modeling process can be initiated, a textual representation must be chosen and implemented. In this unit we’ll introduce n-grams, term document matrices, and term frequency–inverse document frequency forms, all standardized formats used as model inputs.
4. Modeling – This section will focus on language processing techniques that to be used for both exploratory insight and text classification.
5. Case Study – The class will walk through a guided case study that highlights the use of text mining techniques for the extraction topics from a large employee satisfaction survey.
6. Special Topics – Here we’ll end with a discussion recent developments in text mining such as the word-2-vec representation, deep learning models, and stylometry.