Time series, NLP, and Missing Values

FIT5145: Workshop 4

Core Concept 2: Natural Language Processing (NLP)

How would you navigate through a large corpus of text?
How do you preprocess textual data? (Real data is messy)
What are the common preprocessing steps in NLP?
1. Drop Stopwords and Punctuation (e.g., and the ,)
2. Lowercase (e.g., Technology == technology)
3. Tokenise (e.g., "I love R" -> ["I", "love", "R"])
4. Stem/Lemmatise (e.g., running -> run)