FIT5145: Workshop 3

FIT5145 Week 5 (Workshop 4)

FIT5145: Workshop 3

Objectives

  • How to manipulate time series data
  • How to preprocess textual data
  • How to deal with missing values and impute them
FIT5145: Workshop 3

Core Concept 1: Working with Times Series

  • How do you interpret datetime represented as strings?
"2025-04-02" "02/04/2025" "04-02-25"
# yyyymmdd     ddmmyyyy     mmddyy
ymd()        dmy()          mdy()
  • How do you parse irregular datetime formats?
  • How do you format a datetime to another format?
strptime( "<string>", format = "%format")
FIT5145: Workshop 3

Core Concept 2: Natural Language Processing (NLP)

  • How would you navigate through a large corpus of text?
  • How do you preprocess textual data? (Real data is messy)
  • What are the common preprocessing steps in NLP?
    1. Drop Stopwords and Punctuation (e.g., and the ,)
    2. Lowercase (e.g., Technology == technology)
    3. Tokenise (e.g., "I love R" -> ["I", "love", "R"])
    4. Stem/Lemmatise (e.g., running -> run)
FIT5145: Workshop 3

Core concept 3: Missing Values

  • What are missing values?
  • What's the consequence if we ignore missing values?
  • How do we deal with missing values?
    • Remove/drop?
    • Impute (fill them in)?
  • How do we impute missing values?
    • Mean imputation (<5%)
    • Regression imputation (<10%)
    • Multiple/ML imputation (<20%)
FIT5145: Workshop 3

Today's Agenda

  • 5.3 Temporal (Time-series) and textual data [ ~60 mins ]

    • Time series data ( ~30 mins )
    • NLP preprocessing ( ~30 mins )
  • 5.2 Melbourne house prices [ ~45 mins ]

    • Data exploration ( ~25 mins )
    • Data imputation ( ~20 mins )
  • 5.1 Big Data [If time permits]

    • Applied class 5.1 PDF
    • Might be good to do at home in your own time

What is Mean imputation? Mean imputation is a simple method of replacing missing values with the mean of the available values in the dataset.

What is Regression imputation? Regression imputation is a method of predicting the missing values based on the relationship between the missing variable and other variables in the dataset.

What is Multiple imputation? Multiple imputation is a method of creating multiple datasets with different imputed values, and then combining the results to account for the uncertainty of the missing data. ML imputation is a specific type of multiple imputation that uses machine learning algorithms to predict the missing values.