All about regressions

Instructor Notes Activity 8.1: This is the R programming exercise. Nothing too complex here. Activity 8.2 This is a discussion about largely polynomial models. You may have to allow them some time to do this in class. The reading is from the output of an old Python exercise, but students are not expected to do any programming. Activity 8.2 is about the resulting plots and what they say about the regression modelling. Activity 8.3 is about correlation and causation. The idea of the activity is to get students to think of relationships between data, what the relationships can be used for and what they may indicate. The major difference between the US trends and the Australian trends is that in the US, "sunburn" tends to peak a bit before "ice cream", whereas in Australia they both peak on the same week. This can certainly be said that for the US an interest in "ice cream" doesn't cause an interest in "sunburn", but students should also question whether "sunburn" causes "ice cream". "umbrella" is interesting because it sort of peaks at the same time as both (i.e., not in the winter!). There is a huge peak for "umbrella" in February 2019, but I suspect this is due to the tv show "The Umbrella Academy" first being released then. Some other terms you could use and discuss are "heat" and "shade". Step 4 of Activity 7.1 looks at Australian population figures and how best to describe their rise or fall. While this is heading towards linear regression modelling, it doesn't actually mention it until the end. The idea of the discussion is to get students to realise that there are multiple ways to look at the data, and that the statistical analysis we have covered so far (mode, median, mean, variance) isn't enough. If we want to model the data, we need more. Next week, we will get into actually doing the modelling! As to the plots, I don't have the corresponding R code because I actually made them using Excel (I know, I cheated). You are welcome to make them in R yourself and share the code. The data is provided to the students on Moodle. One thing to note though is that the y-axis for the mean/median vs year plot is slightly different to the other plots. This gives the false impression that the mean and median are more different that they actually are. They are also plotted on the box plot as a horizontal line (median) and an X (mean). The original data is from the ABS: 3218.0 Regional Population Growth, Australia Released at 11.30am (Canberra time) 25 March 2020 Population Estimates by Electoral Division, 2009 to 2019 It is a table of estimated resident population (ERP) numbers (estimated because they don't guarantee they are complete) for each division for the years 2009-2019. It also has a estimated national population for each year. To calculate the growth in each electoral division, I simply divided the ERP for Year N by the ERP for Year N-1. This gives growth rates for the years 2010-2019. To calculate the national growth rate for each year, I divided the national estimated population for Year N by the national estimated population for year N-1. This is what is in the first plot. The median and mean mentioned in the tutorial instructions are calculated from these figures. This is why they are similar. If the mean and median are calculated instead from the electoral division growth rates (as shown in the other plots), then the mean and median are more different as the median is much lower due to all the small electoral divisions that had a low growth rate. These have little effect on the national population numbers, which we are trying to model. This distinction between the mean and median from the electoral division rates is also obvious in the final plot. The question of the accuracy of the model will be mainly covered in week 9, but I recommend you suggest to students to think about formula for variance and how that may be adapted to compare the mean value of each year to the corresponding value from the model. Lab activity 8.4 is another R programming exercise that looks at how decision trees can be used to refine the modelling by segmenting the data. Nothing too complex here, but there are two different trees that we are looking at, so be sure to discuss what might be the pros and cons of each method.

Correlation measures the strength and direction of a linear relationship between two variables We use correlation coefficient (R) that quantifies the direction (+/-) Strength (weak, moderate, strong) and Consistency (how tightly clustered the points are around the line)

insert img/simpsons_paradox.png

insert https://www.tylervigen.com/spurious/correlation/image/6863_google-searches-for-minecraft_correlates-with_the-number-of-surgens-in-florida.svg

FIT5145 Workshop Week 8

Objectives

Core Concept: Considerations in Correlation vs Causation

What correlation means

Correlation can be misleading

Correlation can be spurious

Correlation can be spurious

Causation is another beast

Correlation

Causation

So what?

Today's Agenda

Coding Tasks

Self-guided