Pedestrian activity

The City of Melbourne has developed an automated pedestrian counting system to better understand pedestrian activity. Data is captured from counting sensors across various locations in Melbourne's CBD.

We've stored a subset of this data in comma separated (.csv) file called melb_walk_wide.csv on Github (Please open this link in new tab). If you have trouble accessing it, download it to your working directory from Moodle. Clicking on the Raw button on GitHub lets you view melb_walk_wide.csv in your web browser:

Reading a .csv file from GitHub

To read (or import) melb_walk_wide.csv into your R:

  • Load the tidyverse, which contains the read_csv() function (from the readr package) to read .csv files in R. (A good rule of thumb is to always load the tidyverse before you begin any data analysis.)
  • Copy the GitHub URL of melb_walk_wide.csv and paste inside the read_csv() function. If you have already downloaded it and it is in your working directory, just use the directory path, e.g., ./melb_walk_wide.csv
  • Store the data in an object named ped_wide.

Fill out the missing parts of the code chunk (???) and then run:

# Load tidyverse
library(tidyverse)

# Read melb_walk.csv from GitHub URL and store in object named ped_wide
ped_wide <- read_csv("https://raw.githubusercontent.com/quangvanbui/FIT5145-data/master/melb_walk_wide.csv")
# Alternatively, read it from your working directory
# ped_wide <- ???("./melb_walk_wide.csv")

# Print ped_wide
ped_wide
A spec_tbl_df: 744 x 46
Date_TimeDateTimeAlfred PlaceBirrarung MarrBourke St-Russell St (West)Bourke Street Mall (North)Bourke Street Mall (South)Chinatown-Lt Bourke St (South)Chinatown-Swanston St (North)...Spencer St-Collins St (North)Spencer St-Collins St (South)St Kilda Rd-Alexandra GardensState LibraryThe Arts CentreTin Alley-Swanston St (West)Town Hall (West)Victoria PointWaterfront CityWebb Bridge
<dttm><date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>...<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2018-12-31 13:00:002019-01-01 020727331745 918 770494 633...186757713601548280020302512171339762
2018-12-31 14:00:002019-01-01 1 9910861722 995 635361 908...1838416189614941746123077 470 352336
2018-12-31 15:00:002019-01-01 2 60 5711113 416 262304 469... 896222 731 878 940101927 202 100138
2018-12-31 16:00:002019-01-01 3 21 208 786 382 194208 289... 686133 288 309 466 9 998 116 29 95
2018-12-31 17:00:002019-01-01 4 15 83 405 165 106149 155... 346 87 134 133 109 1 472 31 5 25
2018-12-31 18:00:002019-01-01 5 31 46 249 117 53 65 93... 284 50 63 110 64 3 209 17 1 30
2018-12-31 19:00:002019-01-01 6 17 48 117 48 55 51 44... 285 82 69 42 57NA 131 27 10 16
2018-12-31 20:00:002019-01-01 7 31 49 76 47 51 72 73... 292 70 70 50 84 5 153 9 11 32
2018-12-31 21:00:002019-01-01 8 46 65 99 86 87 25 113... 377 70 83 83 124 2 207 15 19 56
2018-12-31 22:00:002019-01-01 9116 116 131 253 258 98 82... 567127 170 128 26819 388 28 39124
2018-12-31 23:00:002019-01-0110112 225 294 956 689368 302... 963208 282 284 53615 955 92 59154
2019-01-01 00:00:002019-01-0111176 291 53015421141446 471...1135193 562 4731132311369 79 136193
2019-01-01 01:00:002019-01-0112100 354 63420841634656 952... 926241 710 7021427221887 75 162250
2019-01-01 02:00:002019-01-0113 85 465 802232417237461240... 939194 986 8151608182198 66 187277
2019-01-01 03:00:002019-01-0114 61 452 802267718314721214... 9221831076 9681927232343 36 169299
2019-01-01 04:00:002019-01-0115135 482 861265820573681207... 964186105610631752252430 59 200287
2019-01-01 05:00:002019-01-0116101 620 882281919893131128...1120202107710421590212348 86 168280
2019-01-01 06:00:002019-01-0117101 975 963258717274521090...1107203120010841244282239 85 142258
2019-01-01 07:00:002019-01-0118144160011502095 9946591260...1066217 885 997 731311829 78 186211
2019-01-01 08:00:002019-01-0119 82 37412091365 6057401079... 795144 5981011 637191500 103 196155
2019-01-01 09:00:002019-01-0120127 2021047 954 4506701035... 665175 502 912 493261283 48 213124
2019-01-01 10:00:002019-01-0121 62 219 976 801 321536 919... 499165 344 592 47120 992 51 172147
2019-01-01 11:00:002019-01-0122 672211 649 411 222386 550... 540149 773 343 60110 826 35 101 73
2019-01-01 12:00:002019-01-0123 32 237 435 165 103205 340... 468 96 159 184 116 4 451 22 48 42
2019-01-01 13:00:002019-01-02 0 15 35 258 86 43105 147... 224 51 46 61 77 1 196 35 12 20
2019-01-01 14:00:002019-01-02 1 22 25 163 41 32 72 90... 109 39 18 46 41 1 82 72 7 18
2019-01-01 15:00:002019-01-02 2 5 9 72 52 18 40 44... 52 11 9 24 45 4 41 8 1 1
2019-01-01 16:00:002019-01-02 3 6 19 65 18 16 22 31... 23 26 9 17 67NA 20 4 1 3
2019-01-01 17:00:002019-01-02 4 10 27 44 10 19 34 17... 35 17 20 10 91 2 31 5 1 1
2019-01-01 18:00:002019-01-02 5 12 40 30 31 15 28 12... 128 32 31 18 288 5 84 2 1 22
............................................................
2019-01-30 07:00:002019-01-3018 341 NA 90017661169497 827...2792 5181107142011151012271367 63333
2019-01-30 08:00:002019-01-3019 169 NA 7491083 586589 795...1170 258 748 975 8151101348170109168
2019-01-30 09:00:002019-01-3020 104 NA 650 683 383652 682... 820 212 420 840 294 501104 89 82120
2019-01-30 10:00:002019-01-3021 61 NA 541 457 286406 401... 711 171 294 582 630 29 770 93 43 76
2019-01-30 11:00:002019-01-3022 110 NA 418 368 175366 309... 604 119 206 337 276 23 481 82 50 68
2019-01-30 12:00:002019-01-3023 23 NA 319 125 85282 204... 302 56 90 190 88 47 264 48 18 29
2019-01-30 13:00:002019-01-31 0 19 NA 162 64 39111 74... 141 49 64 86 55 3 106 12 7 11
2019-01-30 14:00:002019-01-31 1 14 NA 99 59 23 42 64... 29 24 28 41 23 2 57 6 5 3
2019-01-30 15:00:002019-01-31 2 2 NA 61 19 17 30 18... 18 12 12 26 28 1 99 8 1 2
2019-01-30 16:00:002019-01-31 3 5 NA 34 NA 11 14 12... 36 10 7 4 26 NA 29 NA NA 2
2019-01-30 17:00:002019-01-31 4 5 NA 17 NA 15 32 9... 19 11 10 5 56 2 38 2 1 3
2019-01-30 18:00:002019-01-31 5 19 NA 42 NA 15 12 16... 265 57 57 27 94 2 64 21 3 52
2019-01-30 19:00:002019-01-31 6 103 NA 97 NA 71 20 30... 969 259 325 107 298 34 217 64 27120
2019-01-30 20:00:002019-01-31 7 292 NA 196 NA 208 95 105...2173 662 887 300 801 39 458287 66385
2019-01-30 21:00:002019-01-31 8 856 NA 342 NA 595184 168...398115731507 6431629193 899528 86568
2019-01-30 22:00:002019-01-31 9 683 NA 365 NA 777181 174...2816 972 902 61210242031035433 35319
2019-01-30 23:00:002019-01-3110 482 NA 425 NA1055277 299...1596 600 981 653 8961001381281 49165
2019-01-31 00:00:002019-01-3111 462161 63713161507390 552...1555 5931166 83612151211931285 59227
2019-01-31 01:00:002019-01-311212393921483276628047231091...242911451537135513931473100528135636
2019-01-31 02:00:002019-01-311313302581590297829657781139...238311351726158715081403355441113545
2019-01-31 03:00:002019-01-3114 700 NA110521542351453 837...1955 5911657125313551492948298 67213
2019-01-31 04:00:002019-01-3115 451 NA 96824282180516 873...2178 5761665124615261152861307 56259
2019-01-31 05:00:002019-01-3116 522 NA 95023742233398 814...3479 6801864150816441282793613 64416
2019-01-31 06:00:002019-01-3117 684 NA116229372430574 901...477711542500181423481833166891 66701
2019-01-31 07:00:002019-01-3118 377 NA1126252317566381065...2698 6122057152219601072577439 93471
2019-01-31 08:00:002019-01-3119 144 NA1010156010097471093...1347 278139810561669 691648183 63215
2019-01-31 09:00:002019-01-3120 103 NA 9421185 700527 863... 922 272 667 981 4361341439122 75128
2019-01-31 10:00:002019-01-3121 66 NA 763 726 395499 749... 814 210 520 719 791 351066 71119125
2019-01-31 11:00:002019-01-3122 104 NA 618 324 203358 442... 587 135 759 4491444 14 702 39 50 62
2019-01-31 12:00:002019-01-3123 12 NA 394 169 95198 245... 323 80 121 177 187 7 358 29 24 35

Note that when we load the tidyverse, R returns messages and warnings to inform of the tidyverse packages have been loaded into our R session, when some of the packages were built, etc. R also returns a message after reading melb_walk_wide.csv using the read_csv() function to let us know how it has specified each column type of the data.

read_csv() or read.csv()?

While base R provides the read.csv() function to read .csv files into R, the read_csv() function (which is a function from the readr package and is part of the tidyverse) reads .csv files approximately 10 times faster than read.csv(). This means a .csv file that would have taken read.csv() 60 minutes to read into R would only take read_csv() 6 minutes to read.

Look at ped_wide

To print out or look at what is inside of ped_wide, type ped_wide in a code chunk and run it.

# Print ped_wide
ped_wide %>% head(n=3)
A tibble: 3 × 46
Date_TimeDateTimeAlfred PlaceBirrarung MarrBourke St-Russell St (West)Bourke Street Mall (North)Bourke Street Mall (South)Chinatown-Lt Bourke St (South)Chinatown-Swanston St (North)⋯Spencer St-Collins St (North)Spencer St-Collins St (South)St Kilda Rd-Alexandra GardensState LibraryThe Arts CentreTin Alley-Swanston St (West)Town Hall (West)Victoria PointWaterfront CityWebb Bridge
<dttm><date><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
2018-12-31 13:00:002019-01-01020727331745918770494633⋯186757713601548280020302512171339762
2018-12-31 14:00:002019-01-011 9910861722995635361908⋯1838416189614941746123077 470 352336
2018-12-31 15:00:002019-01-012 60 5711113416262304469⋯ 896222 731 878 940101927 202 100138

There are circumstances when printing out all of ped_wide is unnecessary, e.g., in a report to communicate our analysis, we should never place a table of the entire data set (if you are a project partner, reading a report from an analyst, how would you feel if the analyst placed a 744 by 46 dimension table for you to read). You can print the head of a data set using the head(), which returns the first 6 rows of the data - a small extract of the data.

Fill out the missing parts of the code chunk (???) and then run:

# head(ped_wide)
# or use %>% to pipe the output to head() function
ped_wide %>% head()

Notice that there are 744 rows and 46 columns. The columns in ped_wide and their definition are provided below:

  • Date_time - date and time stamp of the recorded pedestrian foot traffic count in UTC timezone
  • Date - date of recorded pedetrian foot traffic count in Melbourne's timezone (UTC+11 or UTC+10, depending on daylight savings)
  • Time - hour of the day (24-hour time) of recorded pedestrian foot traffic count in Melbourne's timezone (UTC+11 or UTC+10, depending on daylight savings)
  • Alfred Place - number of pedestrians counted over a one hour period from a sensor located in Alfred Place
  • Birrarung Marr- number of pedestrians counted over a one hour period from a sensor located in Birrarung Marr
  • $\vdots$
  • Webb Bridge - number of pedestrians counted over a one hour period from a sensor located in Webb Bridge

Note that the dates and hours in variables Date and Time differ from Date_Time because of timezone differences.

This type of data is called a time-series of temporal data because it contains information recorded over time. In this example, we have hourly pedestrian counts for a number of locations in Melbourne from January 1 to 31, 2019. Confirm the time period of our data by following the steps below:

  • Take ped_wide and pipe in the arrange() function.
  • Arrange the data by the column, Date.
  • Pipe in the summarise() function and return the first and last date with the first() and last() function.

Fill out the missing parts of the code chunk (???) and then run:

# First and last date in the data
ped_wide %>%
  arrange(Date) %>%
  summarise(
    first_date = first(Date),
    last_date = last(Date)
  )
A tibble: 1 x 2
first_datelast_date
<date><date>
2019-01-012019-01-31

Convert to long form

![Artwork by @allison_horst](images/tidyr_spread_gather.png)

It is helpful to think of a data set as either wide or long. The pedestrian count data, ped_wide, is presented in a wide form, which is to say that the attributes of the data are presented horizontally. Converting ped_wide into a long form presents the same attributes vertically, i.e., no information is lost by reshaping the data.

So why should we reshape the data into a long form? A data set that is represented in a long form is considered a tidy data set and allows us to use all the tools from the tidyverse. The tools created in the tidyverse are designed for us to work in a principled and consistent way but they require that the data be represented the tidy way (long form). We will see later how the dplyr functions to wrangle the data and ggplot2 package to produce graphics (both part of the tidyverse) work seamlessly when the data is in a tidy long form.

Of course, there are instances when a wide form representation of the data is necessary (some models need to be trained with data in a wide form).

Following the steps below to convert ped_wide into a tidy long form data:

  • Take ped_wide and pipe in the gather() function.
  • Inside gather(), specify the key as Sensor and value as Count and gather all columns in ped_wide except Data_time, Date and Time.
  • Store this tidy long form data of pedestrian count in an object named ped.

Fill out the missing parts of the code chunk (???) and then run:

# Convert the data into a long form
ped <- ped_wide %>%
  gather(
    key = Sensor,
    value = Count, -Date_Time, -Date, -Time
  ) %>%
  select(Sensor, everything(), Count)

# Print ped
ped %>% head()
A tibble: 6 x 5
SensorDate_TimeDateTimeCount
<chr><dttm><date><dbl><dbl>
Alfred Place2018-12-31 13:00:002019-01-010207
Alfred Place2018-12-31 14:00:002019-01-011 99
Alfred Place2018-12-31 15:00:002019-01-012 60
Alfred Place2018-12-31 16:00:002019-01-013 21
Alfred Place2018-12-31 17:00:002019-01-014 15
Alfred Place2018-12-31 18:00:002019-01-015 31

While ped_wide contains 744 rows and 46 columns of data and ped contains 31,992 rows and 5 columns, no information is loss by reshaping the data. In ped_wide, the pedestrian count from each sensor was presented in its own column, but in ped, there is a column containing the sensor name/location and another with its pedestrian count. This means that each row in ped captures the number of pedestrians counted over a 1 hour time window at given location.

Note about the arguments in a function

It is not essential to type our the name of a function's argument(s) when specifying what that argument should be. For example, the gather() function used above specified the Sensor variable in the key argument and the Count variable in the value argument:

  • key = Sensor
  • value = Count
ped_wide %>%
  gather(key = Sensor, value = Count, -Date_Time, -Date, -Time)

We can achieve the same result without explicitly providing the argument names:

ped_wide %>%
  gather(Sensor, Count, -Date_Time, -Date, -Time)

This is because the arguments are ordered, i.e., the key goes first, then the value, so by setting Sensor first and then Count next, gather() will know what we wanted Sensor to be specified in the key argument and Count to be specified in the value argument.

State Library

We will explore pedestrian activity around the State Library on the 1st of January, 2019. To do this, we will need to filter ped for the State Library sensor on the 1st of January, 2019.

  • Take ped and pipe in the filter() function.
  • Filter Date to "2019-01-01" and Sensor to "State Library".
  • Store this filtered data in an object named state_lib_jan_one.

Fill out the missing parts of the code chunk (???) and then run:

# Filter for State Library data on Jan 1, 2019
state_lib_jan_one <- ped %>%
  filter(
    Date == "2019-01-01",
    Sensor == "State Library"
  )

# Print state_lib_jan_one
state_lib_jan_one %>% head()
A tibble: 6 x 5
SensorDate_TimeDateTimeCount
<chr><dttm><date><dbl><dbl>
State Library2018-12-31 13:00:002019-01-0101548
State Library2018-12-31 14:00:002019-01-0111494
State Library2018-12-31 15:00:002019-01-012 878
State Library2018-12-31 16:00:002019-01-013 309
State Library2018-12-31 17:00:002019-01-014 133
State Library2018-12-31 18:00:002019-01-015 110

This tells R to take ped and filter it for data captured by the State Library sensor on the 1st of January, 2019, then store this filtered data in an object named state_lib_jan_one. If you have successfully done this, you'll see state_lib_jan_one in your RStudio environments tab and your state_lib_jan_one data should looks like the following:

  • How many rows and columns are in state_lib_jan_one?
  • Explain why there are this many rows. (This may seem obvious, but if you develop a checking mechanism like this, you'll be able to spot data quality or coding issues much sooner, which can save you a lot of time.)
  • In which hour is pedestrian count highest? Explain whether or not this makes sense.

Line plot

A better way to understand the pedestrian count around the State Library sensor in each hour of the day (of Jan 1st, 2019) is to produce a visualisation. Line plots are typically used to visualise time-series data sets, with the x-axis representing the time or date (or both) and the y-axis representing some time-series process. To produce a line plot of the pedestrian count around the State Library for each hour of the day:

  • Take state_lib_jan_one and pipe in the ggplot() function.
  • Specify the aesthetics layer, i.e., what should be placed in the x and y-axis. This goes inside the aes(), which goes inside of ggplot().
  • Add the geometric (or geom) layer to tell R that the visual element we need for our plot is the line.

Fill out the missing parts of the code chunk (???) and then run:

# Line plot of State Library pedestrian count
state_lib_jan_one %>%
  ggplot(aes(y = Count, x = Time)) +
  geom_line()

Describe the pedestrian count from 0:00 to 23:00 on January 1st, 2019, i.e., when is the peak, trough, steepest decline, etc. Would you expect this pattern to appear the following day?

Bar plot

You can copy and run the following code chunk to produce the equivalent plot using bars, i.e., a bar plot of the state library pedestrian count for each hour of the day.

# Bar plot of count
state_lib_jan_one %>%
  ggplot(aes(y = Count, x = Time)) +
  geom_bar(stat = "identity")

Side-by-side box plot

Suppose we wanted to visualise the distribution of pedestrian count from the State Library sensor for each hour of the day (over the month of January, 2019). That is, we want to know what the central tendency, variability and shape of pedestrian count around the State Library looks like at 0:00, 1:00, 2:00, ..., 23:00. We will begin by filtering the data for only pedestrian counts from the State Library:

  • Take ped and pipe in the filter() function.
  • Filter Sensor to "State Library".
  • Store this filtered data in an object named state_lib.

Fill out the missing parts of the code chunk (???) and then run:

# Filter for State Library
state_lib <- ped %>% 
  filter(Sensor == "State Library")

# Print state_lib
state_lib %>% head()
A tibble: 6 x 5
SensorDate_TimeDateTimeCount
<chr><dttm><date><dbl><dbl>
State Library2018-12-31 13:00:002019-01-0101548
State Library2018-12-31 14:00:002019-01-0111494
State Library2018-12-31 15:00:002019-01-012 878
State Library2018-12-31 16:00:002019-01-013 309
State Library2018-12-31 17:00:002019-01-014 133
State Library2018-12-31 18:00:002019-01-015 110

Using state_lib, we can plot a side-by-side box plot of the pedestrian count around the State Library with the following steps:

  • Take state_lib and pipe in the ggplot() function
  • Add the aesthetic layer, which should have Time, Count and Time specified in the x, y and group argument inside of aes(). Note that aes() goes inside of ggplot().
  • Add the geom layer to tell R that the visual element we need for our plot is the boxplot.

Fill out the missing parts of the code chunk (???) and then run:

# Side-by-side box plot of pedestrian count for each hour of the day
state_lib %>%
  ggplot(
    aes(y = Count, x = Time, group = Time)
  ) + geom_boxplot()

Note that the group aesthetic will group the data (state_lib) by each hour of the day (Time), then create a box plot for each of these groups. Without the group aesthetic, ggplot will produce a single box plot of pedestrian count and use the Time variable as the width of the boxplot (and R will return a warning, asking you if you might have forgotten the group aesthetic).

# Box plot without Time specified as the group aesthetic 
state_lib %>%
  ggplot(aes(x = Time, y = Count)) +
  geom_boxplot()

The reason why ggplot does not recognise that Time needs to be grouped (and we had to explicitly tell it to group the data by Time), is because Time is a numeric column. ggplot automatically assumes that numeric columns are all 'connected', which is why it would generate a single box plot if the group aesthetic is not specified.

Multiple locations

Suppose we are interested in the pedestrian count around Melbourne Central and the State Library (both are located near each other).

Filter for multiple sensors

Filter ped so that the pedestrian counts from only the Melbourne Central or State Library sensors are kept. This can be done with the following steps:

  • Take ped and pipe in the filter() function.
  • Use the %in% operator to filter Sensor so that only "Melbourne Central" or "State Library" are kept.
  • Store this filtered data in an object named mc_sl.

Fill out the missing parts of the code chunk (???) and then run:

# Filter for the Melbourne Central and State Library sensors
mc_sl <- ped %>% 
  filter(Sensor %in% c("Melbourne Central", "Library"))

# Print mc_sl
mc_sl %>% head()
A tibble: 6 x 5
SensorDate_TimeDateTimeCount
<chr><dttm><date><dbl><dbl>
Melbourne Central2018-12-31 13:00:002019-01-010NA
Melbourne Central2018-12-31 14:00:002019-01-011NA
Melbourne Central2018-12-31 15:00:002019-01-012NA
Melbourne Central2018-12-31 16:00:002019-01-013NA
Melbourne Central2018-12-31 17:00:002019-01-014NA
Melbourne Central2018-12-31 18:00:002019-01-015NA
  • How many rows and columns are in the data mc_sl?
  • Explain why there are this many rows in mc_sl.
  • How would you filter for all sensors except Melbourne Central and State Library? (Hint: There are 31,992 rows in ped and 1,488 rows in mc_sl, so a data set filtered for all sensors except Melbourne Central and State Library should have 30,504 rows.)

Facetted side-by-side box plots

We've seen how a side-by-side box plot provides a visualisation of the distribution of the data. To divide a plot into the different categories/measurements of a column in the data, we simply add the facet_wrap() layer onto our ggplot() call. Follow the steps below to produce side-by-side box plots separated by the sensors in mc_sl, i.e., Melbourne Central and State Library:

  • Take mc_sl and pipe in the ggplot() function
  • Add the aesthetic layer, which should have Time, Count and Time specified in the x, y and group argument inside of aes(). Note that aes() goes inside of ggplot().
  • Add the geom layer to tell R that the visual element we need for our plot is the boxplot.
  • Add the facet_wrap() layer to split the plot by Sensor.

Fill out the missing parts of the code chunk (???) and then run:

# Side-by-side box plot of pedestrian count for each hour of the day facetted by Sensor
mc_sl %>%
  ggplot(aes(x = Time, y = Count, group = Time)) +
  geom_boxplot() +
  facet_wrap(~ Sensor)

Immediately, we notice that it is difficult to compare the side-by-side box plots of the pedestrian count in Melbourne Central and the State Library because of the outliers in the Melbourne Central data. The sensor seems to have picked up moments in the 22nd and 23rd hour of the day, where the number of pedestrians far exceeded the maximum value at any other hour of the day. Filtering our these outliers will improve the interpretability of the side-by-side box plots.

It may be easier to compare the pedestrian count from both locations if the subplots were position from top-to-bottom, instead of left-to-right. You can make this change by specifying that the number of columns in your facetted plot be equal to 1, i.e., ncol = 1

Fill out the missing parts of the code chunk (???) and then run:

# Remove outliers and produce facetted plot with 1 column
mc_sl %>%
  filter(Count < 5000) %>%
  ggplot(aes(x = Time, y = Count, group = Time)) +
  geom_boxplot() +
  facet_wrap(~ Sensor, ncol = 1)

Group exercises

Returning to ped, complete the following exercises, which will require knowledge of the following concepts:

  • Pipe operator %>%
  • dplyr wrangling functions, e.g., filter(), group_by(), summarise(), arrange(), etc.
  • Functions to use inside of summarise(), e.g., n_distinct(), sum(), etc.
  • ggplot2 to produce a bar chart.

1. Using summarise()

Use a wrangling verb, to count the number of sensors in the ped. Do all the sensors have the same number of measurements?

ped %>%
  summarise(num_sensors = n_distinct(Sensor))
A tibble: 1 x 1
num_sensors
<int>
43

2. Grouping the data

For each sensor, compute the total count for January. Which sensor had the largest count? Which sensor had the smallest count?

ped %>%
  group_by(Sensor) %>%
  summarise(sum = sum(Count, na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(desc(sum))
A tibble: 43 x 2
Sensorsum
<chr><dbl>
Southbank 1395117
Town Hall (West) 1035715
Flinders Street Station Underpass 1015331
Spencer St-Collins St (North) 910109
Bourke Street Mall (North) 895483
The Arts Centre 884885
Princes Bridge 799066
Bourke Street Mall (South) 704858
St Kilda Rd-Alexandra Gardens 620895
Flinders St-Swanston St (West) 535146
State Library 494944
Collins St (North) 488458
Southern Cross Station 485848
Melbourne Central 473789
Melbourne Convention Exhibition Centre 451455
Bourke St-Russell St (West) 449123
Chinatown-Swanston St (North) 402761
QV Market-Elizabeth St (West) 383737
Sandridge Bridge 360679
Lonsdale St (South) 343747
Collins Place (South) 315492
Spencer St-Collins St (South) 257814
Chinatown-Lt Bourke St (South) 257471
Birrarung Marr 235438
Collins Place (North) 222458
New Quay 216206
Queen St (West) 202057
Lygon St (West) 194706
Alfred Place 181529
Lonsdale St-Spring St (West) 171063
Webb Bridge 150208
Grattan St-Swanston St (West) 121150
Victoria Point 117649
Flinders St-Spring St (West) 114549
Lygon St (East) 108837
QV Market-Peel St 95240
Flinders St-Spark La 94461
Monash Rd-Swanston St (West) 66420
Waterfront City 61481
Tin Alley-Swanston St (West) 38773
City Square 0
Flagstaff Station 0
Flinders St-Elizabeth St (East) 0

3. Sum of missing values with sum(is.na())

For each sensor, compute the total number of missing counts. Which sensor had the most missing counts? Why might this be?

ped %>%
 group_by(Sensor) %>%
 summarise(tot_missing = sum(is.na(Count))) %>%
 ungroup() %>%
 arrange(desc(tot_missing))
A tibble: 43 x 2
Sensortot_missing
<chr><int>
City Square 744
Flagstaff Station 744
Flinders St-Elizabeth St (East) 744
Birrarung Marr 416
Melbourne Central 127
Monash Rd-Swanston St (West) 50
Grattan St-Swanston St (West) 38
Tin Alley-Swanston St (West) 25
St Kilda Rd-Alexandra Gardens 24
Waterfront City 21
Victoria Point 12
Bourke Street Mall (North) 8
Flinders St-Spark La 5
Alfred Place 4
Webb Bridge 3
Collins Place (North) 2
Flinders St-Spring St (West) 2
Chinatown-Swanston St (North) 1
Lygon St (East) 1
Lygon St (West) 1
New Quay 1
QV Market-Peel St 1
Southern Cross Station 1
Bourke St-Russell St (West) 0
Bourke Street Mall (South) 0
Chinatown-Lt Bourke St (South) 0
Collins Place (South) 0
Collins St (North) 0
Flinders St-Swanston St (West) 0
Flinders Street Station Underpass 0
Lonsdale St (South) 0
Lonsdale St-Spring St (West) 0
Melbourne Convention Exhibition Centre 0
Princes Bridge 0
QV Market-Elizabeth St (West) 0
Queen St (West) 0
Sandridge Bridge 0
Southbank 0
Spencer St-Collins St (North) 0
Spencer St-Collins St (South) 0
State Library 0
The Arts Centre 0
Town Hall (West) 0

4. Filtering multiple sensors and reshaping the data

Filter ped to contain the counts from the Melbourne Central and State Library sensors only, then use a tidying function to create two columns that contain their counts.

ped %>%
  filter(Sensor %in% c("Melbourne Central", "State Library")) %>%
  spread(Sensor, Count)
A tibble: 744 x 5
Date_TimeDateTimeMelbourne CentralState Library
<dttm><date><dbl><dbl><dbl>
2018-12-31 13:00:002019-01-01 0 NA1548
2018-12-31 14:00:002019-01-01 1 NA1494
2018-12-31 15:00:002019-01-01 2 NA 878
2018-12-31 16:00:002019-01-01 3 NA 309
2018-12-31 17:00:002019-01-01 4 NA 133
2018-12-31 18:00:002019-01-01 5 NA 110
2018-12-31 19:00:002019-01-01 6 NA 42
2018-12-31 20:00:002019-01-01 7 NA 50
2018-12-31 21:00:002019-01-01 8 NA 83
2018-12-31 22:00:002019-01-01 9 NA 128
2018-12-31 23:00:002019-01-0110 NA 284
2019-01-01 00:00:002019-01-0111 NA 473
2019-01-01 01:00:002019-01-0112 NA 702
2019-01-01 02:00:002019-01-0113 NA 815
2019-01-01 03:00:002019-01-0114 NA 968
2019-01-01 04:00:002019-01-0115 NA1063
2019-01-01 05:00:002019-01-0116 NA1042
2019-01-01 06:00:002019-01-0117 NA1084
2019-01-01 07:00:002019-01-0118 NA 997
2019-01-01 08:00:002019-01-0119 NA1011
2019-01-01 09:00:002019-01-0120 NA 912
2019-01-01 10:00:002019-01-0121 7359 592
2019-01-01 11:00:002019-01-012226969 343
2019-01-01 12:00:002019-01-0123 1847 184
2019-01-01 13:00:002019-01-02 0 NA 61
2019-01-01 14:00:002019-01-02 1 NA 46
2019-01-01 15:00:002019-01-02 2 NA 24
2019-01-01 16:00:002019-01-02 3 NA 17
2019-01-01 17:00:002019-01-02 4 NA 10
2019-01-01 18:00:002019-01-02 5 NA 18
...............
2019-01-30 07:00:002019-01-301810761420
2019-01-30 08:00:002019-01-30191040 975
2019-01-30 09:00:002019-01-30201030 840
2019-01-30 10:00:002019-01-3021 581 582
2019-01-30 11:00:002019-01-3022 394 337
2019-01-30 12:00:002019-01-3023 310 190
2019-01-30 13:00:002019-01-31 0 168 86
2019-01-30 14:00:002019-01-31 1 77 41
2019-01-30 15:00:002019-01-31 2 26 26
2019-01-30 16:00:002019-01-31 3 20 4
2019-01-30 17:00:002019-01-31 4 22 5
2019-01-30 18:00:002019-01-31 5 48 27
2019-01-30 19:00:002019-01-31 6 105 107
2019-01-30 20:00:002019-01-31 7 224 300
2019-01-30 21:00:002019-01-31 8 413 643
2019-01-30 22:00:002019-01-31 9 494 612
2019-01-30 23:00:002019-01-3110 664 653
2019-01-31 00:00:002019-01-3111 726 836
2019-01-31 01:00:002019-01-311212161355
2019-01-31 02:00:002019-01-311313771587
2019-01-31 03:00:002019-01-311411561253
2019-01-31 04:00:002019-01-311513531246
2019-01-31 05:00:002019-01-311612081508
2019-01-31 06:00:002019-01-311715681814
2019-01-31 07:00:002019-01-311814781522
2019-01-31 08:00:002019-01-311910821056
2019-01-31 09:00:002019-01-3120 957 981
2019-01-31 10:00:002019-01-3121 927 719
2019-01-31 11:00:002019-01-3122 583 449
2019-01-31 12:00:002019-01-3123 333 177

5. Producing a 100 per cent chart

Create the following 100 per cent chart to compare the foot traffic at Melbourne Central and the State Library during different hours of the day. We can change the dimensions of our plot by changing the code chunk option.

  • By default, an R plot's height and width is set to 5 and 7 inches.
  • Set the height and width to 8 and 12 inches by adding fig.height=8 and fig.width=12 inside the code chunk option, i.e., from {r} to {r fig.height=8, fig.width=12}.

Note that R will return a warning to inform you that missing values in the data have been removed.

ped %>%
  filter(Sensor %in% c("Melbourne Central", "State Library")) %>%
  ggplot(aes(x = Time, y = Count, fill = Sensor)) +
  geom_bar(stat = "identity", position = "fill") +
  facet_wrap(~ Date, ncol = 7) +
  labs(
    title = "Comparing foot traffic at Melbourne Central and the State Library during different hours of the day",
    subtitle = "Greater proportion of foot traffic at the State Library than Melbourne Central during the afternoon"
  )

Explain why the first 8 days of January appear this way.

All of the material is copyrighted under the Creative Commons BY-SA 4.0 copyright.