Pedestrian activity¶
The City of Melbourne has developed an automated pedestrian counting system to better understand pedestrian activity. Data is captured from counting sensors across various locations in Melbourne's CBD.
We've stored a subset of this data in comma separated (.csv) file called melb_walk_wide.csv
on Github (Please open this link in new tab). If you have trouble accessing it, download it to your working directory from Moodle. Clicking on the Raw button on GitHub lets you view melb_walk_wide.csv
in your web browser:
Reading a .csv file from GitHub¶
To read (or import) melb_walk_wide.csv
into your R:
- Load the
tidyverse
, which contains theread_csv()
function (from thereadr
package) to read .csv files in R. (A good rule of thumb is to always load thetidyverse
before you begin any data analysis.) - Copy the GitHub URL of
melb_walk_wide.csv
and paste inside theread_csv()
function. If you have already downloaded it and it is in your working directory, just use the directory path, e.g.,./melb_walk_wide.csv
- Store the data in an object named
ped_wide
.
Fill out the missing parts of the code chunk (???
) and then run:
# Load tidyverse
library(tidyverse)
# Read melb_walk.csv from GitHub URL and store in object named ped_wide
ped_wide <- read_csv("https://raw.githubusercontent.com/quangvanbui/FIT5145-data/master/melb_walk_wide.csv")
# Alternatively, read it from your working directory
# ped_wide <- ???("./melb_walk_wide.csv")
# Print ped_wide
ped_wide
Date_Time | Date | Time | Alfred Place | Birrarung Marr | Bourke St-Russell St (West) | Bourke Street Mall (North) | Bourke Street Mall (South) | Chinatown-Lt Bourke St (South) | Chinatown-Swanston St (North) | ... | Spencer St-Collins St (North) | Spencer St-Collins St (South) | St Kilda Rd-Alexandra Gardens | State Library | The Arts Centre | Tin Alley-Swanston St (West) | Town Hall (West) | Victoria Point | Waterfront City | Webb Bridge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dttm> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
2018-12-31 13:00:00 | 2019-01-01 | 0 | 207 | 2733 | 1745 | 918 | 770 | 494 | 633 | ... | 1867 | 577 | 1360 | 1548 | 2800 | 20 | 3025 | 1217 | 1339 | 762 |
2018-12-31 14:00:00 | 2019-01-01 | 1 | 99 | 1086 | 1722 | 995 | 635 | 361 | 908 | ... | 1838 | 416 | 1896 | 1494 | 1746 | 12 | 3077 | 470 | 352 | 336 |
2018-12-31 15:00:00 | 2019-01-01 | 2 | 60 | 571 | 1113 | 416 | 262 | 304 | 469 | ... | 896 | 222 | 731 | 878 | 940 | 10 | 1927 | 202 | 100 | 138 |
2018-12-31 16:00:00 | 2019-01-01 | 3 | 21 | 208 | 786 | 382 | 194 | 208 | 289 | ... | 686 | 133 | 288 | 309 | 466 | 9 | 998 | 116 | 29 | 95 |
2018-12-31 17:00:00 | 2019-01-01 | 4 | 15 | 83 | 405 | 165 | 106 | 149 | 155 | ... | 346 | 87 | 134 | 133 | 109 | 1 | 472 | 31 | 5 | 25 |
2018-12-31 18:00:00 | 2019-01-01 | 5 | 31 | 46 | 249 | 117 | 53 | 65 | 93 | ... | 284 | 50 | 63 | 110 | 64 | 3 | 209 | 17 | 1 | 30 |
2018-12-31 19:00:00 | 2019-01-01 | 6 | 17 | 48 | 117 | 48 | 55 | 51 | 44 | ... | 285 | 82 | 69 | 42 | 57 | NA | 131 | 27 | 10 | 16 |
2018-12-31 20:00:00 | 2019-01-01 | 7 | 31 | 49 | 76 | 47 | 51 | 72 | 73 | ... | 292 | 70 | 70 | 50 | 84 | 5 | 153 | 9 | 11 | 32 |
2018-12-31 21:00:00 | 2019-01-01 | 8 | 46 | 65 | 99 | 86 | 87 | 25 | 113 | ... | 377 | 70 | 83 | 83 | 124 | 2 | 207 | 15 | 19 | 56 |
2018-12-31 22:00:00 | 2019-01-01 | 9 | 116 | 116 | 131 | 253 | 258 | 98 | 82 | ... | 567 | 127 | 170 | 128 | 268 | 19 | 388 | 28 | 39 | 124 |
2018-12-31 23:00:00 | 2019-01-01 | 10 | 112 | 225 | 294 | 956 | 689 | 368 | 302 | ... | 963 | 208 | 282 | 284 | 536 | 15 | 955 | 92 | 59 | 154 |
2019-01-01 00:00:00 | 2019-01-01 | 11 | 176 | 291 | 530 | 1542 | 1141 | 446 | 471 | ... | 1135 | 193 | 562 | 473 | 1132 | 31 | 1369 | 79 | 136 | 193 |
2019-01-01 01:00:00 | 2019-01-01 | 12 | 100 | 354 | 634 | 2084 | 1634 | 656 | 952 | ... | 926 | 241 | 710 | 702 | 1427 | 22 | 1887 | 75 | 162 | 250 |
2019-01-01 02:00:00 | 2019-01-01 | 13 | 85 | 465 | 802 | 2324 | 1723 | 746 | 1240 | ... | 939 | 194 | 986 | 815 | 1608 | 18 | 2198 | 66 | 187 | 277 |
2019-01-01 03:00:00 | 2019-01-01 | 14 | 61 | 452 | 802 | 2677 | 1831 | 472 | 1214 | ... | 922 | 183 | 1076 | 968 | 1927 | 23 | 2343 | 36 | 169 | 299 |
2019-01-01 04:00:00 | 2019-01-01 | 15 | 135 | 482 | 861 | 2658 | 2057 | 368 | 1207 | ... | 964 | 186 | 1056 | 1063 | 1752 | 25 | 2430 | 59 | 200 | 287 |
2019-01-01 05:00:00 | 2019-01-01 | 16 | 101 | 620 | 882 | 2819 | 1989 | 313 | 1128 | ... | 1120 | 202 | 1077 | 1042 | 1590 | 21 | 2348 | 86 | 168 | 280 |
2019-01-01 06:00:00 | 2019-01-01 | 17 | 101 | 975 | 963 | 2587 | 1727 | 452 | 1090 | ... | 1107 | 203 | 1200 | 1084 | 1244 | 28 | 2239 | 85 | 142 | 258 |
2019-01-01 07:00:00 | 2019-01-01 | 18 | 144 | 1600 | 1150 | 2095 | 994 | 659 | 1260 | ... | 1066 | 217 | 885 | 997 | 731 | 31 | 1829 | 78 | 186 | 211 |
2019-01-01 08:00:00 | 2019-01-01 | 19 | 82 | 374 | 1209 | 1365 | 605 | 740 | 1079 | ... | 795 | 144 | 598 | 1011 | 637 | 19 | 1500 | 103 | 196 | 155 |
2019-01-01 09:00:00 | 2019-01-01 | 20 | 127 | 202 | 1047 | 954 | 450 | 670 | 1035 | ... | 665 | 175 | 502 | 912 | 493 | 26 | 1283 | 48 | 213 | 124 |
2019-01-01 10:00:00 | 2019-01-01 | 21 | 62 | 219 | 976 | 801 | 321 | 536 | 919 | ... | 499 | 165 | 344 | 592 | 471 | 20 | 992 | 51 | 172 | 147 |
2019-01-01 11:00:00 | 2019-01-01 | 22 | 67 | 2211 | 649 | 411 | 222 | 386 | 550 | ... | 540 | 149 | 773 | 343 | 601 | 10 | 826 | 35 | 101 | 73 |
2019-01-01 12:00:00 | 2019-01-01 | 23 | 32 | 237 | 435 | 165 | 103 | 205 | 340 | ... | 468 | 96 | 159 | 184 | 116 | 4 | 451 | 22 | 48 | 42 |
2019-01-01 13:00:00 | 2019-01-02 | 0 | 15 | 35 | 258 | 86 | 43 | 105 | 147 | ... | 224 | 51 | 46 | 61 | 77 | 1 | 196 | 35 | 12 | 20 |
2019-01-01 14:00:00 | 2019-01-02 | 1 | 22 | 25 | 163 | 41 | 32 | 72 | 90 | ... | 109 | 39 | 18 | 46 | 41 | 1 | 82 | 72 | 7 | 18 |
2019-01-01 15:00:00 | 2019-01-02 | 2 | 5 | 9 | 72 | 52 | 18 | 40 | 44 | ... | 52 | 11 | 9 | 24 | 45 | 4 | 41 | 8 | 1 | 1 |
2019-01-01 16:00:00 | 2019-01-02 | 3 | 6 | 19 | 65 | 18 | 16 | 22 | 31 | ... | 23 | 26 | 9 | 17 | 67 | NA | 20 | 4 | 1 | 3 |
2019-01-01 17:00:00 | 2019-01-02 | 4 | 10 | 27 | 44 | 10 | 19 | 34 | 17 | ... | 35 | 17 | 20 | 10 | 91 | 2 | 31 | 5 | 1 | 1 |
2019-01-01 18:00:00 | 2019-01-02 | 5 | 12 | 40 | 30 | 31 | 15 | 28 | 12 | ... | 128 | 32 | 31 | 18 | 288 | 5 | 84 | 2 | 1 | 22 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2019-01-30 07:00:00 | 2019-01-30 | 18 | 341 | NA | 900 | 1766 | 1169 | 497 | 827 | ... | 2792 | 518 | 1107 | 1420 | 1115 | 101 | 2271 | 367 | 63 | 333 |
2019-01-30 08:00:00 | 2019-01-30 | 19 | 169 | NA | 749 | 1083 | 586 | 589 | 795 | ... | 1170 | 258 | 748 | 975 | 815 | 110 | 1348 | 170 | 109 | 168 |
2019-01-30 09:00:00 | 2019-01-30 | 20 | 104 | NA | 650 | 683 | 383 | 652 | 682 | ... | 820 | 212 | 420 | 840 | 294 | 50 | 1104 | 89 | 82 | 120 |
2019-01-30 10:00:00 | 2019-01-30 | 21 | 61 | NA | 541 | 457 | 286 | 406 | 401 | ... | 711 | 171 | 294 | 582 | 630 | 29 | 770 | 93 | 43 | 76 |
2019-01-30 11:00:00 | 2019-01-30 | 22 | 110 | NA | 418 | 368 | 175 | 366 | 309 | ... | 604 | 119 | 206 | 337 | 276 | 23 | 481 | 82 | 50 | 68 |
2019-01-30 12:00:00 | 2019-01-30 | 23 | 23 | NA | 319 | 125 | 85 | 282 | 204 | ... | 302 | 56 | 90 | 190 | 88 | 47 | 264 | 48 | 18 | 29 |
2019-01-30 13:00:00 | 2019-01-31 | 0 | 19 | NA | 162 | 64 | 39 | 111 | 74 | ... | 141 | 49 | 64 | 86 | 55 | 3 | 106 | 12 | 7 | 11 |
2019-01-30 14:00:00 | 2019-01-31 | 1 | 14 | NA | 99 | 59 | 23 | 42 | 64 | ... | 29 | 24 | 28 | 41 | 23 | 2 | 57 | 6 | 5 | 3 |
2019-01-30 15:00:00 | 2019-01-31 | 2 | 2 | NA | 61 | 19 | 17 | 30 | 18 | ... | 18 | 12 | 12 | 26 | 28 | 1 | 99 | 8 | 1 | 2 |
2019-01-30 16:00:00 | 2019-01-31 | 3 | 5 | NA | 34 | NA | 11 | 14 | 12 | ... | 36 | 10 | 7 | 4 | 26 | NA | 29 | NA | NA | 2 |
2019-01-30 17:00:00 | 2019-01-31 | 4 | 5 | NA | 17 | NA | 15 | 32 | 9 | ... | 19 | 11 | 10 | 5 | 56 | 2 | 38 | 2 | 1 | 3 |
2019-01-30 18:00:00 | 2019-01-31 | 5 | 19 | NA | 42 | NA | 15 | 12 | 16 | ... | 265 | 57 | 57 | 27 | 94 | 2 | 64 | 21 | 3 | 52 |
2019-01-30 19:00:00 | 2019-01-31 | 6 | 103 | NA | 97 | NA | 71 | 20 | 30 | ... | 969 | 259 | 325 | 107 | 298 | 34 | 217 | 64 | 27 | 120 |
2019-01-30 20:00:00 | 2019-01-31 | 7 | 292 | NA | 196 | NA | 208 | 95 | 105 | ... | 2173 | 662 | 887 | 300 | 801 | 39 | 458 | 287 | 66 | 385 |
2019-01-30 21:00:00 | 2019-01-31 | 8 | 856 | NA | 342 | NA | 595 | 184 | 168 | ... | 3981 | 1573 | 1507 | 643 | 1629 | 193 | 899 | 528 | 86 | 568 |
2019-01-30 22:00:00 | 2019-01-31 | 9 | 683 | NA | 365 | NA | 777 | 181 | 174 | ... | 2816 | 972 | 902 | 612 | 1024 | 203 | 1035 | 433 | 35 | 319 |
2019-01-30 23:00:00 | 2019-01-31 | 10 | 482 | NA | 425 | NA | 1055 | 277 | 299 | ... | 1596 | 600 | 981 | 653 | 896 | 100 | 1381 | 281 | 49 | 165 |
2019-01-31 00:00:00 | 2019-01-31 | 11 | 462 | 161 | 637 | 1316 | 1507 | 390 | 552 | ... | 1555 | 593 | 1166 | 836 | 1215 | 121 | 1931 | 285 | 59 | 227 |
2019-01-31 01:00:00 | 2019-01-31 | 12 | 1239 | 392 | 1483 | 2766 | 2804 | 723 | 1091 | ... | 2429 | 1145 | 1537 | 1355 | 1393 | 147 | 3100 | 528 | 135 | 636 |
2019-01-31 02:00:00 | 2019-01-31 | 13 | 1330 | 258 | 1590 | 2978 | 2965 | 778 | 1139 | ... | 2383 | 1135 | 1726 | 1587 | 1508 | 140 | 3355 | 441 | 113 | 545 |
2019-01-31 03:00:00 | 2019-01-31 | 14 | 700 | NA | 1105 | 2154 | 2351 | 453 | 837 | ... | 1955 | 591 | 1657 | 1253 | 1355 | 149 | 2948 | 298 | 67 | 213 |
2019-01-31 04:00:00 | 2019-01-31 | 15 | 451 | NA | 968 | 2428 | 2180 | 516 | 873 | ... | 2178 | 576 | 1665 | 1246 | 1526 | 115 | 2861 | 307 | 56 | 259 |
2019-01-31 05:00:00 | 2019-01-31 | 16 | 522 | NA | 950 | 2374 | 2233 | 398 | 814 | ... | 3479 | 680 | 1864 | 1508 | 1644 | 128 | 2793 | 613 | 64 | 416 |
2019-01-31 06:00:00 | 2019-01-31 | 17 | 684 | NA | 1162 | 2937 | 2430 | 574 | 901 | ... | 4777 | 1154 | 2500 | 1814 | 2348 | 183 | 3166 | 891 | 66 | 701 |
2019-01-31 07:00:00 | 2019-01-31 | 18 | 377 | NA | 1126 | 2523 | 1756 | 638 | 1065 | ... | 2698 | 612 | 2057 | 1522 | 1960 | 107 | 2577 | 439 | 93 | 471 |
2019-01-31 08:00:00 | 2019-01-31 | 19 | 144 | NA | 1010 | 1560 | 1009 | 747 | 1093 | ... | 1347 | 278 | 1398 | 1056 | 1669 | 69 | 1648 | 183 | 63 | 215 |
2019-01-31 09:00:00 | 2019-01-31 | 20 | 103 | NA | 942 | 1185 | 700 | 527 | 863 | ... | 922 | 272 | 667 | 981 | 436 | 134 | 1439 | 122 | 75 | 128 |
2019-01-31 10:00:00 | 2019-01-31 | 21 | 66 | NA | 763 | 726 | 395 | 499 | 749 | ... | 814 | 210 | 520 | 719 | 791 | 35 | 1066 | 71 | 119 | 125 |
2019-01-31 11:00:00 | 2019-01-31 | 22 | 104 | NA | 618 | 324 | 203 | 358 | 442 | ... | 587 | 135 | 759 | 449 | 1444 | 14 | 702 | 39 | 50 | 62 |
2019-01-31 12:00:00 | 2019-01-31 | 23 | 12 | NA | 394 | 169 | 95 | 198 | 245 | ... | 323 | 80 | 121 | 177 | 187 | 7 | 358 | 29 | 24 | 35 |
Note that when we load the tidyverse
, R returns messages and warnings to inform of the tidyverse
packages have been loaded into our R session, when some of the packages were built, etc. R also returns a message after reading melb_walk_wide.csv
using the read_csv()
function to let us know how it has specified each column type of the data.
read_csv()
or read.csv()
?¶
While base R provides the read.csv()
function to read .csv files into R, the read_csv()
function (which is a function from the readr
package and is part of the tidyverse
) reads .csv files approximately 10 times faster than read.csv()
. This means a .csv file that would have taken read.csv()
60 minutes to read into R would only take read_csv()
6 minutes to read.
Look at ped_wide
¶
To print out or look at what is inside of ped_wide
, type ped_wide
in a code chunk and run it.
# Print ped_wide
ped_wide %>% head(n=3)
Date_Time | Date | Time | Alfred Place | Birrarung Marr | Bourke St-Russell St (West) | Bourke Street Mall (North) | Bourke Street Mall (South) | Chinatown-Lt Bourke St (South) | Chinatown-Swanston St (North) | ⋯ | Spencer St-Collins St (North) | Spencer St-Collins St (South) | St Kilda Rd-Alexandra Gardens | State Library | The Arts Centre | Tin Alley-Swanston St (West) | Town Hall (West) | Victoria Point | Waterfront City | Webb Bridge |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dttm> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
2018-12-31 13:00:00 | 2019-01-01 | 0 | 207 | 2733 | 1745 | 918 | 770 | 494 | 633 | ⋯ | 1867 | 577 | 1360 | 1548 | 2800 | 20 | 3025 | 1217 | 1339 | 762 |
2018-12-31 14:00:00 | 2019-01-01 | 1 | 99 | 1086 | 1722 | 995 | 635 | 361 | 908 | ⋯ | 1838 | 416 | 1896 | 1494 | 1746 | 12 | 3077 | 470 | 352 | 336 |
2018-12-31 15:00:00 | 2019-01-01 | 2 | 60 | 571 | 1113 | 416 | 262 | 304 | 469 | ⋯ | 896 | 222 | 731 | 878 | 940 | 10 | 1927 | 202 | 100 | 138 |
There are circumstances when printing out all of ped_wide
is unnecessary, e.g., in a report to communicate our analysis, we should never place a table of the entire data set (if you are a project partner, reading a report from an analyst, how would you feel if the analyst placed a 744 by 46 dimension table for you to read). You can print the head of a data set using the head()
, which returns the first 6 rows of the data - a small extract of the data.
Fill out the missing parts of the code chunk (???
) and then run:
# head(ped_wide)
# or use %>% to pipe the output to head() function
ped_wide %>% head()
Notice that there are 744 rows and 46 columns. The columns in ped_wide
and their definition are provided below:
Date_time
- date and time stamp of the recorded pedestrian foot traffic count in UTC timezoneDate
- date of recorded pedetrian foot traffic count in Melbourne's timezone (UTC+11 or UTC+10, depending on daylight savings)Time
- hour of the day (24-hour time) of recorded pedestrian foot traffic count in Melbourne's timezone (UTC+11 or UTC+10, depending on daylight savings)Alfred Place
- number of pedestrians counted over a one hour period from a sensor located in Alfred PlaceBirrarung Marr
- number of pedestrians counted over a one hour period from a sensor located in Birrarung Marr- $\vdots$
Webb Bridge
- number of pedestrians counted over a one hour period from a sensor located in Webb Bridge
Note that the dates and hours in variables Date
and Time
differ from Date_Time
because of timezone differences.
This type of data is called a time-series of temporal data because it contains information recorded over time. In this example, we have hourly pedestrian counts for a number of locations in Melbourne from January 1 to 31, 2019. Confirm the time period of our data by following the steps below:
- Take
ped_wide
and pipe in thearrange()
function. - Arrange the data by the column,
Date
. - Pipe in the
summarise()
function and return the first and last date with thefirst()
andlast()
function.
Fill out the missing parts of the code chunk (???
) and then run:
# First and last date in the data
ped_wide %>%
arrange(Date) %>%
summarise(
first_date = first(Date),
last_date = last(Date)
)
first_date | last_date |
---|---|
<date> | <date> |
2019-01-01 | 2019-01-31 |
Convert to long form¶
It is helpful to think of a data set as either wide or long. The pedestrian count data, ped_wide
, is presented in a wide form, which is to say that the attributes of the data are presented horizontally. Converting ped_wide
into a long form presents the same attributes vertically, i.e., no information is lost by reshaping the data.
So why should we reshape the data into a long form? A data set that is represented in a long form is considered a tidy data set and allows us to use all the tools from the tidyverse
. The tools created in the tidyverse
are designed for us to work in a principled and consistent way but they require that the data be represented the tidy way (long form). We will see later how the dplyr
functions to wrangle the data and ggplot2
package to produce graphics (both part of the tidyverse
) work seamlessly when the data is in a tidy long form.
Of course, there are instances when a wide form representation of the data is necessary (some models need to be trained with data in a wide form).
Following the steps below to convert ped_wide
into a tidy long form data:
- Take
ped_wide
and pipe in thegather()
function. - Inside
gather()
, specify thekey
asSensor
andvalue
asCount
and gather all columns inped_wide
exceptData_time
,Date
andTime
. - Store this tidy long form data of pedestrian count in an object named
ped
.
Fill out the missing parts of the code chunk (???
) and then run:
# Convert the data into a long form
ped <- ped_wide %>%
gather(
key = Sensor,
value = Count, -Date_Time, -Date, -Time
) %>%
select(Sensor, everything(), Count)
# Print ped
ped %>% head()
Sensor | Date_Time | Date | Time | Count |
---|---|---|---|---|
<chr> | <dttm> | <date> | <dbl> | <dbl> |
Alfred Place | 2018-12-31 13:00:00 | 2019-01-01 | 0 | 207 |
Alfred Place | 2018-12-31 14:00:00 | 2019-01-01 | 1 | 99 |
Alfred Place | 2018-12-31 15:00:00 | 2019-01-01 | 2 | 60 |
Alfred Place | 2018-12-31 16:00:00 | 2019-01-01 | 3 | 21 |
Alfred Place | 2018-12-31 17:00:00 | 2019-01-01 | 4 | 15 |
Alfred Place | 2018-12-31 18:00:00 | 2019-01-01 | 5 | 31 |
While ped_wide
contains 744 rows and 46 columns of data and ped
contains 31,992 rows and 5 columns, no information is loss by reshaping the data. In ped_wide
, the pedestrian count from each sensor was presented in its own column, but in ped
, there is a column containing the sensor name/location and another with its pedestrian count. This means that each row in ped
captures the number of pedestrians counted over a 1 hour time window at given location.
Note about the arguments in a function¶
It is not essential to type our the name of a function's argument(s) when specifying what that argument should be. For example, the gather()
function used above specified the Sensor
variable in the key
argument and the Count
variable in the value
argument:
key = Sensor
value = Count
ped_wide %>%
gather(key = Sensor, value = Count, -Date_Time, -Date, -Time)
We can achieve the same result without explicitly providing the argument names:
ped_wide %>%
gather(Sensor, Count, -Date_Time, -Date, -Time)
This is because the arguments are ordered, i.e., the key goes first, then the value, so by setting Sensor
first and then Count
next, gather()
will know what we wanted Sensor
to be specified in the key
argument and Count
to be specified in the value
argument.
State Library¶
We will explore pedestrian activity around the State Library on the 1st of January, 2019. To do this, we will need to filter ped
for the State Library sensor on the 1st of January, 2019.
- Take
ped
and pipe in thefilter()
function. - Filter
Date
to"2019-01-01"
andSensor
to"State Library"
. - Store this filtered data in an object named
state_lib_jan_one
.
Fill out the missing parts of the code chunk (???
) and then run:
# Filter for State Library data on Jan 1, 2019
state_lib_jan_one <- ped %>%
filter(
Date == "2019-01-01",
Sensor == "State Library"
)
# Print state_lib_jan_one
state_lib_jan_one %>% head()
Sensor | Date_Time | Date | Time | Count |
---|---|---|---|---|
<chr> | <dttm> | <date> | <dbl> | <dbl> |
State Library | 2018-12-31 13:00:00 | 2019-01-01 | 0 | 1548 |
State Library | 2018-12-31 14:00:00 | 2019-01-01 | 1 | 1494 |
State Library | 2018-12-31 15:00:00 | 2019-01-01 | 2 | 878 |
State Library | 2018-12-31 16:00:00 | 2019-01-01 | 3 | 309 |
State Library | 2018-12-31 17:00:00 | 2019-01-01 | 4 | 133 |
State Library | 2018-12-31 18:00:00 | 2019-01-01 | 5 | 110 |
This tells R to take ped
and filter it for data captured by the State Library sensor on the 1st of January, 2019, then store this filtered data in an object named state_lib_jan_one
. If you have successfully done this, you'll see state_lib_jan_one
in your RStudio environments tab and your state_lib_jan_one
data should looks like the following:
- How many rows and columns are in
state_lib_jan_one
? - Explain why there are this many rows. (This may seem obvious, but if you develop a checking mechanism like this, you'll be able to spot data quality or coding issues much sooner, which can save you a lot of time.)
- In which hour is pedestrian count highest? Explain whether or not this makes sense.
Line plot¶
A better way to understand the pedestrian count around the State Library sensor in each hour of the day (of Jan 1st, 2019) is to produce a visualisation. Line plots are typically used to visualise time-series data sets, with the x-axis representing the time or date (or both) and the y-axis representing some time-series process. To produce a line plot of the pedestrian count around the State Library for each hour of the day:
- Take
state_lib_jan_one
and pipe in theggplot()
function. - Specify the aesthetics layer, i.e., what should be placed in the x and y-axis. This goes inside the
aes()
, which goes inside ofggplot()
. - Add the geometric (or geom) layer to tell R that the visual element we need for our plot is the line.
Fill out the missing parts of the code chunk (???
) and then run:
# Line plot of State Library pedestrian count
state_lib_jan_one %>%
ggplot(aes(y = Count, x = Time)) +
geom_line()
Describe the pedestrian count from 0:00 to 23:00 on January 1st, 2019, i.e., when is the peak, trough, steepest decline, etc. Would you expect this pattern to appear the following day?
Bar plot¶
You can copy and run the following code chunk to produce the equivalent plot using bars, i.e., a bar plot of the state library pedestrian count for each hour of the day.
# Bar plot of count
state_lib_jan_one %>%
ggplot(aes(y = Count, x = Time)) +
geom_bar(stat = "identity")
Side-by-side box plot¶
Suppose we wanted to visualise the distribution of pedestrian count from the State Library sensor for each hour of the day (over the month of January, 2019). That is, we want to know what the central tendency, variability and shape of pedestrian count around the State Library looks like at 0:00, 1:00, 2:00, ..., 23:00. We will begin by filtering the data for only pedestrian counts from the State Library:
- Take
ped
and pipe in thefilter()
function. - Filter
Sensor
to"State Library"
. - Store this filtered data in an object named
state_lib
.
Fill out the missing parts of the code chunk (???
) and then run:
# Filter for State Library
state_lib <- ped %>%
filter(Sensor == "State Library")
# Print state_lib
state_lib %>% head()
Sensor | Date_Time | Date | Time | Count |
---|---|---|---|---|
<chr> | <dttm> | <date> | <dbl> | <dbl> |
State Library | 2018-12-31 13:00:00 | 2019-01-01 | 0 | 1548 |
State Library | 2018-12-31 14:00:00 | 2019-01-01 | 1 | 1494 |
State Library | 2018-12-31 15:00:00 | 2019-01-01 | 2 | 878 |
State Library | 2018-12-31 16:00:00 | 2019-01-01 | 3 | 309 |
State Library | 2018-12-31 17:00:00 | 2019-01-01 | 4 | 133 |
State Library | 2018-12-31 18:00:00 | 2019-01-01 | 5 | 110 |
Using state_lib
, we can plot a side-by-side box plot of the pedestrian count around the State Library with the following steps:
- Take
state_lib
and pipe in theggplot()
function - Add the aesthetic layer, which should have
Time
,Count
andTime
specified in thex
,y
andgroup
argument inside ofaes()
. Note thataes()
goes inside ofggplot()
. - Add the geom layer to tell R that the visual element we need for our plot is the boxplot.
Fill out the missing parts of the code chunk (???
) and then run:
# Side-by-side box plot of pedestrian count for each hour of the day
state_lib %>%
ggplot(
aes(y = Count, x = Time, group = Time)
) + geom_boxplot()
Note that the group
aesthetic will group the data (state_lib
) by each hour of the day (Time
), then create a box plot for each of these groups. Without the group
aesthetic, ggplot
will produce a single box plot of pedestrian count and use the Time
variable as the width of the boxplot (and R will return a warning, asking you if you might have forgotten the group
aesthetic).
# Box plot without Time specified as the group aesthetic
state_lib %>%
ggplot(aes(x = Time, y = Count)) +
geom_boxplot()
The reason why ggplot
does not recognise that Time
needs to be grouped (and we had to explicitly tell it to group the data by Time
), is because Time
is a numeric column. ggplot
automatically assumes that numeric columns are all 'connected', which is why it would generate a single box plot if the group
aesthetic is not specified.
Multiple locations¶
Suppose we are interested in the pedestrian count around Melbourne Central and the State Library (both are located near each other).
Filter for multiple sensors¶
Filter ped
so that the pedestrian counts from only the Melbourne Central or State Library sensors are kept. This can be done with the following steps:
- Take
ped
and pipe in thefilter()
function. - Use the
%in%
operator to filterSensor
so that only"Melbourne Central"
or"State Library"
are kept. - Store this filtered data in an object named
mc_sl
.
Fill out the missing parts of the code chunk (???
) and then run:
# Filter for the Melbourne Central and State Library sensors
mc_sl <- ped %>%
filter(Sensor %in% c("Melbourne Central", "Library"))
# Print mc_sl
mc_sl %>% head()
Sensor | Date_Time | Date | Time | Count |
---|---|---|---|---|
<chr> | <dttm> | <date> | <dbl> | <dbl> |
Melbourne Central | 2018-12-31 13:00:00 | 2019-01-01 | 0 | NA |
Melbourne Central | 2018-12-31 14:00:00 | 2019-01-01 | 1 | NA |
Melbourne Central | 2018-12-31 15:00:00 | 2019-01-01 | 2 | NA |
Melbourne Central | 2018-12-31 16:00:00 | 2019-01-01 | 3 | NA |
Melbourne Central | 2018-12-31 17:00:00 | 2019-01-01 | 4 | NA |
Melbourne Central | 2018-12-31 18:00:00 | 2019-01-01 | 5 | NA |
- How many rows and columns are in the data
mc_sl
? - Explain why there are this many rows in
mc_sl
. - How would you filter for all sensors except Melbourne Central and State Library? (Hint: There are 31,992 rows in
ped
and 1,488 rows inmc_sl
, so a data set filtered for all sensors except Melbourne Central and State Library should have 30,504 rows.)
Facetted side-by-side box plots¶
We've seen how a side-by-side box plot provides a visualisation of the distribution of the data. To divide a plot into the different categories/measurements of a column in the data, we simply add the facet_wrap()
layer onto our ggplot()
call. Follow the steps below to produce side-by-side box plots separated by the sensors in mc_sl
, i.e., Melbourne Central and State Library:
- Take
mc_sl
and pipe in theggplot()
function - Add the aesthetic layer, which should have
Time
,Count
andTime
specified in thex
,y
andgroup
argument inside ofaes()
. Note thataes()
goes inside ofggplot()
. - Add the geom layer to tell R that the visual element we need for our plot is the boxplot.
- Add the
facet_wrap()
layer to split the plot bySensor
.
Fill out the missing parts of the code chunk (???
) and then run:
# Side-by-side box plot of pedestrian count for each hour of the day facetted by Sensor
mc_sl %>%
ggplot(aes(x = Time, y = Count, group = Time)) +
geom_boxplot() +
facet_wrap(~ Sensor)
Immediately, we notice that it is difficult to compare the side-by-side box plots of the pedestrian count in Melbourne Central and the State Library because of the outliers in the Melbourne Central data. The sensor seems to have picked up moments in the 22nd and 23rd hour of the day, where the number of pedestrians far exceeded the maximum value at any other hour of the day. Filtering our these outliers will improve the interpretability of the side-by-side box plots.
It may be easier to compare the pedestrian count from both locations if the subplots were position from top-to-bottom, instead of left-to-right. You can make this change by specifying that the number of columns in your facetted plot be equal to 1, i.e., ncol = 1
Fill out the missing parts of the code chunk (???
) and then run:
# Remove outliers and produce facetted plot with 1 column
mc_sl %>%
filter(Count < 5000) %>%
ggplot(aes(x = Time, y = Count, group = Time)) +
geom_boxplot() +
facet_wrap(~ Sensor, ncol = 1)
Group exercises¶
Returning to ped
, complete the following exercises, which will require knowledge of the following concepts:
- Pipe operator
%>%
dplyr
wrangling functions, e.g.,filter()
,group_by()
,summarise()
,arrange()
, etc.- Functions to use inside of
summarise()
, e.g.,n_distinct()
,sum()
, etc. ggplot2
to produce a bar chart.
1. Using summarise()
¶
Use a wrangling verb, to count the number of sensors in the ped
. Do all the sensors have the same number of measurements?
ped %>%
summarise(num_sensors = n_distinct(Sensor))
num_sensors |
---|
<int> |
43 |
2. Grouping the data¶
For each sensor, compute the total count for January. Which sensor had the largest count? Which sensor had the smallest count?
ped %>%
group_by(Sensor) %>%
summarise(sum = sum(Count, na.rm = TRUE)) %>%
ungroup() %>%
arrange(desc(sum))
Sensor | sum |
---|---|
<chr> | <dbl> |
Southbank | 1395117 |
Town Hall (West) | 1035715 |
Flinders Street Station Underpass | 1015331 |
Spencer St-Collins St (North) | 910109 |
Bourke Street Mall (North) | 895483 |
The Arts Centre | 884885 |
Princes Bridge | 799066 |
Bourke Street Mall (South) | 704858 |
St Kilda Rd-Alexandra Gardens | 620895 |
Flinders St-Swanston St (West) | 535146 |
State Library | 494944 |
Collins St (North) | 488458 |
Southern Cross Station | 485848 |
Melbourne Central | 473789 |
Melbourne Convention Exhibition Centre | 451455 |
Bourke St-Russell St (West) | 449123 |
Chinatown-Swanston St (North) | 402761 |
QV Market-Elizabeth St (West) | 383737 |
Sandridge Bridge | 360679 |
Lonsdale St (South) | 343747 |
Collins Place (South) | 315492 |
Spencer St-Collins St (South) | 257814 |
Chinatown-Lt Bourke St (South) | 257471 |
Birrarung Marr | 235438 |
Collins Place (North) | 222458 |
New Quay | 216206 |
Queen St (West) | 202057 |
Lygon St (West) | 194706 |
Alfred Place | 181529 |
Lonsdale St-Spring St (West) | 171063 |
Webb Bridge | 150208 |
Grattan St-Swanston St (West) | 121150 |
Victoria Point | 117649 |
Flinders St-Spring St (West) | 114549 |
Lygon St (East) | 108837 |
QV Market-Peel St | 95240 |
Flinders St-Spark La | 94461 |
Monash Rd-Swanston St (West) | 66420 |
Waterfront City | 61481 |
Tin Alley-Swanston St (West) | 38773 |
City Square | 0 |
Flagstaff Station | 0 |
Flinders St-Elizabeth St (East) | 0 |
3. Sum of missing values with sum(is.na())
¶
For each sensor, compute the total number of missing counts. Which sensor had the most missing counts? Why might this be?
ped %>%
group_by(Sensor) %>%
summarise(tot_missing = sum(is.na(Count))) %>%
ungroup() %>%
arrange(desc(tot_missing))
Sensor | tot_missing |
---|---|
<chr> | <int> |
City Square | 744 |
Flagstaff Station | 744 |
Flinders St-Elizabeth St (East) | 744 |
Birrarung Marr | 416 |
Melbourne Central | 127 |
Monash Rd-Swanston St (West) | 50 |
Grattan St-Swanston St (West) | 38 |
Tin Alley-Swanston St (West) | 25 |
St Kilda Rd-Alexandra Gardens | 24 |
Waterfront City | 21 |
Victoria Point | 12 |
Bourke Street Mall (North) | 8 |
Flinders St-Spark La | 5 |
Alfred Place | 4 |
Webb Bridge | 3 |
Collins Place (North) | 2 |
Flinders St-Spring St (West) | 2 |
Chinatown-Swanston St (North) | 1 |
Lygon St (East) | 1 |
Lygon St (West) | 1 |
New Quay | 1 |
QV Market-Peel St | 1 |
Southern Cross Station | 1 |
Bourke St-Russell St (West) | 0 |
Bourke Street Mall (South) | 0 |
Chinatown-Lt Bourke St (South) | 0 |
Collins Place (South) | 0 |
Collins St (North) | 0 |
Flinders St-Swanston St (West) | 0 |
Flinders Street Station Underpass | 0 |
Lonsdale St (South) | 0 |
Lonsdale St-Spring St (West) | 0 |
Melbourne Convention Exhibition Centre | 0 |
Princes Bridge | 0 |
QV Market-Elizabeth St (West) | 0 |
Queen St (West) | 0 |
Sandridge Bridge | 0 |
Southbank | 0 |
Spencer St-Collins St (North) | 0 |
Spencer St-Collins St (South) | 0 |
State Library | 0 |
The Arts Centre | 0 |
Town Hall (West) | 0 |
4. Filtering multiple sensors and reshaping the data¶
Filter ped
to contain the counts from the Melbourne Central and State Library sensors only, then use a tidying function to create two columns that contain their counts.
ped %>%
filter(Sensor %in% c("Melbourne Central", "State Library")) %>%
spread(Sensor, Count)
Date_Time | Date | Time | Melbourne Central | State Library |
---|---|---|---|---|
<dttm> | <date> | <dbl> | <dbl> | <dbl> |
2018-12-31 13:00:00 | 2019-01-01 | 0 | NA | 1548 |
2018-12-31 14:00:00 | 2019-01-01 | 1 | NA | 1494 |
2018-12-31 15:00:00 | 2019-01-01 | 2 | NA | 878 |
2018-12-31 16:00:00 | 2019-01-01 | 3 | NA | 309 |
2018-12-31 17:00:00 | 2019-01-01 | 4 | NA | 133 |
2018-12-31 18:00:00 | 2019-01-01 | 5 | NA | 110 |
2018-12-31 19:00:00 | 2019-01-01 | 6 | NA | 42 |
2018-12-31 20:00:00 | 2019-01-01 | 7 | NA | 50 |
2018-12-31 21:00:00 | 2019-01-01 | 8 | NA | 83 |
2018-12-31 22:00:00 | 2019-01-01 | 9 | NA | 128 |
2018-12-31 23:00:00 | 2019-01-01 | 10 | NA | 284 |
2019-01-01 00:00:00 | 2019-01-01 | 11 | NA | 473 |
2019-01-01 01:00:00 | 2019-01-01 | 12 | NA | 702 |
2019-01-01 02:00:00 | 2019-01-01 | 13 | NA | 815 |
2019-01-01 03:00:00 | 2019-01-01 | 14 | NA | 968 |
2019-01-01 04:00:00 | 2019-01-01 | 15 | NA | 1063 |
2019-01-01 05:00:00 | 2019-01-01 | 16 | NA | 1042 |
2019-01-01 06:00:00 | 2019-01-01 | 17 | NA | 1084 |
2019-01-01 07:00:00 | 2019-01-01 | 18 | NA | 997 |
2019-01-01 08:00:00 | 2019-01-01 | 19 | NA | 1011 |
2019-01-01 09:00:00 | 2019-01-01 | 20 | NA | 912 |
2019-01-01 10:00:00 | 2019-01-01 | 21 | 7359 | 592 |
2019-01-01 11:00:00 | 2019-01-01 | 22 | 26969 | 343 |
2019-01-01 12:00:00 | 2019-01-01 | 23 | 1847 | 184 |
2019-01-01 13:00:00 | 2019-01-02 | 0 | NA | 61 |
2019-01-01 14:00:00 | 2019-01-02 | 1 | NA | 46 |
2019-01-01 15:00:00 | 2019-01-02 | 2 | NA | 24 |
2019-01-01 16:00:00 | 2019-01-02 | 3 | NA | 17 |
2019-01-01 17:00:00 | 2019-01-02 | 4 | NA | 10 |
2019-01-01 18:00:00 | 2019-01-02 | 5 | NA | 18 |
... | ... | ... | ... | ... |
2019-01-30 07:00:00 | 2019-01-30 | 18 | 1076 | 1420 |
2019-01-30 08:00:00 | 2019-01-30 | 19 | 1040 | 975 |
2019-01-30 09:00:00 | 2019-01-30 | 20 | 1030 | 840 |
2019-01-30 10:00:00 | 2019-01-30 | 21 | 581 | 582 |
2019-01-30 11:00:00 | 2019-01-30 | 22 | 394 | 337 |
2019-01-30 12:00:00 | 2019-01-30 | 23 | 310 | 190 |
2019-01-30 13:00:00 | 2019-01-31 | 0 | 168 | 86 |
2019-01-30 14:00:00 | 2019-01-31 | 1 | 77 | 41 |
2019-01-30 15:00:00 | 2019-01-31 | 2 | 26 | 26 |
2019-01-30 16:00:00 | 2019-01-31 | 3 | 20 | 4 |
2019-01-30 17:00:00 | 2019-01-31 | 4 | 22 | 5 |
2019-01-30 18:00:00 | 2019-01-31 | 5 | 48 | 27 |
2019-01-30 19:00:00 | 2019-01-31 | 6 | 105 | 107 |
2019-01-30 20:00:00 | 2019-01-31 | 7 | 224 | 300 |
2019-01-30 21:00:00 | 2019-01-31 | 8 | 413 | 643 |
2019-01-30 22:00:00 | 2019-01-31 | 9 | 494 | 612 |
2019-01-30 23:00:00 | 2019-01-31 | 10 | 664 | 653 |
2019-01-31 00:00:00 | 2019-01-31 | 11 | 726 | 836 |
2019-01-31 01:00:00 | 2019-01-31 | 12 | 1216 | 1355 |
2019-01-31 02:00:00 | 2019-01-31 | 13 | 1377 | 1587 |
2019-01-31 03:00:00 | 2019-01-31 | 14 | 1156 | 1253 |
2019-01-31 04:00:00 | 2019-01-31 | 15 | 1353 | 1246 |
2019-01-31 05:00:00 | 2019-01-31 | 16 | 1208 | 1508 |
2019-01-31 06:00:00 | 2019-01-31 | 17 | 1568 | 1814 |
2019-01-31 07:00:00 | 2019-01-31 | 18 | 1478 | 1522 |
2019-01-31 08:00:00 | 2019-01-31 | 19 | 1082 | 1056 |
2019-01-31 09:00:00 | 2019-01-31 | 20 | 957 | 981 |
2019-01-31 10:00:00 | 2019-01-31 | 21 | 927 | 719 |
2019-01-31 11:00:00 | 2019-01-31 | 22 | 583 | 449 |
2019-01-31 12:00:00 | 2019-01-31 | 23 | 333 | 177 |
5. Producing a 100 per cent chart¶
Create the following 100 per cent chart to compare the foot traffic at Melbourne Central and the State Library during different hours of the day. We can change the dimensions of our plot by changing the code chunk option.
- By default, an R plot's height and width is set to 5 and 7 inches.
- Set the height and width to 8 and 12 inches by adding
fig.height=8
andfig.width=12
inside the code chunk option, i.e., from{r}
to{r fig.height=8, fig.width=12}
.
Note that R will return a warning to inform you that missing values in the data have been removed.
ped %>%
filter(Sensor %in% c("Melbourne Central", "State Library")) %>%
ggplot(aes(x = Time, y = Count, fill = Sensor)) +
geom_bar(stat = "identity", position = "fill") +
facet_wrap(~ Date, ncol = 7) +
labs(
title = "Comparing foot traffic at Melbourne Central and the State Library during different hours of the day",
subtitle = "Greater proportion of foot traffic at the State Library than Melbourne Central during the afternoon"
)
Explain why the first 8 days of January appear this way.
All of the material is copyrighted under the Creative Commons BY-SA 4.0 copyright.