FIT5145 Workshop 1

Pedestrian activity¶

The City of Melbourne has developed an automated pedestrian counting system to better understand pedestrian activity. Data is captured from counting sensors across various locations in Melbourne's CBD.

We've stored a subset of this data in comma separated (.csv) file called melb_walk_wide.csv on Github (Please open this link in new tab). If you have trouble accessing it, download it to your working directory from Moodle. Clicking on the Raw button on GitHub lets you view melb_walk_wide.csv in your web browser:

Reading a .csv file from GitHub¶

To read (or import) melb_walk_wide.csv into your R:

Load the tidyverse, which contains the read_csv() function (from the readr package) to read .csv files in R. (A good rule of thumb is to always load the tidyverse before you begin any data analysis.)
Copy the GitHub URL of melb_walk_wide.csv and paste inside the read_csv() function. If you have already downloaded it and it is in your working directory, just use the directory path, e.g., ./melb_walk_wide.csv
Store the data in an object named ped_wide.

Fill out the missing parts of the code chunk (???) and then run:

# Load tidyverse
library(tidyverse)

# Read melb_walk.csv from GitHub URL and store in object named ped_wide
ped_wide <- read_csv("https://raw.githubusercontent.com/quangvanbui/FIT5145-data/master/melb_walk_wide.csv")
# Alternatively, read it from your working directory
# ped_wide <- ???("./melb_walk_wide.csv")

# Print ped_wide
ped_wide

A spec_tbl_df: 744 x 46
Date_Time	Date	Time	Alfred Place	Birrarung Marr	Bourke St-Russell St (West)	Bourke Street Mall (North)	Bourke Street Mall (South)	Chinatown-Lt Bourke St (South)	Chinatown-Swanston St (North)	...	Spencer St-Collins St (North)	Spencer St-Collins St (South)	St Kilda Rd-Alexandra Gardens	State Library	The Arts Centre	Tin Alley-Swanston St (West)	Town Hall (West)	Victoria Point	Waterfront City	Webb Bridge
<dttm>	<date>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
2018-12-31 13:00:00	2019-01-01	0	207	2733	1745	918	770	494	633	...	1867	577	1360	1548	2800	20	3025	1217	1339	762
2018-12-31 14:00:00	2019-01-01	1	99	1086	1722	995	635	361	908	...	1838	416	1896	1494	1746	12	3077	470	352	336
2018-12-31 15:00:00	2019-01-01	2	60	571	1113	416	262	304	469	...	896	222	731	878	940	10	1927	202	100	138
2018-12-31 16:00:00	2019-01-01	3	21	208	786	382	194	208	289	...	686	133	288	309	466	9	998	116	29	95
2018-12-31 17:00:00	2019-01-01	4	15	83	405	165	106	149	155	...	346	87	134	133	109	1	472	31	5	25
2018-12-31 18:00:00	2019-01-01	5	31	46	249	117	53	65	93	...	284	50	63	110	64	3	209	17	1	30
2018-12-31 19:00:00	2019-01-01	6	17	48	117	48	55	51	44	...	285	82	69	42	57	NA	131	27	10	16
2018-12-31 20:00:00	2019-01-01	7	31	49	76	47	51	72	73	...	292	70	70	50	84	5	153	9	11	32
2018-12-31 21:00:00	2019-01-01	8	46	65	99	86	87	25	113	...	377	70	83	83	124	2	207	15	19	56
2018-12-31 22:00:00	2019-01-01	9	116	116	131	253	258	98	82	...	567	127	170	128	268	19	388	28	39	124
2018-12-31 23:00:00	2019-01-01	10	112	225	294	956	689	368	302	...	963	208	282	284	536	15	955	92	59	154
2019-01-01 00:00:00	2019-01-01	11	176	291	530	1542	1141	446	471	...	1135	193	562	473	1132	31	1369	79	136	193
2019-01-01 01:00:00	2019-01-01	12	100	354	634	2084	1634	656	952	...	926	241	710	702	1427	22	1887	75	162	250
2019-01-01 02:00:00	2019-01-01	13	85	465	802	2324	1723	746	1240	...	939	194	986	815	1608	18	2198	66	187	277
2019-01-01 03:00:00	2019-01-01	14	61	452	802	2677	1831	472	1214	...	922	183	1076	968	1927	23	2343	36	169	299
2019-01-01 04:00:00	2019-01-01	15	135	482	861	2658	2057	368	1207	...	964	186	1056	1063	1752	25	2430	59	200	287
2019-01-01 05:00:00	2019-01-01	16	101	620	882	2819	1989	313	1128	...	1120	202	1077	1042	1590	21	2348	86	168	280
2019-01-01 06:00:00	2019-01-01	17	101	975	963	2587	1727	452	1090	...	1107	203	1200	1084	1244	28	2239	85	142	258
2019-01-01 07:00:00	2019-01-01	18	144	1600	1150	2095	994	659	1260	...	1066	217	885	997	731	31	1829	78	186	211
2019-01-01 08:00:00	2019-01-01	19	82	374	1209	1365	605	740	1079	...	795	144	598	1011	637	19	1500	103	196	155
2019-01-01 09:00:00	2019-01-01	20	127	202	1047	954	450	670	1035	...	665	175	502	912	493	26	1283	48	213	124
2019-01-01 10:00:00	2019-01-01	21	62	219	976	801	321	536	919	...	499	165	344	592	471	20	992	51	172	147
2019-01-01 11:00:00	2019-01-01	22	67	2211	649	411	222	386	550	...	540	149	773	343	601	10	826	35	101	73
2019-01-01 12:00:00	2019-01-01	23	32	237	435	165	103	205	340	...	468	96	159	184	116	4	451	22	48	42
2019-01-01 13:00:00	2019-01-02	0	15	35	258	86	43	105	147	...	224	51	46	61	77	1	196	35	12	20
2019-01-01 14:00:00	2019-01-02	1	22	25	163	41	32	72	90	...	109	39	18	46	41	1	82	72	7	18
2019-01-01 15:00:00	2019-01-02	2	5	9	72	52	18	40	44	...	52	11	9	24	45	4	41	8	1	1
2019-01-01 16:00:00	2019-01-02	3	6	19	65	18	16	22	31	...	23	26	9	17	67	NA	20	4	1	3
2019-01-01 17:00:00	2019-01-02	4	10	27	44	10	19	34	17	...	35	17	20	10	91	2	31	5	1	1
2019-01-01 18:00:00	2019-01-02	5	12	40	30	31	15	28	12	...	128	32	31	18	288	5	84	2	1	22
...	...	...	...	...	...	...	...	...	...		...	...	...	...	...	...	...	...	...	...
2019-01-30 07:00:00	2019-01-30	18	341	NA	900	1766	1169	497	827	...	2792	518	1107	1420	1115	101	2271	367	63	333
2019-01-30 08:00:00	2019-01-30	19	169	NA	749	1083	586	589	795	...	1170	258	748	975	815	110	1348	170	109	168
2019-01-30 09:00:00	2019-01-30	20	104	NA	650	683	383	652	682	...	820	212	420	840	294	50	1104	89	82	120
2019-01-30 10:00:00	2019-01-30	21	61	NA	541	457	286	406	401	...	711	171	294	582	630	29	770	93	43	76
2019-01-30 11:00:00	2019-01-30	22	110	NA	418	368	175	366	309	...	604	119	206	337	276	23	481	82	50	68
2019-01-30 12:00:00	2019-01-30	23	23	NA	319	125	85	282	204	...	302	56	90	190	88	47	264	48	18	29
2019-01-30 13:00:00	2019-01-31	0	19	NA	162	64	39	111	74	...	141	49	64	86	55	3	106	12	7	11
2019-01-30 14:00:00	2019-01-31	1	14	NA	99	59	23	42	64	...	29	24	28	41	23	2	57	6	5	3
2019-01-30 15:00:00	2019-01-31	2	2	NA	61	19	17	30	18	...	18	12	12	26	28	1	99	8	1	2
2019-01-30 16:00:00	2019-01-31	3	5	NA	34	NA	11	14	12	...	36	10	7	4	26	NA	29	NA	NA	2
2019-01-30 17:00:00	2019-01-31	4	5	NA	17	NA	15	32	9	...	19	11	10	5	56	2	38	2	1	3
2019-01-30 18:00:00	2019-01-31	5	19	NA	42	NA	15	12	16	...	265	57	57	27	94	2	64	21	3	52
2019-01-30 19:00:00	2019-01-31	6	103	NA	97	NA	71	20	30	...	969	259	325	107	298	34	217	64	27	120
2019-01-30 20:00:00	2019-01-31	7	292	NA	196	NA	208	95	105	...	2173	662	887	300	801	39	458	287	66	385
2019-01-30 21:00:00	2019-01-31	8	856	NA	342	NA	595	184	168	...	3981	1573	1507	643	1629	193	899	528	86	568
2019-01-30 22:00:00	2019-01-31	9	683	NA	365	NA	777	181	174	...	2816	972	902	612	1024	203	1035	433	35	319
2019-01-30 23:00:00	2019-01-31	10	482	NA	425	NA	1055	277	299	...	1596	600	981	653	896	100	1381	281	49	165
2019-01-31 00:00:00	2019-01-31	11	462	161	637	1316	1507	390	552	...	1555	593	1166	836	1215	121	1931	285	59	227
2019-01-31 01:00:00	2019-01-31	12	1239	392	1483	2766	2804	723	1091	...	2429	1145	1537	1355	1393	147	3100	528	135	636
2019-01-31 02:00:00	2019-01-31	13	1330	258	1590	2978	2965	778	1139	...	2383	1135	1726	1587	1508	140	3355	441	113	545
2019-01-31 03:00:00	2019-01-31	14	700	NA	1105	2154	2351	453	837	...	1955	591	1657	1253	1355	149	2948	298	67	213
2019-01-31 04:00:00	2019-01-31	15	451	NA	968	2428	2180	516	873	...	2178	576	1665	1246	1526	115	2861	307	56	259
2019-01-31 05:00:00	2019-01-31	16	522	NA	950	2374	2233	398	814	...	3479	680	1864	1508	1644	128	2793	613	64	416
2019-01-31 06:00:00	2019-01-31	17	684	NA	1162	2937	2430	574	901	...	4777	1154	2500	1814	2348	183	3166	891	66	701
2019-01-31 07:00:00	2019-01-31	18	377	NA	1126	2523	1756	638	1065	...	2698	612	2057	1522	1960	107	2577	439	93	471
2019-01-31 08:00:00	2019-01-31	19	144	NA	1010	1560	1009	747	1093	...	1347	278	1398	1056	1669	69	1648	183	63	215
2019-01-31 09:00:00	2019-01-31	20	103	NA	942	1185	700	527	863	...	922	272	667	981	436	134	1439	122	75	128
2019-01-31 10:00:00	2019-01-31	21	66	NA	763	726	395	499	749	...	814	210	520	719	791	35	1066	71	119	125
2019-01-31 11:00:00	2019-01-31	22	104	NA	618	324	203	358	442	...	587	135	759	449	1444	14	702	39	50	62
2019-01-31 12:00:00	2019-01-31	23	12	NA	394	169	95	198	245	...	323	80	121	177	187	7	358	29	24	35

Note that when we load the tidyverse, R returns messages and warnings to inform of the tidyverse packages have been loaded into our R session, when some of the packages were built, etc. R also returns a message after reading melb_walk_wide.csv using the read_csv() function to let us know how it has specified each column type of the data.

`read_csv()` or `read.csv()`?¶

While base R provides the read.csv() function to read .csv files into R, the read_csv() function (which is a function from the readr package and is part of the tidyverse) reads .csv files approximately 10 times faster than read.csv(). This means a .csv file that would have taken read.csv() 60 minutes to read into R would only take read_csv() 6 minutes to read.

Look at `ped_wide`¶

To print out or look at what is inside of ped_wide, type ped_wide in a code chunk and run it.

# Print ped_wide
ped_wide %>% head(n=3)

A tibble: 3 × 46
Date_Time	Date	Time	Alfred Place	Birrarung Marr	Bourke St-Russell St (West)	Bourke Street Mall (North)	Bourke Street Mall (South)	Chinatown-Lt Bourke St (South)	Chinatown-Swanston St (North)	⋯	Spencer St-Collins St (North)	Spencer St-Collins St (South)	St Kilda Rd-Alexandra Gardens	State Library	The Arts Centre	Tin Alley-Swanston St (West)	Town Hall (West)	Victoria Point	Waterfront City	Webb Bridge
<dttm>	<date>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	⋯	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
2018-12-31 13:00:00	2019-01-01	0	207	2733	1745	918	770	494	633	⋯	1867	577	1360	1548	2800	20	3025	1217	1339	762
2018-12-31 14:00:00	2019-01-01	1	99	1086	1722	995	635	361	908	⋯	1838	416	1896	1494	1746	12	3077	470	352	336
2018-12-31 15:00:00	2019-01-01	2	60	571	1113	416	262	304	469	⋯	896	222	731	878	940	10	1927	202	100	138

There are circumstances when printing out all of ped_wide is unnecessary, e.g., in a report to communicate our analysis, we should never place a table of the entire data set (if you are a project partner, reading a report from an analyst, how would you feel if the analyst placed a 744 by 46 dimension table for you to read). You can print the head of a data set using the head(), which returns the first 6 rows of the data - a small extract of the data.

Fill out the missing parts of the code chunk (???) and then run:

# head(ped_wide)
# or use %>% to pipe the output to head() function
ped_wide %>% head()

Notice that there are 744 rows and 46 columns. The columns in ped_wide and their definition are provided below:

Date_time - date and time stamp of the recorded pedestrian foot traffic count in UTC timezone
Date - date of recorded pedetrian foot traffic count in Melbourne's timezone (UTC+11 or UTC+10, depending on daylight savings)
Time - hour of the day (24-hour time) of recorded pedestrian foot traffic count in Melbourne's timezone (UTC+11 or UTC+10, depending on daylight savings)
Alfred Place - number of pedestrians counted over a one hour period from a sensor located in Alfred Place
Birrarung Marr- number of pedestrians counted over a one hour period from a sensor located in Birrarung Marr
$\vdots$
Webb Bridge - number of pedestrians counted over a one hour period from a sensor located in Webb Bridge

Note that the dates and hours in variables Date and Time differ from Date_Time because of timezone differences.

This type of data is called a time-series of temporal data because it contains information recorded over time. In this example, we have hourly pedestrian counts for a number of locations in Melbourne from January 1 to 31, 2019. Confirm the time period of our data by following the steps below:

Take ped_wide and pipe in the arrange() function.
Arrange the data by the column, Date.
Pipe in the summarise() function and return the first and last date with the first() and last() function.

Fill out the missing parts of the code chunk (???) and then run:

# First and last date in the data
ped_wide %>%
  arrange(Date) %>%
  summarise(
    first_date = first(Date),
    last_date = last(Date)
  )

A tibble: 1 x 2
first_date	last_date
<date>	<date>
2019-01-01	2019-01-31

Convert to long form¶

![Artwork by @allison_horst](images/tidyr_spread_gather.png)

It is helpful to think of a data set as either wide or long. The pedestrian count data, ped_wide, is presented in a wide form, which is to say that the attributes of the data are presented horizontally. Converting ped_wide into a long form presents the same attributes vertically, i.e., no information is lost by reshaping the data.

So why should we reshape the data into a long form? A data set that is represented in a long form is considered a tidy data set and allows us to use all the tools from the tidyverse. The tools created in the tidyverse are designed for us to work in a principled and consistent way but they require that the data be represented the tidy way (long form). We will see later how the dplyr functions to wrangle the data and ggplot2 package to produce graphics (both part of the tidyverse) work seamlessly when the data is in a tidy long form.

Of course, there are instances when a wide form representation of the data is necessary (some models need to be trained with data in a wide form).

Following the steps below to convert ped_wide into a tidy long form data:

Take ped_wide and pipe in the gather() function.
Inside gather(), specify the key as Sensor and value as Count and gather all columns in ped_wide except Data_time, Date and Time.
Store this tidy long form data of pedestrian count in an object named ped.

Fill out the missing parts of the code chunk (???) and then run:

# Convert the data into a long form
ped <- ped_wide %>%
  gather(
    key = Sensor,
    value = Count, -Date_Time, -Date, -Time
  ) %>%
  select(Sensor, everything(), Count)

# Print ped
ped %>% head()

A tibble: 6 x 5
Sensor	Date_Time	Date	Time	Count
<chr>	<dttm>	<date>	<dbl>	<dbl>
Alfred Place	2018-12-31 13:00:00	2019-01-01	0	207
Alfred Place	2018-12-31 14:00:00	2019-01-01	1	99
Alfred Place	2018-12-31 15:00:00	2019-01-01	2	60
Alfred Place	2018-12-31 16:00:00	2019-01-01	3	21
Alfred Place	2018-12-31 17:00:00	2019-01-01	4	15
Alfred Place	2018-12-31 18:00:00	2019-01-01	5	31

While ped_wide contains 744 rows and 46 columns of data and ped contains 31,992 rows and 5 columns, no information is loss by reshaping the data. In ped_wide, the pedestrian count from each sensor was presented in its own column, but in ped, there is a column containing the sensor name/location and another with its pedestrian count. This means that each row in ped captures the number of pedestrians counted over a 1 hour time window at given location.

Note about the arguments in a function¶

It is not essential to type our the name of a function's argument(s) when specifying what that argument should be. For example, the gather() function used above specified the Sensor variable in the key argument and the Count variable in the value argument:

key = Sensor
value = Count

ped_wide %>%
  gather(key = Sensor, value = Count, -Date_Time, -Date, -Time)

We can achieve the same result without explicitly providing the argument names:

ped_wide %>%
  gather(Sensor, Count, -Date_Time, -Date, -Time)

This is because the arguments are ordered, i.e., the key goes first, then the value, so by setting Sensor first and then Count next, gather() will know what we wanted Sensor to be specified in the key argument and Count to be specified in the value argument.

State Library¶

We will explore pedestrian activity around the State Library on the 1st of January, 2019. To do this, we will need to filter ped for the State Library sensor on the 1st of January, 2019.

Take ped and pipe in the filter() function.
Filter Date to "2019-01-01" and Sensor to "State Library".
Store this filtered data in an object named state_lib_jan_one.

Fill out the missing parts of the code chunk (???) and then run:

# Filter for State Library data on Jan 1, 2019
state_lib_jan_one <- ped %>%
  filter(
    Date == "2019-01-01",
    Sensor == "State Library"
  )

# Print state_lib_jan_one
state_lib_jan_one %>% head()

A tibble: 6 x 5
Sensor	Date_Time	Date	Time	Count
<chr>	<dttm>	<date>	<dbl>	<dbl>
State Library	2018-12-31 13:00:00	2019-01-01	0	1548
State Library	2018-12-31 14:00:00	2019-01-01	1	1494
State Library	2018-12-31 15:00:00	2019-01-01	2	878
State Library	2018-12-31 16:00:00	2019-01-01	3	309
State Library	2018-12-31 17:00:00	2019-01-01	4	133
State Library	2018-12-31 18:00:00	2019-01-01	5	110

This tells R to take ped and filter it for data captured by the State Library sensor on the 1st of January, 2019, then store this filtered data in an object named state_lib_jan_one. If you have successfully done this, you'll see state_lib_jan_one in your RStudio environments tab and your state_lib_jan_one data should looks like the following:

How many rows and columns are in state_lib_jan_one?
Explain why there are this many rows. (This may seem obvious, but if you develop a checking mechanism like this, you'll be able to spot data quality or coding issues much sooner, which can save you a lot of time.)
In which hour is pedestrian count highest? Explain whether or not this makes sense.

Line plot¶

A better way to understand the pedestrian count around the State Library sensor in each hour of the day (of Jan 1st, 2019) is to produce a visualisation. Line plots are typically used to visualise time-series data sets, with the x-axis representing the time or date (or both) and the y-axis representing some time-series process. To produce a line plot of the pedestrian count around the State Library for each hour of the day:

Take state_lib_jan_one and pipe in the ggplot() function.
Specify the aesthetics layer, i.e., what should be placed in the x and y-axis. This goes inside the aes(), which goes inside of ggplot().
Add the geometric (or geom) layer to tell R that the visual element we need for our plot is the line.

Fill out the missing parts of the code chunk (???) and then run:

# Line plot of State Library pedestrian count
state_lib_jan_one %>%
  ggplot(aes(y = Count, x = Time)) +
  geom_line()

Describe the pedestrian count from 0:00 to 23:00 on January 1st, 2019, i.e., when is the peak, trough, steepest decline, etc. Would you expect this pattern to appear the following day?

Bar plot¶

You can copy and run the following code chunk to produce the equivalent plot using bars, i.e., a bar plot of the state library pedestrian count for each hour of the day.

# Bar plot of count
state_lib_jan_one %>%
  ggplot(aes(y = Count, x = Time)) +
  geom_bar(stat = "identity")

Side-by-side box plot¶

Suppose we wanted to visualise the distribution of pedestrian count from the State Library sensor for each hour of the day (over the month of January, 2019). That is, we want to know what the central tendency, variability and shape of pedestrian count around the State Library looks like at 0:00, 1:00, 2:00, ..., 23:00. We will begin by filtering the data for only pedestrian counts from the State Library:

Take ped and pipe in the filter() function.
Filter Sensor to "State Library".
Store this filtered data in an object named state_lib.

Fill out the missing parts of the code chunk (???) and then run:

# Filter for State Library
state_lib <- ped %>% 
  filter(Sensor == "State Library")

# Print state_lib
state_lib %>% head()

A tibble: 6 x 5
Sensor	Date_Time	Date	Time	Count
<chr>	<dttm>	<date>	<dbl>	<dbl>
State Library	2018-12-31 13:00:00	2019-01-01	0	1548
State Library	2018-12-31 14:00:00	2019-01-01	1	1494
State Library	2018-12-31 15:00:00	2019-01-01	2	878
State Library	2018-12-31 16:00:00	2019-01-01	3	309
State Library	2018-12-31 17:00:00	2019-01-01	4	133
State Library	2018-12-31 18:00:00	2019-01-01	5	110

Using state_lib, we can plot a side-by-side box plot of the pedestrian count around the State Library with the following steps:

Take state_lib and pipe in the ggplot() function
Add the aesthetic layer, which should have Time, Count and Time specified in the x, y and group argument inside of aes(). Note that aes() goes inside of ggplot().
Add the geom layer to tell R that the visual element we need for our plot is the boxplot.

Fill out the missing parts of the code chunk (???) and then run:

# Side-by-side box plot of pedestrian count for each hour of the day
state_lib %>%
  ggplot(
    aes(y = Count, x = Time, group = Time)
  ) + geom_boxplot()

Note that the group aesthetic will group the data (state_lib) by each hour of the day (Time), then create a box plot for each of these groups. Without the group aesthetic, ggplot will produce a single box plot of pedestrian count and use the Time variable as the width of the boxplot (and R will return a warning, asking you if you might have forgotten the group aesthetic).

# Box plot without Time specified as the group aesthetic 
state_lib %>%
  ggplot(aes(x = Time, y = Count)) +
  geom_boxplot()

The reason why ggplot does not recognise that Time needs to be grouped (and we had to explicitly tell it to group the data by Time), is because Time is a numeric column. ggplot automatically assumes that numeric columns are all 'connected', which is why it would generate a single box plot if the group aesthetic is not specified.

Multiple locations¶

Suppose we are interested in the pedestrian count around Melbourne Central and the State Library (both are located near each other).

Filter for multiple sensors¶

Filter ped so that the pedestrian counts from only the Melbourne Central or State Library sensors are kept. This can be done with the following steps:

Take ped and pipe in the filter() function.
Use the %in% operator to filter Sensor so that only "Melbourne Central" or "State Library" are kept.
Store this filtered data in an object named mc_sl.

Fill out the missing parts of the code chunk (???) and then run:

# Filter for the Melbourne Central and State Library sensors
mc_sl <- ped %>% 
  filter(Sensor %in% c("Melbourne Central", "Library"))

# Print mc_sl
mc_sl %>% head()

A tibble: 6 x 5
Sensor	Date_Time	Date	Time	Count
<chr>	<dttm>	<date>	<dbl>	<dbl>
Melbourne Central	2018-12-31 13:00:00	2019-01-01	0	NA
Melbourne Central	2018-12-31 14:00:00	2019-01-01	1	NA
Melbourne Central	2018-12-31 15:00:00	2019-01-01	2	NA
Melbourne Central	2018-12-31 16:00:00	2019-01-01	3	NA
Melbourne Central	2018-12-31 17:00:00	2019-01-01	4	NA
Melbourne Central	2018-12-31 18:00:00	2019-01-01	5	NA

How many rows and columns are in the data mc_sl?
Explain why there are this many rows in mc_sl.
How would you filter for all sensors except Melbourne Central and State Library? (Hint: There are 31,992 rows in ped and 1,488 rows in mc_sl, so a data set filtered for all sensors except Melbourne Central and State Library should have 30,504 rows.)

Facetted side-by-side box plots¶

We've seen how a side-by-side box plot provides a visualisation of the distribution of the data. To divide a plot into the different categories/measurements of a column in the data, we simply add the facet_wrap() layer onto our ggplot() call. Follow the steps below to produce side-by-side box plots separated by the sensors in mc_sl, i.e., Melbourne Central and State Library:

Take mc_sl and pipe in the ggplot() function
Add the aesthetic layer, which should have Time, Count and Time specified in the x, y and group argument inside of aes(). Note that aes() goes inside of ggplot().
Add the geom layer to tell R that the visual element we need for our plot is the boxplot.
Add the facet_wrap() layer to split the plot by Sensor.

Fill out the missing parts of the code chunk (???) and then run:

# Side-by-side box plot of pedestrian count for each hour of the day facetted by Sensor
mc_sl %>%
  ggplot(aes(x = Time, y = Count, group = Time)) +
  geom_boxplot() +
  facet_wrap(~ Sensor)

Immediately, we notice that it is difficult to compare the side-by-side box plots of the pedestrian count in Melbourne Central and the State Library because of the outliers in the Melbourne Central data. The sensor seems to have picked up moments in the 22nd and 23rd hour of the day, where the number of pedestrians far exceeded the maximum value at any other hour of the day. Filtering our these outliers will improve the interpretability of the side-by-side box plots.

It may be easier to compare the pedestrian count from both locations if the subplots were position from top-to-bottom, instead of left-to-right. You can make this change by specifying that the number of columns in your facetted plot be equal to 1, i.e., ncol = 1

Fill out the missing parts of the code chunk (???) and then run:

# Remove outliers and produce facetted plot with 1 column
mc_sl %>%
  filter(Count < 5000) %>%
  ggplot(aes(x = Time, y = Count, group = Time)) +
  geom_boxplot() +
  facet_wrap(~ Sensor, ncol = 1)

Group exercises¶

Returning to ped, complete the following exercises, which will require knowledge of the following concepts:

Pipe operator %>%
dplyr wrangling functions, e.g., filter(), group_by(), summarise(), arrange(), etc.
Functions to use inside of summarise(), e.g., n_distinct(), sum(), etc.
ggplot2 to produce a bar chart.

1. Using `summarise()`¶

Use a wrangling verb, to count the number of sensors in the ped. Do all the sensors have the same number of measurements?

ped %>%
  summarise(num_sensors = n_distinct(Sensor))

A tibble: 1 x 1
num_sensors
<int>
43

2. Grouping the data¶

For each sensor, compute the total count for January. Which sensor had the largest count? Which sensor had the smallest count?

ped %>%
  group_by(Sensor) %>%
  summarise(sum = sum(Count, na.rm = TRUE)) %>%
  ungroup() %>%
  arrange(desc(sum))

A tibble: 43 x 2
Sensor	sum
<chr>	<dbl>
Southbank	1395117
Town Hall (West)	1035715
Flinders Street Station Underpass	1015331
Spencer St-Collins St (North)	910109
Bourke Street Mall (North)	895483
The Arts Centre	884885
Princes Bridge	799066
Bourke Street Mall (South)	704858
St Kilda Rd-Alexandra Gardens	620895
Flinders St-Swanston St (West)	535146
State Library	494944
Collins St (North)	488458
Southern Cross Station	485848
Melbourne Central	473789
Melbourne Convention Exhibition Centre	451455
Bourke St-Russell St (West)	449123
Chinatown-Swanston St (North)	402761
QV Market-Elizabeth St (West)	383737
Sandridge Bridge	360679
Lonsdale St (South)	343747
Collins Place (South)	315492
Spencer St-Collins St (South)	257814
Chinatown-Lt Bourke St (South)	257471
Birrarung Marr	235438
Collins Place (North)	222458
New Quay	216206
Queen St (West)	202057
Lygon St (West)	194706
Alfred Place	181529
Lonsdale St-Spring St (West)	171063
Webb Bridge	150208
Grattan St-Swanston St (West)	121150
Victoria Point	117649
Flinders St-Spring St (West)	114549
Lygon St (East)	108837
QV Market-Peel St	95240
Flinders St-Spark La	94461
Monash Rd-Swanston St (West)	66420
Waterfront City	61481
Tin Alley-Swanston St (West)	38773
City Square	0
Flagstaff Station	0
Flinders St-Elizabeth St (East)	0

3. Sum of missing values with `sum(is.na())`¶

For each sensor, compute the total number of missing counts. Which sensor had the most missing counts? Why might this be?

ped %>%
 group_by(Sensor) %>%
 summarise(tot_missing = sum(is.na(Count))) %>%
 ungroup() %>%
 arrange(desc(tot_missing))

A tibble: 43 x 2
Sensor	tot_missing
<chr>	<int>
City Square	744
Flagstaff Station	744
Flinders St-Elizabeth St (East)	744
Birrarung Marr	416
Melbourne Central	127
Monash Rd-Swanston St (West)	50
Grattan St-Swanston St (West)	38
Tin Alley-Swanston St (West)	25
St Kilda Rd-Alexandra Gardens	24
Waterfront City	21
Victoria Point	12
Bourke Street Mall (North)	8
Flinders St-Spark La	5
Alfred Place	4
Webb Bridge	3
Collins Place (North)	2
Flinders St-Spring St (West)	2
Chinatown-Swanston St (North)	1
Lygon St (East)	1
Lygon St (West)	1
New Quay	1
QV Market-Peel St	1
Southern Cross Station	1
Bourke St-Russell St (West)	0
Bourke Street Mall (South)	0
Chinatown-Lt Bourke St (South)	0
Collins Place (South)	0
Collins St (North)	0
Flinders St-Swanston St (West)	0
Flinders Street Station Underpass	0
Lonsdale St (South)	0
Lonsdale St-Spring St (West)	0
Melbourne Convention Exhibition Centre	0
Princes Bridge	0
QV Market-Elizabeth St (West)	0
Queen St (West)	0
Sandridge Bridge	0
Southbank	0
Spencer St-Collins St (North)	0
Spencer St-Collins St (South)	0
State Library	0
The Arts Centre	0
Town Hall (West)	0

4. Filtering multiple sensors and reshaping the data¶

Filter ped to contain the counts from the Melbourne Central and State Library sensors only, then use a tidying function to create two columns that contain their counts.

ped %>%
  filter(Sensor %in% c("Melbourne Central", "State Library")) %>%
  spread(Sensor, Count)

A tibble: 744 x 5
Date_Time	Date	Time	Melbourne Central	State Library
<dttm>	<date>	<dbl>	<dbl>	<dbl>
2018-12-31 13:00:00	2019-01-01	0	NA	1548
2018-12-31 14:00:00	2019-01-01	1	NA	1494
2018-12-31 15:00:00	2019-01-01	2	NA	878
2018-12-31 16:00:00	2019-01-01	3	NA	309
2018-12-31 17:00:00	2019-01-01	4	NA	133
2018-12-31 18:00:00	2019-01-01	5	NA	110
2018-12-31 19:00:00	2019-01-01	6	NA	42
2018-12-31 20:00:00	2019-01-01	7	NA	50
2018-12-31 21:00:00	2019-01-01	8	NA	83
2018-12-31 22:00:00	2019-01-01	9	NA	128
2018-12-31 23:00:00	2019-01-01	10	NA	284
2019-01-01 00:00:00	2019-01-01	11	NA	473
2019-01-01 01:00:00	2019-01-01	12	NA	702
2019-01-01 02:00:00	2019-01-01	13	NA	815
2019-01-01 03:00:00	2019-01-01	14	NA	968
2019-01-01 04:00:00	2019-01-01	15	NA	1063
2019-01-01 05:00:00	2019-01-01	16	NA	1042
2019-01-01 06:00:00	2019-01-01	17	NA	1084
2019-01-01 07:00:00	2019-01-01	18	NA	997
2019-01-01 08:00:00	2019-01-01	19	NA	1011
2019-01-01 09:00:00	2019-01-01	20	NA	912
2019-01-01 10:00:00	2019-01-01	21	7359	592
2019-01-01 11:00:00	2019-01-01	22	26969	343
2019-01-01 12:00:00	2019-01-01	23	1847	184
2019-01-01 13:00:00	2019-01-02	0	NA	61
2019-01-01 14:00:00	2019-01-02	1	NA	46
2019-01-01 15:00:00	2019-01-02	2	NA	24
2019-01-01 16:00:00	2019-01-02	3	NA	17
2019-01-01 17:00:00	2019-01-02	4	NA	10
2019-01-01 18:00:00	2019-01-02	5	NA	18
...	...	...	...	...
2019-01-30 07:00:00	2019-01-30	18	1076	1420
2019-01-30 08:00:00	2019-01-30	19	1040	975
2019-01-30 09:00:00	2019-01-30	20	1030	840
2019-01-30 10:00:00	2019-01-30	21	581	582
2019-01-30 11:00:00	2019-01-30	22	394	337
2019-01-30 12:00:00	2019-01-30	23	310	190
2019-01-30 13:00:00	2019-01-31	0	168	86
2019-01-30 14:00:00	2019-01-31	1	77	41
2019-01-30 15:00:00	2019-01-31	2	26	26
2019-01-30 16:00:00	2019-01-31	3	20	4
2019-01-30 17:00:00	2019-01-31	4	22	5
2019-01-30 18:00:00	2019-01-31	5	48	27
2019-01-30 19:00:00	2019-01-31	6	105	107
2019-01-30 20:00:00	2019-01-31	7	224	300
2019-01-30 21:00:00	2019-01-31	8	413	643
2019-01-30 22:00:00	2019-01-31	9	494	612
2019-01-30 23:00:00	2019-01-31	10	664	653
2019-01-31 00:00:00	2019-01-31	11	726	836
2019-01-31 01:00:00	2019-01-31	12	1216	1355
2019-01-31 02:00:00	2019-01-31	13	1377	1587
2019-01-31 03:00:00	2019-01-31	14	1156	1253
2019-01-31 04:00:00	2019-01-31	15	1353	1246
2019-01-31 05:00:00	2019-01-31	16	1208	1508
2019-01-31 06:00:00	2019-01-31	17	1568	1814
2019-01-31 07:00:00	2019-01-31	18	1478	1522
2019-01-31 08:00:00	2019-01-31	19	1082	1056
2019-01-31 09:00:00	2019-01-31	20	957	981
2019-01-31 10:00:00	2019-01-31	21	927	719
2019-01-31 11:00:00	2019-01-31	22	583	449
2019-01-31 12:00:00	2019-01-31	23	333	177

5. Producing a 100 per cent chart¶

Create the following 100 per cent chart to compare the foot traffic at Melbourne Central and the State Library during different hours of the day. We can change the dimensions of our plot by changing the code chunk option.

By default, an R plot's height and width is set to 5 and 7 inches.
Set the height and width to 8 and 12 inches by adding fig.height=8 and fig.width=12 inside the code chunk option, i.e., from {r} to {r fig.height=8, fig.width=12}.

Note that R will return a warning to inform you that missing values in the data have been removed.

ped %>%
  filter(Sensor %in% c("Melbourne Central", "State Library")) %>%
  ggplot(aes(x = Time, y = Count, fill = Sensor)) +
  geom_bar(stat = "identity", position = "fill") +
  facet_wrap(~ Date, ncol = 7) +
  labs(
    title = "Comparing foot traffic at Melbourne Central and the State Library during different hours of the day",
    subtitle = "Greater proportion of foot traffic at the State Library than Melbourne Central during the afternoon"
  )

Explain why the first 8 days of January appear this way.

All of the material is copyrighted under the Creative Commons BY-SA 4.0 copyright.

FIT5145 Workshop 1

Jevgeni Han

2025-03-09

Pedestrian activity¶

Reading a .csv file from GitHub¶

`read_csv()` or `read.csv()`?¶

Look at `ped_wide`¶

Convert to long form¶

Note about the arguments in a function¶

State Library¶

Line plot¶

Bar plot¶

Side-by-side box plot¶

Multiple locations¶

Filter for multiple sensors¶

Facetted side-by-side box plots¶

Group exercises¶

1. Using `summarise()`¶

2. Grouping the data¶

3. Sum of missing values with `sum(is.na())`¶

4. Filtering multiple sensors and reshaping the data¶

5. Producing a 100 per cent chart¶

Pedestrian activity¶

Reading a .csv file from GitHub¶

read_csv() or read.csv()?¶

Look at ped_wide¶

Convert to long form¶

Note about the arguments in a function¶

State Library¶

Line plot¶

Bar plot¶

Side-by-side box plot¶

Multiple locations¶

Filter for multiple sensors¶

Facetted side-by-side box plots¶

Group exercises¶

1. Using summarise()¶

2. Grouping the data¶

3. Sum of missing values with sum(is.na())¶

4. Filtering multiple sensors and reshaping the data¶

5. Producing a 100 per cent chart¶

`read_csv()` or `read.csv()`?¶

Look at `ped_wide`¶

1. Using `summarise()`¶

3. Sum of missing values with `sum(is.na())`¶