Workshop 2¶

This applied class explores cricket statistics from 2 different sources: an R package and via web scraping.

Cricket data¶

cricketdata is an R package from rOpenSci, which contains data on all international cricket matches is provided by ESPNCricinfo.

T20 batting¶

Load the tidyverse
Read the Cricket_data.csv
Filter the data for your two favourite countries. Here we have chosen Australia and India

library(tidyverse)

# dataset
mt20 <- read_csv("CricketData.csv")

# filter for AUS and IND
mt20_aus_ind <- mt20 %>%
  filter(Country %in% c("India", "Australia"))

Look at `mt20_aus_ind`¶

In a game of cricket, teams take turns batting and bowling. The objective of the batting team is to score as many runs as possible, while the bowling team's objective is prevent the batting team from scoring runs. At the end of an innings, the batting and bowling team swaps. The team with the highest runs wins the match.

Look at what is inside of mt20_aus_ind by running it from a code chunk.

# dimensions?
mt20_aus_ind %>% dim()

# column names
mt20_aus_ind %>% colnames()

# head
mt20_aus_ind %>% head()

'Player'
'Country'
'Start'
'End'
'Matches'
'Innings'
'NotOuts'
'Runs'
'HighScore'
'HighScoreNotOut'
'Average'
'BallsFaced'
'StrikeRate'
'Hundreds'
'Fifties'
'Ducks'
'Fours'
'Sixes'

A tibble: 6 × 18
Player	Country	Start	End	Matches	Innings	NotOuts	Runs	HighScore	HighScoreNotOut	Average	BallsFaced	StrikeRate	Hundreds	Fifties	Ducks	Fours	Sixes
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<lgl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
PP Chawla	India	2010	2012	7	1	0	0	0	FALSE	0	1	0	0	0	1	0	0
A Mishra	India	2010	2017	10	1	0	0	0	FALSE	0	0	NA	0	0	1	0	0
AA Noffke	Australia	2007	2008	2	1	0	0	0	FALSE	0	0	NA	0	0	1	0	0
MM Patel	India	2011	2011	3	1	0	0	0	FALSE	0	1	0	0	0	1	0	0
BW Hilfenhaus	Australia	2007	2012	7	3	1	2	2	FALSE	1	10	20	0	0	1	0	0
UT Yadav	India	2012	2019	7	1	0	2	2	FALSE	2	4	50	0	0	0	0	0

How many rows and columns are in mt20_aus_ind?
What does each row in mt20_aus_ind represent?
What function returns the top of mt20_aus_ind, i.e., the first 6 rows?

We can learn more about mt20_aus_ind by visualising it. For continuous numerical variables, e.g.,

average run score
strike rate

or discrete numerical variables that take on a wide range of numbers, e.g.,

player's highest run score
total runs scored

we can visualise their distribution with histograms or box plots. Understanding how these variables are distributed provides us with information about their central tendency (mean, median, mode), variability (standard deviation, IQR, range) and shape (skewness).

Before performing any analysis, we will need to convert the data into a tidy long form:

Take mt20_aus_ind and select only variables Player, Country, NotOuts, HighScore, Average, StrikeRate, Hundreds, Fifties, Ducks, Fours, Sixes.
Convert this data into a tidy long form using gather()
Gather all columns in mt20_aus_ind except Player and Country, specifying the key and value as Bat_Stats and Value.
Store this tidy long form data of men's T20 batting statistics in an object named mt20_aus_ind_long.
Print mt20_aus_ind_long

Fill out the missing parts of the code chunk (???) and then run:

# Convert mt20_aus_ind to long form
mt20_aus_ind_long <- mt20_aus_ind %>%
  select(Player, Country, NotOuts, HighScore, Average, StrikeRate, Hundreds, Fifties, Ducks, Fours, Sixes) %>%
  gather(Bat_Stats, Value, -Player, -Country)

# Print mt20_aus_ind_long
mt20_aus_ind_long %>% head()

A tibble: 6 x 4
Player	Country	Bat_Stats	Value
<chr>	<chr>	<chr>	<dbl>
PP Chawla	India	NotOuts	0
A Mishra	India	NotOuts	0
AA Noffke	Australia	NotOuts	0
MM Patel	India	NotOuts	0
BW Hilfenhaus	Australia	NotOuts	1
UT Yadav	India	NotOuts	0

How many rows and columns are in mt20_aus_ind_long?
What information does the column Bat_Stats contain?
What information does the column Value contain?

Comparing batting statistics¶

We can compare each country's batting statistics by visualising how they are distributed. A good way to do this with side-by-side box plots:

Take mt20_aus_ind_long and pipe in the ggplot() function.
Add layers to the ggplot call:
- 1st layer: The x aesthetic should be Country and the y aesthetic the values of the battling statistics.
- 2nd layer: Box plots are the visual elements we want to use for our graph, so add geom_boxplot().
- 3rd layer: Another visual element we want to include is jittered data, so add geom_jitter().
- 4th layer: Facet the graph by the variable that represents batting statistics.
- 5th layer: Add labels to your graph with labs().

Fill out the missing parts of the code chunk (???) and then run:

# Box plots of countries by batting statistics
mt20_aus_ind_long %>%
  ggplot(aes(x = Country,  y = Value)) +
  geom_boxplot(outlier.alpha = 0) + # hide the outliers
  geom_jitter(alpha = 0.3) +
  facet_wrap(~ Bat_Stats, scales = "free") +
  labs(
    title = "Distribution of Australian and Indian batting statistics",
    caption = "Source: https://github.com/ropenscilabs/cricketdata"
  )

Explain what do the warning messages tell us about our data?

There are missing values in the data.
Based on the above 9 numerical variables that contain batting statistics from Australian and Indian cricket players, what do you conclude about each countries batting performance?

Distribution of runs from both Aussie and indian players are similar.
Which 2 variables look most symmetrically distributed?

Computing grouped statistics¶

While it is not insightful to compare each country's batting performance based on total runs (some countries may have players that collectively have played many more matches than other countries, so these countries may have higher total runs simply because they have played more matches), we might want to compare each country's total runs divided by total matches. Of course, in many cricket games (T20 included), there will be players that play a match without batting at all and that should be kept in mind. To develop this checking mechanism, we need to understand the data that we're analysing. Here, some research on cricket and how a T20 match is played may be helpful.

Returning back to the wide form data, mt20_aus_ind, fill out the missing parts of the code chunk (???) and then run:

# Compute mean of total runs divided by total matches
mt20_aus_ind %>%
  group_by(Country) %>%
  summarise(total_runs = sum(Runs, na.rm = TRUE),
            total_matches = sum(Matches, na.rm = TRUE),
            totalruns_totalmatches = round(total_runs / total_matches, 3)) %>%
  ungroup()

A tibble: 2 x 4
Country	total_runs	total_matches	totalruns_totalmatches
<chr>	<dbl>	<dbl>	<dbl>
Australia	19280	1405	13.722
India	19568	1409	13.888

Relationship between average runs and strike rate¶

Another statistic that we can explore is the strike rate, which represents the average number of runs scored per 100 balls faced.

Again, using the wide form data, mt20_aus_ind, fill out the missing parts of the code chunk (???) to obtain a scatter plot of average runs by strike rate:

# Scatter plot of average runs and strike rate
mt20_aus_ind %>%
  ggplot(aes(x = StrikeRate, y = Average, colour = Country)) +
  geom_point(alpha = 0.5) +
  labs(title = "Relationship between average runs and strike rate")

How might you inspect values of average run along the upper limit of the graph (above 50 runs)?

# Answer
mt20_aus_ind %>%
  filter(Average > 50) %>%
  arrange(desc(Average))

A spec_tbl_df: 1 x 18
Player	Country	Start	End	Matches	Innings	NotOuts	Runs	HighScore	HighScoreNotOut	Average	BallsFaced	StrikeRate	Hundreds	Fifties	Ducks	Fours	Sixes
<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<lgl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
ML Hayden	Australia	2005	2007	9	9	3	308	73	TRUE	51.33333	214	143.9252	0	4	0	37	13

ON YA HAYDEN!

Web scraping T20I cricket data¶

The ICC Men's T20I Team Rankings is an international Twenty20 cricket rankings system of the International Cricket Council. We want to scrap the "Current rankings" table on the wikipedia page from https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings.

`rvest` pakage for web scraping¶

To scrape the T20I ratings data from the web:

Load the rvest package
Store the T2oI URL as a object
- T20 URL can be stored as an object named t20i_url
Use the read_html() function from the rvest package to scrape data from t20i_url
- T20 scraped data can be stored as a object named t20i_page

Fill out the missing parts of the code chunk (???) and then run:

library(rvest)

# Store T20 URL as an object named t20_url
t20_url <- "https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings"

# Scrape T20 data
t20i_page <- rvest::read_html(t20_url)

The HTML table element inside of t20i_page can be extracted and returned as a data frame with the html_element and html_table() functions from the rvest package.

Fill out the missing parts of the code chunk (???) and then run:

t20i_tables <- rvest::html_element(t20i_page, "table.wikitable") %>%
  rvest::html_table()
t20i_tables %>% head()

A tibble: 6 x 5
Team	Matches	Points	Rating
<chr>	<chr>	<chr>	<chr>	<chr>
India	62	16,556	267	NA
Australia	40	10,241	256	NA
England	39	9876	253	NA
West Indies	46	11604	252	NA
South Africa	35	8777	251	NA
New Zealand	49	12,113	247	NA

We need to some preprocessing here, because we have an empty column, some all of the numeric columns are read in as character which is undesirable. Steps below should addresses these:

# remove the last column which is empty
# t20i_tables <- t20i_tables[-c(5)]
t20i_tables <- t20i_tables[, -5]

# No need to change column names!

# change the data type of Matches and Rating into numeric
# for Points, strip comma then convert to numeric
t20i_tables <- t20i_tables %>%
  mutate(
    Matches = as.integer(Matches),
    Rating = as.integer(Rating),
    Points = as.integer(gsub(",", "", Points))
  )

# Keep top N countries for simplicity
t20i_tables <- t20i_tables[1:25, ]

t20i_tables %>% head()

A tibble: 6 x 4
Team	Matches	Points	Rating
<chr>	<int>	<int>	<int>
India	62	16556	267
Australia	40	10241	256
England	39	9876	253
West Indies	46	11604	252
South Africa	35	8777	251
New Zealand	49	12113	247

Country's rating in T20I¶

Below are bar plots of each countries' rating in T20I cricket, arranged from highest to lowest. To replicate this plot, you will need to add an x and y aesthetic (with the variable in the x aesthetic ordered using the fct_reorder() function), add a geom layer that tells R to use bars as visual elements for the plot, add a layer to flip the x and y axis (coord_flip()) and add the last layer to lab the titles of the plot.

t20i_tables %>%
# width 11 inches, height 8 inches
  ggplot(
    aes(
      x = fct_reorder(Team, Rating),
      y = Rating,
      fill = Matches
    )
  ) +
  geom_bar(stat = "identity", alpha = 0.5) +
  coord_flip() +
  labs(title = "Country rating for T20I cricket",
    x = "Country",
    y = "Rating"
  )

FIT5145 Workshop 2

Jevgeni Han

2025-03-18

Workshop 2¶

Cricket data¶

T20 batting¶

Look at `mt20_aus_ind`¶

Comparing batting statistics¶

Computing grouped statistics¶

Relationship between average runs and strike rate¶

Web scraping T20I cricket data¶

`rvest` pakage for web scraping¶

Country's rating in T20I¶

Workshop 2¶

Cricket data¶

T20 batting¶

Look at mt20_aus_ind¶

Comparing batting statistics¶

Computing grouped statistics¶

Relationship between average runs and strike rate¶

Web scraping T20I cricket data¶

rvest pakage for web scraping¶

Country's rating in T20I¶

Look at `mt20_aus_ind`¶

`rvest` pakage for web scraping¶