Workshop 2

This applied class explores cricket statistics from 2 different sources: an R package and via web scraping.

Cricket data

cricketdata is an R package from rOpenSci, which contains data on all international cricket matches is provided by ESPNCricinfo.

T20 batting

  • Load the tidyverse
  • Read the Cricket_data.csv
  • Filter the data for your two favourite countries. Here we have chosen Australia and India
library(tidyverse)

# dataset
mt20 <- read_csv("CricketData.csv")

# filter for AUS and IND
mt20_aus_ind <- mt20 %>%
  filter(Country %in% c("India", "Australia"))

Look at mt20_aus_ind

In a game of cricket, teams take turns batting and bowling. The objective of the batting team is to score as many runs as possible, while the bowling team's objective is prevent the batting team from scoring runs. At the end of an innings, the batting and bowling team swaps. The team with the highest runs wins the match.

Look at what is inside of mt20_aus_ind by running it from a code chunk.

# dimensions?
mt20_aus_ind %>% dim()

# column names
mt20_aus_ind %>% colnames()

# head
mt20_aus_ind %>% head()
  1. 151
  2. 18
  1. 'Player'
  2. 'Country'
  3. 'Start'
  4. 'End'
  5. 'Matches'
  6. 'Innings'
  7. 'NotOuts'
  8. 'Runs'
  9. 'HighScore'
  10. 'HighScoreNotOut'
  11. 'Average'
  12. 'BallsFaced'
  13. 'StrikeRate'
  14. 'Hundreds'
  15. 'Fifties'
  16. 'Ducks'
  17. 'Fours'
  18. 'Sixes'
A tibble: 6 × 18
PlayerCountryStartEndMatchesInningsNotOutsRunsHighScoreHighScoreNotOutAverageBallsFacedStrikeRateHundredsFiftiesDucksFoursSixes
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><lgl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
PP Chawla India 20102012 71000FALSE0 1 000100
A Mishra India 20102017101000FALSE0 0NA00100
AA Noffke Australia20072008 21000FALSE0 0NA00100
MM Patel India 20112011 31000FALSE0 1 000100
BW HilfenhausAustralia20072012 73122FALSE1102000100
UT Yadav India 20122019 71022FALSE2 45000000
  • How many rows and columns are in mt20_aus_ind?
  • What does each row in mt20_aus_ind represent?
  • What function returns the top of mt20_aus_ind, i.e., the first 6 rows?

We can learn more about mt20_aus_ind by visualising it. For continuous numerical variables, e.g.,

  • average run score
  • strike rate

or discrete numerical variables that take on a wide range of numbers, e.g.,

  • player's highest run score
  • total runs scored

we can visualise their distribution with histograms or box plots. Understanding how these variables are distributed provides us with information about their central tendency (mean, median, mode), variability (standard deviation, IQR, range) and shape (skewness).

Before performing any analysis, we will need to convert the data into a tidy long form:

  • Take mt20_aus_ind and select only variables Player, Country, NotOuts, HighScore, Average, StrikeRate, Hundreds, Fifties, Ducks, Fours, Sixes.
  • Convert this data into a tidy long form using gather()
  • Gather all columns in mt20_aus_ind except Player and Country, specifying the key and value as Bat_Stats and Value.
  • Store this tidy long form data of men's T20 batting statistics in an object named mt20_aus_ind_long.
  • Print mt20_aus_ind_long

Fill out the missing parts of the code chunk (???) and then run:

# Convert mt20_aus_ind to long form
mt20_aus_ind_long <- mt20_aus_ind %>%
  select(Player, Country, NotOuts, HighScore, Average, StrikeRate, Hundreds, Fifties, Ducks, Fours, Sixes) %>%
  gather(Bat_Stats, Value, -Player, -Country)

# Print mt20_aus_ind_long
mt20_aus_ind_long %>% head()
A tibble: 6 x 4
PlayerCountryBat_StatsValue
<chr><chr><chr><dbl>
PP Chawla India NotOuts0
A Mishra India NotOuts0
AA Noffke AustraliaNotOuts0
MM Patel India NotOuts0
BW HilfenhausAustraliaNotOuts1
UT Yadav India NotOuts0
  • How many rows and columns are in mt20_aus_ind_long?
  • What information does the column Bat_Stats contain?
  • What information does the column Value contain?

Comparing batting statistics

We can compare each country's batting statistics by visualising how they are distributed. A good way to do this with side-by-side box plots:

  • Take mt20_aus_ind_long and pipe in the ggplot() function.
  • Add layers to the ggplot call:
    • 1st layer: The x aesthetic should be Country and the y aesthetic the values of the battling statistics.
    • 2nd layer: Box plots are the visual elements we want to use for our graph, so add geom_boxplot().
    • 3rd layer: Another visual element we want to include is jittered data, so add geom_jitter().
    • 4th layer: Facet the graph by the variable that represents batting statistics.
    • 5th layer: Add labels to your graph with labs().

Fill out the missing parts of the code chunk (???) and then run:

# Box plots of countries by batting statistics
mt20_aus_ind_long %>%
  ggplot(aes(x = Country,  y = Value)) +
  geom_boxplot(outlier.alpha = 0) + # hide the outliers
  geom_jitter(alpha = 0.3) +
  facet_wrap(~ Bat_Stats, scales = "free") +
  labs(
    title = "Distribution of Australian and Indian batting statistics",
    caption = "Source: https://github.com/ropenscilabs/cricketdata"
  )
  • Explain what do the warning messages tell us about our data?

    There are missing values in the data.

  • Based on the above 9 numerical variables that contain batting statistics from Australian and Indian cricket players, what do you conclude about each countries batting performance?

    Distribution of runs from both Aussie and indian players are similar.

  • Which 2 variables look most symmetrically distributed?

Computing grouped statistics

While it is not insightful to compare each country's batting performance based on total runs (some countries may have players that collectively have played many more matches than other countries, so these countries may have higher total runs simply because they have played more matches), we might want to compare each country's total runs divided by total matches. Of course, in many cricket games (T20 included), there will be players that play a match without batting at all and that should be kept in mind. To develop this checking mechanism, we need to understand the data that we're analysing. Here, some research on cricket and how a T20 match is played may be helpful.

Returning back to the wide form data, mt20_aus_ind, fill out the missing parts of the code chunk (???) and then run:

# Compute mean of total runs divided by total matches
mt20_aus_ind %>%
  group_by(Country) %>%
  summarise(total_runs = sum(Runs, na.rm = TRUE),
            total_matches = sum(Matches, na.rm = TRUE),
            totalruns_totalmatches = round(total_runs / total_matches, 3)) %>%
  ungroup()
A tibble: 2 x 4
Countrytotal_runstotal_matchestotalruns_totalmatches
<chr><dbl><dbl><dbl>
Australia19280140513.722
India 19568140913.888

Relationship between average runs and strike rate

Another statistic that we can explore is the strike rate, which represents the average number of runs scored per 100 balls faced.

Again, using the wide form data, mt20_aus_ind, fill out the missing parts of the code chunk (???) to obtain a scatter plot of average runs by strike rate:

# Scatter plot of average runs and strike rate
mt20_aus_ind %>%
  ggplot(aes(x = StrikeRate, y = Average, colour = Country)) +
  geom_point(alpha = 0.5) +
  labs(title = "Relationship between average runs and strike rate")
  • How might you inspect values of average run along the upper limit of the graph (above 50 runs)?
# Answer
mt20_aus_ind %>%
  filter(Average > 50) %>%
  arrange(desc(Average))
A spec_tbl_df: 1 x 18
PlayerCountryStartEndMatchesInningsNotOutsRunsHighScoreHighScoreNotOutAverageBallsFacedStrikeRateHundredsFiftiesDucksFoursSixes
<chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><lgl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
ML HaydenAustralia2005200799330873TRUE51.33333214143.92520403713
ON YA HAYDEN!

Web scraping T20I cricket data

The ICC Men's T20I Team Rankings is an international Twenty20 cricket rankings system of the International Cricket Council. We want to scrap the "Current rankings" table on the wikipedia page from https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings.

rvest pakage for web scraping

To scrape the T20I ratings data from the web:

  • Load the rvest package
  • Store the T2oI URL as a object
    • T20 URL can be stored as an object named t20i_url
  • Use the read_html() function from the rvest package to scrape data from t20i_url
    • T20 scraped data can be stored as a object named t20i_page

Fill out the missing parts of the code chunk (???) and then run:

library(rvest)

# Store T20 URL as an object named t20_url
t20_url <- "https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings"

# Scrape T20 data
t20i_page <- rvest::read_html(t20_url)

The HTML table element inside of t20i_page can be extracted and returned as a data frame with the html_element and html_table() functions from the rvest package.

Fill out the missing parts of the code chunk (???) and then run:

t20i_tables <- rvest::html_element(t20i_page, "table.wikitable") %>%
  rvest::html_table()
t20i_tables %>% head()
A tibble: 6 x 5
TeamMatchesPointsRating
<chr><chr><chr><chr><chr>
India 6216,556267NA
Australia 4010,241256NA
England 399876 253NA
West Indies 4611604 252NA
South Africa358777 251NA
New Zealand 4912,113247NA

We need to some preprocessing here, because we have an empty column, some all of the numeric columns are read in as character which is undesirable. Steps below should addresses these:

# remove the last column which is empty
# t20i_tables <- t20i_tables[-c(5)]
t20i_tables <- t20i_tables[, -5]

# No need to change column names!

# change the data type of Matches and Rating into numeric
# for Points, strip comma then convert to numeric
t20i_tables <- t20i_tables %>%
  mutate(
    Matches = as.integer(Matches),
    Rating = as.integer(Rating),
    Points = as.integer(gsub(",", "", Points))
  )

# Keep top N countries for simplicity
t20i_tables <- t20i_tables[1:25, ]

t20i_tables %>% head()
A tibble: 6 x 4
TeamMatchesPointsRating
<chr><int><int><int>
India 6216556267
Australia 4010241256
England 39 9876253
West Indies 4611604252
South Africa35 8777251
New Zealand 4912113247

Country's rating in T20I

Below are bar plots of each countries' rating in T20I cricket, arranged from highest to lowest. To replicate this plot, you will need to add an x and y aesthetic (with the variable in the x aesthetic ordered using the fct_reorder() function), add a geom layer that tells R to use bars as visual elements for the plot, add a layer to flip the x and y axis (coord_flip()) and add the last layer to lab the titles of the plot.

t20i_tables %>%
# width 11 inches, height 8 inches
  ggplot(
    aes(
      x = fct_reorder(Team, Rating),
      y = Rating,
      fill = Matches
    )
  ) +
  geom_bar(stat = "identity", alpha = 0.5) +
  coord_flip() +
  labs(title = "Country rating for T20I cricket",
    x = "Country",
    y = "Rating"
  )