Workshop 2¶
This applied class explores cricket statistics from 2 different sources: an R package and via web scraping.
Cricket data¶
cricketdata
is an R package from rOpenSci
, which contains data on all international cricket matches is provided by ESPNCricinfo
.
T20 batting¶
- Load the
tidyverse
- Read the
Cricket_data.csv
- Filter the data for your two favourite countries. Here we have chosen Australia and India
library(tidyverse)
# dataset
mt20 <- read_csv("CricketData.csv")
# filter for AUS and IND
mt20_aus_ind <- mt20 %>%
filter(Country %in% c("India", "Australia"))
Look at mt20_aus_ind
¶
In a game of cricket, teams take turns batting and bowling. The objective of the batting team is to score as many runs as possible, while the bowling team's objective is prevent the batting team from scoring runs. At the end of an innings, the batting and bowling team swaps. The team with the highest runs wins the match.
Look at what is inside of mt20_aus_ind
by running it from a code chunk.
# dimensions?
mt20_aus_ind %>% dim()
# column names
mt20_aus_ind %>% colnames()
# head
mt20_aus_ind %>% head()
- 151
- 18
- 'Player'
- 'Country'
- 'Start'
- 'End'
- 'Matches'
- 'Innings'
- 'NotOuts'
- 'Runs'
- 'HighScore'
- 'HighScoreNotOut'
- 'Average'
- 'BallsFaced'
- 'StrikeRate'
- 'Hundreds'
- 'Fifties'
- 'Ducks'
- 'Fours'
- 'Sixes'
Player | Country | Start | End | Matches | Innings | NotOuts | Runs | HighScore | HighScoreNotOut | Average | BallsFaced | StrikeRate | Hundreds | Fifties | Ducks | Fours | Sixes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <lgl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
PP Chawla | India | 2010 | 2012 | 7 | 1 | 0 | 0 | 0 | FALSE | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
A Mishra | India | 2010 | 2017 | 10 | 1 | 0 | 0 | 0 | FALSE | 0 | 0 | NA | 0 | 0 | 1 | 0 | 0 |
AA Noffke | Australia | 2007 | 2008 | 2 | 1 | 0 | 0 | 0 | FALSE | 0 | 0 | NA | 0 | 0 | 1 | 0 | 0 |
MM Patel | India | 2011 | 2011 | 3 | 1 | 0 | 0 | 0 | FALSE | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
BW Hilfenhaus | Australia | 2007 | 2012 | 7 | 3 | 1 | 2 | 2 | FALSE | 1 | 10 | 20 | 0 | 0 | 1 | 0 | 0 |
UT Yadav | India | 2012 | 2019 | 7 | 1 | 0 | 2 | 2 | FALSE | 2 | 4 | 50 | 0 | 0 | 0 | 0 | 0 |
- How many rows and columns are in
mt20_aus_ind
? - What does each row in
mt20_aus_ind
represent? - What function returns the top of
mt20_aus_ind
, i.e., the first 6 rows?
We can learn more about mt20_aus_ind
by visualising it. For continuous numerical variables, e.g.,
- average run score
- strike rate
or discrete numerical variables that take on a wide range of numbers, e.g.,
- player's highest run score
- total runs scored
we can visualise their distribution with histograms or box plots. Understanding how these variables are distributed provides us with information about their central tendency (mean, median, mode), variability (standard deviation, IQR, range) and shape (skewness).
Before performing any analysis, we will need to convert the data into a tidy long form:
- Take
mt20_aus_ind
and select only variablesPlayer
,Country
,NotOuts
,HighScore
,Average
,StrikeRate
,Hundreds
,Fifties
,Ducks
,Fours
,Sixes
. - Convert this data into a tidy long form using
gather()
- Gather all columns in
mt20_aus_ind
exceptPlayer
andCountry
, specifying thekey
andvalue
asBat_Stats
andValue
. - Store this tidy long form data of men's T20 batting statistics in an object named
mt20_aus_ind_long
. - Print
mt20_aus_ind_long
Fill out the missing parts of the code chunk (???
) and then run:
# Convert mt20_aus_ind to long form
mt20_aus_ind_long <- mt20_aus_ind %>%
select(Player, Country, NotOuts, HighScore, Average, StrikeRate, Hundreds, Fifties, Ducks, Fours, Sixes) %>%
gather(Bat_Stats, Value, -Player, -Country)
# Print mt20_aus_ind_long
mt20_aus_ind_long %>% head()
Player | Country | Bat_Stats | Value |
---|---|---|---|
<chr> | <chr> | <chr> | <dbl> |
PP Chawla | India | NotOuts | 0 |
A Mishra | India | NotOuts | 0 |
AA Noffke | Australia | NotOuts | 0 |
MM Patel | India | NotOuts | 0 |
BW Hilfenhaus | Australia | NotOuts | 1 |
UT Yadav | India | NotOuts | 0 |
- How many rows and columns are in
mt20_aus_ind_long
? - What information does the column
Bat_Stats
contain? - What information does the column
Value
contain?
Comparing batting statistics¶
We can compare each country's batting statistics by visualising how they are distributed. A good way to do this with side-by-side box plots:
- Take
mt20_aus_ind_long
and pipe in theggplot()
function. - Add layers to the
ggplot
call:- 1st layer: The
x
aesthetic should beCountry
and they
aesthetic the values of the battling statistics. - 2nd layer: Box plots are the visual elements we want to use for our graph, so add
geom_boxplot()
. - 3rd layer: Another visual element we want to include is jittered data, so add
geom_jitter()
. - 4th layer: Facet the graph by the variable that represents batting statistics.
- 5th layer: Add labels to your graph with
labs()
.
- 1st layer: The
Fill out the missing parts of the code chunk (???
) and then run:
# Box plots of countries by batting statistics
mt20_aus_ind_long %>%
ggplot(aes(x = Country, y = Value)) +
geom_boxplot(outlier.alpha = 0) + # hide the outliers
geom_jitter(alpha = 0.3) +
facet_wrap(~ Bat_Stats, scales = "free") +
labs(
title = "Distribution of Australian and Indian batting statistics",
caption = "Source: https://github.com/ropenscilabs/cricketdata"
)
Explain what do the warning messages tell us about our data?
There are missing values in the data.
Based on the above 9 numerical variables that contain batting statistics from Australian and Indian cricket players, what do you conclude about each countries batting performance?
Distribution of runs from both Aussie and indian players are similar.
Which 2 variables look most symmetrically distributed?
Computing grouped statistics¶
While it is not insightful to compare each country's batting performance based on total runs (some countries may have players that collectively have played many more matches than other countries, so these countries may have higher total runs simply because they have played more matches), we might want to compare each country's total runs divided by total matches. Of course, in many cricket games (T20 included), there will be players that play a match without batting at all and that should be kept in mind. To develop this checking mechanism, we need to understand the data that we're analysing. Here, some research on cricket and how a T20 match is played may be helpful.
Returning back to the wide form data, mt20_aus_ind
, fill out the missing parts of the code chunk (???
) and then run:
# Compute mean of total runs divided by total matches
mt20_aus_ind %>%
group_by(Country) %>%
summarise(total_runs = sum(Runs, na.rm = TRUE),
total_matches = sum(Matches, na.rm = TRUE),
totalruns_totalmatches = round(total_runs / total_matches, 3)) %>%
ungroup()
Country | total_runs | total_matches | totalruns_totalmatches |
---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> |
Australia | 19280 | 1405 | 13.722 |
India | 19568 | 1409 | 13.888 |
Relationship between average runs and strike rate¶
Another statistic that we can explore is the strike rate, which represents the average number of runs scored per 100 balls faced.
Again, using the wide form data, mt20_aus_ind
, fill out the missing parts of the code chunk (???
) to obtain a scatter plot of average runs by strike rate:
# Scatter plot of average runs and strike rate
mt20_aus_ind %>%
ggplot(aes(x = StrikeRate, y = Average, colour = Country)) +
geom_point(alpha = 0.5) +
labs(title = "Relationship between average runs and strike rate")
- How might you inspect values of average run along the upper limit of the graph (above 50 runs)?
# Answer
mt20_aus_ind %>%
filter(Average > 50) %>%
arrange(desc(Average))
Player | Country | Start | End | Matches | Innings | NotOuts | Runs | HighScore | HighScoreNotOut | Average | BallsFaced | StrikeRate | Hundreds | Fifties | Ducks | Fours | Sixes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <lgl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
ML Hayden | Australia | 2005 | 2007 | 9 | 9 | 3 | 308 | 73 | TRUE | 51.33333 | 214 | 143.9252 | 0 | 4 | 0 | 37 | 13 |
Web scraping T20I cricket data¶
The ICC Men's T20I Team Rankings is an international Twenty20 cricket rankings system of the International Cricket Council. We want to scrap the "Current rankings" table on the wikipedia page from https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings.
rvest
pakage for web scraping¶
To scrape the T20I ratings data from the web:
- Load the
rvest
package - Store the T2oI URL as a object
- T20 URL can be stored as an object named
t20i_url
- T20 URL can be stored as an object named
- Use the
read_html()
function from thervest
package to scrape data fromt20i_url
- T20 scraped data can be stored as a object named
t20i_page
- T20 scraped data can be stored as a object named
Fill out the missing parts of the code chunk (???
) and then run:
library(rvest)
# Store T20 URL as an object named t20_url
t20_url <- "https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings"
# Scrape T20 data
t20i_page <- rvest::read_html(t20_url)
The HTML table element inside of t20i_page
can be extracted and returned as a data frame with the html_element and html_table()
functions from the rvest
package.
Fill out the missing parts of the code chunk (???
) and then run:
t20i_tables <- rvest::html_element(t20i_page, "table.wikitable") %>%
rvest::html_table()
t20i_tables %>% head()
Team | Matches | Points | Rating | |
---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> |
India | 62 | 16,556 | 267 | NA |
Australia | 40 | 10,241 | 256 | NA |
England | 39 | 9876 | 253 | NA |
West Indies | 46 | 11604 | 252 | NA |
South Africa | 35 | 8777 | 251 | NA |
New Zealand | 49 | 12,113 | 247 | NA |
We need to some preprocessing here, because we have an empty column, some all of the numeric columns are read in as character which is undesirable. Steps below should addresses these:
# remove the last column which is empty
# t20i_tables <- t20i_tables[-c(5)]
t20i_tables <- t20i_tables[, -5]
# No need to change column names!
# change the data type of Matches and Rating into numeric
# for Points, strip comma then convert to numeric
t20i_tables <- t20i_tables %>%
mutate(
Matches = as.integer(Matches),
Rating = as.integer(Rating),
Points = as.integer(gsub(",", "", Points))
)
# Keep top N countries for simplicity
t20i_tables <- t20i_tables[1:25, ]
t20i_tables %>% head()
Team | Matches | Points | Rating |
---|---|---|---|
<chr> | <int> | <int> | <int> |
India | 62 | 16556 | 267 |
Australia | 40 | 10241 | 256 |
England | 39 | 9876 | 253 |
West Indies | 46 | 11604 | 252 |
South Africa | 35 | 8777 | 251 |
New Zealand | 49 | 12113 | 247 |
Country's rating in T20I¶
Below are bar plots of each countries' rating in T20I cricket, arranged from highest to lowest. To replicate this plot, you will need to add an x
and y
aesthetic (with the variable in the x
aesthetic ordered using the fct_reorder()
function), add a geom layer that tells R to use bars as visual elements for the plot, add a layer to flip the x and y axis (coord_flip()
) and add the last layer to lab the titles of the plot.
t20i_tables %>%
# width 11 inches, height 8 inches
ggplot(
aes(
x = fct_reorder(Team, Rating),
y = Rating,
fill = Matches
)
) +
geom_bar(stat = "identity", alpha = 0.5) +
coord_flip() +
labs(title = "Country rating for T20I cricket",
x = "Country",
y = "Rating"
)