vignettes/aq2020.Rmd
aq2020.Rmd
This document shows how to obtain the pollutant data from the AQO website, convert it to the desired format, calculate p-values, and produce the corresponding plots in the paper by Al-Abadleh et al. (2021).
First let’s load the R packages needed to perform all calculations.
# load packages
require(aq2020)
#> Loading required package: aq2020
require(tidyr)
#> Loading required package: tidyr
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
require(rvest)
#> Loading required package: rvest
require(lubridate)
#> Loading required package: lubridate
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
require(ggplot2)
#> Loading required package: ggplot2
This will create a data frame with the following columns:
station
: The station name.station_id
: The numeric station ID (for webscraping).pollutant
: The pollutant name (O3
, PM25
, NO2
, CO
, SO2
).poll_id
: The numeric pollutant ID (for webscraping).has_poll
: TRUE/FALSE
, depending on whether the given station has records for the given pollutant.This data frame has already been saved in the aq2020 package and can be accessed as follows:
pollutant_info
#> # A tibble: 180 x 5
#> station station_id pollutant poll_id has_poll
#> <chr> <dbl> <chr> <dbl> <lgl>
#> 1 Barrie 47045 O3 122 TRUE
#> 2 Barrie 47045 PM25 124 TRUE
#> 3 Barrie 47045 NO2 36 TRUE
#> 4 Barrie 47045 CO 46 FALSE
#> 5 Barrie 47045 SO2 9 FALSE
#> 6 Belleville 54012 O3 122 TRUE
#> 7 Belleville 54012 PM25 124 TRUE
#> 8 Belleville 54012 NO2 36 TRUE
#> 9 Belleville 54012 CO 46 FALSE
#> 10 Belleville 54012 SO2 9 FALSE
#> # … with 170 more rows
The following code shows how to calculate this data from scratch and save it as a CSV file.
WARNING: The dataset pollutant_info
in the aq2020 package may differ from a more recent webscrape version using the code below. This is because the AQO periodically updates its station IDs.
# pollutant names and codes
pollutants <- c(O3 = 122, PM25 = 124, NO2 = 36, CO = 46, SO2 = 9)
# station names and ids
stations <- c(
Barrie = 47045,
Belleville = 54012,
Brantford = 21005,
Burlington = 44008,
Brampton = 46090,
Chatham = 13001,
Cornwall = 56051,
Grand_Bend = 15020,
Guelph = 28028,
Hamilton_Downtown = 29000,
Hamilton_West = 29118,
Hamilton_Mountain = 29214,
Kingston = 52023,
Kitchener = 26060,
London = 15026,
Milton = 44029,
Mississauga = 46108,
Newmarket = 48006,
North_Bay = 75010,
Oakville = 44017,
Ottawa_Downtown = 51001,
Oshawa = 45027,
Parry_Sound = 49005,
Peterborough = 59006,
Port_Stanley = 16015,
Sarnia = 14111,
Sault_Ste_Marie = 71078,
St_Catharines = 27067,
Sudbury = 77233,
Thunder_Bay = 63203,
Toronto_Downtown = 31103,
Toronto_East = 33003,
Toronto_West = 35125,
Windsor_Downtown = 12008,
Windsor_West = 12016,
Toronto_North = 34021
)
# keep only the combinations of stations and pollutants
# for which there is data on the AQO website
station_polls <- sapply(stations, function(station_id) {
station <- "http://www.airqualityontario.com/history/station.php?stationid="
station <- paste0(station, station_id)
station <- read_html(station)
station %>%
html_node("#right_column table.resourceTable") %>%
html_table(header = FALSE) %>%
as_tibble() %>%
filter(X1 == "Pollutants Measured:") %>%
pull(X2)
})
statpoll_hasdata <- sapply(c("O3", "PM2[.]5", "NO2", "CO", "SO2"),
function(poll) {
grepl(poll, station_polls)
})
# combine into a tibble
statpoll <- as_tibble(statpoll_hasdata) %>%
mutate(station = names(stations),
station_id = as.numeric(stations)) %>%
pivot_longer(O3:SO2, names_to = "pollutant", values_to = "has_poll") %>%
mutate(pollutant = gsub("\\[[.]\\]", "", pollutant),
poll_id = as.numeric(pollutants[pollutant])) %>%
select(station, station_id, pollutant, poll_id, has_poll)
# uncomment the line below to save this data as a CSV file
## write_csv(statpoll, path = "aq2020-pollutant_info.csv")
For each station and pollutant combination for which there is data on the AQO website, the following code pulls this data from the website by year, with the option to pick only specific months. Here we’ll take all available data from 2017-2020.
The first step is to save each webscrape as a file of the form Barrie_NO2_2020_1_12.rds
. This is a compressed format consisting here of the NO2
data from the Barrie station for 2020, for months January through December.
Warning: The webscraping below may not provide the same results as those saved in the aq2020 package. This is because the AQO periodically updates the data on its website.
# where to save raw queries
data_path <- file.path("data", "pollutant", "raw")
#' Helper function to standardize file names.
#'
#' @param path Path to the file.
#' @param ... Elements of the file name, to be concatenated into a single string separated by "_".
#' @param ext File extension.
#' @return A string representing the file name.
get_filename <- function(path, ..., ext = "rds") {
fname <- paste0(c(...), collapse = "_")
file.path(path, paste0(fname, ".", ext))
}
# setup webscrape query information
query_dates <- tibble(
year = 2017:2020,
start_month = 1,
end_month = 12
)
# combine with station/pollutant information
query_info <- pollutant_info %>%
filter(has_poll) %>% # keep only stat/poll combo for which there is data
mutate(by = rep(1, n())) %>%
full_join(query_dates %>%
mutate(by = rep(1, n()))) %>%
select(year, start_month, end_month,
station, station_id, pollutant, poll_id)
#> Joining, by = "by"
nquery <- nrow(query_info)
#' Helper function for webscraping.
#'
#' @details Helper function to:
#'
#' - Webscrape a query (given station/pollutant/year).
#' - Format consistently (see [aq2020::get_aqoyr()]).
#' - Save as `data_path/station_pollutant_year_startmonth_endmonth.rds`
#'
#' The webscrape query fails, it saves the file with the content `NA`.
#'
#' @param query Webscrape specification. A list with elements `station`, `pollutant`, `year`, `station_id` ,`poll_id`, `start_month`, `end_month`.
get_query <- function(query) {
message("Station: ", query$station,
", Pollutant: ", query$pollutant,
", Year: ", query$year)
query_yr <- tryCatch(
get_aqoyr(station_id = query$station_id,
poll_id = query$poll_id,
year = query$year,
start_month = query$start_month,
end_month = query$end_month,
fill = TRUE),
error = function(e) NA)
# save data
saveRDS(query_yr,
file = get_filename(path = data_path,
query$station, query$pollutant,
query$year, query$start_month, query$end_month))
}
# number of seconds to wait between website scrapes
wait_time <- c(min = .5, max = 1)
# query subset
# in this case, we just need 2020 as the other years
# were saved in a previous run
sub_query <- which(query_info$year < 2020)
system.time({
for(ii in sub_query) {
query <- query_info[ii,]
get_query(query)
# wait between queries for a random amount of time
Sys.sleep(runif(1, min = wait_time["min"], max = wait_time["max"]))
}
})
Now let’s check that all stations and pollutant datasets have been successfully queried for years 2017-2020:
# check if we have all yearly datasets
sub_query <- which(query_info$year %in% 2017:2020)
bad_ind <- sapply(sub_query, function(ii) {
query <- query_info[ii,]
query_yr <- readRDS(get_filename(path = data_path,
query$station, query$pollutant,
query$year, query$start_month,
query$end_month))
all(is.na(query_yr))
})
query_info[sub_query[bad_ind],] %>% print(n = Inf)
#> # A tibble: 5 x 7
#> year start_month end_month station station_id pollutant poll_id
#> <int> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 2017 1 12 Milton 44029 O3 122
#> 2 2017 1 12 Milton 44029 PM25 124
#> 3 2017 1 12 Milton 44029 NO2 36
#> 4 2017 1 12 Milton 44029 SO2 9
#> 5 2018 1 12 Oshawa 45027 O3 122
We can see that the webscrape failed in a few years for a few of the station/pollutant combinations. We can manually check the AQO website to see what’s going on. So for example, to see what happend in Milton for O3 in 2017, take the following URL:
http://www.airqualityontario.com/history/index.php?c=Academic&s={station_id}&y={year}&p={poll_id}&m={start_month}&e={end_month}&t=html&submitter=Search&i=1
and replace the following terms:
{station_id}
: The station ID for Milton, which is 44029.{year}
: The year in question, which in this case is 2017.{poll_id}
: The pollutant ID, which in this case is 122.{start_month}
: The start month as a number, which is 1 (January).{end_month}
: The end month as a number, which is 12 (December).So the URL above becomes:
http://www.airqualityontario.com/history/index.php?c=Academic&s=44029&y=2017&p=122&m=1&e=12&t=html&submitter=Search&i=1
At the time of this writing, following this URL above produces the message:
Sorry no results found for Ozone at Milton(44029) between January and December of the year 2017.
A raw data file from the webscrape looks like this:
query <- query_info[1,] # look at the first query
query_yr <- readRDS(get_filename(path = data_path,
query$station, query$pollutant,
query$year, query$start_month,
query$end_month))
# display
message("Station: ", query$station,
", Pollutant: ", query$pollutant,
", Year: ", query$year)
#> Station: Barrie, Pollutant: O3, Year: 2017
as_tibble(query_yr)
#> # A tibble: 365 x 26
#> Station Date H01 H02 H03 H04 H05 H06 H07 H08 H09 H10
#> <int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 47045 2017-01-… 30 30 28 26 31 29 25 22 22 30
#> 2 47045 2017-01-… 1 1 5 4 3 3 1 0 5 22
#> 3 47045 2017-01-… 29 29 28 28 26 23 23 23 24 24
#> 4 47045 2017-01-… 16 15 18 20 22 22 21 16 12 23
#> 5 47045 2017-01-… 35 35 36 36 35 34 34 30 30 31
#> 6 47045 2017-01-… 28 28 29 23 25 22 15 9 12 27
#> 7 47045 2017-01-… 26 18 14 4 1 5 4 1 3 17
#> 8 47045 2017-01-… 31 31 29 29 34 34 35 34 32 30
#> 9 47045 2017-01-… 24 30 31 31 30 27 27 25 24 24
#> 10 47045 2017-01-… 16 16 12 15 18 17 17 14 14 19
#> # … with 355 more rows, and 14 more variables: H11 <int>, H12 <int>, H13 <int>,
#> # H14 <int>, H15 <int>, H16 <int>, H17 <int>, H18 <int>, H19 <int>,
#> # H20 <int>, H21 <int>, H22 <int>, H23 <int>, H24 <int>
Cleaning the raw concentration data involves the following steps:
-999
or 9999
by NA
.Date
, Station
, Pollutant
, and Concentration
.The result of these processing steps are available in the aq2020 package in the pollutant_data
object:
pollutant_data
#> # A tibble: 176,039 x 4
#> Date Station Pollutant Concentration
#> <date> <chr> <chr> <dbl>
#> 1 2017-01-01 Barrie NO2 10.5
#> 2 2017-01-01 Barrie O3 26.4
#> 3 2017-01-01 Barrie PM25 7.71
#> 4 2017-01-01 Belleville NO2 4.04
#> 5 2017-01-01 Belleville O3 29.9
#> 6 2017-01-01 Belleville PM25 6.62
#> 7 2017-01-01 Brampton NO2 7.38
#> 8 2017-01-01 Brampton O3 29.4
#> 9 2017-01-01 Brampton PM25 6.88
#> 10 2017-01-01 Brantford NO2 3.26
#> # … with 176,029 more rows
Here is how to recreate this object from scratch:
# merge webscrape files
pollutant_data <- lapply(which(!bad_ind), function(ii) {
query <- query_info[ii,]
readRDS(get_filename(path = data_path,
query$station, query$pollutant,
query$year, query$start_month,
query$end_month)) %>%
as_tibble() %>%
mutate(Station = query$station,
Pollutant = query$pollutant) %>%
select(Date, Station, Pollutant, H01:H24)
}) %>% bind_rows()
# process
pollutant_data <- pollutant_data %>%
mutate(Date = ymd(Date)) %>%
mutate_at(vars(H01:H24), .funs = ~na_if(., 9999)) %>%
mutate_at(vars(H01:H24), .funs = ~na_if(., -999)) %>%
pivot_longer(cols = H01:H24, names_to = "Hour",
values_to = "Concentration") %>%
group_by(Date, Station, Pollutant) %>%
summarize(Concentration = mean(Concentration, na.rm = TRUE),
.groups = "drop")
rownames(pollutant_data) <- NULL
Figure 1 plots the daily \(\text{NO}_2\) concentration in Toronto West over weekdays in June 2020, and during reference years 2017-2019.
poll_data <- pollutant_data %>%
# keep only dates in question
filter(
Station == "Toronto_West",
Pollutant == "NO2",
month(Date, label = TRUE, abbr = FALSE) == "June",
!wday(Date, label = TRUE) %in% c("Sat", "Sun"), # remove weekends
!is.na(Concentration) # remove missing values
) %>%
mutate(
# identifier for year
Year = as.character(year(Date)),
# identifier for 2020 or reference years
Period = ifelse(Year == "2020", "2020", "2017-2019"),
)
# plot data
poll_data %>%
ggplot(aes(x = Period, y = Concentration)) +
geom_boxplot(aes(color = Period),
outlier.shape = NA, alpha = .25) +
geom_jitter(aes(fill = Year), width = .2, size = 2, shape = 21) +
ylab(expression(NO[2]*" Concentration "*(ppb))) +
theme_bw()
# calculate median in each group
poll_med <- poll_data %>%
group_by(Period) %>%
summarize(Median_Concentration = median(Concentration))
poll_med
#> # A tibble: 2 x 2
#> Period Median_Concentration
#> <chr> <dbl>
#> 1 2017-2019 12.8
#> 2 2020 11.1
The data indicate that the median weekday \(\text{NO}_2\) concentration in June during the reference years 2017-2019 was 12.8 ppb, whereas the median concentration in June 2020 was only 11.1 ppb. Does this mean that the drop in \(\text{NO}_2\) is due to the provincial lockdown imposed mid-March? Not necessarily, since this drop could simply be due to natural meteorological fluctuations from one year to the next. The statistical methodology in Al-Abadleh et al. (2021) aims to quantify the probability that this drop is simply due to “random chance” as follows:
Suppose that the \(N_{\text{post}}\) post-lockdown concentration values in June 2020 are labeled \(Y_1, \ldots, Y_{N_{\text{post}}}\), and the \(N_{\text{pre}}\) pre-lockdown concentration values in the reference data June 2017-2019 are labeled \(X_1, \ldots, X_{N_{\text{pre}}}\).
Let
\[ \Delta_{\text{obs}} = \left\vert\text{Median}_{\text{pre}} - \text{Median}_{\text{post}} \right\vert \]
denote the absolute median difference observed between pre- and post-lockdown data; in this case, \(|12.8 - 11.1| = 1.7\).
Under the null hypothesis \(H_0\) that there is no pre/post lockdown difference, every permutation of the \(N_{\text{pre}} + N_{\text{post}}\) observations into groups of size \(N_{\text{pre}}\) and \(N_{\text{post}}\) is equally likely.
In order to assess whether \(\Delta_{\text{obs}} = 1.7\) is statistically significant Al-Abadleh et al. (2021) proposes to calculate the randomization p-value
\[\begin{equation} p = \Pr(\Delta_{\text{rand}} > \Delta_{\text{obs}}), \tag{1} \end{equation}\]
where \(\Delta_{\text{rand}}\) is the absolute difference in medians for a random permutation of the observed data.
In principle, the randomization p-value (1) can be calculated exactly by enumerating all \((N_{\text{pre}} + N_{\text{post}})!\) permutations of the data. However, this is typically a very large number, rendering such an approach inviable in practice. Instead, (1) is estimated to arbitrarily high precision by Monte Carlo simulation, by reporting the fraction of times \(\Delta_{\text{rand}}\) exceeds \(\Delta_{\text{obs}}\) on a large number \(M\) of randomly drawn permutations. A classical result for binomial sampling is that the Monte Carlo standard error for \(M\) random permutations is no more than \(1/\sqrt{4M}\). Thus with \(M = 10000\), the Monte Carlo standard error on the p-value calculation is no more than \(0.005\).
The procedure above is illustrated in Figure 2. As indicated in the Figure, a p-value \(p > 0.05\) indicates that \(\Delta_{\text{obs}}\) is similar to \(\Delta_{\text{rand}}\), and therefore the observed difference in medians can potentially be attributed to random chance. When \(p < 0.05\), the observed difference in medians is much larger than what would be expected by random chance, suggesting that it could be explained by other factors such as the lockdown effect.
The aq2020 package provides the function fisher_pv()
to calculate randomization p-values. As explained in the documentation, fisher_pv()
takes the following inputs:
value
: A vector of length \(N_{\text{pre}} + N_{\text{post}}\) of daily median data.
group
: A grouping vector of length \(N_{\text{pre}} + N_{\text{post}}\) indicating whether the elements of value
belong to pre- or post-lockdown data. Any vector for which length(unique(group)) == 2
is an appropriate grouping vector, e.g., consisting of integers 1 and 2, a two-level factor, etc.
Tfun
: A function taking arguments value
and group
which returns the absolute difference in medians between groups. This function is passed to fisher_pv()
for greater generality, in case the user wishes to specify a test statistic other than difference in medians.
nsim
: The number of permutations to randomly select in order to approximate (1).
The calculation of the randomization test p-value for Toronto West June 2020 vs June 2017-2019 is presented below.
# Calculate difference in medians statistic.
med_diff <- function(value, group) {
if(length(unique(group)) != 2) {
stop("group must consist of exactly 2 unique groups.")
}
meds <- tapply(value, group, median)
dmeds <- abs(meds[1] - meds[2])
setNames(dmeds, nm = "med_diff")
}
# compute the randomization test p-value
diff_pv <- fisher_pv(value = poll_data$Concentration,
group = poll_data$Period,
Tfun = med_diff,
nsim = 1000)
diff_pv
#> Tobs pval
#> med_diff 1.711775 0.131
The first element of diff_pv
is the difference in medians statistic, \(\Delta_{\text{obs}} = 1.7\). The second element is the (Monte Carlo estimate of the) randomization test p-value, which in this case is \(p = 0.131\). Thus, the drop in median \(\text{NO}_2\) concentration of \(1.7\) ppb in June 2020 compared to pre-lockdown reference data of June 2017-2019 is not statistically significant at the 5% level.
The statistical method outlined above tests the null hypothesis that there is no difference between pre- and post-lockdown data in a given month. However, the randomization testing framework can also be used to test for differences between more than two groups. Of particular interest here is whether the June data in each of the three years 2017-2019 are indeed similar enough to be grouped together into a common reference set. The procedure is similar to the two-group case:
Suppose there are \(N_{2017}\), \(N_{2018}\) and \(N_{2019}\) daily medians (no weekends) in each of the reference years.
Consider any test statistic on the data in these years which could be used to assess whether the three years are similar. For example:
The range statistic
\[ R_{\text{obs}} = \max\{\text{Median}_{2017}, \text{Median}_{2018}, \text{Median}_{2019}\} - \min\{\text{Median}_{2017}, \text{Median}_{2018}, \text{Median}_{2019}\}. \]
The variance statistic
\[ V_{\text{obs}} = \operatorname{var}\{\text{Median}_{2017}, \text{Median}_{2018}, \text{Median}_{2019}\}. \]
In Al-Abadleh et al. (2021), the range statistic \(R_{\text{obs}}\) is used, since (i) reduces to the absolute difference in medians when there are only two groups and (ii) the variance is rarely computed from only three elements. However, with more than three groups \(V_{\text{obs}}\) may have its merits over \(R_{\text{obs}}\).
The randomization p-value is defined as
\[\begin{equation} p = \Pr(R_{\text{rand}} > R_{\text{obs}}), \tag{2} \end{equation}\]
where \(R_{\text{rand}}\) is the range statistic for a random permutation of the \(N_{2017} + N_{2018} + N_{2019}\) observations to three groups of size \(N_{2017}\), \(N_{2018}\), and \(N_{2019}\).
The reference year p-value is computed for the June 2017-2019 data below.
# Calculate range between group medians statistics.
med_range <- function(value, group) {
if(length(unique(group)) <= 1) {
stop("group must consist of at least two unique groups.")
}
meds <- tapply(value, group, median)
rmeds <- max(meds) - min(meds)
setNames(rmeds, nm = "med_range")
}
# create a dataset with only the 2017-2019 data and group = year
poll_data2 <- poll_data %>%
filter(Period == "2017-2019") %>%
mutate(Year = year(Date))
# compute the randomization test p-value
range_pv <- fisher_pv(value = poll_data2$Concentration,
group = poll_data2$Year,
Tfun = med_range,
nsim = 1000)
range_pv
#> Tobs pval
#> med_range 2.579167 0.121
Thus we see that the randomization p-value for June during 2017-2019 is \(p = 0.121\), which is not statistically significant. Therefore, we cannot reject the null hypothesis that the median concentration during these years is similar, which justifies grouping them together as reference years.
The randomization test p-value proposed by Al-Abadleh et al. (2021) is useful for estabilishing statistical significance of differences in air quality measures under minimal modeling assumptions. However, the computation of the p-value can be time-consumming for each of numerous pollutants, stations, years, months, etc.
In a separate vignette, we provide a detailed example of computing these p-values on multiple cores in parallel (e.g., over a server). To see the code, run the command vignettes("aq2020-pvalue")
. The full set of p-values for the concentration data is provided by aq2020 in the pollutant_pval
dataset:
pollutant_pval
#> # A tibble: 2,943 x 7
#> Station Pollutant Period Month Ndays Median Pval
#> <chr> <chr> <chr> <ord> <dbl> <dbl> <dbl>
#> 1 Barrie O3 2017-2019 January 68 24.5 0.341
#> 2 Barrie O3 2017-2019 February 60 29.7 0.245
#> 3 Barrie O3 2017-2019 March 66 34.4 0.0094
#> 4 Barrie O3 2017-2019 April 63 34.4 0.129
#> 5 Barrie O3 2017-2019 May 69 31.7 0.0027
#> 6 Barrie O3 2017-2019 June 63 26.6 0.820
#> 7 Barrie O3 2017-2019 July 66 26.6 0.0777
#> 8 Barrie O3 2017-2019 August 68 23.5 0.957
#> 9 Barrie O3 2017-2019 September 62 20 0.899
#> 10 Barrie O3 2017-2019 October 68 18.6 0.0392
#> # … with 2,933 more rows
The columns in the dataset are:
Station
: Station name.Pollutant
: Pollutant name.Period
: 2017-2019 or 2020.Month
: Name of the month.Ndays
: Number of days in the given Month and Period.Median
: Median concentration over all days in the given Month and Period.Pval
: P-value of the randomization test.The p-values can be added to Figure 1 as follows:
# format the p-value data for ggplot
pval_data <- tibble(Pvalue = c(diff_pv[2], range_pv[2]),
Period = c("2017-2019", "2020"))
poll_data %>%
ggplot(aes(x = Period, y = Concentration), ylim = c(0, Inf)) +
geom_boxplot(aes(color = Period),
outlier.shape = NA, alpha = .25) +
geom_jitter(aes(fill = Year), width = .2, size = 2, shape = 21) +
# p-value labels at bottom of boxplots
geom_text(data = pval_data,
mapping = aes(x = Period, label = Pvalue, y = 2)) +
ylab(expression(NO[2]*" Concentration "*(ppb))) +
theme_bw()
One can also plot multiple p-values across different stations and different months. For this we’ll use the precomputed datasets pollutant_data
and pollutant_pval
.
poll_data <- pollutant_data %>%
filter(
Pollutant == "NO2",
Station %in% c("Grand_Bend", "Kitchener", "Toronto_West"),
month(Date) %in% 1:6,
!wday(Date, label = TRUE) %in% c("Sat", "Sun"), # remove weekends
!is.na(Concentration)
) %>%
mutate(
Year = factor(year(Date)),
Month = month(Date, label = TRUE, abbr = FALSE),
Period = ifelse(Year == "2020", "2020", "2017-2019")
)
pval_data <- pollutant_pval %>%
filter(
Pollutant == "NO2",
Station %in% c("Grand_Bend", "Kitchener", "Toronto_West"),
as.numeric(Month) %in% 1:6
)
poll_data %>%
ggplot(aes(x = Month, y = Concentration), ylim = c(-5, Inf)) +
geom_boxplot(aes(color = Period),
outlier.shape = NA, alpha = .25) +
geom_jitter(mapping = aes(fill = Year, group = Period),
position = position_jitterdodge(0.7, seed = 1),
size = .8, shape = 21, color = "transparent") +
# p-value labels at bottom of boxplots
geom_text(data = pval_data,
mapping = aes(x = Month, label = round(Pval, 2),
group = Period, y = -2),
position = position_dodge(0.8), size = 3) +
facet_wrap(~ Station, nrow = 3, scales = "free_y") +
ylab(expression(NO[2]*" Concentration "*(ppb))) +
theme_bw()
Al-Abadleh, H.A., Lysy, M., Neil, L., Patel, P., Mohammed, W., and Khalaf, Y., 2021. Rigorous quantification of statistical significance of the covid-19 lockdown effect on air quality: The case from ground-based measurements in Ontario, Canada. Journal of Hazardous Materials, 413, 125445.