vignettes/aq2020-pvalue.Rmd
aq2020-pvalue.Rmd
The randomization test p-value proposed by Al-Abadleh et al. (2021) is useful for estabilishing statistical significance of differences in air quality measures under minimal modeling assumptions. However, the computation of the p-value can be time-consumming for each of numerous pollutants, stations, years, months, etc.
The following R code provides an example of computing these p-values on multiple cores in parallel (e.g., over a server). Essentially this will consist of a for-loop over all combinations of pollutants, stations, years, and months in the pollutant_data
dataset. Each iteration of the for-loop is independent of the others, such that the whole loop is embarassingly parallelizable and thus can be easily run on parallel cores.
First we load the required packages:
# packages
require(aq2020)
#> Loading required package: aq2020
require(tidyr)
#> Loading required package: tidyr
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
require(lubridate)
#> Loading required package: lubridate
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
require(parallel)
#> Loading required package: parallel
In large-scale computations, it is best to save results after each iteration of the for-loop rather than wait for the entire loop to terminate. This way if anything goes wrong part way through we don’t need to restart from scratch.
# where to save pvalue calculations
data_path <- file.path("data", "pollutant", "pvalue")
#' Helper function to standardize file names.
#'
#' @param path Path to the file.
#' @param ... Elements of the file name, to be concatenated into a single string separated by "_".
#' @param ext File extension.
#' @return A string representing the file name.
get_filename <- function(path, ..., ext = "rds") {
fname <- paste0(c(...), collapse = "_")
file.path(path, paste0(fname, ".", ext))
}
Now let’s create the functions to compute the absolute difference of medians and range of medians test statistics. For more information on these please see vignette("aq2020")
.
#' Calculate the absolute difference in medians statistic.
#'
#' @param value A vector of length `n_obs` of daily median data.
#' @param group A grouping vector of length `n_obs` indicating whether the observation is pre or post lockdown data. Any vector for which `length(unique(group)) == 2` will work.
#' @return The difference in medians statistic.
#' @details When `length(unique(group)) = n_group < 2`, the function returns `NA`. This is so the p-value calculation doesn't crash when e.g., there is no data in one of the groups. When `n_group > 2` an error is thrown.
med_diff <- function(value, group) {
n_group <- length(unique(group))
if(n_group > 2) {
stop("group must consist of at most 2 unique groups.")
}
if(n_group < 2) return(NA)
meds <- tapply(value, group, median)
dmeds <- abs(meds[1] - meds[2])
setNames(dmeds, nm = "med_diff")
}
#' Calculate the range between group medians statistic.
#'
#' @param value A vector of length `n_obs` of daily median data.
#' @param group A grouping vector of length `n_obs` indicating whether the observation is pre or post lockdown data. Any vector for which `length(unique(group)) >= 2` will work.
#' @return The range between group medians statistic.
#' @details When `length(unique(group)) = n_group < 2`, the function returns `NA`. This is so the p-value calculation doesn't crash when e.g., there is no data in one of the groups.
med_range <- function(value, group) {
if(length(unique(group)) < 2) return(NA)
meds <- tapply(value, group, median)
rmeds <- max(meds) - min(meds)
setNames(rmeds, nm = "med_range")
}
Now we define the core function pval_calc()
which will be applied at each step of the for-loop to each combination of pollutants.
#' Compute the randomization p-value for each combination of station, pollutant and month.
#'
#' @param station Station name.
#' @param pollutant Pollutant name.
#' @param months Vector of integers between 1 and 12.
#' @param no_wknd Logical, whether to exclude weekends.
#' @param nsim Number of permutations to use in the Monte Carlo calculation.
#'
#' @return A tibble with columns:
#' \describe{
#' \item{`Station`}{Station name.}
#' \item{`Pollutant`}{Pollutant name.}
#' \item{`Month`}{Month of the year.}
#' \item{`Period`}{Either "2017-2019" or "2020".}
#' \item{`Stat`}{Name of the computed statistic. These are: `Tobs`, `pval`, `med`, and `n`.}
#' \item{`Value`}{Value of the computed statistic.}
#' }
pval_calc <- function(station, pollutant, months, no_wknd = TRUE, nsim) {
# helper function for summarize() with tibbles
set_tibble <- function(x, nm) {
tibble(Value = setNames(x, NULL),
Stat = nm)
}
message("Station: ", station, ", Pollutant: ", pollutant)
# prepare data for p-values
poll_data <- pollutant_data %>%
filter(Station == station & Pollutant == pollutant) %>%
mutate(Year = year(Date),
Month = month(Date, label = TRUE, abbr = FALSE),
Day = day(Date)) %>%
filter(Month %in% month(months, label = TRUE, abbr = FALSE)) %>%
filter(!is.na(Concentration))
if(no_wknd) {
poll_data <- poll_data %>%
mutate(Weekday = wday(Date, label = TRUE)) %>%
filter(!Weekday %in% c("Sat", "Sun")) %>%
select(-Weekday)
}
# pvalue calculations
stat_names <- c("Tobs", "pval", "median", "n")
tm <- system.time({
# 2017-2019
pv_ref <- poll_data %>%
filter(Year != 2020) %>%
group_by(Month) %>%
summarize(set_tibble(
x = c(fisher_pv(group = Year,
value = Concentration,
nsim = nsim,
Tfun = med_range),
median(Concentration),
length(Concentration)),
nm = stat_names),
.groups = "drop") %>%
mutate(Period = "2017-2019")
# 2020
pv_2020 <- poll_data %>%
mutate(Year = ifelse(Year == 2020, "2020", "2017-2019")) %>%
group_by(Month) %>%
summarize(set_tibble(
x = c(fisher_pv(group = Year,
value = Concentration,
nsim = nsim,
Tfun = med_diff),
median(Concentration[Year == "2020"]),
length(Concentration[Year == "2020"])),
nm = stat_names),
.groups = "drop") %>%
mutate(Period = "2020")
pv_data <- bind_rows(pv_ref, pv_2020)
})
message("Time: ", round(tm[3], 1), " seconds")
# append station and pollutant information
pv_data <- pv_data %>%
mutate(Station = station,
Pollutant = pollutant) %>%
select(Station, Pollutant, Month, Period, Stat, Value)
}
Now let’s check what pval_calc()
computes:
pv <- pval_calc(station = "Milton",
pollutant = "O3",
months = 11:12,
no_wknd = TRUE,
nsim = 100)
#> Station: Milton, Pollutant: O3
#> Time: 0.1 seconds
pv %>% pivot_wider(names_from = "Stat", values_from = "Value")
#> # A tibble: 4 x 8
#> Station Pollutant Month Period Tobs pval median n
#> <chr> <chr> <ord> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Milton O3 November 2017-2019 NA NA 22.8 21
#> 2 Milton O3 December 2017-2019 1.47 0.7 19.2 28
#> 3 Milton O3 November 2020 1.71 0.75 21.0 21
#> 4 Milton O3 December 2020 1.15 0.42 20.4 23
As we can see, the output consists of the station, pollutant and months requested, the period, and four computed statistics. These are:
Tobs
: The value of the test statistic, which is difference in medians for the 2020 period and range of medians for the 2017-2019 period.pval
: The corresponding randomization p-value.median
: The median concentration value during either period.n
: The sample size in either period.Note that Tobs
and pval
in November 2017-2019 are NA
. Let’s take a closer look at why this is:
pollutant_data %>%
mutate(Year = factor(year(Date))) %>%
filter(
Station == "Milton",
Pollutant == "O3",
month(Date) == 11
) %>%
group_by(Year, .drop = FALSE) %>%
summarize(n = n())
#> # A tibble: 4 x 2
#> Year n
#> <fct> <int>
#> 1 2017 0
#> 2 2018 0
#> 3 2019 30
#> 4 2020 30
As we can see it is because there is no \(\text{O}_3\) pollutant data for Milton in November of 2017 and 2018, such that the reference set 2017-2019 consists of only one year and thus the range-of-medians statistic becomes meaningless.
We are now ready to calculate the p-value for all station/pollutant/month combinations. Since pval_calc()
accepts a vector of months, the for-loop need only run over all 123 possible station/pollutant combinations:
# all station/pollutant combinations
pollutant_info %>% filter(has_poll)
#> # A tibble: 123 x 5
#> station station_id pollutant poll_id has_poll
#> <chr> <dbl> <chr> <dbl> <lgl>
#> 1 Barrie 47045 O3 122 TRUE
#> 2 Barrie 47045 PM25 124 TRUE
#> 3 Barrie 47045 NO2 36 TRUE
#> 4 Belleville 54012 O3 122 TRUE
#> 5 Belleville 54012 PM25 124 TRUE
#> 6 Belleville 54012 NO2 36 TRUE
#> 7 Brantford 21005 O3 122 TRUE
#> 8 Brantford 21005 PM25 124 TRUE
#> 9 Brantford 21005 NO2 36 TRUE
#> 10 Burlington 44008 O3 122 TRUE
#> # … with 113 more rows
First let’s set up our R session for parallel processing the calculations in the for-loop:
# parallel job specification
nsim <- 1e4 # number of random permutations
months <- 1:12 # months for which to calculate p-values
no_wknd <- TRUE # exclude weekends
# which station/pollutant combinations
# from pollutant_info to run
job_ind <- which(pollutant_info$has_poll)
# cluster setup
# number of cores to use.
# this uses all available cores, but a different number can be set.
ncores <- detectCores(logical = FALSE)
# parallel processing seed
RNGkind("L'Ecuyer-CMRG")
# create the parallel cluster
cl <- makeCluster(spec = ncores)
# load all packages on each cluster.
invisible(
clusterEvalQ(cl, {
require(aq2020)
require(tidyr)
require(dplyr)
require(lubridate)
require(parallel)
})
)
# copy all R objects required for the calculation onto each core.
# it's often simplest to just copy everything in the workspace.
clusterExport(cl, varlist = ls()[ls() != "cl"])
Now we’re ready to run the p-value computation in parallel. This is done with the function parallel::parLapply()
. Unlike regular R for-loops, parLapply()
does the entire for-loop computation in a single function call. Therefore, if at step there is an error all previous computations are lost. Therefore, the parallel computation is set up so that:
parLapply()
is instructed to flag an error at each step rather than throw one, using tryCatch()
. This way parLapply()
will never throw an error, but you can find out after the fact which steps of the computation failed by inspecting the contents of bad_ind
.
system.time({
bad_ind <- parLapply(cl, job_ind, fun = function(ii) {
station <- pollutant_info$station[ii]
pollutant <- pollutant_info$pollutant[ii]
# calculate p-value in tryCatch block to catch/flag any errors
pv_data <- tryCatch(
pval_calc(station = station, pollutant = pollutant,
months = months, no_wknd = no_wknd, nsim = nsim),
error = function(e) NULL
)
# save data
if(!is.null(pv_data)) {
saveRDS(pv_data,
file = get_filename(path = data_path,
station, pollutant))
}
# return TRUE if p-value computation failed and FALSE otherwise
is.null(pv_data)
})
})
Once the calculation is done, we can check whether any of the jobs failed:
# if you are in an interactive R session, you can simply do this:
## any(unlist(bad_ind))
# if you ran R as a batch job which exists after the computation is done,
# you can regenerate the contents of bad_ind like this:
bad_ind <- lapply(job_ind, function(ii) {
station <- pollutant_info$station[ii]
pollutant <- pollutant_info$pollutant[ii]
pv_data <- tryCatch(
readRDS(file = get_filename(path = data_path,
station, pollutant)),
error = function(e) NULL
)
is.null(pv_data)
})
any(unlist(bad_ind))
#> [1] FALSE
If the result is other than FALSE
, you can rerun parLapply()
over the jobs in which(unlist(bad_ind))
.
Finally, when the parallel computation is done you must shut down the cluster:
stopCluster(cl)
The computation above were saved in separate files for each station/pollutant combination. The following code puts these together into a single dataset containing the following columns:
Station
: Station name.Pollutant
: Pollutant name.Period
: 2017-2019 or 2020.Month
: Name of the month.Ndays
: Number of days in the given Month and Period.Median
: Median concentration over all days in the given Month and Period.Pval
: P-value of the randomization test.
pollutant_pval <- lapply(job_ind, function(ii) {
station <- pollutant_info$station[ii]
pollutant <- pollutant_info$pollutant[ii]
pv_data <- readRDS(file = get_filename(path = data_path,
station, pollutant))
pv_data %>%
pivot_wider(names_from = "Stat", values_from = "Value") %>%
select(Station, Pollutant, Period, Month,
Ndays = n, Median = median, Pval = pval)
}) %>% bind_rows()
This dataset is available in the aq2020 dataset as pollutant_pval
:
aq2020::pollutant_pval # dataset inside the aq2020 package
#> # A tibble: 2,943 x 7
#> Station Pollutant Period Month Ndays Median Pval
#> <chr> <chr> <chr> <ord> <dbl> <dbl> <dbl>
#> 1 Barrie O3 2017-2019 January 68 24.5 0.341
#> 2 Barrie O3 2017-2019 February 60 29.7 0.245
#> 3 Barrie O3 2017-2019 March 66 34.4 0.0094
#> 4 Barrie O3 2017-2019 April 63 34.4 0.129
#> 5 Barrie O3 2017-2019 May 69 31.7 0.0027
#> 6 Barrie O3 2017-2019 June 63 26.6 0.820
#> 7 Barrie O3 2017-2019 July 66 26.6 0.0777
#> 8 Barrie O3 2017-2019 August 68 23.5 0.957
#> 9 Barrie O3 2017-2019 September 62 20 0.899
#> 10 Barrie O3 2017-2019 October 68 18.6 0.0392
#> # … with 2,933 more rows
Al-Abadleh, H.A., Lysy, M., Neil, L., Patel, P., Mohammed, W., and Khalaf, Y., 2021. Rigorous quantification of statistical significance of the covid-19 lockdown effect on air quality: The case from ground-based measurements in Ontario, Canada. Journal of Hazardous Materials, 413, 125445.