\(\trainee\): Student trainee. \(\cotrainee\): Student co-trainee. \(\coraut\): Corresponding author. \(\eqaut\): Equally-contributing authors.
“LocalCop: An R package for local likelihood inference for conditional copulas.”
Acar E, Lysy M, Kuchinsky A (2024). Journal of Open-Source Software (accepted). article.
Conditional copulas models allow the dependence structure between multiple response variables to be modelled as a function of covariates. LocalCop is an R/C++ package for computationally efficient semiparametric conditional copula modelling using a local likelihood inference framework developed in Acar, Craiu, & Yao (2011), Acar, Craiu, & Yao (2013) and… Acar, Czado, & Lysy (2019).
“Statistical Methods for Microrheology of Airway Mucus with Extreme Heterogeneity.”
Caughman\(\cotrainee\) N, Papanikolas M, Markovetz MR, Freeman R, Hill DB, Forest\(\coraut\) MG, Lysy\(\coraut\) M (2024). Journal of Rheology (accepted). DOI: 10.1101/2023.11.20.567244, article.
The mucus lining of the human airway epithelium contains two gel-forming mucins, MUC5B and MUC5AC. During progression of cystic fibrosis (CF), mucus hyper-concentrates as its mucin ratio changes, coinciding with formation of insoluble, dense mucus flakes. We explore rheological heterogeneity of this pathology with reconstituted mucus matching three stages of… CF progression and particle-tracking of 200 nm and 1 micron diameter beads. We introduce statistical data analysis methods specific to low signal-to-noise data within flakes. Each bead time series is decomposed into: (i) a fractional Brownian motion (fBm) classifier of the pure time-series signal; (ii) high-frequency static and dynamic noise; and (iii) low-frequency deterministic drift. Subsequent analysis focuses on the denoised fBm classifier ensemble from each mucus sample and bead diameter. Every ensemble fails a homogeneity test, compelling clustering methods to assess levels of heterogeneity. The first binary level detects beads within vs. outside flakes. A second binary level detects within-flake bead signals that can vs. cannot be disentangled from the experimental noise floor. We show all denoised ensembles, within- and outside-flakes, fail a homogeneity test, compelling additional clustering; next, all clusters with sufficient data fail a homogeneity test. These levels of heterogeneity are consistent with outcomes from a stochastic phase-separation process, and dictate applying the generalized Stokes-Einstein relation to each bead per cluster per sample, then frequency-domain averaging to assess rheological heterogeneity. Flakes exhibit a spectrum of gel-like and sol-like domains, outside-flake solutions a spectrum of sol-like domains, painting a rheological signature of the phase-separation process underlying flake-burdened mucus.
“Fast and Scalable Inference for Spatial Extreme Value Models.”
Chen\(\cotrainee\) M, Ramezan R, Lysy\(\coraut\) M (2024). Canadian Journal of Statistics (accepted). article.
The generalized extreme value (GEV) distribution is a popular model for analyzing and forecasting extreme weather data. To increase prediction accuracy, spatial information is often pooled via a latent Gaussian process (GP) on the GEV parameters. Inference for GEV-GP models is typically carried out using Markov chain Monte Carlo (MCMC)… methods, or using approximate inference methods such as the integrated nested Laplace approximation (INLA). However, MCMC becomes prohibitively slow as the number of spatial locations increases, whereas INLA is only applicable to a limited subset of GEV-GP models. In this paper, we revisit the original Laplace approximation for fitting spatial GEV models. In combination with a popular sparsity-inducing spatial covariance approximation technique, we show through simulations that our approach accurately estimates the Bayesian predictive distribution of extreme weather events, is scalable to several thousand spatial locations, and is several orders of magnitude faster than MCMC. A case study in forecasting extreme snowfall across Canada is presented.
“Data-Adaptive Probabilistic Likelihood Approximation for Ordinary Differential Equations.”
Wu\(\trainee\) M, Lysy\(\coraut\) M (2024). In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics. article.
Estimating the parameters of ordinary differential equations (ODEs) is of fundamental importance in many scientific applications. While ODEs are typically approximated with deterministic algorithms, new research on probabilistic solvers indicates that they produce more reliable parameter estimates by better accounting for numerical errors. However, many ODE systems are highly sensitive… to their parameter values. This produces deep local maxima in the likelihood function – a problem which existing probabilistic solvers have yet to resolve. Here we present a novel probabilistic ODE likelihood approximation, DALTON, which can dramatically reduce parameter sensitivity by learning from noisy ODE measurements in a data-adaptive manner. Our approximation scales linearly in both ODE variables and time discretization points, and is applicable to ODEs with both partially-unobserved components and non-Gaussian measurement models. Several examples demonstrate that DALTON produces more accurate parameter estimates via numerical optimization than existing probabilistic ODE solvers, and even in some cases than the exact ODE likelihood itself.
“Functional Connectivity: Continuous-Time Latent Factor Models for Neural Spike Trains.”
Chen\(\cotrainee\) M, Lysy M, Moorman D, Ramezan\(\coraut\) R (2023). In Proceedings of the 2023 Conference on Cognitive Computational Neuroscience. article.
Modelling the dynamics of interactions in a neuronal ensemble is an important problem in functional connectivity research. One popular framework is latent factor models (LFMs), which have achieved notable success in decoding neuronal population dynamics. However, most LFMs are specified in discrete time, where the choice of bin size significantly… impacts inference results. In this work, we present what is, to the best of our knowledge, the first continuous-time multivariate spike train LFM for studying neuronal interactions and functional connectivity. We present an efficient parameter inference algorithm for our biologically justifiable model which (1) scales linearly in the number of simultaneously recorded neurons and (2) bypasses time binning and related issues. Simulation studies show that parameter estimation using the proposed model is highly accurate. Applying our LFM to experimental data from a classical conditioning study on the prefrontal cortex in rats, we found that coordinated neuronal activities are affected by (1) the onset of the cue for reward delivery, and (2) the sub-region within the frontal cortex (OFC/mPFC). These findings shed new light on our understanding of cue and outcome value encoding.
“Always a Bridesmaid: A Machine Learning Approach to Minor Party Identity in Multi-Party Systems.”
French Bourgeois L, Harell A, Stephenson L, Guay P, Lysy M (2023). Statistics, Politics and Policy, 14(1), 91–112. DOI: 10.1515/spp-2022-0009, article.
In multiparty systems, maintaining a distinct and positive partisan identity may be more difficult for those who identify with minor parties, because such parties lack the rich history of success that could reinforce a positive social standing in the political realm. Yet, we know little about the unique nature of… minor partisan identities because partisanship tends to be most prominent in single-member plurality systems that tend toward two dominant parties, such as the United States. Canada provides a fascinating case of a single-member plurality electoral system that has consistently led to a multiparty system, ideal for studying minor party identity. We use large datasets of public opinion data, collected in 2019 and 2021 in Canada, to test a Lasso regression, a machine learning technique, to identify the factors that are the most important to predict whether partisans of minor political parties will seek in-group distinctiveness, meaning that they seek a different and positive political identity from the major political parties they are in competition with, or take part in out-group favouritism, meaning that they seek to become closer major political parties. We find that party rating is the most important predictor. The more partisans of the minor party rate their own party favourably, the more they take part in distinctiveness. We also find that the more minor party partisans perceive the major party as favourable, the more favouritism they will show towards the major party.
“The power of weak, transient interactions across biology: A paradigm of emergent behavior.”
Vasquez PA, Walker B, Bloom K, Kolbin D, Caughman N, Freeman R, Lysy M, Hult C, Newhall KA, Papanikolas M, Edelmaier C, Forest MG (2023). Physica D: Nonlinear Phenomena, 454(15), 133866. DOI: 10.1016/j.physd.2023.133866.
A growing list of diverse biological systems and their equally diverse functionalities provides realizations of a paradigm of emergent behavior. In each of these biological systems, pervasive ensembles of weak, short-lived, spatially local interactions act autonomously to convey functionalities at larger spatial and temporal scales. In this article, a range… of diverse systems and functionalities are presented in a cursory manner with literature citations for further details. Then two systems and their properties are discussed in more detail: yeast chromosome biology and human respiratory mucus.
“Application of Machine Learning and Statistical Modeling to Identify Sources of Air Pollutant Emissions in Kitchener, Ontario, Canada.”
Mohammed W, Adamescu A, Neil L, Shantz N, Townend T, Lysy\(\coraut\) M, Al-Abadleh\(\coraut\) HA (2022). Environmental Science: Atmospheres (Accepted). DOI: 10.1039/D2EA00084A, article.
Machine learning is used across many disciplines to identify complex relations between outcomes and several potential predictors. In the case of air quality research in heavily populated urban centers, such techniques were used to correlate the impacts of Traffic-Related Air Pollutants (TRAP) on vulnerable members of communities, future emission levels,… and potential solutions that mitigate adverse effects of poor air quality. However, machine learning tools have not been used to assess the variables that influence measured pollutant levels in a suburban environment. The objective of this study is to apply a novel combination of Random Forest (RF) modeling, a machine learning algorithm, and statistical significance analysis to assess the impacts of anthropogenic and meteorological variables on observed pollutant levels in two separate datasets collected during and after the COVID-19 lockdowns in Kitchener, Ontario, Canada. The results highlight that TRAP levels studied here are linked to meteorology and traffic count/type, with relatively higher sensitivity to the former. Upon taking statistical significance into account when assessing relative importance of variables affecting pollutant levels, our study found that traffic variables had a more discernible influence than many meteorological variables. Additional studies with a larger dataset and spread throughout the year are needed to expand upon these initial findings. The proposed approach outlines a ``blueprint’’ method of quantifying the importance of traffic in mid-size cities experiencing fast population growth and development.
“A Multivariate Point Process Model for Simultaneously Recorded Neural Spike Trains.”
Ramezan R, Chen\(\cotrainee\) M, Lysy M, Marriott P (2022). In Proceedings of the 2022 Conference on Cognitive Computational Neuroscience (Accepted). article.
The current state-of-the-art in neurophysiological data collection allows for simultaneous recording of tens to hundreds of neurons, for which point processes are an appropriate statistical modelling framework. However, existing point process models lack multivariate generalizations which are both flexible and computationally tractable. This paper introduces a multivariate generalization of the… Skellam process with resetting (SPR), a point process tailored to model individual neural spike trains. The multivariate SPR (MSPR) is biologically justified as it mimics the process of neural integration. Its flexible dependence structure and a fast parameter estimation method make it well-suited for the analysis of simultaneously recorded spike trains from multiple neurons. The strengths and weaknesses of the MSPR are demonstrated through simulation and analysis of experimental data.
“Multimodel Bayesian Analysis of Load Duration Effects in Lumber Reliability.”
Yang\(\cotrainee\) Y, Lysy M, Wong SWK (2022). In Proceedings of the 13th International Conference on Structural Safety and Reliability (Accepted). article.
This paper evaluates the reliability of lumber, accounting for the duration-of-load (DOL) effect under different load profiles based on a multimodel Bayesian approach. Three individual DOL models previously used for reliability assessment are considered: the US model, the Canadian model, and the Gamma process model. Procedures for stochastic generation of… residential, snow, and wind loads are also described. We propose Bayesian model-averaging (BMA) as a method for combining the reliability estimates of individual models under a given load profile that coherently accounts for statistical uncertainty in the choice of model and parameter values. The method is applied to the analysis of a Hemlock experimental dataset, where the BMA results are illustrated via estimated reliability indices together with 95% interval bands.
“Measurement error correction in particle tracking microrheology.”
Ling\(\trainee\) Y, Lysy\(\coraut\) M, Seim I, Newby JM, Hill DB, Cribb J, Forest MG (2022). Annals of Applied Statistics, 16(3), 1747–1773. DOI: 10.1214/21-AOAS1565.
In diverse biological applications, single-particle tracking (SPT) of passive microscopic species has become the experimental measurement of choice, when either the materials are of limited volume or so soft as to deform uncontrollably when manipulated by traditional instruments. In a wide range of SPT experiments, a ubiquitous finding is that… of long-range dependence in the particles’ motion. This is characterized by a power-law signature in the mean squared displacement (MSD) of particle positions as a function of time, the parameters of which reveal valuable information about the viscous and elastic properties of various biomaterials. However, MSD measurements are typically contaminated by complex and interacting sources of instrumental noise. As these often affect the high-frequency bandwidth to which MSD estimates are particularly sensitive, inadequate error correction can lead to severe bias in power law estimation and, thereby, the inferred viscoelastic properties. In this article we propose a novel strategy to filter high-frequency noise from SPT measurements. Our filters are shown theoretically to cover a broad spectrum of high-frequency noises and lead to a parametric estimator of MSD power-law coefficients for which an efficient computational implementation is presented. Based on numerous analyses of experimental and simulated data, results suggest our methods perform very well compared to other denoising procedures.
“Robust and efficient parametric spectral density estimation for high-throughput data.”
Lysy\(\coraut\) M, Zhu\(\trainee\) F, Yates\(\trainee\) B, Labuda A (2022). Technometrics, 64(1), 30–51. DOI: 10.1080/00401706.2021.1884134.
Modern scientific instruments readily record various dynamical phenomena at high frequency and for extended durations. Spanning timescales across several orders of magnitude, such “high-throughput” (HTP) data are routinely analyzed with parametric models in the frequency domain. However, the large size of HTP datasets can render maximum likelihood estimation prohibitively expensive.… Moreover, HTP recording devices are operated by extensive electronic circuitry, producing periodic noise to which parameter estimates are highly sensitive. This article proposes to address these issues with a two-stage approach. Preliminary parameter estimates are first obtained by a periodogram variance-stabilizing procedure, for which data compression greatly reduces computational costs with minimal impact to statistical efficiency. Next, a novel test with false discovery rate control eliminates most periodic outliers, to which the second-stage estimator becomes more robust. Extensive simulations and experimental results indicate that for a widely used model in HTP data analysis, a substantial reduction in mean squared error can be expected by applying our methodology.
“Machine learning-guided, big data-enabled, biomarker-based systems pharmacology: modeling the stochasticity of natural history and disease progression.”
McComb M, Blair RH, Lysy M, Ramanathan M (2021). Journal of Pharmacokinetics and Pharmacodynamics, 49(1), 65–79. DOI: 10.1007/s10928-021-09786-5.
The incidence of systemic and metabolic co-morbidities increases with aging. The purpose was to investigate a novel paradigm for modeling the orchestrated changes in many disease-related biomarkers that occur during aging. A hybrid strategy that integrates machine learning and stochastic modeling was evaluated for modeling the long-term dynamics of biomarker… systems. Bayesian networks (BN) were used to identify quantitative systems pharmacology (QSP)-like models for the inter-dependencies for three disease-related datasets of metabolic (MB), metabolic with leptin (MB-L), and cardiovascular (CVB) biomarkers from the NHANES database. Biomarker dynamics were modeled using discrete stochastic vector autoregression (VAR) equations. BN were used to derive the topological order and connectivity of a data driven QSP model structure for inter-dependence of biomarkers across the lifespan. The strength and directionality of the connections in the QSP models were evaluated using bootstrapping. VAR models based on QSP model structures from BN were assessed for modeling biomarker system dynamics. BN-restricted VAR models of order 1 were identified as parsimonious and effective for characterizing biomarker system dynamics in the MB, MB-L and CVB datasets. Simulation of annual and triennial data for each biomarker provided good fits and predictions of the training and test datasets, respectively. The novel strategy harnesses machine learning to construct QSP model structures for inter-dependence of biomarkers. Stochastic modeling with the QSP models was effective for predicting the age-varying dynamics of disease-relevant biomarkers over the lifespan.
“Rigorous Quantification of the Statistical Significance of COVID-19 Lockdown Effect on Air Quality: The Case from Ground-Based Measurements in Ontario, Canada.”
Al-Abadleh\(\coraut\) HA, Lysy\(\coraut\) M, Neil L, Patel P, Mohammed W, Khalaf Y (2021). Journal of Hazardous Materials, 413, 125445(15). DOI: 10.1016/j.jhazmat.2021.125445.
Preliminary analyses of satellite measurements from around the world showed drops in nitrogen dioxide (NO2) coinciding with lockdowns due to the COVID-19 pandemic. Several studies found that these drops correlated with local decreases in transportation and/or industry. None of these studies, however, has rigorously quantified the statistical significance of these… drops relative to natural meteorological variability and other factors that influence pollutant levels during similar time periods in previous years. Here, we develop a novel statistical protocol that accounts for seasonal variability, transboundary influences, and new factors such as COVID-19 restrictions in explaining trends in several pollutant levels at 16 ground-based measurement sites in Southern Ontario, Canada. We find statistically significant and temporary drops in NO2 (11 out 16 sites) and CO (all 4 sites) in April-December 2020, with pollutant levels 20% lower than in the previous three years. Fewer sites (2–3 out of 16) experienced statistically significant drops in O3 and PM2.5. The statistical significance testing framework developed here is the first of its kind applied to air quality data. It highlights the benefit of a rigorous assessment of statistical significance, should analyses of pollutant levels post COVID-19 lockdowns be used to inform policy decisions.
“A Convergence Diagnostic for Bayesian Clustering.”
Lysy M, Asgharian M, Partovi Nia V (2020). WIRES Computational Statistics, 13(4), e1536. DOI: 10.1002/wics.1536.
In many applications of Bayesian clustering, posterior sampling on the discrete state space of cluster allocations is achieved via Markov chain Monte Carlo (MCMC) techniques. As it is typically challenging to design transition kernels to explore this state space efficiently, MCMC convergence diagnostics for clustering applications are especially important. Here… we propose a diagnostic tool for discrete-space MCMC, focusing on Bayesian clustering applications where the model parameters have been integrated out. We construct a Hotelling-type statistic on the highest probability states, and use regenerative sampling theory to derive its equilibrium distribution. By leveraging information from the unnormalized posterior, our diagnostic offers added protection against seemingly convergent chains in which the relative frequency of visited states is incorrect. The methodology is illustrated with a Bayesian clustering analysis of genetic mutants of the flowering plant .
“Predicting Population Level Hip Fracture Risk: A Novel Hierarchical Model Incorporating Probabilistic Approaches and Factor of Risk Principles.”
Martel\(\cotrainee\) DR, Lysy M, Laing AC (2020). Computer Methods in Biomechanics and Biomedical Engineering, 23(15), 1201–1214. DOI: 10.1080/10255842.2020.1793331.
Fall-related hip fractures are a major public health issue. While individual-level risk assessment tools exist, population-level predictive models could catalyze innovation in large-scale interventions. This study presents a hierarchical probabilistic model that predicts population-level hip fracture risk based on Factor of Risk (FOR) principles. Model validation demonstrated that FOR output… aligned with a published dataset categorized by sex and hip fracture status. The model predicted normalized FOR for 100000 individuals simulating the Canadian older-adult population. Predicted hip fracture risk was higher for females (by an average of 38%), and increased with age (by 15% per decade). Potential applications are discussed.
“Common-Factor Stochastic Volatility Modeling with Observable Proxy.”
Fang\(\trainee\) Y, Lysy\(\coraut\) M, McLeish D (2020). Canadian Journal of Statistics, 48(1), 36–61. DOI: 10.1002/cjs.11536.
Multi-asset modelling is of fundamental importance to financial applications such as risk management and portfolio selection. In this article, we propose a multivariate stochastic volatility modelling framework with a parsimonious and interpretable correlation structure. Building on well-established evidence of common volatility factors among individual assets, we consider a multivariate diffusion… process with a common-factor structure in the volatility innovations. Upon substituting an observable market proxy for the common volatility factor, we markedly improve the estimation of several model parameters and latent volatilities. The model is applied to a portfolio of several important constituents of the S&P500 in the financial sector, with the VIX index as the common-factor proxy. We find that the prediction intervals for asset forecasts are comparable to those of more complex dependence models, but that option-pricing uncertainty can be greatly reduced by adopting a common-volatility structure.
“Flexible Dynamic Vine Copula Models for Multivariate Time Series Data.”
Acar EF, Czado C, Lysy M (2019). Econometrics and Statistics, 12, 181–197. DOI: 10.1016/j.ecosta.2019.03.002.
The representation of temporal patterns is essential to time series analysis. In the case of two or more time series, one needs to account for temporal patterns not only in each univariate series but also in their joint behavior. A multivariate model is proposed here for the specification of time-varying… dependence patterns in multivariate time series in a flexible way. The model is built by first addressing the temporal patterns in each series and then modeling the interdependencies among their innovations using a time-varying vine copula model. To specify the vine decomposition, a heuristic model selection tool that accounts for both the magnitude and variation of the empirical Kendall tau across different time intervals is employed. The time variation in the strength of pairwise dependencies is inferred using nonparametric smoothing techniques, and the uncertainty in the resulting estimates is assessed using a parametric bootstrap. The methods are evaluated in a simulation study and used to analyze daily exchange rate returns of seven major currencies from August 2005 to August 2016.
“Technological strategies to estimate and control diffusive passage times through the mucus barrier in mucosal drug delivery.”
Newby JM, Seim I, Lysy M, Ling\(\trainee\) Y, Huckaby J, Lai SK, Forest MG (2018). Advanced Drug Delivery Reviews, 124, 64–81. DOI: 10.1016/j.addr.2017.12.002.
In mucosal drug delivery, two design goals are desirable: 1) insure drug passage through the mucosal barrier to the epithelium prior to drug removal from the respective organ via mucus clearance; and 2) design carrier particles to achieve a prescribed arrival time and drug uptake schedule at the epithelium. Both… goals are achievable if one can control {“}one-sided{”} diffusive passage times of drug carrier particles: from deposition at the mucus interface, through the mucosal barrier, to the epithelium. The passage time distribution must be, with high confidence, shorter than the timescales of mucus clearance to maximize drug uptake. For 100nm and smaller drug-loaded nanoparticulates, as well as pure drug powders or drug solutions, diffusion is normal (i.e., Brownian) and rapid, easily passing through the mucosal barrier prior to clearance. Major challenges in quantitative control over mucosal drug delivery lie with larger drug-loaded nanoparticulates that are comparable to or larger than the pores within the mucus gel network, for which diffusion is not simple Brownian motion and typically much less rapid; in these scenarios, a timescale competition ensues between particle passage through the mucus barrier and mucus clearance from the organ. In the lung, as a primary example, coordinated cilia and air drag continuously transport mucus toward the trachea, where mucus and trapped cargo are swallowed into the digestive tract. Mucus clearance times in lung airways range from minutes to hours or significantly longer depending on deposition in the upper, middle, lower airways and on lung health, giving a wide time window for drug-loaded particle design to achieve controlled delivery to the epithelium. We review the physical and chemical factors (of both particles and mucus) that dictate particle diffusion in mucus, and the technological strategies (theoretical and experimental) required to achieve the design goals. First we describe an idealized scenario - a homogeneous viscous fluid of uniform depth with a particle undergoing passive normal diffusion - where the theory of Brownian motion affords the ability to rigorously specify particle size distributions to meet a prescribed, one-sided, diffusive passage time distribution. Furthermore, we describe how the theory of Brownian motion provides the scaling of one-sided diffusive passage times with respect to mucus viscosity and layer depth, and under reasonable caveats, one can also prescribe passage time scaling due to heterogeneity in viscosity and layer depth. Small-molecule drugs and muco-inert, drug-loaded carrier particles 100nm and smaller fall into this class of rigorously controllable passage times for drug delivery. Second we describe the prevalent scenarios in which drug-loaded carrier particles in mucus violate simple Brownian motion, instead exhibiting anomalous sub-diffusion, for which all theoretical control over diffusive passage times is lost, and experiments are prohibitive if not impossible to measure one-sided passage times. We then discuss strategies to overcome these roadblocks, requiring new particle-tracking experiments and emerging advances in theory and computation of anomalous, sub-diffusive processes that are necessary to predict and control one-sided particle passage times from deposition at the mucosal interface to epithelial uptake. We highlight progress to date, remaining hurdles, and prospects for achieving the two design goals for 200nm and larger, drug-loaded, non-dissolving, nanoparticulates.
“Diving into the consumer nutrition environment: A Bayesian spatial factor analysis of neighborhood restaurant environment.”
Luan\(\cotrainee\) H, Law J, Lysy M (2018). Spatial and Spatio-Temporal Epidemiology, 24, 39–51. DOI: 10.1016/j.sste.2017.12.001.
Neighborhood restaurant environment (NRE) plays a vital role in shaping residents’ eating behaviors. While NRE ‘healthfulness’ is a multi-facet concept, most studies evaluate it based only on restaurant type, thus largely ignoring variations of in-restaurant features. In the few studies that do account for such features, healthfulness scores are simply… averaged over accessible restaurants, thereby concealing any uncertainty that attributed to neighborhoods’ size or spatial correlation. To address these limitations, this paper presents a Bayesian Spatial Factor Analysis for assessing NRE healthfulness in the city of Kitchener, Canada. Several in-restaurant characteristics are included. By treating NRE healthfulness as a spatially correlated latent variable, the adopted modeling approach can: (i) identify specific indicators most relevant to NRE healthfulness, (ii) provide healthfulness estimates for neighborhoods without accessible restaurants, and (iii) readily quantify uncertainties in the healthfulness index. Implications of the analysis for intervention program development and community food planning are discussed.
“Model comparison and assessment for single particle tracking in biological fluids.”
Lysy M, Pillai NS, Hill DB, Forest MG, Mellnik JW, Vasquez PA, McKinley SA (2016). Journal of the American Statistical Association, 111(516), 1413–1426. DOI: 10.1080/01621459.2016.1158716.
State-of-the-art techniques in passive particle-tracking microscopy provide high-resolution path trajectories of diverse foreign particles in biological fluids. For particles on the order of 1 μm diameter, these paths are generally inconsistent with simple Brownian motion. Yet, despite an abundance of data confirming these findings and their wide-ranging scientific implications, stochastic… modeling of the complex particle motion has received comparatively little attention. Even among posited models, there is virtually no literature on likelihood-based inference, model comparisons, and other quantitative assessments. In this article, we develop a rigorous and computationally efficient Bayesian methodology to address this gap. We analyze two of the most prevalent candidate models for 30-sec paths of 1 μm diameter tracer particles in human lung mucus: fractional Brownian motion (fBM) and a Generalized Langevin Equation (GLE) consistent with viscoelastic theory. Our model comparisons distinctly favor GLE over fBM, with the former describing the data remarkably well up to the timescales for which we have reliable information. Supplementary materials for this article are available online.
“Comment on Article by Chkrebtii, Campbell, Calderhead, and Girolami.”
Lysy M (2016). Bayesian Analysis, 11(4), 1269–1273. DOI: 10.1214/16-BA1036, article.
The authors present an ingenious probabilistic numerical solver for deterministic differential equations (DEs). The true solution is progressively identified via model interrogations, in a formal framework of Bayesian updating. I have attempted to extend the authors’ ideas to stochastic differential equations (SDEs), and discuss two challenges encountered in this endeavor:… (i) the non-differentiability of SDE sample paths, and (ii) the sampling of diffusion bridges, typically required of solutions to the SDE inverse problem.
“Maximum likelihood estimation for single particle, passive microrheology data with drift.”
Mellnik JW, Lysy M, Vasquez PA, Pillai NS, Hill DB, Cribb J, McKinley SA, Forest MG (2016). Journal of Rheology, 60(3), 379–392. DOI: 10.1122/1.4943988.
Volume limitations and low yield thresholds of biological fluids have led to widespread use of passive microparticle rheology. The mean-squared-displacement (MSD) statistics of bead position time series (bead paths) are either applied directly to determine the creep compliance [Xu et al., Rheol. Acta 37, 387–398 (1998)] or transformed to determine… dynamic storage and loss moduli [Mason and Weitz, Phys. Rev. Lett. 74, 1250–1253 (1995)]. A prevalent hurdle arises when there is a nondiffusive experimental drift in the data. Commensurate with the magnitude of drift relative to diffusive mobility, quantified by a Péclet number, the MSD statistics are distorted, and thus the path data must be {“}corrected{”} for drift. The standard approach is to estimate and subtract the drift from particle paths, and then calculate MSD statistics. We present an alternative, parametric approach using maximum likelihood estimation that simultaneously fits drift and diffusive model parameters from the path data; the MSD statistics (and consequently the compliance and dynamic moduli) then follow directly from the best-fit model. We illustrate and compare both methods on simulated path data over a range of Péclet numbers, where exact answers are known. We choose fractional Brownian motion as the numerical model, because it affords tunable, subdiffusive MSD statistics consistent with typical 30 s long, experimental observations of microbeads in several biological fluids. Finally, we apply and compare both methods on data from human bronchial epithelial cell culture mucus.
“Calibration of higher eigenmodes of cantilevers.”
Labuda A, Kocuń M, Lysy M, Walsh T, Meinhold J, Proksch T, Meinhold W, Anderson C, Proksch R (2016). Review of Scientific Instruments, 87(7), 073705. DOI: 10.1063/1.4955122, http://alekslabuda.com/sites/default/files/publications/[2016-07]\%20Calibration\%20of\%20higher\%20eigenmodes\%20of\%20cantilevers.pdf.
A method is presented for calibrating the higher eigenmodes (resonant modes) of atomic forcemicroscopy cantilevers that can be performed prior to any tip-sample interaction. The method leverages recent efforts in accurately calibrating the first eigenmode by providing… the higher-mode stiffness as a ratio to the first mode stiffness. A one-time calibration routine must be performed for every cantilever type to determine a power-law relationship between stiffness and frequency, which is then stored for future use on similar cantilevers. Then, future calibrations only require a measurement of the ratio of resonant frequencies and the stiffness of the first mode. This method is verified through stiffness measurements using three independent approaches: interferometric measurement, AC approach-curve calibration, and finite element analysis simulation. Power-law values for calibrating higher-mode stiffnesses are reported for several cantilever models. Once the higher-mode stiffnesses are known, the amplitude of each mode can also be calibrated from the thermal spectrum by application of the equipartition theorem.
“SoF: Soft-cluster matrix factorization for probabilistic clustering.”
Zhao\(\cotrainee\) H, Poupart P, Zhang Y, Lysy M (2015). In Proceedings of the 29th AAAI Conference on Artificial Intelligence, volume 29 number 1, 3188–3195. article.
We propose SoF (Soft-cluster matrix Factorization), a prob-abilistic clustering algorithm which softly assigns each datapoint into clusters. Unlike model-based clustering algorithms, SoF does not make assumptions about the data density distribution. Instead, we take an axiomatic approach to define 4 properties that the probability of co-clustered pairs of pointsshould … satisfy. Based on the properties, SoF utilizes a distance measure between pairs of points to induce the conditional co-cluster probabilities. The objective function in our framework establishes an important connection between probabilistic clustering and constrained symmetric Nonnegative Matrix Factorization (NMF), hence providing a theoretical interpretation for NMF-based clustering algorithms. To optimize the objective, we derive a sequential minimization algorithm using a penalty method. Experimental resultson both synthetic and real-world datasets show that SoF significantly outperforms previous NMF-based algorithms and that it is able to detect non-convex patterns as well as cluster boundaries.
“A new probabilistic method for quantifying \(n\)-dimensional ecological niches and niche overlap.”
Swanson\(\eqaut\) HK, Lysy\(\eqaut\) M, Power M, Stasko\(\cotrainee\) AD, Johnson JD, Reist JD (2015). Ecology, 96(2), 318–324. DOI: 10.1890/14-0235.1.
Considerable progress has been made in the development of statistical tools to quantify trophic relationships using stable isotope ratios, including tools that address size and overlap of isotopic niches. We build upon recent progress and propose a new probabilistic method for determining niche region and pairwise niche overlap that can… be extended beyond two dimensions, provides directional estimates of niche overlap, accounts for species-specific distributions in niche space, and, unlike geometric methods, produces consistent and unique bivariate projections of multivariate data. We define the niche region (\(\mathrm{N}_\mathrm{R}\)) as a given 95% (or user-defined \(\alpha\)) probability region in multivariate space. Overlap is calculated as the probability that an individual from species \(A\) is found in the \(\mathrm{N}_\mathrm{R}\) of species \(B\). Uncertainty is accounted for in a Bayesian framework, and is the only aspect of the methodology that depends on sample size. Application is illustrated with three-dimensional stable isotope data, but practitioners could use any continuous indicator of ecological niche in any number of dimensions. We suggest that this represents an advance in our ability to quantify and compare ecological niches in a way that is more consistent with Hutchinson’s concept of an {“}\(n\)-dimensional hypervolume{”}.
“Functional derivatives applied to error propagation of uncertainties in topography to large-aperture scintillometer-derived heat fluxes.”
Gruber M, Fochesatto G, Hartogensis O, Lysy M (2014). Atmospheric Measurement Techniques, 7, 2361–2371. DOI: 10.5194/amt-7-2361-2014, article.
Scintillometer measurements allow for estimations of the refractive index structure parameter Cn2 over large areas in the atmospheric surface layer. Turbulent fluxes of heat and momentum are inferred through coupled sets of equations derived from the Monin–Obukhov similarity hypothesis. One-dimensional sensitivity functions have been produced that relate the sensitivity of… heat fluxes to uncertainties in single values of beam height over flat terrain. However, real field sites include variable topography. We develop here, using functional derivatives, the first analysis of the sensitivity of scintillometer-derived sensible heat fluxes to uncertainties in spatially distributed topographic measurements. Sensitivity is shown to be concentrated in areas near the center of the beam path and where the underlying topography is closest to the beam height. Relative uncertainty contributions to the sensible heat flux from uncertainties in topography can reach 20% of the heat flux in some cases. Uncertainty may be greatly reduced by focusing accurate topographic measurements in these specific areas. A new two-dimensional variable terrain sensitivity function is developed for quantitative error analysis. This function is compared with the previous one-dimensional sensitivity function for the same measurement strategy over flat terrain. Additionally, a new method of solution to the set of coupled equations is produced that eliminates computational error.
“A multiresolution method for parameter estimation of diffusion processes.”
Kou S, Olding BP, Lysy M, Liu JS (2012). Journal of the American Statistical Association, 107(500), 1558–1574. DOI: 0.1080/01621459.2012.720899.
Diffusion process models are widely used in science, engineering and finance. Most diffusion processes are described by stochastic differential equations in continuous time. In practice, however, data is typically only observed at discrete time points. Except for a few very special cases, no analytic form exists for the likelihood of… such discretely observed data. For this reason, parametric inference is often achieved by using discrete-time approximations, with accuracy controlled through the introduction of missing data. We present a new multiresolution Bayesian framework to address the inference difficulty. The methodology relies on the use of multiple approximations and extrapolation, and is significantly faster and more accurate than known strategies based on Gibbs sampling. We apply the multiresolution approach to three data-driven inference problems – one in biophysics and two in finance – one of which features a multivariate diffusion model with an entirely unobserved component.
“Stochastic noise in atomic force microscopy.”
Labuda\(\eqaut\) A, Lysy\(\eqaut\) M, Paul W, Miyahara Y, Grütter P, Bennewitz R, Sutton M (2012). Physical Review E, 86(3), 031104. DOI: 10.1103/PhysRevE.86.031104, http://www.alekslabuda.com/sites/default/files/publications/[2012-09]\%20Stochastic\%20noise\%20in\%20AFM.pdf.
Having reached the quantum and thermodynamic limits of detection, atomic force microscopy (AFM) experiments are routinely being performed at the fundamental limit of signal to noise. A critical understanding of the statistical properties of noise leads to more accurate interpretation of data, optimization of experimental protocols, advancements in instrumentation, and… new measurement techniques. Furthermore, accurate simulation of cantilever dynamics requires knowledge of stochastic behavior of the system, as stochastic noise may exceed the deterministic signals of interest, and even dominate the outcome of an experiment. In this article, the power spectral density (PSD), used to quantify stationary stochastic processes, is introduced in the context of a thorough noise analysis of the light source used to detect cantilever deflections. The statistical properties of PSDs are then outlined for various stationary, nonstationary, and deterministic noise sources in the context of AFM experiments. Following these developments, a method for integrating PSDs to provide an accurate standard deviation of linear measurements is described. Lastly, a method for simulating stochastic Gaussian noise from any arbitrary power spectral density is presented. The result demonstrates that mechanical vibrations of the AFM can cause a logarithmic velocity dependence of friction and induce multiple slip events in the atomic stick-slip process, as well as predicts an artifactual temperature dependence of friction measured by AFM.
“Stochastic simulation of tip-sample interactions in atomic force microscopy.”
Labuda\(\eqaut\) A, Lysy\(\eqaut\) M, Grütter P (2012). Applied Physics Letters, 101(11), 113105. DOI: 10.1063/1.4745781, http://www.alekslabuda.com/sites/default/files/publications/[2012-09]\%20Stochastic\%20simulation\%20of\%20tip-sample\%20interactions\%20in\%20AFM.pdf.
Atomic force microscopy (AFM) simulators, which are used to gain insight into tip-sample physics and data interpretation, so far have been optimized for modeling deterministic cantilever dynamics. In this article, we demonstrate a method for semi-empirical simulation of the stochastic dynamics of tip-sample interactions. The detection, force, and displacement noises… are separately generated directly from their numerically defined power spectral densities and used to simulate a force spectroscopy experiment in water at the mica interface. Mechanical noise of the AFM is shown to dominate over thermal noise of the cantilever upon interaction with the last two hydration layers.
“Shrinkage estimation in multilevel normal models.”
Morris CN, Lysy M (2012). Statistical Science, 27(1), 115–134. DOI: 10.1214/11-STS363, article.
This review traces the evolution of theory that started when Charles
Stein in 1955 [] showed that using each separate sample mean from \(k \ge 3\) Normal populations to estimate
its own population mean \(\mu_i\) can
be… improved upon uniformly for every possible \(\mu = (\mu_1,\ldots,\mu_k)'\). The
dominating estimators, referred to here as being
Model-I minimax'', can be found by shrinking the sample means toward any constant vector. Admissible minimax shrinkage estimators were derived by Stein and others as posterior means based on a random effects model,
Model-II’’
here, wherein the \(\mu_i\) values have
their own distributions. Section 2 centers on Figure 2, which organizes
a wide class of priors on the unknown Level-II hyperparameters that have
been proved to yield admissible Model-I minimax shrinkage estimators in
the ``equal variance case’’. Putting a flat prior on the Level-II
variance is unique in this class for its scale-invariance and for its
conjugacy, and it induces Stein’s harmonic prior (SHP) on \(\mu_i\).
“A heteroskedastic accelerated failure time model for survival analysis.”
Wang\(\trainee\) Y, You\(\trainee\) T, Lysy\(\coraut\) M (2019). University of Waterloo. article.
Nonparametric and semiparametric methods are commonly used in survival analysis to mitigate the bias due to model misspecification. However, such methods often cannot estimate upper-tail survival quantiles when a sizable proportion of the data are censored, in which case parametric likelihood-based estimators present a viable alternative. In this article, we… extend a popular family of parametric survival models which make the Accelerated Failure Time (AFT) assumption to account for heteroscedasticity in the survival times. The conditional variances can depend on arbitrary covariates, thus adding considerable flexibility to the homoscedastic model. We present an Expectation-Conditional-Maximization (ECM) algorithm to efficiently compute the HAFT maximum likelihood estimator with right-censored data. The methodology is applied to the heavily censored data from a colon cancer clinical trial, for which a new type of highly stringent model residuals is proposed. Based on these, the HAFT model was found to eliminate most outliers from its homoscedastic counterpart.
“Evidence that self-similar microrheology of highly entangled polymeric hydrogels scales robustly with, and is tunable by, polymer concentration.”
Seim I, Cribb JA, Newby JM, Vasquez P, Lysy M, Hill DB, Forest MG (2017). University of North Carolina, Chapel Hill. article.
We report observations of a remarkable scaling behavior with respect to concentration in the passive microbead rheology of two highly entangled polymeric solutions, polyethylene oxide (PEO) and hyaluronic acid (HA). This behavior was reported previously [Hill et al., PLOS ONE (2014)] for human lung mucus, a complex biological hydrogel, motivating… the current study for synthetic polymeric solutions PEO and HA. The strategy is to identify, and focus within, a wide range of lag times \(\tau\) for which passive micron diameter beads exhibit self-similar (fractional, power law) mean-squared-displacement (MSD) statistics. For lung mucus, PEO at three different molecular weights (Mw), and HA at one Mw, we find ensemble-averaged MSDs of the form \(\langle \Delta r^2(\tau)\rangle = 4D_\alpha \tau^\alpha\), all within a common band, [1/60 sec, 3 sec], of lag times \(\tau\). We employ the MSD power law parameters \((D_\alpha, \alpha)\) to classify each polymeric solution over a range of highly entangled concentrations. By the generalized Stokes-Einstein relation, power law MSD implies power law elastic \(G'(\omega)\) and viscous \(G''(\omega)\) moduli for frequencies \(1/\tau\), [0.33 sec\(^{−1}\), 60 sec\(^{−1}\)]. A natural question surrounds the polymeric properties that dictate \(D_\alpha\) and \(\alpha\), e.g. polymer concentration \(c\), Mw, and stiffness (persistence length). In [Hill et al., PLOS ONE (2014)], we showed the MSD exponent \(\alpha\) varies linearly, while the pre-factor \(D_\alpha\) varies exponentially, with concentration, i.e. the semi-log plot, \((\log(D_\alpha),\alpha)(c)\) of the classifier data is collinear. Here we show the same result for three distinct Mw PEO and HA at a single Mw. Future studies are required to explore the generality of these results for polymeric solutions, and to understand this scaling behavior with polymer concentration.
“No-Means Clustering: A Stochastic Variant of \(k\)-Means.”
Partovi Nia V, Lysy\(\coraut\) M, Mouret G (2017). Polytechnique Montréal. article.
Simple, intuitive, and scalable to large problems, \(k\)-means clustering is perhaps the most frequently-used technique for unsupervised learning. However, global optimization of the \(k\)-means objective function is challenging, as the clustering algorithm is highly sensitive to its initial value. Exploiting the connection between \(k\)-means and Bayesian clustering, we explore the… benefits of stochastic optimization to address this issue. Our ``’’ algorithm has provably superior mixing time to a natural Gibbs sampler with auxiliary cluster centroids. Yet, it retains the same computational complexity as the original \(k\)-means approach. Comparisons on two benchmark datasets indicate that stochastic search usually produces more homogeneous clusters than the steepest descent algorithm employed by \(k\)-means. Our method objective function has multiple modes which are not too far apart.
“Statistical inference for stochastic differential equations with memory.”
Lysy M, Pillai NS (2013). University of Waterloo. article.
In this paper we construct a framework for doing statistical inference for discretely observed stochastic differential equations (SDEs) where the driving noise has ‘memory’. Classical SDE models for inference assume the driving noise to be Brownian motion, or {“}white noise{”}, thus implying a Markov assumption. We focus on the case… when the driving noise is a fractional Brownian motion, which is a common continuous-time modeling device for capturing long-range memory. Since the likelihood is intractable, we proceed via data augmentation, adapting a familiar discretization and missing data approach developed for the white noise case. In addition to the other SDE parameters, we take the Hurst index to be unknown and estimate it from the data. Posterior sampling is performed via a Hybrid Monte Carlo algorithm on both the parameters and the missing data simultaneously so as to improve mixing. We point out that, due to the long-range correlations of the driving noise, careful discretization of the underlying SDE is necessary for valid inference. Our approach can be adapted to other types of rough-path driving processes such as Gaussian {“}colored{”} noise. The methodology is used to estimate the evolution of the memory parameter in US short-term interest rates.