Elements of Applied Biostatistics
Preface
Why bother with linear models – aren’t
t
-tests and ANOVA good enough?
What is unusual about this book?
A Table of Models
including mapping between linear models and classical tests
1
Analyzing experimental data with a linear model
1.1
This text is about using linear models to estimate treatment effects and the uncertainty in our estimates. This, raises the question, what is “an effect?”
Part I: Getting Started
2
Getting Started – R Projects and R Markdown
2.1
R vs R Studio
2.2
Download and install R and R studio
2.3
Open R Studio and modify the workspace preference
2.4
If you didn’t modify the workspace preferences from the previous section, go back and do it
2.5
R Markdown in a nutshell
2.6
Install R Markdown
2.7
Importing Packages
2.8
Create an R Studio Project for this textbook
2.9
Working on a project, in a nutshell
2.10
Create an R Markdown file for this Chapter
2.10.1
Modify the yaml header
2.10.2
Modify the “setup” chunk
2.10.3
Create a “fake-data” chunk
2.10.4
Create a “plot” chunk
2.10.5
Knit
Part III: R fundamentals
3
Data – Reading, Wrangling, and Writing
3.1
The way data should be
3.2
Importing – best practices
3.3
Learning from this chapter
3.4
Working in R
3.4.1
Importing data
3.5
Data wrangling
3.5.1
Reshaping data – Wide to long
3.5.2
Reshaping data – Transpose (turning the columns into rows)
3.5.3
Combining data
3.5.4
Subsetting data
3.5.5
Wrangling columns
3.5.6
Missing data
3.6
Saving data
3.7
Exercises
4
Plotting Models
4.1
Pretty good plots show the model and the data
4.1.1
Pretty good plot component 1: Modeled effects plot
4.1.2
Pretty good plot component 2: Modeled mean and CI plot
4.1.3
Combining Effects and Modeled mean and CI plots – an Effects and response plot.
4.1.4
Some comments on plot components
4.2
Working in R
4.2.1
Source data
4.2.2
How to plot the model
4.2.3
Be sure ggplot_the_model is in your R folder
4.2.4
How to use the Plot the Model functions
4.2.5
How to generate a Response Plot using ggpubr
4.2.6
How to generate a Response Plot with a grid of treatments using ggplot2
4.2.7
How to generate an Effects Plot
4.2.8
How to combine the response and effects plots
4.2.9
How to add the interaction effect to response and effects plots
Part IV: Some Fundamentals of Statistical Modeling
5
Variability and Uncertainty (Standard Deviations, Standard Errors, Confidence Intervals)
5.1
The sample standard deviation vs. the standard error of the mean
5.1.1
Sample standard deviation
5.1.2
Standard error of the mean
5.2
Using Google Sheets to generate fake data to explore the standard error
5.2.1
Steps
5.3
Using R to generate fake data to explore the standard error
5.3.1
part I
5.3.2
part II - means
5.3.3
part III - how do SD and SE change as sample size (n) increases?
5.3.4
Part IV – Generating fake data with for-loops
5.4
Bootstrapped standard errors
5.4.1
An example of bootstrapped standard errors using vole data
5.5
Confidence Interval
5.5.1
Interpretation of a confidence interval
6
P-values
6.1
A
p
-value is the probability of sampling a value as or more extreme than the test statistic if sampling from a null distribution
6.2
Pump your intuition – Creating a null distribution
6.3
A null distribution of
t
-values – the
t
distribution
6.4
P-values from the perspective of permutation
6.5
Parametric vs. non-parametric statistics
6.6
frequentist probability and the interpretation of p-values
6.6.1
Background
6.6.2
This book covers frequentist approaches to statistical modeling and when a probability arises, such as the
p
-value of a test statistic, this will be a frequentist probability.
6.6.3
Two interpretations of the
p
-value
6.6.4
NHST
6.7
Some major misconceptions of the
p
-value
6.7.1
Misconception:
p
is the probability that the null is true
and
\(1-p\)
is probability that the alternative is true
6.7.2
Misconception: a
p
-value is repeatable
6.7.3
Misconception: 0.05 is the lifetime rate of false discoveries
6.7.4
Misconception: a low
p
-value indicates an important effect
6.7.5
Misconception: a low
p
-value indicates high model fit or high predictive capacity
6.8
What the
p
-value does not mean
6.9
Recommendations
6.9.1
Primary sources for recommendations
6.10
Problems
7
Errors in inference
7.1
Classical NHST concepts of wrong
7.1.1
Type I error
7.1.2
Power
7.2
A non-Neyman-Pearson concept of power
7.2.1
Estimation error
7.2.2
Coverage
7.2.3
Type S error
7.2.4
Type M error
Part V: Introduction to Linear Models
8
An introduction to linear models
8.1
Two specifications of a linear model
8.1.1
The “error draw” specification
8.1.2
The “conditional draw” specification
8.1.3
Comparing the error-draw and conditional-draw ways of specifying the linear model
8.1.4
ANOVA notation of a linear model
8.2
A linear model can be fit to data with continuous, discrete, or categorical
\(X\)
variables
8.2.1
Fitting linear models to experimental data in which the
\(X\)
variable is continuous or discrete
8.2.2
Fitting linear models to experimental data in which the
\(X\)
variable is categorical
8.3
Statistical models are used for prediction, explanation, and description
8.4
What do we call the
\(X\)
and
\(Y\)
variables?
8.5
Modeling strategy
8.6
Predictions from the model
8.7
Inference from the model
8.7.1
Assumptions for inference with a statistical model
8.7.2
Specific assumptions for inference with a linear model
8.8
“linear model,”regression model“, or”statistical model"?
9
Linear models with a single, continuous
X
(“regression”)
9.1
A linear model with a single, continuous
X
is classical “regression”
9.1.1
Analysis of “green-down” data
9.1.2
Learning from the green-down example
9.1.3
Using a regression model for “explanation” – causal models
9.1.4
Using a regression model for prediction – prediction models
9.1.5
Using a regression model for creating a new response variable – comparing slopes of longitudinal data
9.1.6
Using a regression model for for calibration
9.2
Working in R
9.2.1
Fitting the linear model
9.2.2
Getting to know the linear model: the
summary
function
9.2.3
Inference – the coefficient table
9.2.4
How good is our model? – Model checking
9.2.5
Plotting models with continuous
X
9.2.6
Creating a table of predicted values and 95% prediction intervals
9.3
Hidden code
9.3.1
Import and plot of fig2c (ecosystem warming experimental) data
9.3.2
Import and plot efig_3d (Ecosysem warming observational) data
9.3.3
Import and plot of fig1f (methionine restriction) data
9.4
Try it
9.4.1
A prediction model from the literature
9.5
Intuition pumps
9.5.1
Correlation and $R^2
10
Linear models with a single, categorical
X
(“t-tests” and “ANOVA”)
10.1
A linear model with a single, categorical
X
variable estimates the effects of the levels of
X
on the response
10.1.1
Example 1 (fig3d) – two treatment levels (“groups”)
10.1.2
Understanding the analysis with two treatment levels
10.1.3
Example 2 – three treatment levels (“groups”)
10.1.4
Understanding the analysis with three (or more) treatment levels
10.2
Working in R
10.2.1
Fit the model
10.2.2
Controlling the output in tables using the coefficient table as an example
10.2.3
Using the emmeans function
10.2.4
Using the contrast function
10.2.5
How to generate ANOVA tables
10.3
Hidden Code
10.3.1
Importing and wrangling the fig_3d data for example 1
10.3.2
Importing and wrangling the fig2a data for example 2
11
Model Checking
11.1
All statistical analyses should be followed by model checking
11.2
Linear model assumptions
11.2.1
A bit about IID
11.3
Diagnostic plots use the residuals from the model fit
11.3.1
Residuals
11.3.2
A Normal Q-Q plot is used to check for characteristic departures from Normality
11.3.3
Mapping QQ-plot departures from Normality
11.3.4
Model checking homoskedasticity
11.4
Using R
11.5
Hidden Code
11.5.1
Normal Q-Q plots
12
Violations of independence, homogeneity, or Normality
12.1
Lack of independence
12.1.1
Example 1 (exp1b) – a paired t-test is a special case of a linear mixed model
12.1.2
Example 2 (diHOME exp2a) – A repeated measures ANOVA is a special case of a linear mixed model
12.2
Heterogeneity of variances
12.2.1
When groups of the focal test have >> variance
12.3
The conditional response isn’t Normal
12.3.1
Example 1 (fig6f) – Linear models for non-normal count data
12.3.2
My data aren’t normal, what is the best practice?
12.4
Hidden Code
12.4.1
Importing and wrangling the exp1b data
12.4.2
Importing and wrangling the exp2a data
12.4.3
Importing and wrangling the fig6f data
13
Issues in inference
13.1
Replicated experiments – include
\(\texttt{Experiment}\)
as a random factor (better than one-way ANOVA of means)
13.1.1
Multiple experiments Example 1 (wound healing Exp4d)
13.1.2
Models for combining replicated experiments
13.1.3
Understanding Model
exp4d_m1
13.1.4
The univariate model is equivalent to a linear mixed model of the aggregated data (Model exp4d_m2)
13.1.5
A linear mixed model of the full data
13.1.6
Analysis of the experiment means has less precision and power
13.1.7
Don’t do this – a t-test/fixed-effect ANOVA of the full data
13.2
Comparing change from baseline (pre-post)
13.2.1
Pre-post example 1 (DPP4 fig4c)
13.2.2
What if the data in example 1 were from from an experiment where the treatment was applied prior to the baseline measure?
13.2.3
Pre-post example 2 (XX males fig1c)
13.2.4
Regression to the mean
13.3
Longitudinal designs with more than one-post baseline measure
13.3.1
Area under the curve (AUC)
13.4
Normalization – the analysis of ratios
13.4.1
Kinds of ratios in experimental biology
13.4.2
Example 1 – The ratio is a density (number of something per area)
13.4.3
Example 2 – The ratio is normalizing for size differences
13.5
Don’t do this stuff
13.5.1
Normalize the response so that all control values are equal to 1.
13.6
A difference in significance is not necessarily significant
13.7
Researcher degrees of freedom
13.8
Hidden code
13.8.1
Import exp4d vimentin cell count data (replicate experiments example)
13.8.2
Import Fig4c data
13.8.3
XX males fig1c
13.8.4
Generation of fake data to illustrate regression to the mean
13.8.5
Import fig3f
13.8.6
Import exp3b
13.8.7
Plot the model of exp3b (glm offset data)
Part VI: More than one
\(X\)
– Multivariable Models
14
Linear models with added covariates (“ANCOVA”)
14.1
Adding covariates can increases the precision of the effect of interest
14.2
Understanding a linear model with an added covariate – heart necrosis data
14.2.1
Fit the model
14.2.2
Plot the model
14.2.3
Interpretation of the model coefficients
14.2.4
Everything adds up
14.2.5
Interpretation of the estimated marginal means
14.2.6
Interpretation of the contrasts
14.2.7
Adding the covariate improves inference
14.3
Understanding interaction effects with covariates
14.3.1
Fit the model
14.3.2
Plot the model with interaction effect
14.3.3
Interpretation of the model coefficients
14.3.4
What is the effect of a treatment, if interactions are modeled? – it depends.
14.3.5
Which model do we use,
\(\mathcal{M}_1\)
or
\(\mathcal{M}_2\)
?
14.4
Understanding ANCOVA tables
14.5
Working in R
14.5.1
Importing the heart necrosis data
14.5.2
Fitting the model
14.5.3
Using the emmeans function
14.5.4
ANCOVA tables
14.5.5
Plotting the model
14.6
Best practices
14.6.1
Do not use a ratio of part:whole as a response variable – instead add the denominator as a covariate
14.6.2
Do not use change from baseline as a response variable – instead add the baseline measure as a covariate
14.6.3
Do not “test for balance” of baseline measures
14.7
Best practices 2: Use a covariate instead of normalizing a response
15
Linear models with two categorical
\(X\)
– Factorial linear models (“two-way ANOVA”)
15.1
A linear model with crossed factors estimates interaction effects
15.1.1
An interaction is a difference in simple effects
15.1.2
A linear model with crossed factors includes interaction effects
15.1.3
factorial experiments are frequently analyzed as flattened linear models in the experimental biology literature
15.2
Example 1 – Estimation of a treatment effect relative to a control effect (“Something different”) (Experiment 2j glucose uptake data)
15.2.1
Understand the experimental design
15.2.2
Fit the linear model
15.2.3
Inference
15.2.4
Plot the model
15.3
Understanding the linear model with crossed factors 1
15.3.1
What the coefficients are
15.3.2
The interaction effect is something different
15.3.3
Why we want to compare the treatment effect to a control effect
15.3.4
The order of the factors in the model tells the same story differently
15.3.5
Power for the interaction effect is less than that for simple effects
15.3.6
Planned comparisons vs. post-hoc tests
15.4
Example 2: Estimation of the effect of background condition on an effect (“it depends”) (Experiment 3e lesian area data)
15.4.1
Understand the experimental design
15.4.2
Fit the linear model
15.4.3
Check the model
15.4.4
Inference from the model
15.4.5
Plot the model
15.5
Understanding the linear model with crossed factors 2
15.5.1
Conditional and marginal means
15.5.2
Simple (conditional) effects
15.5.3
Marginal effects
15.5.4
The additive model
15.5.5
Reduce models for the right reason
15.5.6
The marginal means of an additive linear model with two factors can be weird
15.6
Example 3: Estimation of synergy (“More than the sum of the parts”) (Experiment 1c JA data)
15.6.1
Examine the data
15.6.2
Fit the model
15.6.3
Model check
15.6.4
Inference from the model
15.6.5
Plot the model
15.6.6
Alternative plot
15.7
Understanding the linear model with crossed factors 3
15.7.1
Thinking about the coefficients of the linear model
15.8
Issues in inference
15.8.1
For pairwise contrasts, it doesn’t matter if you fit a factorial or a flattened linear model
15.8.2
For interaction contrasts, it doesn’t matter if you fit a factorial or a flattened linear model
15.8.3
Adjusting
p
-values for multiple tests
15.9
Two-way ANOVA
15.9.1
How to read a two-way ANOVA table
15.9.2
What do the main effects in an ANOVA table mean?
15.10
More issues in inference
15.10.1
Longitudinal experiments – include Time as a random factor (better than repeated measures ANOVA)
15.11
Working in R
15.11.1
Model formula
15.11.2
Using the emmeans function
15.11.3
Contrasts
15.11.4
Practice safe ANOVA
15.11.5
Better to avoid these
15.12
Hidden Code
15.12.1
Import exp2j (Example 1)
15.12.2
Import exp3e lesian area data (Example 2)
15.12.3
Import Exp1c JA data (Example 3)
Part VII – Expanding the Linear Model
16
Models with random factors – linear mixed models
16.1
Example 1 – A random intercepts and slopes explainer (demo1)
16.1.1
Batched measurements result in clustered residuals
16.1.2
Clustered residuals result in correlated error
16.1.3
In blocked designs, clustered residuals adds a variance component that masks treatment effects
16.1.4
Linear mixed models are linear models with added random factors
16.1.5
What the random effects are
16.1.6
In a blocked design, a linear model with added random effects increases precision of treatment effects
16.1.7
The correlation among random intercepts and slopes
16.1.8
Clustered residuals create heterogeneity among treatments
16.1.9
Linear mixed models are flexible
16.1.10
A random intercept only model
16.1.11
A model including an interaction intercept
16.1.12
AIC and model selection – which model to report?
16.1.13
The specification of random effects matters
16.1.14
Mixed Effect and Repeated Measures ANOVA
16.1.15
Pseudoreplication
16.2
Example 2 – experiments without subsampling replication (exp6g)
16.2.1
Understand the data
16.2.2
Model fit and inference
16.2.3
The model exp6g_m1 adds a random intercept but not a random slope
16.2.4
The fixed effect coefficients of model exp6g_m1
16.2.5
The random intercept coefficients of exp6g_m1
16.2.6
The random and residual variance and the intraclass correlation of model exp6g_m1
16.2.7
The linear mixed model exp6g_m1 increases precision of treatment effects, relative to a fixed effects model
16.2.8
Alternative models for exp6g
16.2.9
Paired t-tests and repeated measures ANOVA are special cases of linear mixed models
16.2.10
Classical (“univariate model”) repeated measures ANOVA of exp6g
16.2.11
“Multivariate model” repeated measures ANOVA
16.2.12
Linear mixed models vs repeated measures ANOVA
16.2.13
Modeling
\(\texttt{mouse_id}\)
as a fixed effect
16.3
Example 3 – Factorial experiments and no subsampling replicates (exp5c)
16.3.1
Understand the data
16.3.2
Examine the data
16.3.3
Model fit and inference
16.3.4
Why we care about modeling batch in exp5c
16.3.5
The linear mixed model exp5c_m1 adds two random intercepts
16.3.6
The fixed effect coefficients of model exp5c_m1
16.3.7
The random effect coefficients of model exp5c_m1
16.3.8
Alternative models for exp5c
16.3.9
Classical (“univariate model”) repeated measures ANOVA
16.3.10
“Multivariate model” repeated measures ANOVA of exp5c
16.3.11
Modeling
\(\texttt{donor}\)
as a fixed effect
16.4
Example 4 – Experiments with subsampling replication (exp1g)
16.4.1
Understand the data
16.4.2
Examine the data
16.4.3
Fit the model
16.4.4
Inference from the model
16.4.5
Plot the model
16.4.6
Alternaplot the model
16.4.7
Understanding the alternative models
16.4.8
The VarCorr matrix of models exp1g_m1a and exp1g_m1b
16.4.9
The linear mixed model has more precision and power than the fixed effect model of batch means
16.4.10
Fixed effect models and pseudoreplication
16.4.11
Mixed-effect ANOVA
16.5
Working in R
16.5.1
Fitting linear mixed models
16.5.2
Plotting models fit to batched data
16.5.3
Repeated measures ANOVA (randomized complete block with no subsampling)
16.6
Hidden code
16.6.1
Import exp5c
16.6.2
Import exp1g
17
Linear models for longitudinal experiments – I. pre-post designs
17.1
Best practice models
17.2
Common alternatives that are not recommended
17.3
Advanced models
17.4
Understanding the alternative models
17.4.1
(M1) Linear model with the baseline measure as the covariate (ANCOVA model)
17.4.2
(M2) Linear model of the change score (change-score model)
17.4.3
(M3) Linear model of post-baseline values without the baseline as a covariate (post model)
17.4.4
(M4) Linear model with factorial fixed effects (fixed-effects model)
17.4.5
(M5) Repeated measures ANOVA
17.4.6
(M6) Linear mixed model
17.4.7
(M7) Linear model with correlated error
17.4.8
(M8) Constrained fixed effects model with correlated error (cLDA model)
17.4.9
Comparison table
17.5
Example 1 – a single post-baseline measure (pre-post design)
17.6
Working in R
17.7
Hidden code
17.7.1
Import and wrangle mouse sociability data
18
Linear models for counts, binary responses, skewed responses, and ratios – Generalized Linear Models
18.1
Introducing Generalized Linear Models using count data examples
18.1.1
The Generalized Linear Model (GLM)
18.1.2
Kinds of data that are modeled by a GLM
18.2
Example 1 – GLM models for count responses (“angiogenic sprouts” exp3a)
18.2.1
Understand the data
18.2.2
Model fit and inference
18.3
Understanding Example 1
18.3.1
Modeling strategy
18.3.2
Model checking fits to count data
18.3.3
Biological count data are rarely fit well by a Poisson GLM. Instead, fit a quasi-poisson or negative binomial GLM model.
18.3.4
A GLM is a linear model on the link scale
18.3.5
Coeffecients of a Generalized Linear Model with a log-link function are on the link scale.
18.3.6
Modeled means in the emmeans table of a Generalized Linear Model can be on the link scale or response scale – Report the response scale
18.3.7
Some consequences of fitting a linear model to count data
18.4
Example 2 – Use a GLM with an offset instead of a ratio of some measurement per area (“dna damage” data exp3b)
18.4.1
exp3b (“dna damage”) data
18.4.2
Understand the data
18.4.3
Model fit and inference
18.5
Understanding Example 2
18.5.1
An offset is an added covariate with a coefficient fixed at 1
18.5.2
A count GLM with an offset models the area-normalized means
18.5.3
Compare an offset to an added covariate with an estimated coefficient
18.5.4
Issues with plotting
18.6
Example 3 – GLM models for binary responses
18.7
Working in R
18.7.1
Fitting GLMs to count data
18.7.2
Fitting a GLM to a continuous conditional response with right skew.
18.7.3
Fitting a GLM to a binary (success or failure, presence or absence, survived or died) response
18.7.4
Fitting Generalized Linear Mixed Models
18.8
Model checking GLMs
18.9
Hidden code
18.9.1
Import Example 1 data (exp3a – “angiogenic sprouts”)
19
Linear models with heterogenous variance
Appendix: An example set of analyses of experimental data with linear models
20
Background physiology to the experiments in Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
21
Analyses for Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
21.1
Setup
21.2
Data source
21.3
control the color palette
21.4
useful functions
21.5
figure 2b – effect of ASK1 deletion on growth (body weight)
21.5.1
figure 2b – import
21.5.2
figure 2b – exploratory plots
21.6
Figure 2c – Effect of ASK1 deletion on final body weight
21.6.1
Figure 2c – import
21.6.2
Figure 2c – check own computation of weight change v imported value
21.6.3
Figure 2c – exploratory plots
21.6.4
Figure 2c – fit the model: m1 (lm)
21.6.5
Figure 2c – check the model: m1
21.6.6
Figure 2c – fit the model: m2 (gamma glm)
21.6.7
Figure 2c – check the model, m2
21.6.8
Figure 2c – inference from the model
21.6.9
Figure 2c – plot the model
21.6.10
Figure 2c – report
21.7
Figure 2d – Effect of ASK1 KO on glucose tolerance (whole curve)
21.7.1
Figure 2d – Import
21.7.2
Figure 2d – exploratory plots
21.7.3
Figure 2d – fit the model
21.7.4
Figure 2d – check the model
21.7.5
Figure 2d – inference
21.7.6
Figure 2d – plot the model
21.8
Figure 2e – Effect of ASK1 deletion on glucose tolerance (summary measure)
21.8.1
Figure 2e – message the data
21.8.2
Figure 2e – exploratory plots
21.8.3
Figure 2e – fit the model
21.8.4
Figure 2e – check the model
21.8.5
Figure 2e – inference from the model
21.8.6
Figure 2e – plot the model
21.9
Figure 2f – Effect of ASK1 deletion on glucose infusion rate
21.9.1
Figure 2f – import
21.9.2
Figure 2f – exploratory plots
21.9.3
Figure 2f – fit the model
21.9.4
Figure 2f – check the model
21.9.5
Figure 2f – inference
21.9.6
Figure 2f – plot the model
21.10
Figure 2g – Effect of ASK1 deletion on tissue-specific glucose uptake
21.10.1
Figure 2g – import
21.10.2
Figure 2g – exploratory plots
21.10.3
Figure 2g – fit the model
21.10.4
Figure 2g – check the model
21.10.5
Figure 2g – inference
21.10.6
Figure 2g – plot the model
21.11
Figure 2h
21.12
Figure 2i – Effect of ASK1 deletion on liver TG
21.12.1
Figure 2i – fit the model
21.12.2
Figure 2i – check the model
21.12.3
Figure 2i – inference
21.12.4
Figure 2i – plot the model
21.12.5
Figure 2i – report the model
21.13
Figure 2j
22
Simulations – Count data (alternatives to a t-test)
22.1
Use data similar to Figure 6f from Example 1
22.2
Functions
22.3
Simulations
22.3.1
Type I, Pseudo-Normal distribution
22.3.2
Type I, neg binom, equal n
22.3.3
Type I, neg binom, equal n, small theta
22.3.4
Type I, neg binom, unequal n
22.3.5
Power, Pseudo-Normal distribution, equal n
22.3.6
Power, neg binom, equal n
22.3.7
Power, neg binom, small theta
22.3.8
Power, neg binom, unequal n
22.3.9
Power, neg binom, unequal n, unequal theta
22.3.10
Type 1, neg binom, equal n, unequal theta
22.4
Save it, Read it
22.5
Analysis
Appendix 1: Getting Started with R
22.6
Get your computer ready
22.6.1
Start here
22.6.2
Install R
22.6.3
Install R Studio
22.6.4
Install R Markdown
22.6.5
(optional) Alternative LaTeX installations
22.7
Start learning R Studio
Appendix 2: Online Resources for Getting Started with Statistical Modeling in R
Published with bookdown
Applied Statistics for Experimental Biology
Part III: R fundamentals