Elements of Applied Biostatistics
Preface
0.1
Why bother with linear models – aren’t
t
-tests and ANOVA good enough?
0.2
What is unusual about this book?
A Table of Models
including mapping between linear models and classical tests
Part I: Getting Started
1
Getting Started – R Projects and R Markdown
1.1
R vs R Studio
1.2
Download and install R and R studio
1.3
Open R Studio and modify the workspace preference
1.4
If you didn’t modify the workspace preferences from the previous section, go back and do it
1.5
R Markdown in a nutshell
1.6
Install R Markdown
1.7
Importing Packages
1.8
Create an R Studio Project for this textbook
1.9
Working on a project, in a nutshell
1.10
Create an R Markdown file for this Chapter
1.10.1
Modify the yaml header
1.10.2
Modify the “setup” chunk
1.10.3
Create a “fake-data” chunk
1.10.4
Create a “plot” chunk
1.10.5
Knit
Part II: An introduction to the analysis of experimental data with a linear model
2
Analyzing experimental data with a linear model
2.1
This text is about using linear models to estimate treatment effects and the uncertainty in our estimates. This, raises the question, what is “an effect?”
3
Background physiology to the experiments in Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
4
Analyses for Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
4.1
Setup
4.2
Data source
4.3
control the color palette
4.4
useful functions
4.5
figure 2b – effect of ASK1 deletion on growth (body weight)
4.5.1
figure 2b – import
4.5.2
figure 2b – exploratory plots
4.6
Figure 2c – Effect of ASK1 deletion on final body weight
4.6.1
Figure 2c – import
4.6.2
Figure 2c – check own computation of weight change v imported value
4.6.3
Figure 2c – exploratory plots
4.6.4
Figure 2c – fit the model: m1 (lm)
4.6.5
Figure 2c – check the model: m1
4.6.6
Figure 2c – fit the model: m2 (gamma glm)
4.6.7
Figure 2c – check the model, m2
4.6.8
Figure 2c – inference from the model
4.6.9
Figure 2c – plot the model
4.6.10
Figure 2c – report
4.7
Figure 2d – Effect of ASK1 KO on glucose tolerance (whole curve)
4.7.1
Figure 2d – Import
4.7.2
Figure 2d – exploratory plots
4.7.3
Figure 2d – fit the model
4.7.4
Figure 2d – check the model
4.7.5
Figure 2d – inference
4.7.6
Figure 2d – plot the model
4.8
Figure 2e – Effect of ASK1 deletion on glucose tolerance (summary measure)
4.8.1
Figure 2e – message the data
4.8.2
Figure 2e – exploratory plots
4.8.3
Figure 2e – fit the model
4.8.4
Figure 2e – check the model
4.8.5
Figure 2e – inference from the model
4.8.6
Figure 2e – plot the model
4.9
Figure 2f – Effect of ASK1 deletion on glucose infusion rate
4.9.1
Figure 2f – import
4.9.2
Figure 2f – exploratory plots
4.9.3
Figure 2f – fit the model
4.9.4
Figure 2f – check the model
4.9.5
Figure 2f – inference
4.9.6
Figure 2f – plot the model
4.10
Figure 2g – Effect of ASK1 deletion on tissue-specific glucose uptake
4.10.1
Figure 2g – import
4.10.2
Figure 2g – exploratory plots
4.10.3
Figure 2g – fit the model
4.10.4
Figure 2g – check the model
4.10.5
Figure 2g – inference
4.10.6
Figure 2g – plot the model
4.11
Figure 2h
4.12
Figure 2i – Effect of ASK1 deletion on liver TG
4.12.1
Figure 2i – fit the model
4.12.2
Figure 2i – check the model
4.12.3
Figure 2i – inference
4.12.4
Figure 2i – plot the model
4.12.5
Figure 2i – report the model
4.13
Figure 2j
Part III: R fundamentals
5
Data – Reading, Wrangling, and Writing
5.1
Learning from this chapter
5.2
Working in R
5.2.1
Importing data
5.3
Data wrangling
5.3.1
Reshaping data – Wide to long
5.3.2
Reshaping data – Transpose (turning the columns into rows)
5.3.3
Combining data
5.3.4
Subsetting data
5.3.5
Wrangling columns
5.3.6
Missing data
5.4
Saving data
5.5
Exercises
6
Plotting Models
6.1
Pretty good plots show the model and the data
6.1.1
Pretty good plot component 1: Modeled effects plot
6.1.2
Pretty good plot component 2: Modeled mean and CI plot
6.1.3
Combining Effects and Modeled mean and CI plots – an Effects and response plot.
6.1.4
Some comments on plot components
6.2
Working in R
6.2.1
Source data
6.2.2
How to plot the model
6.2.3
Be sure ggplot_the_model is in your R folder
6.2.4
How to use the Plot the Model functions
6.2.5
How to generate a Response Plot using ggpubr
6.2.6
How to generate a Response Plot with a grid of treatments using ggplot2
6.2.7
How to generate an Effects Plot
6.2.8
How to combine the response and effects plots
6.2.9
How to add the interaction effect to response and effects plots
Part IV: Some Fundamentals of Statistical Modeling
7
Variability and Uncertainty (Standard Deviations, Standard Errors, Confidence Intervals)
7.1
The sample standard deviation vs. the standard error of the mean
7.1.1
Sample standard deviation
7.1.2
Standard error of the mean
7.2
Using Google Sheets to generate fake data to explore the standard error
7.2.1
Steps
7.3
Using R to generate fake data to explore the standard error
7.3.1
part I
7.3.2
part II - means
7.3.3
part III - how do SD and SE change as sample size (n) increases?
7.3.4
Part IV – Generating fake data with for-loops
7.4
Bootstrapped standard errors
7.4.1
An example of bootstrapped standard errors using vole data
7.5
Confidence Interval
7.5.1
Interpretation of a confidence interval
8
P-values
8.1
A
p
-value is the probability of sampling a value as or more extreme than the test statistic if sampling from a null distribution
8.2
Pump your intuition – Creating a null distribution
8.3
A null distribution of
t
-values – the
t
distribution
8.4
P-values from the perspective of permutation
8.5
Parametric vs. non-parametric statistics
8.6
frequentist probability and the interpretation of p-values
8.6.1
Background
8.6.2
This book covers frequentist approaches to statistical modeling and when a probability arises, such as the
p
-value of a test statistic, this will be a frequentist probability.
8.6.3
Two interpretations of the
p
-value
8.6.4
NHST
8.7
Some major misconceptions of the
p
-value
8.7.1
Misconception:
p
is the probability that the null is true
and
\(1-p\)
is probability that the alternative is true
8.7.2
Misconception: a
p
-value is repeatable
8.7.3
Misconception: 0.05 is the lifetime rate of false discoveries
8.7.4
Misconception: a low
p
-value indicates an important effect
8.7.5
Misconception: a low
p
-value indicates high model fit or high predictive capacity
8.8
What the
p
-value does not mean
8.9
Recommendations
8.9.1
Primary sources for recommendations
8.10
Problems
9
Errors in inference
9.1
Classical NHST concepts of wrong
9.1.1
Type I error
9.1.2
Power
9.2
A non-Neyman-Pearson concept of power
9.2.1
Estimation error
9.2.2
Coverage
9.2.3
Type S error
9.2.4
Type M error
Part V: Introduction to Linear Models
10
An introduction to linear models
10.1
Two specifications of a linear model
10.1.1
The “error draw” specification
10.1.2
The “conditional draw” specification
10.1.3
Comparing the error-draw and conditional-draw ways of specifying the linear model
10.1.4
ANOVA notation of a linear model
10.2
A linear model can be fit to data with continuous, discrete, or categorical
\(X\)
variables
10.2.1
Fitting linear models to experimental data in which the
\(X\)
variable is continuous or discrete
10.2.2
Fitting linear models to experimental data in which the
\(X\)
variable is categorical
10.3
Statistical models are used for prediction, explanation, and description
10.4
What do we call the
\(X\)
and
\(Y\)
variables?
10.5
Modeling strategy
10.6
Predictions from the model
10.7
Inference from the model
10.7.1
Assumptions for inference with a statistical model
10.7.2
Specific assumptions for inference with a linear model
10.8
“linear model,”regression model“, or”statistical model"?
11
Linear models with a single, continuous
X
(“regression”)
11.1
A linear model with a single, continuous
X
is classical “regression”
11.1.1
Analysis of “green-down” data
11.1.2
Learning from the green-down example
11.1.3
Using a regression model for “explanation” – causal models
11.1.4
Using a regression model for prediction – prediction models
11.1.5
Using a regression model for creating a new response variable – comparing slopes of longitudinal data
11.1.6
Using a regression model for for calibration
11.2
Working in R
11.2.1
Fitting the linear model
11.2.2
Getting to know the linear model: the
summary
function
11.2.3
Inference – the coefficient table
11.2.4
How good is our model? – Model checking
11.2.5
Plotting models with continuous
X
11.2.6
Creating a table of predicted values and 95% prediction intervals
11.3
Hidden code
11.3.1
Import and plot of fig2c (ecosystem warming experimental) data
11.3.2
Import and plot efig_3d (Ecosysem warming observational) data
11.3.3
Import and plot of fig1f (methionine restriction) data
11.4
Try it
11.4.1
A prediction model from the literature
11.5
Intuition pumps
11.5.1
Correlation and $R^2
12
Linear models with a single, categorical
X
(“t-tests” and “ANOVA”)
12.1
A linear model with a single, categorical
X
variable estimates the effects of the levels of
X
on the response
12.1.1
Example 1 (fig3d) – two treatment levels (“groups”)
12.1.2
Understanding the analysis with two treatment levels
12.1.3
Example 2 – three treatment levels (“groups”)
12.1.4
Understanding the analysis with three (or more) treatment levels
12.2
Working in R
12.2.1
Fit the model
12.2.2
Controlling the output in tables using the coefficient table as an example
12.2.3
Using the emmeans function
12.2.4
Using the contrast function
12.2.5
How to generate ANOVA tables
12.3
Hidden Code
12.3.1
Importing and wrangling the fig_3d data for example 1
12.3.2
Importing and wrangling the fig2a data for example 2
13
Model Checking
13.1
All statistical analyses should be followed by model checking
13.2
Linear model assumptions
13.2.1
A bit about IID
13.3
Diagnostic plots use the residuals from the model fit
13.3.1
Residuals
13.3.2
A Normal Q-Q plot is used to check for characteristic departures from Normality
13.3.3
Mapping QQ-plot departures from Normality
13.3.4
Model checking homoskedasticity
13.4
Using R
13.5
Hidden Code
13.5.1
Normal Q-Q plots
14
Violations of independence, homogeneity, or Normality
14.1
Lack of independence
14.1.1
Example 1 (exp1b) – a paired t-test is a special case of a linear mixed model
14.1.2
Example 2 (diHOME exp2a) – A repeated measures ANOVA is a special case of a linear mixed model
14.2
Heterogeneity of variances
14.2.1
When groups of the focal test have >> variance
14.3
The conditional response isn’t Normal
14.3.1
Example 1 (fig6f) – Linear models for non-normal count data
14.3.2
My data aren’t normal, what is the best practice?
14.4
Hidden Code
14.4.1
Importing and wrangling the exp1b data
14.4.2
Importing and wrangling the exp2a data
14.4.3
Importing and wrangling the fig6f data
15
Issues in inference
15.1
Replicated experiments – include
\(\texttt{Experiment}\)
as a random factor (better than one-way ANOVA of means)
15.1.1
Multiple experiments Example 1 (wound healing Exp4d)
15.1.2
Models for combining replicated experiments
15.1.3
Understanding Model
exp4d_m1
15.1.4
The univariate model is equivalent to a linear mixed model of the aggregated data (Model exp4d_m2)
15.1.5
A linear mixed model of the full data
15.1.6
Analysis of the experiment means has less precision and power
15.1.7
Don’t do this – a t-test/fixed-effect ANOVA of the full data
15.2
Comparing change from baseline (pre-post)
15.2.1
Pre-post example 1 (DPP4 fig4c)
15.2.2
What if the data in example 1 were from from an experiment where the treatment was applied prior to the baseline measure?
15.2.3
Pre-post example 2 (XX males fig1c)
15.2.4
Regression to the mean
15.3
Longitudinal designs with more than one-post baseline measure
15.3.1
Area under the curve (AUC)
15.4
Normalization – the analysis of ratios
15.4.1
Kinds of ratios in experimental biology
15.4.2
Example 1 – The ratio is a density (number of something per area)
15.5
Don’t do this stuff
15.5.1
Normalize the response so that all control values are equal to 1.
15.6
A difference in significance is not necessarily significant
15.7
Researcher degrees of freedom
15.8
Hidden code
15.8.1
Import exp4d vimentin cell count data (replicate experiments example)
15.8.2
Import Fig4c data
15.8.3
XX males fig1c
15.8.4
Generation of fake data to illustrate regression to the mean
15.8.5
Import fig3f
15.8.6
Import exp3b
15.8.7
Plot the model of exp3b (glm offset data)
Part VI: More than one
\(X\)
– Multivariable Models
16
Linear models with added covariates (“ANCOVA”)
16.1
Adding covariates can increases the precision of the effect of interest
16.2
Understanding a linear model with an added covariate – heart necrosis data
16.2.1
Fit the model
16.2.2
Plot the model
16.2.3
Interpretation of the model coefficients
16.2.4
Everything adds up
16.2.5
Interpretation of the estimated marginal means
16.2.6
Interpretation of the contrasts
16.2.7
Adding the covariate improves inference
16.3
Understanding interaction effects with covariates
16.3.1
Fit the model
16.3.2
Plot the model with interaction effect
16.3.3
Interpretation of the model coefficients
16.3.4
What is the effect of a treatment, if interactions are modeled? – it depends.
16.3.5
Which model do we use,
\(\mathcal{M}_1\)
or
\(\mathcal{M}_2\)
?
16.4
Understanding ANCOVA tables
16.5
Working in R
16.5.1
Importing the heart necrosis data
16.5.2
Fitting the model
16.5.3
Using the emmeans function
16.5.4
ANCOVA tables
16.5.5
Plotting the model
16.6
Best practices
16.6.1
Do not use a ratio of part:whole as a response variable – instead add the denominator as a covariate
16.6.2
Do not use change from baseline as a response variable – instead add the baseline measure as a covariate
16.6.3
Do not “test for balance” of baseline measures
16.7
Best practices 2: Use a covariate instead of normalizing a response
17
Linear models with two categorical
\(X\)
– Factorial linear models (“two-way ANOVA”)
17.1
A linear model with crossed factors estimates interaction effects
17.1.1
An interaction is a difference in simple effects
17.1.2
A linear model with crossed factors includes interaction effects
17.1.3
factorial experiments are frequently analyzed as flattened linear models in the experimental biology literature
17.2
Example 1 – Estimation of a treatment effect relative to a control effect (“Something different”) (Experiment 2j glucose uptake data)
17.2.1
Understand the experimental design
17.2.2
Fit the linear model
17.2.3
Inference
17.2.4
Plot the model
17.3
Understanding the linear model with crossed factors 1
17.3.1
What the coefficients are
17.3.2
The interaction effect is something different
17.3.3
Why we want to compare the treatment effect to a control effect
17.3.4
The order of the factors in the model tells the same story differently
17.3.5
Power for the interaction effect is less than that for simple effects
17.3.6
Planned comparisons vs. post-hoc tests
17.4
Example 2: Estimation of the effect of background condition on an effect (“it depends”) (Experiment 3e lesian area data)
17.4.1
Understand the experimental design
17.4.2
Fit the linear model
17.4.3
Check the model
17.4.4
Inference from the model
17.4.5
Plot the model
17.5
Understanding the linear model with crossed factors 2
17.5.1
Conditional and marginal means
17.5.2
Simple (conditional) effects
17.5.3
Marginal effects
17.5.4
The additive model
17.5.5
Reduce models for the right reason
17.5.6
The marginal means of an additive linear model with two factors can be weird
17.6
Example 3: Estimation of synergy (“More than the sum of the parts”) (Experiment 1c JA data)
17.6.1
Examine the data
17.6.2
Fit the model
17.6.3
Model check
17.6.4
Inference from the model
17.6.5
Plot the model
17.6.6
Alternative plot
17.7
Understanding the linear model with crossed factors 3
17.7.1
Thinking about the coefficients of the linear model
17.8
Issues in inference
17.8.1
For pairwise contrasts, it doesn’t matter if you fit a factorial or a flattened linear model
17.8.2
For interaction contrasts, it doesn’t matter if you fit a factorial or a flattened linear model
17.8.3
Adjusting
p
-values for multiple tests
17.9
Two-way ANOVA
17.9.1
How to read a two-way ANOVA table
17.9.2
What do the main effects in an ANOVA table mean?
17.10
More issues in inference
17.10.1
Longitudinal experiments – include Time as a random factor (better than repeated measures ANOVA)
17.11
Working in R
17.11.1
Model formula
17.11.2
Using the emmeans function
17.11.3
Contrasts
17.11.4
Practice safe ANOVA
17.11.5
Better to avoid these
17.12
Hidden Code
17.12.1
Import exp2j (Example 1)
17.12.2
Import exp3e lesian area data (Example 2)
17.12.3
Import Exp1c JA data (Example 3)
Part VII – Expanding the Linear Model
18
Models with random factors – linear mixed models
18.1
Example 1 – A random intercepts and slopes explainer (demo1)
18.1.1
Batched measurements result in clustered residuals
18.1.2
Clustered residuals result in correlated error
18.1.3
In blocked designs, clustered residuals adds a variance component that masks treatment effects
18.1.4
Linear mixed models are linear models with added random factors
18.1.5
What the random effects are
18.1.6
In a blocked design, a linear model with added random effects increases precision of treatment effects
18.1.7
The correlation among random intercepts and slopes
18.1.8
Clustered residuals create heterogeneity among treatments
18.1.9
Linear mixed models are flexible
18.1.10
A random intercept only model
18.1.11
A model including an interaction intercept
18.1.12
AIC and model selection – which model to report?
18.1.13
The specification of random effects matters
18.1.14
Mixed Effect and Repeated Measures ANOVA
18.1.15
Pseudoreplication
18.2
Example 2 – experiments without subsampling replication (exp6g)
18.2.1
Understand the data
18.2.2
Model fit and inference
18.2.3
The model exp6g_m1 adds a random intercept but not a random slope
18.2.4
The fixed effect coefficients of model exp6g_m1
18.2.5
The random intercept coefficients of exp6g_m1
18.2.6
The random and residual variance and the intraclass correlation of model exp6g_m1
18.2.7
The linear mixed model exp6g_m1 increases precision of treatment effects, relative to a fixed effects model
18.2.8
Alternative models for exp6g
18.2.9
Paired t-tests and repeated measures ANOVA are special cases of linear mixed models
18.2.10
Classical (“univariate model”) repeated measures ANOVA of exp6g
18.2.11
“Multivariate model” repeated measures ANOVA
18.2.12
Linear mixed models vs repeated measures ANOVA
18.2.13
Modeling
\(\texttt{mouse_id}\)
as a fixed effect
18.3
Example 3 – Factorial experiments and no subsampling replicates (exp5c)
18.3.1
Understand the data
18.3.2
Examine the data
18.3.3
Model fit and inference
18.3.4
Why we care about modeling batch in exp5c
18.3.5
The linear mixed model exp5c_m1 adds two random intercepts
18.3.6
The fixed effect coefficients of model exp5c_m1
18.3.7
The random effect coefficients of model exp5c_m1
18.3.8
Alternative models for exp5c
18.3.9
Classical (“univariate model”) repeated measures ANOVA
18.3.10
“Multivariate model” repeated measures ANOVA of exp5c
18.3.11
Modeling
\(\texttt{donor}\)
as a fixed effect
18.4
Example 4 – Experiments with subsampling replication (exp1g)
18.4.1
Understand the data
18.4.2
Examine the data
18.4.3
Fit the model
18.4.4
Inference from the model
18.4.5
Plot the model
18.4.6
Alternaplot the model
18.4.7
Understanding the alternative models
18.4.8
The VarCorr matrix of models exp1g_m1a and exp1g_m1b
18.4.9
The linear mixed model has more precision and power than the fixed effect model of batch means
18.4.10
Fixed effect models and pseudoreplication
18.4.11
Mixed-effect ANOVA
18.5
Working in R
18.5.1
Fitting linear mixed models
18.5.2
Plotting models fit to batched data
18.5.3
Repeated measures ANOVA (randomized complete block with no subsampling)
18.6
Hidden code
18.6.1
Import exp5c
18.6.2
Import exp1g
19
Linear models for longitudinal experiments – I. pre-post designs
19.1
Best practice models
19.2
Common alternatives that are not recommended
19.3
Advanced models
19.4
Understanding the alternative models
19.4.1
(M1) Linear model with the baseline measure as the covariate (ANCOVA model)
19.4.2
(M2) Linear model of the change score (change-score model)
19.4.3
(M3) Linear model of post-baseline values without the baseline as a covariate (post model)
19.4.4
(M4) Linear model with factorial fixed effects (fixed-effects model)
19.4.5
(M5) Repeated measures ANOVA
19.4.6
(M6) Linear mixed model
19.4.7
(M7) Linear model with correlated error
19.4.8
(M8) Constrained fixed effects model with correlated error (cLDA model)
19.4.9
Comparison table
19.5
Example 1 – a single post-baseline measure (pre-post design)
19.6
Working in R
19.7
Hidden code
19.7.1
Import and wrangle mouse sociability data
20
Linear models for counts, binary responses, skewed responses, and ratios – Generalized Linear Models
20.1
Introducing Generalized Linear Models using count data examples
20.1.1
The Generalized Linear Model (GLM)
20.1.2
Kinds of data that are modeled by a GLM
20.2
Example 1 – GLM models for count responses (“angiogenic sprouts” exp3a)
20.2.1
Understand the data
20.2.2
Model fit and inference
20.3
Understanding Example 1
20.3.1
Modeling strategy
20.3.2
Model checking fits to count data
20.3.3
Biological count data are rarely fit well by a Poisson GLM. Instead, fit a quasi-poisson or negative binomial GLM model.
20.3.4
A GLM is a linear model on the link scale
20.3.5
Coeffecients of a Generalized Linear Model with a log-link function are on the link scale.
20.3.6
Modeled means in the emmeans table of a Generalized Linear Model can be on the link scale or response scale – Report the response scale
20.3.7
Some consequences of fitting a linear model to count data
20.4
Example 2 – Use a GLM with an offset instead of a ratio of some measurement per area (“dna damage” data exp3b)
20.4.1
exp3b (“dna damage”) data
20.4.2
Understand the data
20.4.3
Model fit and inference
20.5
Understanding Example 2
20.5.1
An offset is an added covariate with a coefficient fixed at 1
20.5.2
A count GLM with an offset models the area-normalized means
20.5.3
Compare an offset to an added covariate with an estimated coefficient
20.5.4
Issues with plotting
20.6
Example 3 – GLM models for binary responses
20.7
Working in R
20.7.1
Fitting GLMs to count data
20.7.2
Fitting a GLM to a continuous conditional response with right skew.
20.7.3
Fitting a GLM to a binary (success or failure, presence or absence, survived or died) response
20.7.4
Fitting Generalized Linear Mixed Models
20.8
Model checking GLMs
20.9
Hidden code
20.9.1
Import Example 1 data (exp3a – “angiogenic sprouts”)
21
Linear models with heterogenous variance
22
Simulations – Count data (alternatives to a t-test)
22.1
Use data similar to Figure 6f from Example 1
22.2
Functions
22.3
Simulations
22.3.1
Type I, Pseudo-Normal distribution
22.3.2
Type I, neg binom, equal n
22.3.3
Type I, neg binom, equal n, small theta
22.3.4
Type I, neg binom, unequal n
22.3.5
Power, Pseudo-Normal distribution, equal n
22.3.6
Power, neg binom, equal n
22.3.7
Power, neg binom, small theta
22.3.8
Power, neg binom, unequal n
22.3.9
Power, neg binom, unequal n, unequal theta
22.3.10
Type 1, neg binom, equal n, unequal theta
22.4
Save it, Read it
22.5
Analysis
Appendix 1: Getting Started with R
22.6
Get your computer ready
22.6.1
Start here
22.6.2
Install R
22.6.3
Install R Studio
22.6.4
Install R Markdown
22.6.5
(optional) Alternative LaTeX installations
22.7
Start learning R Studio
Appendix 2: Online Resources for Getting Started with Statistical Modeling in R
Published with bookdown
Elements of Statistical Modeling for Experimental Biology
Part II: An introduction to the analysis of experimental data with a linear model