4 Module 4: Multiple Regression

Here are some code chunks that setup this chapter

# Here are the libraries I used
library(tidyverse) # standard
library(knitr) # need for a couple things to make knitted document to look nice
library(readr) # need to read in data
library(ggpubr) # allows for stat_cor in ggplots
library(ggfortify) # Needed for autoplot to work on lm()
library(gridExtra) # allows me to organize the graphs in a grid
library(car) # need for some regression stuff like vif
library(GGally)
library(Hmisc) # Needed for some visuals
library(rms) # needed for some data reduction tech.
library(pcaPP)
library(see)
library(performance)

# This changes the default theme of ggplot
old.theme <- theme_get()
theme_set(theme_bw())

4.1 SENIC Data

We will now begin the examining data from the Study on the Efficcacy of Nosocomial Infection Control (SENIC project). The general objective, and therefore project name, was to examine how effective infection surveillance and control programs were at reducing hopsital acquired (nosocomial) diseases. Data was obtained through the test Applied Linear Statistical Models 5th edition (Neter et al).

senic <- read_csv(here::here("datasets", 
                             'SENIC.csv'))

stayLength: Average length of stay of all patients in hospital (in days)
age: Average age of patients (in years)
infectionRisk: Average estimated probability of acquiring infection in hospital (in percent)
cultureRatio: Ratio of number of cultures performed to number of patients without signs or symptoms of hospital-acquired infection, times 100
xrayRatio: Ratio of number of X-rays performed to number of patients , without signs or symptoms of pneumonia, times 100
beds: Average number of beds in hospital during study period
school Med school affiliation (Yes or No)
region: Geographic region (NE, NC, S, W)
patients: Average number of patients in hospital per day during study period
nurses: Average number of full-time equivalent registered and licensed practical nurses during study period (number full-time plus one half the number part time)
facilities: Percent of 35 potential facilities and services that are provided by the hospital

4.2 Infection Risk

There are a lot of potential variables that may relate to infectionRisk, but let’s just concentrate on stayLength and cultureRatio.

#isc ending indicates the variables used: 
#(i)nfectionRisk, (s)tayLength, (c)ultureRatio

senicisc <- select(senic, infectionRisk,
                   stayLength, cultureRatio)

4.2.1 Relation of `infectionRisk` with `stayLength` and `cultureRatio`

4.2.2 Model with `stayLength`

Here is the linear regression output for infectionRisk with stayLength.

infStayLm <- lm(infectionRisk ~ stayLength, senicisc)
summary(infStayLm)


Call:
lm(formula = infectionRisk ~ stayLength, data = senicisc)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7823 -0.7039  0.1281  0.6767  2.5859 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.74430    0.55386   1.344    0.182    
stayLength   0.37422    0.05632   6.645 1.18e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.139 on 111 degrees of freedom
Multiple R-squared:  0.2846,    Adjusted R-squared:  0.2781 
F-statistic: 44.15 on 1 and 111 DF,  p-value: 1.177e-09

Our regression equation for predicting infectionRisk is:

\[\widehat{risk} = 0.744 + 0.374\cdot stayLength\]

4.2.3 Model with `cultureRatio`

Here is the linear regression output for infectionRisk with cultureRatio.

infCuLm <- lm(infectionRisk ~ cultureRatio, senicisc)
summary(infCuLm)


Call:
lm(formula = infectionRisk ~ cultureRatio, data = senicisc)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6759 -0.7133  0.1593  0.7966  3.1860 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.19790    0.19377  16.504  < 2e-16 ***
cultureRatio  0.07326    0.01031   7.106 1.22e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.117 on 111 degrees of freedom
Multiple R-squared:  0.3127,    Adjusted R-squared:  0.3065 
F-statistic: 50.49 on 1 and 111 DF,  p-value: 1.218e-10

The regression equation is then:

\[\widehat{risk} = 3.198 + 0.073\cdot cultureRatio\]

4.3 Linear Regression with Two Variables

We can “easily” create a model that accounts for a linear relationship between two variables. If we have one response variable \(y\) and two predictor variables \(x_1\) and \(x_2\), then the model is:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]

Where we assume that we have error terms \(\epsilon \sim N(0,\sigma_\epsilon)\) which are independent. An additional assumption is that the two predictor variables, \(x_1\) and \(x_2\) are independent of each other.

We still have two parts to the model.

The linear relation/conditional mean \(\mu_{y|x} = \beta_0 + \beta_1 x_1 + \beta_2 x_2\)
The error \(\epsilon\)

4.4 Model for `infectionRisk` using two variables

The way to get the estimated regression equation is the same. We just add another variable into the lm() formula.

riskLm <- lm(infectionRisk ~ stayLength + cultureRatio, 
             data=senicisc)

summary(riskLm)


Call:
lm(formula = infectionRisk ~ stayLength + cultureRatio, data = senicisc)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1822 -0.7275  0.1040  0.6847  2.7143 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.805491   0.487756   1.651    0.102    
stayLength   0.275472   0.052465   5.251 7.46e-07 ***
cultureRatio 0.056451   0.009798   5.761 7.70e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 110 degrees of freedom
Multiple R-squared:  0.4504,    Adjusted R-squared:  0.4404 
F-statistic: 45.07 on 2 and 110 DF,  p-value: 5.04e-15

So our model for estimated risk is

\[\widehat{risk} = 0.805 + 0.275 \cdot stayLength + 0.056\cdot cultureRatio\]

4.4.1 Graphing the relationship

4.4.2 Interpreting the Coefficients

The interpretation, so now we have to figure out to interpret three different estimated coefficients: \(\widehat \beta_{0}, \widehat \beta_{1}, \widehat \beta_{2}\).

\(\widehat \beta_{0}\) is the estimated y-intercept. The idea behind the intercept coefficient is similar, except for now it is \(\hat \mu_{y|x}hat\), the estimated mean value of the \(y\) variable, when both \(x_1\) and \(x_2\) are 0.
\(\widehat \beta_{1}\) is the amount that \(\hat \mu_{y|x}\) increases when \(x_1\) increases by 1 unit AND \(x_2\) is constant.
\(\widehat \beta_{2}\) is the amount that \(\hat \mu_{y|x}\) increases when \(x_2\) increases by 1 unit AND \(x_1\) is constant.

So for our estimated regression model:

\[\widehat{risk} = 0.805 + 0.275 \cdot stayLength + 0.056\cdot cultureRatio\]

We estimate that the mean infection risk of hospitals is 0.805% among hospitals with an average length of stay if 0 days and average culture ratio of 0. (Does this make sense?)
We estimate that the mean infection risk of hospitals increases by 0.275% among hospitals with average length of stay 1 day longer, and average culture ratio saying constant.
We estimate that the mean infection risk of hospitals increases by 0.056% among hospitals where the average culture ratio rate is 1 unit higher, and average length of stay remains constant.

4.5 Inference on the regression coefficients

Just like before, there are several columns for each coefficient. The columns are for the following hypothesis test.

\[H_0: \beta_k = 0\] \[H_1: \beta_k \ne 0\]

The test statistic is:

\[t = \frac{\hat \beta_k}{SE_{\hat \beta_k}}\]

We simple divide our coefficient estimate by its standard error.

The test statistic is assumed to be from a \(t\) distribution with degrees of freedom \(df = n - (p+1)\). Where \(p\) is the total number of predictor variables. And the p-value is the probability of obtaining

When looking at our model summary we interpret the p-values similarly.

summary(riskLm)


Call:
lm(formula = infectionRisk ~ stayLength + cultureRatio, data = senicisc)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1822 -0.7275  0.1040  0.6847  2.7143 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.805491   0.487756   1.651    0.102    
stayLength   0.275472   0.052465   5.251 7.46e-07 ***
cultureRatio 0.056451   0.009798   5.761 7.70e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 110 degrees of freedom
Multiple R-squared:  0.4504,    Adjusted R-squared:  0.4404 
F-statistic: 45.07 on 2 and 110 DF,  p-value: 5.04e-15

There is weak evidence that the true linear model has a non-zero intercept.
There is extremely strong evidence that there is a linear component in the relationship between stayLength and infectionRisk (\(p<0.0001\)) when cultureRatio is included in the model.
There is extremely strong evidence that there is a linear component in the relationship between cultureRatio and infectionRisk (\(p<0.0001\)) when stayLength is included in the model.

The test for the intercept is not important (usually). The tests for the slopes are more important though not terribly so by themselves.

4.6 Understanding the Difference in Nominal Significance Error (DINS)

4.6.1 What is DINS?

The Difference in Nominal Significance Error (DINS) occurs when we incorrectly conclude that two regression coefficients differ significantly from each other based solely on their individual statistical significance tests. In simpler terms, we make this error when we observe:

Coefficient A is statistically significant (p < 0.05)
Coefficient B is not statistically significant (p > 0.05)
We then incorrectly conclude that A and B must be significantly different from each other

4.6.2 Why is this an error?

This reasoning is flawed because the difference between “significant” and “not significant” is not itself necessarily significant. Two coefficients can have different p-values (one below 0.05 and one above 0.05), yet the difference between the coefficients themselves might not be statistically significant.

4.6.3 How to properly compare coefficients

To correctly determine if two coefficients differ significantly:

Calculate the difference between the coefficients
Calculate the standard error of this difference
Perform a formal hypothesis test on this difference
Or construct a confidence interval for the difference

4.6.4 Example in multiple regression

Consider a regression model with two predictors X_1_ and X_2_:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon \]

If Beta1 = 0.5 with p = 0.03 (significant) and Beta2 = 0.3 with p = 0.08 (not significant), we cannot automatically conclude that Beta1 is significantly different from Beta2. We need to explicitly test whether Beta1- Beta2 differs significantly from zero.

4.6.5 Example of a test

Both significant… and significantly different… but does that matter?!?

We could normalize to the SD of the predictor… but does is that matter!?!

Let’s consider a model from earlier…

summary(riskLm)


Call:
lm(formula = infectionRisk ~ stayLength + cultureRatio, data = senicisc)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1822 -0.7275  0.1040  0.6847  2.7143 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.805491   0.487756   1.651    0.102    
stayLength   0.275472   0.052465   5.251 7.46e-07 ***
cultureRatio 0.056451   0.009798   5.761 7.70e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.003 on 110 degrees of freedom
Multiple R-squared:  0.4504,    Adjusted R-squared:  0.4404 
F-statistic: 45.07 on 2 and 110 DF,  p-value: 5.04e-15

library(marginaleffects)
# Coefficients
avg_comparisons(riskLm)


         Term Estimate Std. Error    z Pr(>|z|)    S  2.5 % 97.5 %
 cultureRatio   0.0565     0.0098 5.76   <0.001 26.8 0.0372 0.0757
 stayLength     0.2755     0.0525 5.25   <0.001 22.7 0.1726 0.3783

Type: response
Comparison: +1

# Compare Coefficients
avg_comparisons(riskLm,
                hypothesis = "cultureRatio = stayLength")


              Hypothesis Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
 cultureRatio=stayLength   -0.219     0.0564 -3.88   <0.001 13.2 -0.33 -0.108

Type: response

# Compare Standardized Coefficients using Marginal Effects
avg_comparisons(riskLm, 
                variables = list(cultureRatio = "sd", 
                                 stayLength = "sd"))


         Term Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
 cultureRatio    0.578        0.1 5.76   <0.001 26.8 0.381  0.774
 stayLength      0.527        0.1 5.25   <0.001 22.7 0.330  0.723

Type: response
Comparison: (x + sd/2) - (x - sd/2)

avg_comparisons(
  riskLm,
  variables = list(cultureRatio = "sd", stayLength = "sd"),
  hypothesis = "cultureRatio = stayLength"
)


              Hypothesis Estimate Std. Error     z Pr(>|z|)   S  2.5 % 97.5 %
 cultureRatio=stayLength   0.0512      0.163 0.313    0.754 0.4 -0.269  0.371

Type: response

4.6.6 Implications for interpretation

When interpreting multiple regression coefficients:

Avoid comparing significance levels directly
Use proper statistical tests for coefficient differences
Consider scale of the predictor and what your research question is!

Recognizing and avoiding the DINS error helps ensure your statistical conclusions are valid and well-supported by your data.

4.7 Categorical and Ordinal Variables in Regression Models

4.7.1 Introduction

In regression analysis, predictor variables can be:

Continuous: Numerical variables that can take any value within a range (e.g., age, income, temperature)
Categorical: Variables with distinct categories or groups (e.g., gender, color, treatment type)
Ordinal: Categorical variables with a natural ordering (e.g., education level, satisfaction rating)

4.7.2 Categorical Variables (Nominal)

4.7.2.1 Key Characteristics

No inherent ordering between categories
Examples: gender (male/female), color (red/blue/green), treatment (A/B/C)
Cannot be used directly in regression as numerical values

4.7.2.2 Dummy Variable Encoding

To include categorical variables in regression, we create dummy variables (also called indicator variables):

# Example: Color variable with 3 categories (red, blue, green)
data <- data.frame(
  y = c(10, 15, 12, 18, 14, 16),
  color = c("red", "blue", "green", 
            "red", "blue", "green")
)

# Create dummy variables manually
data$color_blue <- ifelse(data$color == "blue", 1, 0)
data$color_green <- ifelse(data$color == "green", 1, 0)
# Note: red is the reference category (omitted)

# Or use R's factor system (automatic dummy creation)
data$color <- factor(data$color)
model <- lm(y ~ color, data = data)
summary(model)


Call:
lm(formula = y ~ color, data = data)

Residuals:
   1    2    3    4    5    6 
-4.0  0.5 -2.0  4.0 -0.5  2.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   14.500      2.598   5.581   0.0114 *
colorgreen    -0.500      3.674  -0.136   0.9004  
colorred      -0.500      3.674  -0.136   0.9004  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.674 on 3 degrees of freedom
Multiple R-squared:  0.008163,  Adjusted R-squared:  -0.6531 
F-statistic: 0.01235 on 2 and 3 DF,  p-value: 0.9878

4.7.2.3 Reference Category

One category is always omitted as the reference category
Coefficients represent the difference from the reference category
R automatically chooses the first alphabetical category as reference

# Change reference category
data$color <- relevel(factor(data$color), ref = "blue")
model2 <- lm(y ~ color, data = data)

4.7.3 Ordinal Variables

4.7.3.1 Key Characteristics

Categories have a natural ordering
Examples: education (high school < bachelor’s < master’s < PhD), satisfaction (low < medium < high)
Can be treated as categorical OR utilize the ordering information

4.7.3.2 Approach 1: Treat as Categorical

# Create example dataset with ordinal variable
set.seed(123)
n <- 100

# Create ordinal education variable
education_levels <- c("High School", "Some College", "Bachelor", "Graduate")
data <- data.frame(
  salary = rnorm(n, 50000, 15000),
  education = sample(education_levels, n, replace = TRUE),
  experience = runif(n, 0, 20)
)

# Method 1: Treat as regular categorical (dummy variables)
data$education_cat <- factor(data$education)
model_cat <- lm(salary ~ education_cat + experience, data = data)
summary(model_cat)


Call:
lm(formula = salary ~ education_cat + experience, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-32390  -9723   -886   9345  32913 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                46414.8     3764.9  12.328   <2e-16 ***
education_catGraduate       6697.0     4127.6   1.622    0.108    
education_catHigh School    -824.6     3988.1  -0.207    0.837    
education_catSome College   1775.5     4164.3   0.426    0.671    
experience                   315.4      238.6   1.322    0.189    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13560 on 95 degrees of freedom
Multiple R-squared:  0.05917,   Adjusted R-squared:  0.01955 
F-statistic: 1.494 on 4 and 95 DF,  p-value: 0.2103

4.7.3.3 Approach 2: Ordered Factors

# Method 2: Ordered factor with polynomial contrasts
data$education_ord <- factor(data$education, 
                            levels = c("High School", "Some College", 
                                     "Bachelor", "Graduate"),
                            ordered = TRUE)

model_ord <- lm(salary ~ education_ord + experience, data = data)
summary(model_ord)


Call:
lm(formula = salary ~ education_ord + experience, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-32390  -9723   -886   9345  32913 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      48326.8     2680.9  18.026   <2e-16 ***
education_ord.L   4648.6     2627.4   1.769    0.080 .  
education_ord.Q   2048.5     2794.0   0.733    0.465    
education_ord.C   2872.9     2914.1   0.986    0.327    
experience         315.4      238.6   1.322    0.189    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13560 on 95 degrees of freedom
Multiple R-squared:  0.05917,   Adjusted R-squared:  0.01955 
F-statistic: 1.494 on 4 and 95 DF,  p-value: 0.2103

4.7.3.4 Approach 3: Set a different contrast coding

Proceed with caution

# Method 3: Manual contrast coding
contrasts(data$education_cat) = contr.helmert(4)
model_num <- lm(salary ~ education_cat + experience, data = data)
summary(model_num)


Call:
lm(formula = salary ~ education_cat + experience, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-32390  -9723   -886   9345  32913 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    48326.83    2680.91  18.026   <2e-16 ***
education_cat1  3348.52    2063.82   1.622    0.108    
education_cat2 -1391.04    1075.88  -1.293    0.199    
education_cat3   -45.49     802.99  -0.057    0.955    
experience       315.40     238.60   1.322    0.189    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13560 on 95 degrees of freedom
Multiple R-squared:  0.05917,   Adjusted R-squared:  0.01955 
F-statistic: 1.494 on 4 and 95 DF,  p-value: 0.2103

4.7.4 Key Considerations

4.7.4.1 Model Interpretation

Categorical coefficients: Difference in outcome between that category and reference category
Ordinal coefficients:
- Linear term: overall linear trend across ordered categories
- Quadratic term: curvature in the relationship
- Higher-order terms: more complex patterns

4.7.4.2 Choosing the Approach

Use categorical approach when ordering is not meaningful or you want maximum flexibility
Use ordinal approach when ordering is meaningful and you want to capture trends
Be cautious with custom contrasts unless you’re confident about your interpretation

4.7.5 Degrees of Freedom

Categorical variable with k categories uses k-1 degrees of freedom
Ordinal polynomial contrasts also use k-1 degrees of freedom but partition them into trend components

4.7.6 Summary

Categorical variables require dummy variable encoding for regression
Reference category interpretation is crucial
Ordinal variables offer multiple encoding options depending on research goals
Always examine your contrasts and ensure they align with your research questions
R’s factor system handles much of the encoding automatically, but understanding the underlying mechanics is essential for proper interpretation

4.8 Interaction Terms in Regression Models

4.8.1 What are Interaction Terms?

In regression analysis, an interaction term allows the effect of one predictor variable to depend on the value of another predictor variable. Without interactions, we assume that predictors have independent, additive effects on the outcome.

4.8.1.1 Mathematical Foundation

Additive Model (No Interaction): \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\)

Model with Interaction: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3(X_1 \times X_2) + \varepsilon\)

The interaction coefficient \(\beta_3\) tells us how much the effect of \(X_1\) on \(Y\) changes for each one-unit increase in \(X_2\).

4.8.2 The Parallel Slopes Concept

4.8.2.1 Without Interaction (Parallel Slopes)

When there’s no interaction between a continuous and categorical variable, the regression lines for different groups have the same slope but different intercepts - they are parallel.

# Example: No interaction between income and education level
model_additive <- lm(salary ~ income + education_level, data = df)

4.8.2.2 With Interaction (Non-Parallel Slopes)

When an interaction exists, the slopes differ between groups - the lines are not parallel.

# Example: Interaction between income and education level
model_interaction <- lm(salary ~ income * education_level, data = df)
# Note: income * education_level expands to:
# income + education_level + income:education_level

4.8.3 Types of Interactions

4.8.3.1 1. Continuous × Categorical Interaction

# Load example data
library(ggplot2)
data(mtcars)

# Create interaction model: mpg predicted by weight and transmission type
model <- lm(mpg ~ wt * am, data = mtcars)
summary(model)


Call:
lm(formula = mpg ~ wt * am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  31.4161     3.0201  10.402 4.00e-11 ***
wt           -3.7859     0.7856  -4.819 4.55e-05 ***
am           14.8784     4.2640   3.489  0.00162 ** 
wt:am        -5.2984     1.4447  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

# Visualize the interaction
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(am))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Weight (1000 lbs)", 
       y = "Miles per Gallon",
       color = "Transmission\n(0=Auto, 1=Manual)") +
  theme_minimal()

Interpretation: The effect of weight on fuel efficiency differs between automatic and manual transmissions.

4.8.3.2 2. Continuous × Continuous Interaction

# Example with two continuous variables
model_cont <- lm(mpg ~ wt * hp, data = mtcars)
summary(model_cont)


Call:
lm(formula = mpg ~ wt * hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0632 -1.6491 -0.7362  1.4211  4.5513 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
wt          -8.21662    1.26971  -6.471 5.20e-07 ***
hp          -0.12010    0.02470  -4.863 4.04e-05 ***
wt:hp        0.02785    0.00742   3.753 0.000811 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.153 on 28 degrees of freedom
Multiple R-squared:  0.8848,    Adjusted R-squared:  0.8724 
F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13

# The interaction term wt:hp tells us how the effect of weight 
# changes as horsepower increases (or vice versa)

4.8.3.3 3. Categorical × Categorical Interaction

# Example with factors
model_cat <- lm(mpg ~ factor(cyl) * factor(am), data = mtcars)
summary(model_cat)


Call:
lm(formula = mpg ~ factor(cyl) * factor(am), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6750 -1.1000  0.1125  1.6875  5.8250 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                22.900      1.751  13.081 6.06e-13 ***
factor(cyl)6               -3.775      2.316  -1.630 0.115155    
factor(cyl)8               -7.850      1.957  -4.011 0.000455 ***
factor(am)1                 5.175      2.053   2.521 0.018176 *  
factor(cyl)6:factor(am)1   -3.733      3.095  -1.206 0.238553    
factor(cyl)8:factor(am)1   -4.825      3.095  -1.559 0.131069    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.032 on 26 degrees of freedom
Multiple R-squared:  0.7877,    Adjusted R-squared:  0.7469 
F-statistic: 19.29 on 5 and 26 DF,  p-value: 5.179e-08

4.8.4 Interpreting Interaction Coefficients

Consider the model: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3(X_1 \times X_2)\)

4.8.4.1 For Continuous × Categorical:

\(\beta_1\): Effect of \(X_1\) when \(X_2 = 0\) (reference category)
\(\beta_2\): Difference in intercept between categories when \(X_1 = 0\)
\(\beta_3\): How much the slope of \(X_1\) differs between categories

4.8.4.2 Example Interpretation:

# If our model is: salary ~ experience * gender
# Where gender: 0 = male, 1 = female

# Coefficients might be:
# (Intercept): 40000  - Starting salary for males with 0 experience
# experience:   2000  - Salary increase per year experience for males
# gender:       5000  - Salary difference for females at 0 experience
# experience:gender: -500 - How much less the experience effect is for females

# This means:
# Males: salary = 40000 + 2000(experience)
# Females: salary = 45000 + 1500(experience)

4.9 Practical Example in R

# Complete example with interpretation
library(ggplot2)

# Simulate data
set.seed(123)
n <- 100
experience <- runif(n, 0, 20)
gender <- rbinom(n, 1, 0.5)  # 0 = male, 1 = female
salary <- 35000 + 2500*experience + 8000*gender - 400*experience*gender + rnorm(n, 0, 5000)

df <- data.frame(salary, experience, gender = factor(gender, labels = c("Male", "Female")))

# Fit interaction model
model <- lm(salary ~ experience * gender, data = df)
summary(model)


Call:
lm(formula = salary ~ experience * gender, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-9034.0 -3041.0  -624.4  3056.7 16264.0 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              34936.2     1596.4  21.885  < 2e-16 ***
experience                2500.9      134.7  18.563  < 2e-16 ***
genderFemale              6819.7     2032.2   3.356  0.00113 ** 
experience:genderFemale   -376.9      176.2  -2.139  0.03494 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4884 on 96 degrees of freedom
Multiple R-squared:  0.8786,    Adjusted R-squared:  0.8748 
F-statistic: 231.6 on 3 and 96 DF,  p-value: < 2.2e-16

# Visualize
ggplot(df, aes(x = experience, y = salary, color = gender)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Salary by Experience and Gender (with Interaction)",
       x = "Years of Experience", 
       y = "Salary ($)") +
  theme_minimal()

# Extract slopes for each group
coef(model)["experience"]  # Slope for males (reference)

experience 
  2500.915

coef(model)["experience"] + coef(model)["experience:genderFemale"]  # Slope for females

experience 
  2124.041

4.9.1 Key Takeaways

Interactions allow effects to vary: The impact of one variable depends on another variable’s value
Parallel slopes assumption: No interaction means equal slopes across groups
Interpretation changes: With interactions, main effects represent conditional effects
Visualization helps: Plotting interaction effects aids interpretation and communication

4.9.2 Common Pitfalls

Centering continuous variables: Consider mean-centering for easier interpretation of main effects
Higher-order terms: Three-way interactions and beyond become very difficult to interpret
Sample size: Interactions require larger samples to detect reliably
Multiple comparisons: Be cautious when testing many potential interactions

4.9.3 Confidence Intervals for Coefficients

We can likewise get confidence intervals for the coefficients. They take the general form:

\[\hat \beta_i \pm t_{\alpha/2, df} \cdot SE_{\hat \beta_i}\]

Again the \(t_{\alpha/2}\) quantile depends on the value for \(\alpha\) of the desired nominal confidence level \(1 - \alpha\).

To get the confidence intervals we use the confint() function just before.

confint(riskLm, level = 0.99)

                  0.5 %     99.5 %
(Intercept)  -0.4730459 2.08402798
stayLength    0.1379482 0.41299604
cultureRatio  0.0307672 0.08213574

The estimated change in the mean of infectionRisk is:

From 0.138 to 0.413 at 99% confidence when increasing stayLength by 1 day. (And keeping cultureRatio constant.)
From 0.031 to 0.082 at 99% confidence when increasing cultureRatio by 1 (ratios don’t really have units). (Keeping stayLength constant.)

4.10 Estimating the Mean/Predicting Future Observations

Getting predictions is similar to before. You use the predict() function, and you need to specify what values you want predictions for.

You make a data.frame with values for each predictor variable.

You must specify all variables that are used as predictors.
Each variable must have the same number of points.
If you screw this up, you will get predictions for all of the original points in the data.

Lets say you want to predict/estimate infectionRisk for a hospital that have an average length of stay of 5, 10, and 15 days, and a average culture ratio of 5, 15, 25. But which way are we combining those?

newdata <- data.frame(stayLength = c(5, 10, 15),
                      cultureRatio = c(5, 15, 25))
predict(riskLm, newdata)

       1        2        3 
2.465109 4.406984 6.348859

Or what about?

newdata <- data.frame(stayLength = c(5, 10, 15), 
                      cultureRatio = c(25, 15, 5))
predict(riskLm, newdata)

       1        2        3 
3.594138 4.406984 5.219830

Or…?

newdata <- data.frame(stayLength = c(15, 5, 10), 
                      cultureRatio = c(15, 25, 5))
predict(riskLm, newdata)

       1        2        3 
5.784345 3.594138 3.842469

You have to know which exact combination of predictor variable values you want.

4.10.1 Confidence Intervals for the Mean and Prediction Intervals for Future Observations

This is the same as well.

Confidence intervals (CIs) are establishing a range of plausible values for the conditional mean: \(\mu_{y|x} = \beta_0 + \beta_1 x_1 + \widehat \beta_{2} x_2\)
Prediction intervals (PIs) establish a range of plausible values for individual observations: \(y = \beta_0 + \beta_1 x_1 + \widehat \beta_{2} x_2 + \epsilon\)

Let’s make 99% CIs and PIs.

newdata <- data.frame(stayLength = c(15, 5, 10), 
                      cultureRatio = c(15, 25, 5))
CIs <- predict(riskLm, newdata, 
               interval = "confidence", level = 0.99)
PIs <- predict(riskLm, newdata,
               interval = "prediction", level = 0.99)

CIs

       fit      lwr      upr
1 5.784345 5.001363 6.567327
2 3.594138 2.803874 4.384403
3 3.842469 3.456304 4.228635

PIs

       fit       lwr      upr
1 5.784345 3.0409109 8.527779
2 3.594138 0.8486173 6.339659
3 3.842469 1.1849346 6.500004

Interpretations are now relative to the value of both predictors variables.

CI Interpretation: There is 99% confidence that the mean infection risk of hospitals with average length of stay of 15 days and culture ratio of 15 is between 5.00% to 6.57%.

PI Interpretation: There is 99% confidence that the infection risk of an Individual hospital with average length of stay of 15 days and culture ratio of 15 is between 3.04% to 8.53%.

4.11 Residual Analysis

To assess validity of the model assumptions we can examine the residuals in the same manner as before. Create plots of the residuals using autoplot() and assess them in the same way as before.

autoplot(riskLm)

4.12 Adding more variables!

The linear model is not restricted to one or two variables. We can have as many predictor variables as we want! Let \(p\) denote the total number of predictor variables we use. The linear model is now:

\[y = \widehat \beta_0 + \widehat \beta_1 x_1 + \widehat \beta_{2} x_2 + ... + \widehat \beta_{p} + \epsilon = \beta_0 + \sum_{k=1}^p\beta_k x_k + \epsilon\]

4.12.1 `facilities` and `infectionRisk`?

Here is a graph between facilities and infectionRisk.

ggplot(senic, aes(x = facilities, y = infectionRisk)) +
  geom_point()

4.12.2 Adding `facilities` to the `infectionRisk` model.

riskierLm <- lm(infectionRisk ~ stayLength + cultureRatio + facilities, senic)
summary(riskierLm)


Call:
lm(formula = infectionRisk ~ stayLength + cultureRatio + facilities, 
    data = senic)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.26400 -0.59873  0.01723  0.56650  2.64517 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.491332   0.481636   1.020  0.30992    
stayLength   0.223907   0.053366   4.196 5.56e-05 ***
cultureRatio 0.054200   0.009479   5.718 9.55e-08 ***
facilities   0.019630   0.006454   3.042  0.00295 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9674 on 109 degrees of freedom
Multiple R-squared:  0.4934,    Adjusted R-squared:  0.4795 
F-statistic: 35.39 on 3 and 109 DF,  p-value: 4.769e-16

4.12.3 Remember to always check your residuals!

autoplot(riskierLm)

4.12.4 The Model Analysis of Variance: Global F-Test

Assuming there are \(p\) predictor variables

\(H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0\)

\(H_1:\) At least one \(\beta_k\) is not \(0\).

To test this, we use that same breakdown of model variability as before:

\[SSTO = \sum_{i=1}^n(y_i - \bar y)^2\]

This variability in the variable can be broken up into two pieces:

\[SSR = \sum_{i=1}^n(\hat y_i - \bar y)^2 = \sum_{i=1}^n(\widehat \mu_{y|x} - \widehat\mu_y)^2\]

\[SSE = \sum_{i=1}^n(y_i - \widehat y_i)^2\]

The two sums of squares combined are the total variability \(SSTO\).

\[SSTO = SSR + SSE\] The Regression Degrees of Freedom is now \(p\), the error degrees of freedom is \(n - (p + 1)\).

This gives the following mean squares

Mean Square Regression: \(MSR = SSR/p\)
Mean Square Error: \(MSE = SSE/[n-(p+1)]\)

The test statistic is \(F_t = MSR/MSE\) and we use the \(F(p, n-(p +1))\) distribution to compute the p-value.

\[ F_t = \frac{SSR \div p}{MSE} = \frac{MSR}{MSE} \]

4.12.5 F-Test for `infectionRisk` model with 3 predictors

This test is available via the model summary in the very last row.

summary(riskierLm)


Call:
lm(formula = infectionRisk ~ stayLength + cultureRatio + facilities, 
    data = senic)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.26400 -0.59873  0.01723  0.56650  2.64517 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.491332   0.481636   1.020  0.30992    
stayLength   0.223907   0.053366   4.196 5.56e-05 ***
cultureRatio 0.054200   0.009479   5.718 9.55e-08 ***
facilities   0.019630   0.006454   3.042  0.00295 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9674 on 109 degrees of freedom
Multiple R-squared:  0.4934,    Adjusted R-squared:  0.4795 
F-statistic: 35.39 on 3 and 109 DF,  p-value: 4.769e-16

Does this really matter? Ask yourself what the null hypothesis would and if it would matter.

4.12.6 Experiment: What happens to the `stayLength` slope?

Let’s look at the coefficients for the three models involving stayLength as a predictor.

By itself.

term	estimate	std.error	statistic	p.value
(Intercept)	0.7443037	0.5538573	1.343855	0.1817359
stayLength	0.3742169	0.0563195	6.644530	0.0000000

With cultureRatio.

coeff.summary(riskLm)

term	estimate	std.error	statistic	p.value
(Intercept)	0.8054910	0.4877558	1.651423	0.1015044
stayLength	0.2754721	0.0524647	5.250615	0.0000007
cultureRatio	0.0564515	0.0097984	5.761279	0.0000001

And finally adding in facilities.

coeff.summary(riskierLm)

term	estimate	std.error	statistic	p.value
(Intercept)	0.4913323	0.4816361	1.020132	0.3099249
stayLength	0.2239075	0.0533656	4.195726	0.0000556
cultureRatio	0.0542000	0.0094793	5.717698	0.0000001
facilities	0.0196303	0.0064539	3.041603	0.0029482

4.13 Transformations

A simple way to check the relationship between the variables would be to use the pairs() function. The pairs() function produces scatterplots between all variables you specify.

The basic syntax is pairs(formula, data) where formula is the formula of the regression model you are considering data is the dataset you are using.

pairs(infectionRisk ~ stayLength + cultureRatio + facilities, 
      senic)

This is a scatterplot matrix. Notice that the vertical axis determined by the variable named to the left or right of a plot, and horizontal axis is determined by the variable named above or below a plot.

4.13.1 Finding the right transformations

pairs(infectionRisk ~ log(stayLength) + 
        log(cultureRatio) + 
        facilities, senic)

4.13.2 Incorporating them into the model

friskyLm <- lm(infectionRisk ~ log(stayLength) + 
                 log(cultureRatio) + 
                 facilities, senic)
summary(friskyLm)


Call:
lm(formula = infectionRisk ~ log(stayLength) + log(cultureRatio) + 
    facilities, data = senic)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.31059 -0.63488  0.04808  0.54228  2.43225 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -4.255154   1.110040  -3.833 0.000212 ***
log(stayLength)    2.534644   0.540984   4.685 8.12e-06 ***
log(cultureRatio)  0.924657   0.134138   6.893 3.70e-10 ***
facilities         0.012736   0.006254   2.036 0.044147 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9138 on 109 degrees of freedom
Multiple R-squared:  0.548, Adjusted R-squared:  0.5355 
F-statistic: 44.05 on 3 and 109 DF,  p-value: < 2.2e-16

4.13.3 Residuals!

autoplot(friskyLm)

Then you have to reinterpret everything, and so on and so on.

4.13.4 Which log?

friskyLm2 <- lm(infectionRisk ~ log2(stayLength) + 
                  log2(cultureRatio) + 
                  facilities, senic)
summary(friskyLm2)


Call:
lm(formula = infectionRisk ~ log2(stayLength) + log2(cultureRatio) + 
    facilities, data = senic)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.31059 -0.63488  0.04808  0.54228  2.43225 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -4.255154   1.110040  -3.833 0.000212 ***
log2(stayLength)    1.756881   0.374982   4.685 8.12e-06 ***
log2(cultureRatio)  0.640923   0.092977   6.893 3.70e-10 ***
facilities          0.012736   0.006254   2.036 0.044147 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9138 on 109 degrees of freedom
Multiple R-squared:  0.548, Adjusted R-squared:  0.5355 
F-statistic: 44.05 on 3 and 109 DF,  p-value: < 2.2e-16

4.13.5 Can you spot the difference in residuals?

autoplot(friskyLm2)

autoplot(friskyLm)

4.14 Variable Selection, Data Reduction, and Model Comparison

4.15 Explainable statistical learning in public health for policy development: the case of real-world suicide data

We will work with data made available from this paper:

https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0796-7

If you want to go really in-depth of how you deal with data, this article goes into a lot of detail.

phe <- read_csv(here::here("datasets","phe.csv"))

4.15.1 Variables. A LOT!

Variables
2014 Suicide (age-standardised rate per 100,000 - outcome measure)
2013 Adult social care users who have as much social contact as they would like (% of adult social care users)
2013 Adults in treatment at specialist alcohol misuse services (rate per 1000 population)
2013 Adults in treatment at specialist drug misuse services (rate per 1000 population)
2013 Alcohol-related hospital admission (female) (directly standardised rate per 100,000 female population)
2013 Alcohol-related hospital admission (male) (directly standardised rate per 100,000 male population)
2013 Alcohol-related hospital admission (directly standardised rate per 100,000 population)
2013 Children in the youth justice system (rate per 1,000 aged 10–18)
2013 Children leaving care (rate per 10,000 < 18 population)
2013 Depression recorded prevalence (% of adults with a new diagnosis of depression who had a bio-psychosocial assessment)
2013 Domestic abuse incidents (rate per 1,000 population)
2013 Emergency hospital admissions for intentional self-harm (female) (directly age-standardised rate per 100,000 women)
2013 Emergency hospital admissions for intentional self-harm (male) (directly age-standardised rate per 100,000 men)
2013 Emergency hospital admissions for intentional self-harm (directly age-and-sex-standardised rate per 100,000)
2013 Looked after children (rate per 10,000 < 18 population)
2013 Self-reported well-being - high anxiety (% of people)
2013 Severe mental illness recorded prevalence (% of practice register [all ages])
2013 Social care mental health clients receiving services (rate per 100,000 population)
2013 Statutory homelessness (rate per 1000 households)
2013 Successful completion of alcohol treatment (% who do not represent within 6 months)
2013 Successful completion of drug treatment - non-opiate users (% who do not represent within 6 months)
2013 Successful completion of drug treatment - opiate users (% who do not represent within 6 months)
2013 Unemployment (% of working-age population)
2012 Adult carers who have as much social contact as they would like (18+ yrs) (% of 18+ carers)
2012 Adult carers who have as much social contact as they would like (all ages) (% of adult carers)
2011 Estimated prevalence of opiates and/or crack cocaine use (rate per 1,000 aged 15–64)
2011 Long-term health problems or disability (% of people whose day-to-day activities are limited by their health or disability)
2011 Marital breakup (% of adults whose current marital status is separated or divorced)
2011 Older people living alone (% of households occupied by a single person aged 65 or over)
2011 People living alone (% of all households occupied by a single person)
Mental Health Service users with crisis plans: % of people in contact with services with a crisis plan in place (end of quarter snapshot)
Older people
2011 Self-reported well-being - low happiness (% of people with a low happiness score)

4.16 How do we choose variables?

Objective: Find a way to predict suicide rates.

We could just use all of the predictors in a linear model.

fullLm <- lm(suicide_rate ~ ., phe)

summary(fullLm)


Call:
lm(formula = suicide_rate ~ ., data = phe)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0118 -1.0447 -0.2702  0.9903  4.1993 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)
(Intercept)                     2.0155877  3.5059381   0.575    0.566
children_youth_justice         -0.0235629  0.0247292  -0.953    0.343
adult_carers_isolated_18        0.0336685  0.0262501   1.283    0.202
adult_carers_isolated_all_ages -0.0226404  0.0424220  -0.534    0.595
adult_carers_not_isolated       0.3786174  0.2379446   1.591    0.114
alcohol_rx_18                  -0.3001189  0.2515821  -1.193    0.235
alcohol_rx_all_ages            -0.0130574  0.0148093  -0.882    0.380
alcohol_admissions_f           -0.0060296  0.0127911  -0.471    0.638
alchol_admissions_m             0.0167345  0.0268020   0.624    0.534
alcohol_admissions_p           -0.0600782  0.0895445  -0.671    0.504
children_leaving_care           0.0216349  0.0288644   0.750    0.455
depression                     -0.1169805  0.1328876  -0.880    0.380
domestic_abuse                  0.0269278  0.0419236   0.642    0.522
self_harm_female                0.0369915  0.0946913   0.391    0.697
self_harm_male                  0.0318458  0.0944774   0.337    0.737
self_harm_persons              -0.0658812  0.1902674  -0.346    0.730
opiates                         0.2041520  0.1375203   1.485    0.140
lt_health_problems              0.0318830  0.1456081   0.219    0.827
lt_unepmloyment                -0.6342130  1.2588572  -0.504    0.615
looked_after_children           0.0174365  0.0146237   1.192    0.236
marital_breakup                 0.1606036  0.1723378   0.932    0.353
old_pople_alone                 0.3779967  0.4371635   0.865    0.389
alone                          -0.1004305  0.1386281  -0.724    0.470
self_reported_well_being        0.0605038  0.0681410   0.888    0.376
smi                             2.1708482  1.3401643   1.620    0.108
social_care_mh                  0.0006815  0.0005270   1.293    0.199
homeless                       -0.1550536  0.1051424  -1.475    0.143
alcohol_rx                      0.0170587  0.0282631   0.604    0.547
drug_rx_non_opiate             -0.0408008  0.0271119  -1.505    0.135
drug_rx_opiate                  0.1068925  0.0848400   1.260    0.210
unemployment                    0.0690206  0.2925910   0.236    0.814

Residual standard error: 1.687 on 118 degrees of freedom
Multiple R-squared:  0.5031,    Adjusted R-squared:  0.3767 
F-statistic: 3.982 on 30 and 118 DF,  p-value: 3.823e-08

autoplot(fullLm)

If we are going by statistical significance, each predictor variable fails but the global F-test says that a linear model is working (sort of).

And that \(R^2\) is pretty good for “prediciting” something related to human behavior.

Residuals look good.

Overall, this is bad!

You do not just use all the variables you have

Though there are exceptions: read Harrell’s Regression Model Strategies (Chapter 4, Section 12, 4.12.1 - 4.12.3)

4.16.1 The scope of the problem

How many variables are there to use…? How well do they work?

We could try cor(phe) and see which variables are most correlated with suicide_rate. (You’ll get a lot of output… that’s \(31\times 31 = 961\) correlations)

I’ll save you the pain and just show the correlations with just suicide_rate.

                               suicide_rate
children_youth_justice           0.05847330
adult_carers_isolated_18         0.24939518
adult_carers_isolated_all_ages   0.20369843
adult_carers_not_isolated        0.45228878
alcohol_rx_18                    0.38997284
alcohol_rx_all_ages              0.33117982
alcohol_admissions_f             0.29906808
alchol_admissions_m              0.31868168
alcohol_admissions_p             0.11433511
children_leaving_care            0.36700280
depression                       0.32509700
domestic_abuse                   0.14835220
self_harm_female                 0.49025136
self_harm_male                   0.52004798
self_harm_persons                0.51686092
opiates                          0.41195709
lt_health_problems               0.47825399
lt_unepmloyment                  0.19643927
looked_after_children            0.51243274
marital_breakup                  0.39971208
old_pople_alone                  0.39870200
alone                            0.31037882
self_reported_well_being         0.12949288
smi                              0.18390513
social_care_mh                   0.20125278
homeless                        -0.32105739
alcohol_rx                      -0.07513134
drug_rx_non_opiate              -0.14207504
drug_rx_opiate                  -0.14345307
unemployment                     0.18827138

4.16.2 Just use the best correlations?

These are a few of the strongest correlations.

                               suicide_rate

self_harm_female                 0.49025136
self_harm_male                   0.52004798
self_harm_persons                0.51686092
looked_after_children            0.51243274

Let’s make a model.

badIdea <- lm(suicide_rate ~ self_harm_female + self_harm_male + self_harm_persons + looked_after_children, phe)

summary(badIdea)


Call:
lm(formula = suicide_rate ~ self_harm_female + self_harm_male + 
    self_harm_persons + looked_after_children, data = phe)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1974 -1.2196  0.0677  1.1319  6.4354 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.521275   0.485989  13.419  < 2e-16 ***
self_harm_female       0.059297   0.087465   0.678 0.498889    
self_harm_male         0.053601   0.087682   0.611 0.541954    
self_harm_persons     -0.106747   0.175961  -0.607 0.545038    
looked_after_children  0.031342   0.008442   3.713 0.000292 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.763 on 144 degrees of freedom
Multiple R-squared:  0.3379,    Adjusted R-squared:  0.3195 
F-statistic: 18.37 on 4 and 144 DF,  p-value: 3.246e-12

What does the model have to say about how the self harm variables are related to suicide rate?

4.17 Multicollinearity

pairs(suicide_rate ~ self_harm_persons + self_harm_female + self_harm_male, phe)

cor(phe$self_harm_female,phe$self_harm_male)

[1] 0.8865727

cor(phe$self_harm_persons,phe$self_harm_male)

[1] 0.9601094

cor(phe$self_harm_persons,phe$self_harm_female)

[1] 0.9804772

When we have predictor variables that are linearly related to each other, we have what is called multi-collinearity.

These variables essentially all contribute the same information over and over to the model. This makes a feedback loop that kicks everything around.

What about a model with just self_harm_persons (Why that one?) and looked_after_children?

notAsBadLm <- lm(suicide_rate ~ self_harm_persons + looked_after_children, phe)

summary(notAsBadLm)


Call:
lm(formula = suicide_rate ~ self_harm_persons + looked_after_children, 
    data = phe)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3450 -1.2323  0.0677  1.0792  6.2763 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           6.704515   0.430054  15.590  < 2e-16 ***
self_harm_persons     0.008385   0.002124   3.947 0.000123 ***
looked_after_children 0.027803   0.007281   3.818 0.000198 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.756 on 146 degrees of freedom
Multiple R-squared:  0.3337,    Adjusted R-squared:  0.3246 
F-statistic: 36.56 on 2 and 146 DF,  p-value: 1.344e-13

4.17.1 But what about self harm and looking after children?

Relations may not be as blaringly obvious as that. What about self harm and looking after children?

ggplot(phe, aes(x = self_harm_persons, y = looked_after_children)) +
  geom_point() +
  stat_cor()

This isn’t nearly as bad and wouldn’t be considered a detrimental relation.

4.18 Measuring multi-collinearity

4.18.1 A linear model for the predictors

Linear relation between many predictor variables.

\[x_k =\beta_{0}^* + \beta_{1}^* x_1 + \beta_{2}^*x_2 + \dots + \beta_p^* x_p + \epsilon^*\]

This is like any model and has it’s own \(R^2\) \(R^2_k\).

If this \(R^2\) is high, it means that a predictor variable is redundant.

There are a few ways to measure “high”.

Some sources would say that \(0.8 \le R_k^2 \le 0.9\) would be problematic, and you have extreme multicollinearity issues for anything higher.

4.18.2 Tolerance

Tolerance is \(1 - R^2_k\).

That’s it.

So \(0.2\) is the “problematic” cutoff and \(0.9\) is the “extreme” cutoff.

It may be referred to as TOL sometimes.

4.18.3 Variance Inflation Factors

The Variance Inflation Factor is:

\[VIF_k = \frac{1}{1 - R^2_k}\]

The VIF is considered “problematic” if greater than 5, and “extreme” if greater than 10.

4.18.4 Getting TOL or VIF: `performance` package

A nifty package that will help us deal with all the nasty bits of linear regression in R is the performance package. It is part of the easystats universe of packages.

To get tolerance and VIFs, create the model using lm(), and then plug it into the check_collinearity function.

check_collinearity(model)

library(performance)
library(see)

# badidea was the model with all the self harm variables

check_collinearity(badIdea)

# Check for Multicollinearity

Low Correlation

                  Term  VIF          VIF 95% CI adj. VIF Tolerance
 looked_after_children 2.04 [   1.66,     2.63]     1.43      0.49
 Tolerance 95% CI
     [0.38, 0.60]

High Correlation

              Term      VIF          VIF 95% CI adj. VIF Tolerance
  self_harm_female  3811.50 [2809.27,  5171.42]    61.74  2.62e-04
    self_harm_male  1863.65 [1373.68,  2528.51]    43.17  5.37e-04
 self_harm_persons 10400.43 [7665.39, 14111.48]   101.98  9.61e-05
 Tolerance 95% CI
     [0.00, 0.00]
     [0.00, 0.00]
     [0.00, 0.00]

Versus

check_collinearity(notAsBadLm)

# Check for Multicollinearity

Low Correlation

                  Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
     self_harm_persons 1.53 [1.29, 1.96]     1.24      0.65     [0.51, 0.78]
 looked_after_children 1.53 [1.29, 1.96]     1.24      0.65     [0.51, 0.78]

4.18.5 Plot `check_collinearity` checks

plot(check_collinearity(badIdea))

Versus

plot(check_collinearity(notAsBadLm))

4.18.6 VIF and Tolerance in the full model

You do not get rid of a variable because it has a small tolerance/large VIF!
You would investigate which variables are most correlated with eachother.
You need to use your own reasoning to decide which variables to remove based on the correlations between predictors AND the response.

Tolerance and VIF only indicate that the overall model has a problem, not the individual variable.

check_collinearity(fullLm)

# Check for Multicollinearity

Low Correlation

                           Term  VIF           VIF 95% CI adj. VIF Tolerance
         children_youth_justice 1.48 [    1.29,     1.80]     1.22      0.68
       adult_carers_isolated_18 1.88 [    1.59,     2.30]     1.37      0.53
 adult_carers_isolated_all_ages 1.84 [    1.57,     2.26]     1.36      0.54
      adult_carers_not_isolated 3.11 [    2.55,     3.88]     1.76      0.32
           alcohol_admissions_p 2.05 [    1.73,     2.52]     1.43      0.49
          children_leaving_care 4.32 [    3.49,     5.42]     2.08      0.23
                     depression 2.73 [    2.25,     3.38]     1.65      0.37
                 domestic_abuse 1.90 [    1.61,     2.33]     1.38      0.53
                marital_breakup 2.36 [    1.97,     2.91]     1.54      0.42
                          alone 4.82 [    3.88,     6.06]     2.19      0.21
       self_reported_well_being 1.42 [    1.24,     1.72]     1.19      0.71
                            smi 3.62 [    2.94,     4.52]     1.90      0.28
                 social_care_mh 1.29 [    1.15,     1.58]     1.14      0.77
                       homeless 3.11 [    2.55,     3.87]     1.76      0.32
                     alcohol_rx 3.27 [    2.68,     4.08]     1.81      0.31
             drug_rx_non_opiate 3.21 [    2.63,     4.00]     1.79      0.31
                 drug_rx_opiate 1.99 [    1.68,     2.44]     1.41      0.50
 Tolerance 95% CI
     [0.56, 0.78]
     [0.43, 0.63]
     [0.44, 0.64]
     [0.26, 0.39]
     [0.40, 0.58]
     [0.18, 0.29]
     [0.30, 0.44]
     [0.43, 0.62]
     [0.34, 0.51]
     [0.16, 0.26]
     [0.58, 0.81]
     [0.22, 0.34]
     [0.63, 0.87]
     [0.26, 0.39]
     [0.24, 0.37]
     [0.25, 0.38]
     [0.41, 0.60]

Moderate Correlation

                  Term  VIF           VIF 95% CI adj. VIF Tolerance
       lt_unepmloyment 5.30 [    4.26,     6.68]     2.30      0.19
 looked_after_children 6.68 [    5.33,     8.45]     2.58      0.15
          unemployment 8.68 [    6.89,    11.02]     2.95      0.12
 Tolerance 95% CI
     [0.15, 0.23]
     [0.12, 0.19]
     [0.09, 0.15]

High Correlation

                 Term      VIF           VIF 95% CI adj. VIF Tolerance
        alcohol_rx_18    18.48 [   14.52,    23.59]     4.30      0.05
  alcohol_rx_all_ages   351.34 [  273.84,   450.86]    18.74  2.85e-03
 alcohol_admissions_f   978.20 [  762.20,  1255.50]    31.28  1.02e-03
  alchol_admissions_m  2345.59 [ 1827.47,  3010.68]    48.43  4.26e-04
     self_harm_female  4877.78 [ 3800.20,  6261.00]    69.84  2.05e-04
       self_harm_male  2362.51 [ 1840.66,  3032.40]    48.61  4.23e-04
    self_harm_persons 13277.66 [10344.20, 17043.10]   115.23  7.53e-05
              opiates    12.46 [    9.83,    15.87]     3.53      0.08
   lt_health_problems    11.71 [    9.25,    14.91]     3.42      0.09
      old_pople_alone    11.18 [    8.84,    14.23]     3.34      0.09
 Tolerance 95% CI
     [0.04, 0.07]
     [0.00, 0.00]
     [0.00, 0.00]
     [0.00, 0.00]
     [0.00, 0.00]
     [0.00, 0.00]
     [0.00, 0.00]
     [0.06, 0.10]
     [0.07, 0.11]
     [0.07, 0.11]

You have to jump through a lot of hoops and then make justifiable choices.

4.18.7 Detecting Multicollinearity

The following our indicators of multicollinearity:

“Significant” correlations between pairs of independent variables.
Non-significant t-tests for all or nearly all of the predictor variables, but a significant global F-test.
The coefficients have opposite signs than what would be expected (by logic or compared to the individual correlation with the response variable).
A single VIF of 10, more than one VIF of 5 or above. Several VIF 3 or over. Or corresponding tolerance values. Or corresponding \(R^2_k\) values

The cutoffs for VIF, tolerance, or \(R^2_k\) are guidelines not solid rules. Theory should always trump these more arbitrary statistical rules.

4.19 Variable screening the PHE data

The paper has various methods for variable selection to remove multicollinearity issues.

They don’t really document it that well… Here’s my best take.

# I created this dataset after going through things over and over
# I removed variables from the csv as I was working with it in R
# Lots of fun!

# also, I only use this dataset for demonstration so I don't have to type out
# all of the variables I left in the data into the formula. 

pheRed <- read_csv(here::here("datasets",
                              'phe_reduced.csv'))

redModel <- lm(suicide_rate ~ ., pheRed)

check_collinearity(redModel)

# Check for Multicollinearity

Low Correlation

                           Term  VIF   VIF 95% CI adj. VIF Tolerance
         children_youth_justice 1.13 [1.04, 1.51]     1.06      0.88
 adult_carers_isolated_all_ages 1.48 [1.27, 1.85]     1.22      0.68
           alcohol_admissions_p 1.78 [1.49, 2.23]     1.33      0.56
          children_leaving_care 2.58 [2.09, 3.30]     1.61      0.39
                     depression 2.21 [1.81, 2.80]     1.49      0.45
                 domestic_abuse 1.42 [1.23, 1.78]     1.19      0.70
              self_harm_persons 2.35 [1.92, 2.99]     1.53      0.43
                        opiates 2.14 [1.76, 2.70]     1.46      0.47
                marital_breakup 1.83 [1.53, 2.30]     1.35      0.55
                          alone 1.49 [1.28, 1.87]     1.22      0.67
       self_reported_well_being 1.24 [1.10, 1.57]     1.11      0.81
                 social_care_mh 1.17 [1.05, 1.51]     1.08      0.86
                       homeless 2.25 [1.84, 2.86]     1.50      0.44
                     alcohol_rx 1.43 [1.24, 1.80]     1.20      0.70
                 drug_rx_opiate 1.62 [1.37, 2.03]     1.27      0.62
 Tolerance 95% CI
     [0.66, 0.97]
     [0.54, 0.79]
     [0.45, 0.67]
     [0.30, 0.48]
     [0.36, 0.55]
     [0.56, 0.81]
     [0.33, 0.52]
     [0.37, 0.57]
     [0.43, 0.65]
     [0.54, 0.78]
     [0.64, 0.91]
     [0.66, 0.95]
     [0.35, 0.54]
     [0.56, 0.81]
     [0.49, 0.73]

Some variables I removed because I could not find the correct definition.
I am not an expert on this stuff so I had to do my best based on my mediocre knowledge of the subject matter, and trying to figure out what the authors of the paper were saying.
When it is okay to stop removing variables is entirely dependent on the situation. I don’t even have a guideline for it.

4.20 Next step: Choosing an actual model

Supposing the multicollinearity issue has been solved, how do we move on?

Do we use all the variables in the model?

summary(redModel)


Call:
lm(formula = suicide_rate ~ ., data = pheRed)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8689 -1.1154 -0.0917  0.8901  4.2030 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)  
(Intercept)                     4.0832859  3.0285950   1.348   0.1799  
children_youth_justice         -0.0136101  0.0220822  -0.616   0.5387  
adult_carers_isolated_all_ages  0.0000509  0.0387482   0.001   0.9990  
alcohol_admissions_p           -0.0623441  0.0849678  -0.734   0.4644  
children_leaving_care           0.0422261  0.0227752   1.854   0.0659 .
depression                     -0.1622739  0.1220073  -1.330   0.1858  
domestic_abuse                  0.0247253  0.0369822   0.669   0.5049  
self_harm_persons               0.0066734  0.0025813   2.585   0.0108 *
opiates                         0.1010386  0.0580642   1.740   0.0842 .
marital_breakup                 0.2937180  0.1548091   1.897   0.0600 .
alone                           0.0336382  0.0786620   0.428   0.6696  
self_reported_well_being        0.0345748  0.0650050   0.532   0.5957  
social_care_mh                  0.0007603  0.0005101   1.490   0.1385  
homeless                       -0.2377671  0.0912485  -2.606   0.0102 *
alcohol_rx                     -0.0047326  0.0190770  -0.248   0.8045  
drug_rx_opiate                  0.0305590  0.0779842   0.392   0.6958  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.72 on 133 degrees of freedom
Multiple R-squared:  0.4176,    Adjusted R-squared:  0.3519 
F-statistic: 6.358 on 15 and 133 DF,  p-value: 5.163e-10

Well now at least some of the predictor’s are significant.
We should not just use all predictors since that ruins the generalizibility of the model.

4.20.0.1 The principle of parsimony: Models should have as few variables/parameters as possible while retaining adequacy.

4.21 Methods for model assessment

We have many variables to choose from, and it’s really easy to make a model.

Should we just pick the variables most highly correlated with suicide rate?
What is our cutoff then?

                               suicide_rate
children_youth_justice           0.05847330
adult_carers_isolated_all_ages   0.20369843
alcohol_admissions_p             0.11433511
children_leaving_care            0.36700280
depression                       0.32509700
domestic_abuse                   0.14835220
self_harm_persons                0.51686092
opiates                          0.41195709
marital_breakup                  0.39971208
alone                            0.31037882
self_reported_well_being         0.12949288
social_care_mh                   0.20125278
homeless                        -0.32105739
alcohol_rx                      -0.07513134
drug_rx_opiate                  -0.14345307

cor1 <- lm(suicide_rate ~ self_harm_persons, phe)
cor2 <- lm(suicide_rate ~ self_harm_persons + opiates, phe)
cor3 <- lm(suicide_rate ~ self_harm_persons + opiates + marital_breakup, phe)
cor4 <- lm(suicide_rate ~ self_harm_persons + opiates + marital_breakup + 
             children_leaving_care, phe)
cor5 <- lm(suicide_rate ~ self_harm_persons + opiates + marital_breakup + 
             children_leaving_care + depression, phe)
cor6 <- lm(suicide_rate ~ self_harm_persons + opiates + marital_breakup + 
             children_leaving_care + depression + homeless, phe)

Is the last model the best?
Should we add more variables?
Different variables?

4.21.1 Just “significant” variables?

Lets use any of the variables that had a p-value below 0.1.

# Changing model name method since now we will be looking at several models

sig1 <- lm(suicide_rate ~ children_leaving_care + self_harm_persons + opiates + marital_breakup + homeless, phe)

summary(sig1)


Call:
lm(formula = suicide_rate ~ children_leaving_care + self_harm_persons + 
    opiates + marital_breakup + homeless, data = phe)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8194 -1.1301 -0.1541  0.9383  5.6420 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)            4.855937   1.449776   3.349  0.00104 **
children_leaving_care  0.037193   0.019703   1.888  0.06109 . 
self_harm_persons      0.005501   0.002307   2.385  0.01840 * 
opiates                0.123432   0.050315   2.453  0.01536 * 
marital_breakup        0.218171   0.134664   1.620  0.10741   
homeless              -0.212926   0.075547  -2.818  0.00551 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.705 on 143 degrees of freedom
Multiple R-squared:  0.385, Adjusted R-squared:  0.3635 
F-statistic: 17.91 on 5 and 143 DF,  p-value: 9.201e-14

autoplot(sig1)

Wait… now marital breakup isn’t significant?

# Changing model name method since now we will be looking at several models

sig2 <- lm(suicide_rate ~ children_leaving_care + self_harm_persons + opiates + 
             homeless, phe)

summary(sig2)


Call:
lm(formula = suicide_rate ~ children_leaving_care + self_harm_persons + 
    opiates + homeless, data = phe)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8777 -1.1053  0.0011  1.0022  5.8723 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            7.033367   0.546679  12.866  < 2e-16 ***
children_leaving_care  0.044142   0.019338   2.283  0.02392 *  
self_harm_persons      0.006671   0.002203   3.028  0.00292 ** 
opiates                0.121234   0.050580   2.397  0.01782 *  
homeless              -0.227131   0.075459  -3.010  0.00309 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.714 on 144 degrees of freedom
Multiple R-squared:  0.3738,    Adjusted R-squared:  0.3564 
F-statistic: 21.49 on 4 and 144 DF,  p-value: 6.473e-14

autoplot(sig2)

Relying on “statistical” significance means you will end-up shooting yourself in the foot, metaphorically speaking, i.e., make a bad model.

4.21.2 \(R^2\)? (Don’t use it to choose models).

Recall the definition of \(R^2\): It is the proportion of variability in the response variable explained by your predictor variables in the linear model.

Higher \(R^2\) is better, so maybe we should choose the model with the highest \(R^2\)?

The full model has an \(R^2\) of 0.5031.
After whittling down some variables due to multicollinearity, model1 has an \(R^2\) of 0.385.
model2 has an \(R^2\) of 0.3738.

So the full model is best! Right? All this stuff about variable elimination is pointless. I wasted hours of my life to get to this point…. No!

Never determine which is the best model via \(R^2\).

\(R^2\) can be arbitrarily high when using many variables that may not be actually related to the response variable in any meaningful way.

n = 500
a = 10; b = 10

### MAKING DATA WITH A TRUE LINEAR MODEL
x <- rnorm(n, 10, 5)
y = a + b*x + rnorm(n,0 ,80)

data <- data.frame(y,x)

ggplot(data, aes(x,y)) +
  geom_point() +
  geom_smooth(method = 'lm')

summary(lm(y ~ x, data))


Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-212.297  -56.472   -0.473   47.840  252.631 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  19.6028     8.2493   2.376   0.0179 *  
x             9.4893     0.7394  12.833   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 81.85 on 498 degrees of freedom
Multiple R-squared:  0.2485,    Adjusted R-squared:  0.247 
F-statistic: 164.7 on 1 and 498 DF,  p-value: < 2.2e-16

The R-squared here for the perfectly correct model is 0.2485242
- R-squared can be low even when you use the “correct” model.
Now let’s throw in some irrelevant data and graph what happens to R-squared.

# Extra worthless predictors

p = 300

extra.predictors <- data.frame(matrix(rnorm(n*p), nrow=n))

full.data <- data.frame(data, extra.predictors)# need code to combine true and extra data

rsq = rep(NA, p+1)

for(iter in 1:(p+1)){
form <-formula(paste('y ~ ', paste(names(full.data)[2:(iter+1)], collapse=' + ')))
fit.mdl <- lm(form, data=full.data)
rsq[iter] <- summary(fit.mdl)$r.squared # Get the rsq value for the model.
}
ggplot(data.frame(nvars=1:(p+1), rsq),aes(x=nvars,y=rsq)) +
geom_path()

The beginning of the line on the left represents the R-squared value when using the one predictor variable that actually has any true relation with the response variable.
When we add on additional predictors (useless ones), R-squared always increases.

4.21.3 Adjusted \(R^2\) (More conservative)

\[R^2_{adj} = 1 - (1-R^2)\frac{n-1}{n-p-1}\] This is supposed to correct for the fact that \(R^2\) always increases with a greater number of variables (penalizes for the complexity of the model).

Here’s a table for most of the models we’ve discussed so far. It has both the adjusted and unadjusted form.

     model       Rsq    AdjRsq
1   fullLm 0.5030783 0.3767423
2 redModel 0.4176079 0.3519246
3     cor1 0.2671452 0.2621598
4     cor2 0.3275478 0.3183361
5     cor3 0.3476621 0.3341654
6     cor4 0.3508866 0.3328556
7     cor5 0.3514271 0.3287497
8     cor6 0.3943409 0.3687496

But… it doesn’t necessarily. Don’t trust it either.

4.21.4 Predicted \(R^2\) (Even more conservative)

Predicted R-squared or as I will write \(R^2{pred}\) is another play on the same idea as \(R^2\), except for it is based on taking out an observation, computing the regression model without it, then seeing how well the model predicts the left out value.

You can get it via the olsrr package using the ols_pred_rsq() function. The input is your model.

For instance, let us compare this to \(R^2\) from the performance package.

r2(fullLm)

# R2 for Linear Regression
       R2: 0.503
  adj. R2: 0.377

library(olsrr)
ols_pred_rsq(fullLm)

[1] 0.1258262

That’s a lot worse than the 0.503 for the unadjusted R-Squared

Here’s a new table including predicted R-squared

     model       Rsq    AdjRsq   PredRsq
1   fullLm 0.5030783 0.3767423 0.1258262
2 redModel 0.4176079 0.3519246 0.2264297
3     cor1 0.2671452 0.2621598 0.2445699
4     cor2 0.3275478 0.3183361 0.2929179
5     cor3 0.3476621 0.3341654 0.3003604
6     cor4 0.3508866 0.3328556 0.2948558
7     cor5 0.3514271 0.3287497 0.2818233
8     cor6 0.3943409 0.3687496 0.3218613

\(R^2\) only tells how well the model works on the data you have (overfitting kills its utility).
\(R^2_{adj}\) tries to make an attempt at saying how well the model would work on the wider population/future observations.
\(R^2_{pred}\) is probably the best measure of how well your model will actually work with new data.
NONE of these are perfect so use these tools with caution and care.

4.21.5 Akaike Information Criterion AIC

One last point of discussion will be the Akaike Information Criterion.

\[AIC = 2(p+1) + n\log(SSE/n) - C\]

There’s a lot of theory behind this one and the Bayesian Information Criterion (BIC)

\[BIC = (p+1)\ln(n) + n\log(SSE/n) - C\]

This is where things get more confusing since AIC and BIC are numbers where SMALLER IS BETTER.

We will just be looking at AIC

     model       Rsq    AdjRsq   PredRsq      AIC
1   fullLm 0.5030783 0.3767423 0.1258262 183.0882
2 redModel 0.4176079 0.3519246 0.2264297 176.7362
3     cor1 0.2671452 0.2621598 0.2445699 182.9770
4     cor2 0.3275478 0.3183361 0.2929179 172.1605
5     cor3 0.3476621 0.3341654 0.3003604 169.6356
6     cor4 0.3508866 0.3328556 0.2948558 170.8973
7     cor5 0.3514271 0.3287497 0.2818233 172.7732
8     cor6 0.3943409 0.3687496 0.3218613 164.5730

Different measures will tell you different things.

Here \(R^2_{pred}\) and AIC agree that the cor6 model is the best.

cor6 <- lm(suicide_rate ~ self_harm_persons + opiates + marital_breakup + 
             children_leaving_care + depression + homeless, phe)

So is that the one we use?

It’s not that simple.

4.22 Variable Selection Methods: Problems and Pitfalls

Variable selection = methods for choosing which predictor variables to include in statistical models
Focus on stepwise regression as a common but problematic approach
Stepwise selection: automated process of adding/removing variables based on statistical significance (p-values and model fit)

4.22.1 Major Problems with Stepwise Selection

4.22.1.1 1. Biased R-squared Values

R-squared values are inflated
Makes model appear to explain more variation than it actually does
Gives false confidence in model’s predictive ability

4.22.1.2 2. Invalid Statistical Inference

P-values are too small (not valid)
Confidence intervals are too narrow
Doesn’t account for multiple testing problem
Standard errors are biased low

4.22.1.3 3. Biased Regression Coefficients

Coefficients are biased away from zero
More likely to select variables with overestimated effects
Selection process favors chance findings
True effects may be much smaller than estimated

4.22.1.4 4. Model Instability

Results highly dependent on sample
Small changes in data can lead to different variables being selected
Poor reproducibility
Different samples likely to produce different models

4.22.2 Example: House Price Prediction

Scenario: 20 potential predictors of house prices
First sample might select:
- Square footage
- Number of bathrooms
- Lot size
Second sample might select completely different variables:
- Age of house
- Number of bedrooms
- Distance to downtown
Demonstrates instability of selection process

4.22.3 Better Alternatives

4.22.3.1 1. Theory-Driven Selection

Use subject matter knowledge
Select variables based on theoretical importance
Include known important predictors regardless of significance

4.22.3.2 2. Include More Variables

Keep theoretically important variables
Don’t eliminate based solely on statistical significance
Better to include too many than too few important variables
- Degrees of freedom permitting of course!

4.22.3.3 3. Alternative Dimension Reduction Methods

Principal Components Analysis
Regularization methods (LASSO, Ridge regression)
Data reduction techniques that don’t use outcome variable

4.22.4 Key Takeaways

Avoid automated variable selection methods
Don’t let computational convenience override good statistical practice
Complex but theoretically sound models preferred over overly simplified ones
Statistical significance shouldn’t be the sole criterion for variable inclusion

5 Prespecification of Predictor Complexity in Statistical Modeling

5.1 I. Introduction to Linear Relationships

Truly linear relationships are rare in real-world data
Notable exception: Same measurements at different timepoints
- Example: Blood pressure before and after treatment
Most relationships between predictors and outcomes are nonlinear
Linear modeling often chosen due to data limitations, not reality

5.2 Problems with Post-Hoc Simplification

5.2.1 Common but Problematic Approaches:

Examining scatter plots
Checking descriptive statistics
Using informal assessments
Modifying model based on these observations

5.2.2 Key Issue:

Creates “phantom degrees of freedom”
Informal assessments use degrees of freedom not accounted for in:
- Standard errors
- P-values
- R² values

5.3 The Prespecification Approach

5.3.1 Core Principles:

Decide on predictor complexity before examining relationships
Base decisions on:
- Effective sample size
- Prior knowledge
- Expected predictor importance
Maintain decisions regardless of analysis results

5.3.2 Benefits:

More reliable statistical inference
Better representation of uncertainty
Prevention of bias from data-driven simplification

5.4 Practical Implementation

5.4.1 Guidelines for Complexity:

Allow more complex representations for:
1. Stronger expected relationships
2. Larger effective sample sizes
3. More important predictors

5.4.2 Examples of Implementation:

Using splines with more knots for important predictors
Scaling complexity to sample size
Matching complexity to theoretical importance

5.5 Validation and Testing

5.5.1 Allowed Practices:

Graphing estimated relationships
Performing nonlinearity tests
Presenting results to readers

5.5.2 Important Rule:

Maintain prespecified complexity even if simpler relationships appear adequate

5.6 The Directional Principle

5.6.1 Key Concepts:

Moving simple → complex
- Degrees of freedom properly increase
- Statistical tests maintain distribution
Moving complex → simple
- Requires special adjustments
- May compromise statistical validity

5.7 Importance and Impact

5.7.1 Benefits of Prespecification:

Prevents optimistic performance estimates
Maintains valid statistical inference
Provides reliable predictions
Avoids data-driven simplification bias

5.7.2 Trade-offs:

May appear conservative
Slight overfitting preferred to spurious precision

5.8 Summary

Prespecify predictor complexity based on prior knowledge
Avoid data-driven simplification
Maintain statistical validity through consistent approach
Better to slightly overfit than create spuriously precise estimates

5.9 Sample Size Requirements & Overfitting in Regression Models

5.9.1 Definition

Overfitting occurs when model complexity exceeds data information content
Results in:
- Inflated measures of model performance (e.g., R²)
- Poor prediction on new data
- Model fits noise rather than signal

5.9.1.1 Visual Example

Consider two models of the same data:
Simple model:    y ~ x
Overfit model:   y ~ x + x² + x³ + x⁴ + ...

5.9.2 The m/15 Rule

5.9.2.1 Limiting Sample Size (m)

Type of Response	Limiting Sample Size (m)
Continuous	Total sample size (n)
Binary	min(n₁, n₂) smaller group
Ordinal (k categories)	n - (1/n)Σnᵢ³
Survival time	Number of events/failures

5.9.2.2 Basic Rule

Number of parameters (p) should be < m/15
Some situations require more conservative p < m/20
Less conservative p < m/10 possible

5.9.3 Counting Parameters

5.9.3.1 Include ALL of these in your count:

Main predictor variables
Nonlinear terms
Interaction terms
Dummy variables for categorical predictors (k-1)

5.9.3.2 Example Parameter Count

Model components:
- Age (nonlinear, 3 knots) = 2 parameters
- Sex (binary) = 1 parameter
- Treatment (3 categories) = 2 parameters
- Age × Treatment interaction = 4 parameters
Total = 9 parameters

5.9.4 Special Considerations

5.9.4.1 Need More Conservative Ratios When:

Predictors have narrow distributions
- e.g., age range 30-45 years only
Highly unbalanced categorical variables
- e.g., 95% in one category
Clustered measurements
Small effect sizes expected

5.9.5 Practical Example

5.9.5.1 Binary Outcome Study

Study population:
- 1000 total patients
- 100 heart attacks
- 900 no heart attacks

Calculations:
m = min(100, 900) = 100
Maximum parameters = 100/15 ≈ 6-7

5.9.6 Alternative Approaches

5.9.6.1 Other Methods to Assess/Prevent Overfitting:

Shrinkage estimates
Cross-validation
Bootstrap validation
Penalized regression methods

5.9.7 Sample Size for Variance Estimation

Need ~70 observations for ±20% precision in \(\sigma\) estimate
Affects all standard errors and p-values
Critical for reliable inference

5.9.8 Key Takeaways

Count Everything: Include all terms in parameter count
Be Conservative: Use m/15 as starting point
Consider Context: Adjust for data peculiarities
Validate: Use shrinkage or cross-validation
Simplify: Prefer simpler models when in doubt

5.9.9 Practice Problems

Calculate maximum parameters for:
- 500 patients, continuous outcome
- 1000 patients, 150 events (survival)
- 300 patients, 50/250 binary outcome

Count parameters in model:

y ~ age + age² + sex + treatment*race
where treatment has 3 levels and race has 4 levels

5.10 Shrinkage in Statistical Models: Understanding the Basics

5.10.1 Introduction

Statistical models can suffer from overfitting
Overfitting: model performs well on training data but poorly on new data
Similar to memorizing test answers without understanding concepts
Need methods to make models more reliable and generalizable

5.10.2 What is Shrinkage?

Technique to prevent overfitting
Makes model predictions more conservative and reliable
Acts like a “leash” on model coefficients
Helps handle regression to the mean

5.10.3 Example

Study of 10 different medical treatments
Some treatments appear very effective by chance
When tested on new patients, effects usually less extreme
Natural tendency for extreme results to move toward average
This movement toward average is “shrinkage”

5.10.4 Key Shrinkage Methods

5.10.4.1 1. Ridge Regression

Adds penalty to prevent large coefficients
Characteristics:
- Like rubber band pulling coefficients toward zero
- Keeps all variables in model
- Reduces size of effects
- Ideal for correlated predictors

5.10.4.2 2. LASSO Regression

Least Absolute Shrinkage and Selection Operator
Characteristics:
- Can force coefficients to exactly zero
- Performs variable selection
- Creates simpler models
- Good for identifying important variables

5.10.5 Benefits of Shrinkage

More stable predictions
Better performance on new data
Protection against overfitting
More reliable assessment of variable importance

5.10.6 Key Takeaway

Shrinkage represents statistical humility
Prevents extreme predictions based on limited data
Makes models more reliable and practical

5.11 Data Reduction Methods

5.11.1 Definition

Process of reducing number of parameters in statistical models
Focus on dimension reduction without using response variable Y
“Unsupervised” approach to prevent overfitting

5.11.2 Purpose

Improve model stability
Reduce overfitting
Maintain statistical inference validity
Handle situations with many predictors relative to sample size

5.11.3 Redundancy Analysis

Core Concept
- Identifies predictors well-predicted by other predictors
- Removes redundant variables systematically
Implementation Process
- Convert predictors to appropriate forms
  - Continuous → restricted cubic splines
  - Categorical → dummy variables
- Use OLS for prediction
- Remove highest R² predictors
- Iterate until threshold reached

5.11.4 Variable Clustering

Purpose
- Group related predictors
- Identify independent dimensions
- Simplify model structure
Methods
- Statistical clustering with correlations
- Principal component analysis (oblique rotation)
- Hierarchical cluster analysis
Important Considerations
- Use robust measures for skewed variables
- Consider rank-based measures
- Hoeffding’s D for non-monotonic relationships

5.11.5 Variable Transformation and Scaling

Key Methods
- Maximum Total Variance (MTV)
- Maximum Generalized Variance (MGV)
Process Goals
- Optimize variable transformations
- Maximize relationships between predictors
- Reduce complexity
Benefits
- Fewer nonlinear terms needed
- Better interpretability
- More meaningful combinations

5.11.6 Simple Scoring of Variable Clusters

Approaches
- First principal component
- Weighted sums
- Expert-assigned severity points
Common Applications
- Binary predictor groups
- Hierarchical scoring systems
- Implementation-focused solutions

5.12 Implementation Guidelines

5.12.1 Best Practices

Prioritize subject matter knowledge
Validate without response variable
Use independent data for validation
Document reduction decisions
Balance simplicity vs. information

5.12.2 Recommended Workflow

Start with redundancy analysis
Apply variable clustering
Transform within clusters if needed
Create simple scores where appropriate
Validate each step

5.13 Key Considerations

5.13.1 Advantages

Prevents overfitting
Maintains statistical validity
Improves model stability
Enhances interpretability

5.13.2 Limitations

Potential information loss
Trade-off with complexity
Need for validation

5.14 Discussion Points

5.14.1 Critical Questions

Choosing between methods
Determining optimal clusters
Balancing complexity and interpretability

5.14.2 Implementation Challenges

Deciding on thresholds
Handling mixed variable types
Validating reduction decisions

5.14.3 Remarks

Essential for stable modeling with many predictors
Requires thoughtful method combination
Focus on maintaining predictive power
Ensure statistical validity

5.15 Data Reduction Techniques Examples

First, let’s load the required packages and prepare our data:

library(Hmisc)
library(rms)
library(pcaPP)

# Read the data
df <- read.csv(here::here(
  "datasets",
  "phe.csv"))

We have 149 rows of data
We want to investigate the continuous outcome of suicide_rate
m/15 rule: we can only have about 10 parameters in the model (we could be more liberal in loosen it to 15 variables)

5.15.1 1. Redundancy Analysis

Redundancy analysis helps identify predictors that can be well-predicted from other variables:

# Perform redundancy analysis
df2 = df %>% dplyr::select(-suicide_rate)
redun_result <- redun(~ .,
                      data=df, r2=0.6)

# Print results
redun_result


Redundancy Analysis

~.

n: 149  p: 31   nk: 3 

Number of NAs:   0 

Transformation of target variables forced to be linear

R-squared cutoff: 0.6   Type: ordinary 

R^2 with which each variable can be predicted from all other variables:

        children_youth_justice       adult_carers_isolated_18 
                         0.474                          0.634 
adult_carers_isolated_all_ages      adult_carers_not_isolated 
                         0.612                          0.769 
                 alcohol_rx_18            alcohol_rx_all_ages 
                         0.965                          0.998 
          alcohol_admissions_f            alchol_admissions_m 
                         0.999                          1.000 
          alcohol_admissions_p          children_leaving_care 
                         0.686                          0.832 
                    depression                 domestic_abuse 
                         0.738                          0.729 
              self_harm_female                 self_harm_male 
                         1.000                          1.000 
             self_harm_persons                        opiates 
                         1.000                          0.952 
            lt_health_problems                lt_unepmloyment 
                         0.945                          0.876 
         looked_after_children                marital_breakup 
                         0.889                          0.706 
               old_pople_alone                          alone 
                         0.956                          0.906 
      self_reported_well_being                            smi 
                         0.520                          0.859 
                social_care_mh                       homeless 
                         0.513                          0.790 
                    alcohol_rx             drug_rx_non_opiate 
                         0.761                          0.787 
                drug_rx_opiate                   suicide_rate 
                         0.658                          0.639 
                  unemployment 
                         0.926 

Rendundant variables:

self_harm_persons alchol_admissions_m alcohol_rx_all_ages alcohol_rx_18
old_pople_alone self_harm_male unemployment looked_after_children
lt_health_problems smi drug_rx_non_opiate opiates children_leaving_care
alcohol_admissions_f homeless self_harm_female


Predicted from variables:

children_youth_justice adult_carers_isolated_18
adult_carers_isolated_all_ages adult_carers_not_isolated
alcohol_admissions_p depression domestic_abuse lt_unepmloyment
marital_breakup alone self_reported_well_being social_care_mh
alcohol_rx drug_rx_opiate suicide_rate

        Variable Deleted   R^2
1      self_harm_persons 1.000
2    alchol_admissions_m 1.000
3    alcohol_rx_all_ages 0.978
4          alcohol_rx_18 0.960
5        old_pople_alone 0.950
6         self_harm_male 0.927
7           unemployment 0.896
8  looked_after_children 0.866
9     lt_health_problems 0.802
10                   smi 0.783
11    drug_rx_non_opiate 0.727
12               opiates 0.718
13 children_leaving_care 0.704
14  alcohol_admissions_f 0.684
15              homeless 0.663
16      self_harm_female 0.600
                                                            R^2 after later deletions
1           1 1 1 1 0.987 0.987 0.986 0.986 0.986 0.985 0.985 0.984 0.981 0.981 0.625
2   0.997 0.997 0.997 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.7 0.696 0.665
3       0.978 0.978 0.977 0.977 0.976 0.974 0.973 0.972 0.972 0.972 0.717 0.713 0.677
4             0.959 0.955 0.952 0.951 0.951 0.947 0.947 0.792 0.787 0.755 0.751 0.742
5                    0.949 0.941 0.941 0.83 0.809 0.801 0.765 0.765 0.753 0.731 0.722
6                          0.925 0.923 0.921 0.92 0.917 0.915 0.912 0.895 0.891 0.636
7                                0.892 0.888 0.883 0.883 0.878 0.867 0.856 0.85 0.848
8                                      0.862 0.86 0.859 0.853 0.777 0.755 0.751 0.722
9                                           0.794 0.792 0.783 0.782 0.768 0.753 0.733
10                                                0.781 0.746 0.741 0.724 0.713 0.704
11                                                      0.724 0.718 0.717 0.716 0.702
12                                                            0.712 0.691 0.689 0.673
13                                                                  0.689 0.674 0.637
14                                                                        0.681 0.656
15                                                                              0.657
16

As with our section on multicollinarity, we can see that a number of variables are redundant. If we were to remove all these variables we would at least satisfy m/10 which is a bit more liberal than the m/15 guidance.

5.15.2 2. Variable Clustering

Let’s perform hierarchical clustering on our variables to identify related groups:

# Perform variable clustering
vc <- varclus(~ .,
              data=df2, sim= 'hoeffding')

# Plot dendrogram
plot(vc)

We can see a lot of the variables clustering together

5.15.3 3. Principal Components Analysis

Let’s examine the data structure using PCA:

# Perform PCA
pca_result <- princomp(df2, cor=TRUE)

# Scree plot
plot(pca_result, type="lines", main="Scree Plot")
abline(h=1, lty=2, col="red")  # Kaiser criterion line

# Print variance explained by first few components
summary(pca_result, loadings=TRUE, cutoff=0.3)

Importance of components:
                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
Standard deviation     3.1327307 2.2436841 1.51857984 1.33713169 1.21224371
Proportion of Variance 0.3271334 0.1678039 0.07686949 0.05959737 0.04898449
Cumulative Proportion  0.3271334 0.4949373 0.57180682 0.63140419 0.68038869
                           Comp.6     Comp.7    Comp.8     Comp.9    Comp.10
Standard deviation     1.07625280 1.01836888 0.9747646 0.92320127 0.86023787
Proportion of Variance 0.03861067 0.03456917 0.0316722 0.02841002 0.02466697
Cumulative Proportion  0.71899936 0.75356853 0.7852407 0.81365075 0.83831772
                          Comp.11    Comp.12    Comp.13    Comp.14    Comp.15
Standard deviation     0.76164102 0.74326524 0.70575177 0.67895928 0.65072428
Proportion of Variance 0.01933657 0.01841477 0.01660285 0.01536619 0.01411474
Cumulative Proportion  0.85765429 0.87606907 0.89267192 0.90803811 0.92215284
                          Comp.16    Comp.17    Comp.18    Comp.19     Comp.20
Standard deviation     0.62502778 0.57314172 0.56636534 0.55357002 0.478567350
Proportion of Variance 0.01302199 0.01094971 0.01069232 0.01021466 0.007634224
Cumulative Proportion  0.93517484 0.94612455 0.95681687 0.96703153 0.974665755
                           Comp.21     Comp.22     Comp.23     Comp.24
Standard deviation     0.473155304 0.413788556 0.340046283 0.302671397
Proportion of Variance 0.007462531 0.005707366 0.003854382 0.003053666
Cumulative Proportion  0.982128287 0.987835652 0.991690035 0.994743701
                           Comp.25     Comp.26     Comp.27      Comp.28
Standard deviation     0.250796490 0.224198463 0.174227171 0.1176538993
Proportion of Variance 0.002096629 0.001675498 0.001011837 0.0004614147
Cumulative Proportion  0.996840330 0.998515828 0.999527665 0.9999890799
                            Comp.29      Comp.30
Standard deviation     1.669952e-02 6.980650e-03
Proportion of Variance 9.295803e-06 1.624316e-06
Cumulative Proportion  9.999984e-01 1.000000e+00

Loadings:
                               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
children_youth_justice                               0.378                0.580
adult_carers_isolated_18                             0.413                0.307
adult_carers_isolated_all_ages                                                 
adult_carers_not_isolated                                                      
alcohol_rx_18                                                                  
alcohol_rx_all_ages                                                            
alcohol_admissions_f                                                           
alchol_admissions_m                                                            
alcohol_admissions_p                                                           
children_leaving_care                                                          
depression                                                                     
domestic_abuse                                              0.342              
self_harm_female                                                               
self_harm_male                                                                 
self_harm_persons                                                              
opiates                                                                        
lt_health_problems                                                             
lt_unepmloyment                                                                
looked_after_children                                                          
marital_breakup                                                                
old_pople_alone                        0.338                                   
alone                                               -0.434                     
self_reported_well_being                                    0.381              
smi                                                 -0.366                     
social_care_mh                                              0.488              
homeless                              -0.325                                   
alcohol_rx                                   -0.543                            
drug_rx_non_opiate                           -0.558                            
drug_rx_opiate                               -0.397                            
unemployment                                                                   
                               Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
children_youth_justice                                                      
adult_carers_isolated_18                                                    
adult_carers_isolated_all_ages         0.351         -0.499  -0.301         
adult_carers_not_isolated                                                   
alcohol_rx_18                                                               
alcohol_rx_all_ages                                                         
alcohol_admissions_f                                                        
alchol_admissions_m                                                         
alcohol_admissions_p                  -0.378                                
children_leaving_care                                                       
depression                                                    0.340   0.373 
domestic_abuse                               -0.397           0.304         
self_harm_female                                                            
self_harm_male                                                              
self_harm_persons                                                           
opiates                                                               0.335 
lt_health_problems                                                          
lt_unepmloyment                                                             
looked_after_children                                                       
marital_breakup                       -0.316         -0.366   0.442         
old_pople_alone                                                      -0.344 
alone                                                                       
self_reported_well_being        0.605 -0.374                                
smi                                                                         
social_care_mh                 -0.618                                       
homeless                                                                    
alcohol_rx                                                                  
drug_rx_non_opiate                                                          
drug_rx_opiate                                                0.312  -0.320 
unemployment                                                                
                               Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19
children_youth_justice                  0.376                                 
adult_carers_isolated_18       -0.505  -0.378                                 
adult_carers_isolated_all_ages                                  0.304         
adult_carers_not_isolated      -0.390           0.510                  -0.459 
alcohol_rx_18                                                                 
alcohol_rx_all_ages                                                           
alcohol_admissions_f                                                          
alchol_admissions_m                                                           
alcohol_admissions_p                            0.376  -0.366                 
children_leaving_care                                   0.393                 
depression                             -0.304           0.406          -0.387 
domestic_abuse                                                          0.311 
self_harm_female                                                              
self_harm_male                                                                
self_harm_persons                                                             
opiates                                                -0.305                 
lt_health_problems                                                            
lt_unepmloyment                                -0.448                  -0.334 
looked_after_children                                                         
marital_breakup                                                               
old_pople_alone                                                               
alone                                                                         
self_reported_well_being                                                      
smi                                                                           
social_care_mh                                                                
homeless                                                        0.483         
alcohol_rx                                                                    
drug_rx_non_opiate                                              0.338         
drug_rx_opiate                                                 -0.382         
unemployment                                                                  
                               Comp.20 Comp.21 Comp.22 Comp.23 Comp.24 Comp.25
children_youth_justice                                                        
adult_carers_isolated_18                                                      
adult_carers_isolated_all_ages                                                
adult_carers_not_isolated                                                     
alcohol_rx_18                                                                 
alcohol_rx_all_ages                                                           
alcohol_admissions_f                                                          
alchol_admissions_m                                                           
alcohol_admissions_p                                                          
children_leaving_care                                  -0.469                 
depression                                                                    
domestic_abuse                                                                
self_harm_female                                                        0.495 
self_harm_male                                                         -0.668 
self_harm_persons                                                             
opiates                                                                       
lt_health_problems              0.423                                         
lt_unepmloyment                                                 0.422         
looked_after_children                                   0.634   0.429         
marital_breakup                -0.370                                         
old_pople_alone                 0.307                                         
alone                                  -0.441                                 
self_reported_well_being                                                      
smi                                     0.561  -0.314                         
social_care_mh                                                                
homeless                        0.349                                         
alcohol_rx                                      0.525                         
drug_rx_non_opiate                     -0.316  -0.454                         
drug_rx_opiate                                                                
unemployment                                                   -0.659         
                               Comp.26 Comp.27 Comp.28 Comp.29 Comp.30
children_youth_justice                                                
adult_carers_isolated_18                                              
adult_carers_isolated_all_ages                                        
adult_carers_not_isolated                                             
alcohol_rx_18                   0.342  -0.625                         
alcohol_rx_all_ages                            -0.760                 
alcohol_admissions_f                            0.590   0.513         
alchol_admissions_m                                    -0.806         
alcohol_admissions_p                                                  
children_leaving_care                                                 
depression                                                            
domestic_abuse                                                        
self_harm_female                                                0.487 
self_harm_male                         -0.353                   0.338 
self_harm_persons                                              -0.804 
opiates                        -0.379   0.435                         
lt_health_problems              0.479                                 
lt_unepmloyment                                                       
looked_after_children                                                 
marital_breakup                                                       
old_pople_alone                -0.488                                 
alone                                                                 
self_reported_well_being                                              
smi                                                                   
social_care_mh                                                        
homeless                                                              
alcohol_rx                                                            
drug_rx_non_opiate                                                    
drug_rx_opiate                                                        
unemployment                   -0.312

5.15.4 4. Sparse Principal Components Analysis

Using pcaPP for robust sparse PCA.

# Perform sparse PCA
sparse_pca <- sPCAgrid(df2, k=10, method = 'sd' ,
center =mean , scale =sd , scores =TRUE ,
maxiter =10)

# Plot variance explained
plot(sparse_pca,type = 'lines' , main= ' ')

# Print loadings
print(sparse_pca$loadings)


Loadings:
                               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
children_youth_justice                                             1.000       
adult_carers_isolated_18                                                       
adult_carers_isolated_all_ages         0.464                                   
adult_carers_not_isolated       0.252                                          
alcohol_rx_18                   0.295                                          
alcohol_rx_all_ages             0.305                                          
alcohol_admissions_f            0.295                                          
alchol_admissions_m             0.303                                          
alcohol_admissions_p            0.161                                          
children_leaving_care           0.248                                          
depression                      0.111                                          
domestic_abuse                  0.106                                     0.791
self_harm_female                0.131               -0.264                     
self_harm_male                  0.222                                          
self_harm_persons               0.176                                          
opiates                         0.278                                          
lt_health_problems              0.210                                          
lt_unepmloyment                 0.226                                          
looked_after_children           0.300                                          
marital_breakup                 0.113                                          
old_pople_alone                        0.636                                   
alone                           0.122                                    -0.607
self_reported_well_being                             0.965                     
smi                             0.122                                          
social_care_mh                                              1.000              
homeless                              -0.613                                   
alcohol_rx                                    0.707                            
drug_rx_non_opiate                            0.707                            
drug_rx_opiate                                                                 
unemployment                    0.234                                          
                               Comp.8 Comp.9 Comp.10
children_youth_justice                              
adult_carers_isolated_18                            
adult_carers_isolated_all_ages                      
adult_carers_not_isolated                           
alcohol_rx_18                                       
alcohol_rx_all_ages                                 
alcohol_admissions_f                                
alchol_admissions_m                                 
alcohol_admissions_p                         -0.574 
children_leaving_care                               
depression                      0.903               
domestic_abuse                                      
self_harm_female                                    
self_harm_male                                      
self_harm_persons                      0.444        
opiates                                             
lt_health_problems                                  
lt_unepmloyment                                     
looked_after_children                               
marital_breakup                               0.819 
old_pople_alone                                     
alone                                               
self_reported_well_being                            
smi                                                 
social_care_mh                                      
homeless                                            
alcohol_rx                                          
drug_rx_non_opiate                                  
drug_rx_opiate                         0.896        
unemployment                   -0.426               

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.033  0.033  0.033  0.033  0.033  0.033  0.033  0.033  0.033
Cumulative Var  0.033  0.067  0.100  0.133  0.167  0.200  0.233  0.267  0.300
               Comp.10
SS loadings      1.000
Proportion Var   0.033
Cumulative Var   0.333

Like are variable clutstering, we can see that lots of variables load onto the first principal component. We could add component 1 to our regression model and keep the rest of the variables (not those loaded on comp. 1).

However, let’s instead use stepwise regression only on the preditors in combination with sparse PCA to select variables for 10 components in our model.

Stepwise here tells us which predictors that load onto each PC are needed to predict that pc

spca1 <- princmp(df2, sw = TRUE ,
                 k = 10,
                 kapprox = 10,
                 method = "sparse", 
                 cor = TRUE, 
                 nvmax = 30)

print(spca1)

Sparse Principal Components Analysis

Stepwise Approximations to PCs With Cumulative R^2

PC 1 
alchol_admissions_m (0.975) + self_harm_persons (0.999) +
looked_after_children (1)

PC 2 
adult_carers_isolated_all_ages (0.759) + homeless (0.978) +
old_pople_alone (1)

PC 3 
drug_rx_non_opiate (0.89) + alcohol_rx (1)

PC 4 
self_harm_female (1)

PC 5 
adult_carers_isolated_18 (0.861) + self_harm_female (1)

PC 6 
self_harm_female (0.993) + self_reported_well_being (1)

PC 7 
social_care_mh (1)

PC 8 
children_youth_justice (1)

PC 9 
domestic_abuse (0.872) + alone (1)

PC 10 
depression (0.875) + unemployment (0.999) + old_pople_alone (1)

We get 10 components, some with many predictors, some with only one.

5.15.5 Put it all together

Using theory, we could decide to create cluster scores for some of these components, while others with only 2 variables we might want to just eliminate one of the variables.

For those that we want to have a composite score, we can use the first principal component of their scores. Or, if we want to use this scoring outside of this context we can get a regression model on the first PC.

Let me demonstrate with our first component from the sparse PCA.

# Get pc scores for the first sparse PC variables
pc1_scores = princmp(df2[,c(names(spca1$sw[[1]]))],
                     method = "regular",
                     cor = TRUE)$scores[,1]
df2$pc1_scores = pc1_scores

mod1 = lm(pc1_scores~alchol_admissions_m + self_harm_persons , data = df2)

parameters::parameters(mod1) |> print_md()

Parameter	Coefficient	SE	95% CI	t(146)	p
(Intercept)	-6.53	0.18	(-6.88, -6.17)	-36.38	< .001
alchol admissions m	3.63e-03	1.47e-04	(3.34e-03, 3.92e-03)	24.70	< .001
self harm persons	9.09e-03	4.38e-04	(8.22e-03, 9.95e-03)	20.75	< .001

Now, we could report those regression coefficients to get a pretty reasonable estimate of the scoring for that component of the model.

Now, let’s get the second and third scores as well.

# Get pc scores for the second sparse PC variables
pc2_scores = princmp(df2[,c(names(spca1$sw[[2]]))],
                     method = "regular",
                     cor = TRUE)$scores[,1]

df2$pc2_scores = pc2_scores

pc3_scores = princmp(df2[,c(names(spca1$sw[[3]]))],
                     method = "regular",
                     cor = TRUE)$scores[,1]

df2$pc3_scores = pc3_scores

For the rest, we will simplify things a bit and just use the first variable of the remaining 10 sparse PCA components.

Let’s now fit a model based using our 3 PCs and the 7 other variables we selected.

df_all = df2 %>%
  select(pc1_scores, pc2_scores, pc3_scores,
         self_harm_female,
         adult_carers_isolated_18,
         self_harm_female,
         social_care_mh,
         children_youth_justice,
         domestic_abuse,
         depression)

df_all$suicide_rate = df$suicide_rate

# 10 predictors, 1 outcome, 149 observations
model_all = lm(
  data = df_all,
  suicide_rate ~ .
)

summary(model_all)


Call:
lm(formula = suicide_rate ~ ., data = df_all)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2067 -1.1131 -0.1051  0.9371  4.1430 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               9.6296409  1.6223906   5.935 2.22e-08 ***
pc1_scores                0.5747571  0.1555263   3.696 0.000315 ***
pc2_scores                0.3436984  0.1388871   2.475 0.014539 *  
pc3_scores               -0.1009508  0.1084578  -0.931 0.353579    
self_harm_female          0.0025597  0.0024544   1.043 0.298819    
adult_carers_isolated_18  0.0239721  0.0243409   0.985 0.326408    
social_care_mh            0.0010556  0.0004836   2.183 0.030729 *  
children_youth_justice   -0.0310969  0.0241586  -1.287 0.200165    
domestic_abuse            0.0266173  0.0359734   0.740 0.460599    
depression               -0.1202582  0.1114831  -1.079 0.282584    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.729 on 139 degrees of freedom
Multiple R-squared:  0.3853,    Adjusted R-squared:  0.3455 
F-statistic: 9.681 on 9 and 139 DF,  p-value: 2.122e-11

Now let’s check our assumptions:

check_model(model_all)

5.15.6 Analysis Summary

From the redundancy analysis, we can identify variables that are highly predictable from others, helping reduce dimensionality while maintaining information.
The variable clustering dendrogram shows natural groupings in our data, which can guide feature selection or creation of composite scores.
The PCA results show how many components are needed to explain a certain percentage of variance in the data.
Sparse PCA provides a more interpretable solution by forcing some loadings to zero while maintaining most of the explained variance.

5.15.7 Recommendations for Data Reduction

Based on these analyses, we can recommend:

Consider combining highly correlated variables within the same cluster into composite scores
Use the first few principal components if dimension reduction is needed while maintaining maximum variance
For interpretability, consider using the sparse PCA solution which provides clearer variable groupings
Remove redundant variables identified in the redundancy analysis on an as needed basis. Justify your choices!

These techniques provide different perspectives on data reduction, and the choice depends on the specific needs of the analysis (interpretability, variance preservation, or prediction accuracy).

5.16 Our model building process

Look at the full model. Coefficients, residuals, and all.
Investigate for multicollinearity and find ways to remove variables that may be problematic.
Look at the full model of the reduced data: coefficients, residuals, and all. You may need to apply a transformation to the \(y\) variable.
If you think it’s a good fit and can justify using it, you’re done.
Otherwise you need to start eliminating variables via knowledge of the data or specific experimental questions ideally. If you are purely aiming for a predictive approach, you may try data reduction methods.
Once you have found a reduced model, examine the residuals for outliers and violation assumptions. Remove outliers only if you can justify it.
If you don’t have outliers and your assumptions look good, you’re done.
Check if your variables still remain relevant. You may have to remove or add variables that are now relevant. Use experimental questions or data reduction again…
Check residuals and outliers again. Hopefully everything looks. Good.

Every time you messed with something in the model, you need to go back through and check residuals, validity of transformations, etc.

I’ll throw you softballs in this class for the sake of your sanity.