2  Data and Models

knitr::opts_chunk$set(echo = FALSE, tidy = TRUE, 
                      cache = FALSE, 
                      message = FALSE, WARNING = FALSE)
# Very standard packages
library(graphics)
library(ggplot2)
library(tidyverse)
library(knitr)
library(readr)
library(MASS)
library(plotly)
library(flextable)
# I do not want default theme.
old.theme <- theme_get()
theme_set(theme_bw())

2.1 Data

2.1.1 Variables and Observations

Let’s talk about what we mean by data, in this course.

Data is composed of variables and observations.

Example: We have a patient. We measure their blood pressure. It is observed to be 133/86.

Variable: Blood pressure

Observation: 133/86

Or we could reformulate this

Variables: Systolic blood pressure and diastolic blood pressure.

Observation(s): 133 and 86

2.1.2 Heart data introduction

heart <- read_csv(here::here("datasets", "Heart.csv"))

as_flextable(head(heart))

age

sex

chestPain

restSysBP

cholesterol

fastBldSgr

restECG

maxHR

exAng

slope

majorVessels

disease

numeric

character

character

numeric

numeric

numeric

numeric

numeric

numeric

numeric

numeric

character

63

Male

typical

145

233

1

2

150

0

3

0

No

67

Male

asymptomatic

160

286

0

2

108

1

2

3

Yes

67

Male

asymptomatic

120

229

0

2

129

1

2

2

Yes

37

Male

nonanginal

130

250

0

0

187

0

3

0

No

41

Female

nontypical

130

204

0

2

172

0

1

0

No

56

Male

nontypical

120

236

0

0

178

0

1

0

No

n: 6

2.1.3 Heart Disease Data Dictionary

A data dictionary explains what the “names” of variables in a dataset mean. For the heart data:

  • age: The patient’s age in years

  • sex: The patient’s sex, Male or Female.

  • chestPain: The chest pain experienced

    • typical: typical angina
    • nontypical’: abnormal angina
    • nonaginal: non-anginal pain
    • asymptomatic: no pain
  • restSysBP: systolic blood pressure upon admission to hospital in mm Hg

  • cholesterol: The patient’s cholesterol measurement in mg/dl

  • fastBldSgr: indicator for whether the paitient’s fasting blood sugar was greater than 120 mg/dl: 1 if yes, 0 if no.

  • restECG: Resting electrocardiographic measurement

    • 0: normal
    • 1: having ST-T wave abnormality
    • 2 showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • maxHR: The patient’s maximum heart rate achieved during controlled exercise

  • exAng: Exercise induced angina: 1 if yes, 0 if no

  • slope: the slope of the peak exercise ST segment

    • 1 if slope is positive
    • 2 if slope is approximately 0
    • 3 if slope is negative
  • majorVessels: The number of major vessels (0-3) colored in fluoroscopy.

  • disease: Indicates whether a patient had heart disease: Yes if yes, no if no.

2.2 Mathematical Models

2.2.1 input and output

The whole

The general form of a model is deceptively simple:

\[input \to output.\]

We have some information, the input, we use some process, \(\to\), in order to get some information, the output.

Model: put a quarter in the gumball machine, turn the knob, and a gumball comes out.

Figure 2.1: We have quarter as input to the process, gumball machine, which allows to yield a

 

2.2.2 Mathematical models

We are doing math here. We will represent our input as \(x\), and our output as \(y\). We put our input \(x\) into some function \(g()\).

\[y = g(x)\]

  • \(x\) is the input
  • \(g\) is the \(\to\)
  • \(y\) is the output

\[ y = a + b \cdot x \]

2.2.3 Heart Model

The Mayo Clinic says “You can calculate your maximum heart rate by subtracting your age from 220”.

  • \(maxHR = 220 - age\)
  • \(y = 220 - x\)
    • y is maxHR
    • x is age

2.2.3.1 What about with the real data?

 

 

2.3 Statistical models and Error

\[y = g(x) + \epsilon\]

  • y = response
  • x = predictor
  • g(x) = function
  • \(\epsilon\) = error or variability in the model

2.3.1 Heart example

The Mayo Clinic also specifies “You may have a higher or lower maximum heart rate, sometimes by as much as 15 to 20 beats per minute”.

\[MaxHR = 220 - age \pm 20\]

What’s going on here?

  1. The guidelines from the Mayo Clinic apply mainly to the overall population of adults (age 16+).
  2. This data is based on a study about heart disease.
  3. Primarily, there are two groups in the data: those with heart disease, and those without.
  4. Maybe heart disease has an effect.

2.3.1.1 Grouping Means

 

2.3.2 Conditional Means vs Unconditional Means

Let’s concentrate on the formula for basic statistical models.

\[y = g(x) + \epsilon\]

2.3.2.1 Simplest Example

Unconditional Mean

\[y = \mu + \epsilon\]

2.3.2.2 Simple Model With Disease

\[y = 140 \pm 23\] Why might we use this model?

Percent 140 +/- 23
82.01439

2.3.2.3 Model for Those Without Disease

The average maxHR should be about 214 - age, the standard deviation of that model is about 16

We denote this as:

\[y = 214 - x \pm 16\]

Percent 214 - x +/- 16
0.7256098

2.4 Linear models

2.4.1 Simple linear models: one predictor variable.

The simple linear model says the \(\mu_{y|x}\) is a line that depends on \(x\) based on a \(y\)-intercept which we denote by \(\beta_0\) and a slope which we denote \(\beta_1\).

\[\mu_{y|x} = \beta_0 + \beta_1 x\]

  • Conditional mean (deterministic)
  • \(\beta_0\): y-intercept
  • \(\beta_1\): slope

\[ y = \beta_0 + \beta_1 x + \epsilon = \mu_{y|x} + \epsilon \]

  • \(\epsilon\) is the error in the model (noted as “randomness” in some lecture videos1)

2.4.2 Linear models with more than one predictor variable

A linear model in general means a model that can be written as a sum of variables and coefficients:

\[y = \beta_0 + \sum_{i=1}^k\beta_i x_i + \epsilon\]

  • \(\mu_{y | x}\) deterministic
  • \(\epsilon\) is the model error

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ... + \beta_kx_k + \epsilon \]


  1. We should ↩︎