1.3 Probability, Statistics, and Data Thinking

0. Orientation

This module is not “how to do homework problems.” It is how to keep inference honest when models are approximations, data are biased, and decisions have asymmetric costs.

Named lineages: Laplace • Kolmogorov • Fisher • Neyman–Pearson • Tukey • Box • de Finetti • Jeffreys • Wald • Gelman • Pearl • Rubin

Starter imprint (intuition → skepticism) Books / Films

How Not to Be Wrong (Ellenberg) selection effects, regression, linear-model traps, probability as x-ray goggles
The Signal and the Noise (Silver) forecasting, calibration instincts, institutional prediction culture
The Drunkard’s Walk (Mlodinow) randomness literacy, regression to the mean, misread stochasticity
The Joy of Stats (Hans Rosling) EDA at population scale: time, multivariate structure, storytelling vs data

1. Architecture of Uncertainty

1.1 Kolmogorov’s base

Probability as measure theory: (Ω, 𝒜, P). No dice required—only a structure that makes P(Event) coherent and countably additive.

1.2 Random variables as measurable projections

A random variable is a measurable map X : (Ω, 𝒜) → (S, 𝔖). A distribution is the pushforward P_X. Choosing X is choosing what slice of the world you compress into something you can bet on.

1.3 Randomness vs ignorance

Same machinery, different ontology: frequencies in stable physical processes vs uncertainty in strategic, social, mis-specified environments.

Foundations (formal) Primary

Kolmogorov — Foundations of the Theory of Probability axioms that everything else sits on
Laplace — A Philosophical Essay on Probabilities symmetry, rational belief, early Bayesian posture

Courses (machinery baseline) Video / Notes

Harvard Stat 110 — Joe Blitzstein random variables, conditioning, LLN/CLT, clean setups
MIT OCW 18.05 — Introduction to Probability and Statistics probability + inference + explicit Bayes segment

2. Distributions & Tail Law

2.1 pmf, pdf, CDF

Keep the distinctions sharp: densities integrate to probabilities; CDF always exists. Mixed/singular laws exist; don’t assume a density.

2.2 Moments—and when they fail

Heavy tails break the comfort props: sometimes mean doesn’t exist; sometimes variance doesn’t. “Uncertainty = variance” fails in tail-dominant domains. Quantiles and full distribution shape matter.

2.3 Joint / marginal / conditional

Almost every model is a conditional statement: Y | X ~ P(· | X; θ).

Tail intuition (risk in the world) Film / Doc

Tails You Win: The Science of Chance (Spiegelhalter) risk communication, everyday tail exposure
The Big Short mis-specification + tail risk + model worship failure modes (as parable)

Distributions made legible YouTube

StatQuest (Josh Starmer) common distributions, likelihood, regression—intuition first, symbols second
StatQuest — distributions playlists (search) fast retrieval for specific families (Normal, Poisson, Binomial, etc.)

3. Estimation, Identifiability & Information

3.1 Identifiability (precondition)

Before “estimate θ,” ask whether P_θ = P_θ′ ⇒ θ = θ′. If not, you can’t resolve θ no matter how much data you collect.

3.2 Frequentist properties

Bias/variance/MSE, consistency, asymptotic normality. MLE: under regularity + correct spec, asymptotic efficiency and Fisher information.

3.3 Fisher, likelihood & KL (in M-open reality)

In M-open worlds, MLE converges to the parameter minimizing KL divergence to truth within the model class: θ becomes “best approximation,” not “truth.”

3.4 Sufficiency (compression)

A sufficient statistic T is principled compression: given T, the rest of data has no additional information about θ.

Fisher / likelihood / experimental inference Canonical

Fisher — Statistical Methods for Research Workers estimation + the birth of significance culture
Fisher — The Design of Experiments design logic you must understand (and critique)
The Lady Tasting Tea (Salsburg) history bridge into Fisher/NP/Wald without losing the stakes

Decision spine (actions under uncertainty) Wald

Wald — Statistical Decision Functions loss, risk, minimax, Bayes rules (unifies inference as action)
Moneyball cinematic decision theory: constraints, priors, model-based action

4. Testing, Error & Decision

4.1 Neyman–Pearson

Type I/II errors, size α, power. Likelihood ratio test is optimal for simple-vs-simple hypotheses.

4.2 p-values and their misuse

p-values are conditional-on-H₀ tail areas; they are not P(H₀ true), not “due to chance,” not effect size. They collapse under multiplicity and flexibility.

4.3 Wald: loss & risk

The real object is the decision rule δ, evaluated by risk R(θ, δ)=E_θ[L(δ(X),θ)].

4.4 Likelihood principle tension

Frequentist procedures depend on what could have happened; likelihood/Bayes centers on what did happen via L(θ|x).

Primary collision texts (testing & its collapse modes) Papers / Essays

Neyman & Pearson (1933) — Most Efficient Tests canonical NP error-control framework
Box (1976) — Science and Statistics iterative modeling; anti-idolatry in original form
Gelman & Loken — The Statistical Crisis in Science data-dependent analysis; forking paths as structural mechanism
Gelman & Loken — Garden of Forking Paths (search) researcher degrees of freedom (even without intent)
McShane et al. (2019) — Abandon Statistical Significance explicit attack on threshold ritual (p < 0.05)
Amrhein, Greenland, McShane (2019) — Retire Statistical Significance mainstream admission of NHST breakdown

Audio framing (institution pressure → inference failure) Podcast

EconTalk — Andrew Gelman episode (search) p-values, replication, model mis-spec in live discourse

5. Bayesian Updating, Priors & Exchangeability

5.1 Prior → posterior

π(θ|x) ∝ p(x|θ)π(θ). Then posterior predictive integrates over parameter uncertainty.

5.2 de Finetti & exchangeability

Exchangeability gives i.i.d.-conditional-on-θ representation; θ becomes a latent summary of an exchangeable process, not metaphysical “true essence.”

5.3 Priors: robustness & sensitivity

Priors can stabilize or smuggle bias. Practice requires sensitivity analysis and prior–data conflict checks.

5.4 Gelman workflow

Generative models + posterior predictive checks. If the model cannot generate data that look like reality in key ways, it is wrong in practice.

Bayesian canon Primary

The Theory That Would Not Die (McGrayne) Bayes history as institutional conflict map
de Finetti — Foresight (Springer chapter page) probability as coherent betting; subjective foundations
de Finetti — Foresight (PDF) direct text for exchangeability posture
Jeffreys — Theory of Probability Jeffreys priors; objective/subjective hybrid
Bayesian Data Analysis (BDA3) modern Bayes workflow: modeling + checking + computation
Regression and Other Stories (Gelman, Hill, Vehtari) regression as generative modeling + criticism

Courses + workflow practice Modern

Aalto — Bayesian Data Analysis course (Vehtari et al.) BDA3 keyed; hierarchical models; checking; predictive evaluation
Visualization in Bayesian Workflow (2019) (search) workflow operationalized: plots drive model building and checking

6. EDA as Adversarial Audit

6.1 Summaries & visual structure

EDA is pre-model inspection: shape, asymmetry, clusters, nonlinearity, heteroskedasticity, outliers.

6.2 Sampling, selection & measurement

Interrogate how data came to exist: sampling design, selection bias, censoring, measurement error, unit inconsistencies.

6.3 Time & nonstationarity

Break i.i.d. illusions: regime shifts, trends, seasonality, change points.

6.4 High-dimensional EDA

PCA/embeddings expose structure and anomalies when p is large; treat as reconnaissance, not truth.

Tukey line Primary

Tukey — Exploratory Data Analysis EDA philosophy + resistant summaries + transformations
Tukey (1962) — The Future of Data Analysis (JSTOR) data analysis as its own discipline (prefigures data science)
The Joy of Stats (Rosling) EDA as a visual weapon: time + multivariate structure

Fast EDA mechanics YouTube

StatQuest quick clarity on plots, regression assumptions, diagnostics

7. Models in an M-open World

7.1 M-closed vs M-open vs M-complete

Real systems are M-open: truth not inside {P_θ}. Models are tools; θ is a coordinate in an approximation.

7.2 Predictive vs causal models

Predictive success does not imply causal correctness. “What will happen?” differs from “what if we intervene?”

7.3 Dependence: time, space, groups, networks

Ignoring dependence inflates effective sample size and creates overconfidence.

7.4 Evaluation, calibration, regularization

Holdouts/CV, proper scoring rules, calibration, AIC/BIC/WAIC/LOO, shrinkage priors, penalization.

Anti-idolatry anchors Box / forecasting

Box (1976) — Science and Statistics model-building as iteration + criticism (not dogma)
The Signal and the Noise (Silver) forecasting, calibration instincts, model failure patterns
The Big Short stress-test your assumptions: Gaussian comfort is not reality

8. Causality, Adaptivity & Data Reuse

8.1 Causal structure & identifiability

“Control for variables” is not causality. Graph structure matters: confounders, colliders, mediators. Identifiability is a structural claim, not a computational trick.

8.2 Adaptivity, bandits, adversarial environments

When sampling is adaptive or adversarial, classical p-values/intervals miscalibrate. You need methods built for feedback loops.

8.3 Data reuse & forking paths

Exploration → model choice → “confirmatory” reporting without adjustment inflates false discovery. Mitigate via preregistration, sample splitting, selective inference.

Pearl line (graphs / do-calculus) Causal DAGs

Pearl (1995) — Causal Diagrams for Empirical Research (Biometrika) foundational bridge: diagrams as causal language
Pearl — Causal Diagrams (open PDF via eScholarship) accessible full text copy
Pearl — Causality (book) formal SCM spine; do-calculus
Pearl — The Book of Why conceptual entry + limitations of associational ML

Rubin line (potential outcomes / design-before-analysis) Causal inference

Rubin (1984) — Bayesianly Justifiable & Relevant Frequency Calculations (Project Euclid) Bayes/frequency reconciliation posture
Causal Inference for Statistics, Social, and Biomedical Sciences (Imbens & Rubin) RCM definitive textbook treatment
Lex Fridman Podcast — Judea Pearl conceptual limits of association; causal claims under pressure

9. Data Thinking: Generative, Adversarial, Sovereign vs Synthetic

9.1 The generative stance (loop)

Question → mechanism → selection/measurement → model family → inference → diagnostics → evaluation → decision → feedback → revision.

9.2 Two practices with the same math

Synthetic/statist practice: defaults-as-ritual, unexamined sampling, model class treated as truth, predictive success mistaken for causal understanding, data reuse ignored.

Sovereign/critical practice: explicit priors/loss/model class, adversarial diagnostics, M-open realism, predictive vs causal separation, dependence/adaptivity accounted for, exploration separated from confirmation.

Imprint (decision + model criticism) Practice

Regression and Other Stories generative modeling + checking as standard operating procedure
Moneyball loss functions in disguise: choose actions, not “answers”

Compressed Map (one-page mental model)

Random variables: measurable compressions X: Ω → S
Distributions: pushforward P_X, including tails + dependence
Estimation: identifiability, MSE, asymptotics, Fisher info, sufficiency
Testing: NP error control, p-values, likelihood ratios, decision theory (loss/risk)
Bayes: priors/posteriors, exchangeability, PPCs, hierarchical models
EDA: adversarial audit of shape, bias, measurement, nonstationarity
Models: M-open approximations; predictive vs causal; evaluation + calibration + regularization
Meta: causality, adaptivity, data reuse; likelihood principle tension

If you want this as a printable one-page, keep this section only and hide the rest with CSS (e.g., print stylesheet).

Resource Library (all links)

Grouped by medium. Items appear above at their “best insertion point,” and are mirrored here for retrieval.

Books (worldview, history, workflow) expand/collapse

Courses & Channels expand/collapse

Papers / Essays expand/collapse

Films / Documentaries expand/collapse

Podcasts expand/collapse

↑ Back to top

All external links open in a new tab. Replace index.html with your actual homepage path.

1.3 — Probability, Statistics, and Data Thinking

0. Orientation

1. Architecture of Uncertainty

1.1 Kolmogorov’s base

1.2 Random variables as measurable projections

1.3 Randomness vs ignorance

2. Distributions & Tail Law

2.1 pmf, pdf, CDF

2.2 Moments—and when they fail

2.3 Joint / marginal / conditional

3. Estimation, Identifiability & Information

3.1 Identifiability (precondition)

3.2 Frequentist properties

3.3 Fisher, likelihood & KL (in M-open reality)

3.4 Sufficiency (compression)

4. Testing, Error & Decision

4.1 Neyman–Pearson

4.2 p-values and their misuse

4.3 Wald: loss & risk

4.4 Likelihood principle tension

5. Bayesian Updating, Priors & Exchangeability

5.1 Prior → posterior

5.2 de Finetti & exchangeability

5.3 Priors: robustness & sensitivity

5.4 Gelman workflow

6. EDA as Adversarial Audit

6.1 Summaries & visual structure

6.2 Sampling, selection & measurement

6.3 Time & nonstationarity

6.4 High-dimensional EDA

7. Models in an M-open World

7.1 M-closed vs M-open vs M-complete

7.2 Predictive vs causal models

7.3 Dependence: time, space, groups, networks

7.4 Evaluation, calibration, regularization

8. Causality, Adaptivity & Data Reuse

8.1 Causal structure & identifiability

8.2 Adaptivity, bandits, adversarial environments

8.3 Data reuse & forking paths

9. Data Thinking: Generative, Adversarial, Sovereign vs Synthetic

9.1 The generative stance (loop)

9.2 Two practices with the same math

Compressed Map (one-page mental model)

Resource Library (all links)