1.3 — Probability, Statistics, and Data Thinking
Thinking and acting under uncertainty in a noisy, adaptive, sometimes adversarial world.
0. Orientation
This module is not “how to do homework problems.” It is how to keep inference honest when models are approximations, data are biased, and decisions have asymmetric costs.
- How Not to Be Wrong (Ellenberg) selection effects, regression, linear-model traps, probability as x-ray goggles
- The Signal and the Noise (Silver) forecasting, calibration instincts, institutional prediction culture
- The Drunkard’s Walk (Mlodinow) randomness literacy, regression to the mean, misread stochasticity
- The Joy of Stats (Hans Rosling) EDA at population scale: time, multivariate structure, storytelling vs data
1. Architecture of Uncertainty
1.1 Kolmogorov’s base
Probability as measure theory: (Ω, 𝒜, P). No dice required—only a structure that makes P(Event) coherent and countably additive.
1.2 Random variables as measurable projections
A random variable is a measurable map X : (Ω, 𝒜) → (S, 𝔖). A distribution is the pushforward P_X. Choosing X is choosing what slice of the world you compress into something you can bet on.
1.3 Randomness vs ignorance
Same machinery, different ontology: frequencies in stable physical processes vs uncertainty in strategic, social, mis-specified environments.
- Kolmogorov — Foundations of the Theory of Probability axioms that everything else sits on
- Laplace — A Philosophical Essay on Probabilities symmetry, rational belief, early Bayesian posture
- Harvard Stat 110 — Joe Blitzstein random variables, conditioning, LLN/CLT, clean setups
- MIT OCW 18.05 — Introduction to Probability and Statistics probability + inference + explicit Bayes segment
2. Distributions & Tail Law
2.1 pmf, pdf, CDF
Keep the distinctions sharp: densities integrate to probabilities; CDF always exists. Mixed/singular laws exist; don’t assume a density.
2.2 Moments—and when they fail
Heavy tails break the comfort props: sometimes mean doesn’t exist; sometimes variance doesn’t. “Uncertainty = variance” fails in tail-dominant domains. Quantiles and full distribution shape matter.
2.3 Joint / marginal / conditional
Almost every model is a conditional statement: Y | X ~ P(· | X; θ).
- Tails You Win: The Science of Chance (Spiegelhalter) risk communication, everyday tail exposure
- The Big Short mis-specification + tail risk + model worship failure modes (as parable)
- StatQuest (Josh Starmer) common distributions, likelihood, regression—intuition first, symbols second
- StatQuest — distributions playlists (search) fast retrieval for specific families (Normal, Poisson, Binomial, etc.)
3. Estimation, Identifiability & Information
3.1 Identifiability (precondition)
Before “estimate θ,” ask whether P_θ = P_θ′ ⇒ θ = θ′. If not, you can’t resolve θ no matter how much data you collect.
3.2 Frequentist properties
Bias/variance/MSE, consistency, asymptotic normality. MLE: under regularity + correct spec, asymptotic efficiency and Fisher information.
3.3 Fisher, likelihood & KL (in M-open reality)
In M-open worlds, MLE converges to the parameter minimizing KL divergence to truth within the model class: θ becomes “best approximation,” not “truth.”
3.4 Sufficiency (compression)
A sufficient statistic T is principled compression: given T, the rest of data has no additional information about θ.
- Fisher — Statistical Methods for Research Workers estimation + the birth of significance culture
- Fisher — The Design of Experiments design logic you must understand (and critique)
- The Lady Tasting Tea (Salsburg) history bridge into Fisher/NP/Wald without losing the stakes
- Wald — Statistical Decision Functions loss, risk, minimax, Bayes rules (unifies inference as action)
- Moneyball cinematic decision theory: constraints, priors, model-based action
4. Testing, Error & Decision
4.1 Neyman–Pearson
Type I/II errors, size α, power. Likelihood ratio test is optimal for simple-vs-simple hypotheses.
4.2 p-values and their misuse
p-values are conditional-on-H₀ tail areas; they are not P(H₀ true), not “due to chance,” not effect size. They collapse under multiplicity and flexibility.
4.3 Wald: loss & risk
The real object is the decision rule δ, evaluated by risk R(θ, δ)=E_θ[L(δ(X),θ)].
4.4 Likelihood principle tension
Frequentist procedures depend on what could have happened; likelihood/Bayes centers on what did happen via L(θ|x).
- Neyman & Pearson (1933) — Most Efficient Tests canonical NP error-control framework
- Box (1976) — Science and Statistics iterative modeling; anti-idolatry in original form
- Gelman & Loken — The Statistical Crisis in Science data-dependent analysis; forking paths as structural mechanism
- Gelman & Loken — Garden of Forking Paths (search) researcher degrees of freedom (even without intent)
- McShane et al. (2019) — Abandon Statistical Significance explicit attack on threshold ritual (p < 0.05)
- Amrhein, Greenland, McShane (2019) — Retire Statistical Significance mainstream admission of NHST breakdown
- EconTalk — Andrew Gelman episode (search) p-values, replication, model mis-spec in live discourse
5. Bayesian Updating, Priors & Exchangeability
5.1 Prior → posterior
π(θ|x) ∝ p(x|θ)π(θ). Then posterior predictive integrates over parameter uncertainty.
5.2 de Finetti & exchangeability
Exchangeability gives i.i.d.-conditional-on-θ representation; θ becomes a latent summary of an exchangeable process, not metaphysical “true essence.”
5.3 Priors: robustness & sensitivity
Priors can stabilize or smuggle bias. Practice requires sensitivity analysis and prior–data conflict checks.
5.4 Gelman workflow
Generative models + posterior predictive checks. If the model cannot generate data that look like reality in key ways, it is wrong in practice.
- The Theory That Would Not Die (McGrayne) Bayes history as institutional conflict map
- de Finetti — Foresight (Springer chapter page) probability as coherent betting; subjective foundations
- de Finetti — Foresight (PDF) direct text for exchangeability posture
- Jeffreys — Theory of Probability Jeffreys priors; objective/subjective hybrid
- Bayesian Data Analysis (BDA3) modern Bayes workflow: modeling + checking + computation
- Regression and Other Stories (Gelman, Hill, Vehtari) regression as generative modeling + criticism
- Aalto — Bayesian Data Analysis course (Vehtari et al.) BDA3 keyed; hierarchical models; checking; predictive evaluation
- Visualization in Bayesian Workflow (2019) (search) workflow operationalized: plots drive model building and checking
6. EDA as Adversarial Audit
6.1 Summaries & visual structure
EDA is pre-model inspection: shape, asymmetry, clusters, nonlinearity, heteroskedasticity, outliers.
6.2 Sampling, selection & measurement
Interrogate how data came to exist: sampling design, selection bias, censoring, measurement error, unit inconsistencies.
6.3 Time & nonstationarity
Break i.i.d. illusions: regime shifts, trends, seasonality, change points.
6.4 High-dimensional EDA
PCA/embeddings expose structure and anomalies when p is large; treat as reconnaissance, not truth.
- Tukey — Exploratory Data Analysis EDA philosophy + resistant summaries + transformations
- Tukey (1962) — The Future of Data Analysis (JSTOR) data analysis as its own discipline (prefigures data science)
- The Joy of Stats (Rosling) EDA as a visual weapon: time + multivariate structure
- StatQuest quick clarity on plots, regression assumptions, diagnostics
7. Models in an M-open World
7.1 M-closed vs M-open vs M-complete
Real systems are M-open: truth not inside {P_θ}. Models are tools; θ is a coordinate in an approximation.
7.2 Predictive vs causal models
Predictive success does not imply causal correctness. “What will happen?” differs from “what if we intervene?”
7.3 Dependence: time, space, groups, networks
Ignoring dependence inflates effective sample size and creates overconfidence.
7.4 Evaluation, calibration, regularization
Holdouts/CV, proper scoring rules, calibration, AIC/BIC/WAIC/LOO, shrinkage priors, penalization.
- Box (1976) — Science and Statistics model-building as iteration + criticism (not dogma)
- The Signal and the Noise (Silver) forecasting, calibration instincts, model failure patterns
- The Big Short stress-test your assumptions: Gaussian comfort is not reality
8. Causality, Adaptivity & Data Reuse
8.1 Causal structure & identifiability
“Control for variables” is not causality. Graph structure matters: confounders, colliders, mediators. Identifiability is a structural claim, not a computational trick.
8.2 Adaptivity, bandits, adversarial environments
When sampling is adaptive or adversarial, classical p-values/intervals miscalibrate. You need methods built for feedback loops.
8.3 Data reuse & forking paths
Exploration → model choice → “confirmatory” reporting without adjustment inflates false discovery. Mitigate via preregistration, sample splitting, selective inference.
- Pearl (1995) — Causal Diagrams for Empirical Research (Biometrika) foundational bridge: diagrams as causal language
- Pearl — Causal Diagrams (open PDF via eScholarship) accessible full text copy
- Pearl — Causality (book) formal SCM spine; do-calculus
- Pearl — The Book of Why conceptual entry + limitations of associational ML
- Rubin (1984) — Bayesianly Justifiable & Relevant Frequency Calculations (Project Euclid) Bayes/frequency reconciliation posture
- Causal Inference for Statistics, Social, and Biomedical Sciences (Imbens & Rubin) RCM definitive textbook treatment
- Lex Fridman Podcast — Judea Pearl conceptual limits of association; causal claims under pressure
9. Data Thinking: Generative, Adversarial, Sovereign vs Synthetic
9.1 The generative stance (loop)
Question → mechanism → selection/measurement → model family → inference → diagnostics → evaluation → decision → feedback → revision.
9.2 Two practices with the same math
- Regression and Other Stories generative modeling + checking as standard operating procedure
- Moneyball loss functions in disguise: choose actions, not “answers”
Compressed Map (one-page mental model)
Distributions: pushforward P_X, including tails + dependence
Estimation: identifiability, MSE, asymptotics, Fisher info, sufficiency
Testing: NP error control, p-values, likelihood ratios, decision theory (loss/risk)
Bayes: priors/posteriors, exchangeability, PPCs, hierarchical models
EDA: adversarial audit of shape, bias, measurement, nonstationarity
Models: M-open approximations; predictive vs causal; evaluation + calibration + regularization
Meta: causality, adaptivity, data reuse; likelihood principle tension
If you want this as a printable one-page, keep this section only and hide the rest with CSS (e.g., print stylesheet).
Resource Library (all links)
Grouped by medium. Items appear above at their “best insertion point,” and are mirrored here for retrieval.
Books (worldview, history, workflow) expand/collapse
- How Not to Be Wrong — Jordan Ellenberg
- The Signal and the Noise — Nate Silver
- The Drunkard’s Walk — Leonard Mlodinow
- The Lady Tasting Tea — David Salsburg
- The Theory That Would Not Die — Sharon Bertsch McGrayne
- Bayesian Data Analysis (BDA3) — Gelman et al.
- Regression and Other Stories — Gelman, Hill, Vehtari
- Foundations of the Theory of Probability — Kolmogorov
- A Philosophical Essay on Probabilities — Laplace
- Statistical Decision Functions — Abraham Wald
- Causality — Judea Pearl
- The Book of Why — Judea Pearl
- Causal Inference for Statistics, Social, and Biomedical Sciences — Imbens & Rubin
Courses & Channels expand/collapse
Papers / Essays expand/collapse
- Neyman & Pearson (1933) — Most Efficient Tests
- Box (1976) — Science and Statistics
- Gelman & Loken — The Statistical Crisis in Science
- McShane et al. (2019) — Abandon Statistical Significance
- Amrhein, Greenland, McShane (2019) — Retire Statistical Significance
- Rubin (1984) — Bayesianly Justifiable & Relevant Frequency Calculations
- Pearl (1995) — Causal Diagrams for Empirical Research (Biometrika)
- Pearl — Causal Diagrams (open PDF)
- Tukey (1962) — The Future of Data Analysis (JSTOR)
- de Finetti — Foresight (Springer)
- de Finetti — Foresight (PDF)