← Curriculum Home Stage 1 Module 1.3 Topic Probability • Statistics • Data Thinking

1.3 — Probability, Statistics, and Data Thinking

Thinking and acting under uncertainty in a noisy, adaptive, sometimes adversarial world.

Core Kolmogorov → Fisher/NP → Tukey/Box → Gelman Spine Identifiability • Tail law • Decision • Workflow

0. Orientation

This module is not “how to do homework problems.” It is how to keep inference honest when models are approximations, data are biased, and decisions have asymmetric costs.

Named lineages: Laplace • Kolmogorov • Fisher • Neyman–Pearson • Tukey • Box • de Finetti • Jeffreys • Wald • Gelman • Pearl • Rubin
Starter imprint (intuition → skepticism) Books / Films

1. Architecture of Uncertainty

1.1 Kolmogorov’s base

Probability as measure theory: (Ω, 𝒜, P). No dice required—only a structure that makes P(Event) coherent and countably additive.

1.2 Random variables as measurable projections

A random variable is a measurable map X : (Ω, 𝒜) → (S, 𝔖). A distribution is the pushforward P_X. Choosing X is choosing what slice of the world you compress into something you can bet on.

1.3 Randomness vs ignorance

Same machinery, different ontology: frequencies in stable physical processes vs uncertainty in strategic, social, mis-specified environments.

Foundations (formal) Primary
Courses (machinery baseline) Video / Notes

2. Distributions & Tail Law

2.1 pmf, pdf, CDF

Keep the distinctions sharp: densities integrate to probabilities; CDF always exists. Mixed/singular laws exist; don’t assume a density.

2.2 Moments—and when they fail

Heavy tails break the comfort props: sometimes mean doesn’t exist; sometimes variance doesn’t. “Uncertainty = variance” fails in tail-dominant domains. Quantiles and full distribution shape matter.

2.3 Joint / marginal / conditional

Almost every model is a conditional statement: Y | X ~ P(· | X; θ).

Tail intuition (risk in the world) Film / Doc
Distributions made legible YouTube

3. Estimation, Identifiability & Information

3.1 Identifiability (precondition)

Before “estimate θ,” ask whether P_θ = P_θ′ ⇒ θ = θ′. If not, you can’t resolve θ no matter how much data you collect.

3.2 Frequentist properties

Bias/variance/MSE, consistency, asymptotic normality. MLE: under regularity + correct spec, asymptotic efficiency and Fisher information.

3.3 Fisher, likelihood & KL (in M-open reality)

In M-open worlds, MLE converges to the parameter minimizing KL divergence to truth within the model class: θ becomes “best approximation,” not “truth.”

3.4 Sufficiency (compression)

A sufficient statistic T is principled compression: given T, the rest of data has no additional information about θ.

Fisher / likelihood / experimental inference Canonical
Decision spine (actions under uncertainty) Wald

4. Testing, Error & Decision

4.1 Neyman–Pearson

Type I/II errors, size α, power. Likelihood ratio test is optimal for simple-vs-simple hypotheses.

4.2 p-values and their misuse

p-values are conditional-on-H₀ tail areas; they are not P(H₀ true), not “due to chance,” not effect size. They collapse under multiplicity and flexibility.

4.3 Wald: loss & risk

The real object is the decision rule δ, evaluated by risk R(θ, δ)=E_θ[L(δ(X),θ)].

4.4 Likelihood principle tension

Frequentist procedures depend on what could have happened; likelihood/Bayes centers on what did happen via L(θ|x).

Primary collision texts (testing & its collapse modes) Papers / Essays
Audio framing (institution pressure → inference failure) Podcast

5. Bayesian Updating, Priors & Exchangeability

5.1 Prior → posterior

π(θ|x) ∝ p(x|θ)π(θ). Then posterior predictive integrates over parameter uncertainty.

5.2 de Finetti & exchangeability

Exchangeability gives i.i.d.-conditional-on-θ representation; θ becomes a latent summary of an exchangeable process, not metaphysical “true essence.”

5.3 Priors: robustness & sensitivity

Priors can stabilize or smuggle bias. Practice requires sensitivity analysis and prior–data conflict checks.

5.4 Gelman workflow

Generative models + posterior predictive checks. If the model cannot generate data that look like reality in key ways, it is wrong in practice.

Bayesian canon Primary
Courses + workflow practice Modern

6. EDA as Adversarial Audit

6.1 Summaries & visual structure

EDA is pre-model inspection: shape, asymmetry, clusters, nonlinearity, heteroskedasticity, outliers.

6.2 Sampling, selection & measurement

Interrogate how data came to exist: sampling design, selection bias, censoring, measurement error, unit inconsistencies.

6.3 Time & nonstationarity

Break i.i.d. illusions: regime shifts, trends, seasonality, change points.

6.4 High-dimensional EDA

PCA/embeddings expose structure and anomalies when p is large; treat as reconnaissance, not truth.

Tukey line Primary
Fast EDA mechanics YouTube
  • StatQuest quick clarity on plots, regression assumptions, diagnostics

7. Models in an M-open World

7.1 M-closed vs M-open vs M-complete

Real systems are M-open: truth not inside {P_θ}. Models are tools; θ is a coordinate in an approximation.

7.2 Predictive vs causal models

Predictive success does not imply causal correctness. “What will happen?” differs from “what if we intervene?”

7.3 Dependence: time, space, groups, networks

Ignoring dependence inflates effective sample size and creates overconfidence.

7.4 Evaluation, calibration, regularization

Holdouts/CV, proper scoring rules, calibration, AIC/BIC/WAIC/LOO, shrinkage priors, penalization.

Anti-idolatry anchors Box / forecasting

8. Causality, Adaptivity & Data Reuse

8.1 Causal structure & identifiability

“Control for variables” is not causality. Graph structure matters: confounders, colliders, mediators. Identifiability is a structural claim, not a computational trick.

8.2 Adaptivity, bandits, adversarial environments

When sampling is adaptive or adversarial, classical p-values/intervals miscalibrate. You need methods built for feedback loops.

8.3 Data reuse & forking paths

Exploration → model choice → “confirmatory” reporting without adjustment inflates false discovery. Mitigate via preregistration, sample splitting, selective inference.

Pearl line (graphs / do-calculus) Causal DAGs
Rubin line (potential outcomes / design-before-analysis) Causal inference

9. Data Thinking: Generative, Adversarial, Sovereign vs Synthetic

9.1 The generative stance (loop)

Question → mechanism → selection/measurement → model family → inference → diagnostics → evaluation → decision → feedback → revision.

9.2 Two practices with the same math

Synthetic/statist practice: defaults-as-ritual, unexamined sampling, model class treated as truth, predictive success mistaken for causal understanding, data reuse ignored.
Sovereign/critical practice: explicit priors/loss/model class, adversarial diagnostics, M-open realism, predictive vs causal separation, dependence/adaptivity accounted for, exploration separated from confirmation.
Imprint (decision + model criticism) Practice

Compressed Map (one-page mental model)

Random variables: measurable compressions X: Ω → S
Distributions: pushforward P_X, including tails + dependence
Estimation: identifiability, MSE, asymptotics, Fisher info, sufficiency
Testing: NP error control, p-values, likelihood ratios, decision theory (loss/risk)
Bayes: priors/posteriors, exchangeability, PPCs, hierarchical models
EDA: adversarial audit of shape, bias, measurement, nonstationarity
Models: M-open approximations; predictive vs causal; evaluation + calibration + regularization
Meta: causality, adaptivity, data reuse; likelihood principle tension

If you want this as a printable one-page, keep this section only and hide the rest with CSS (e.g., print stylesheet).

Resource Library (all links)

Grouped by medium. Items appear above at their “best insertion point,” and are mirrored here for retrieval.

Books (worldview, history, workflow) expand/collapse
Courses & Channels expand/collapse
Papers / Essays expand/collapse
Films / Documentaries expand/collapse
Podcasts expand/collapse
↑ Back to top
All external links open in a new tab. Replace index.html with your actual homepage path.