The Lady Tasting Tea:How Statistics Revolutionized Science in the Twentieth Century

Motivation

  • To understand God’s thoughts, we must study statistics, for these are the measure of his purpose. —Florence Nightingale

She used statistical methods to improve healthcare during the Crimean War, demonstrating the power of data visualization.

Error function

Laplace did not need God in his formulation, he did need something he called the “error function”.

–Predictions do not fit observations exactly (eg. planets and comets predicted by formulas due to perturbations in the earth’s atmosphere/ human error)

–use error function to account for slight discrepancies between the observed and the predicted

–early 19th-century science was in the grip of philosophical determinism–the belief that everything that happens is determined in advance by the initial conditions of the universe and the mathematical formulas that describe its motions

–more precise measurement not exactly leads to less error—-instead inversely. Newton and Laplace had used were proving to be only rough approximations.

—new paradigm: statistical model of reality

–by the end of the 20th century almost all of science had shifted to using statistical models.

ideas and expression have drifted into the popular vocabulary

–profound shift in philosophical view: what are these statistical models? How did they come about? What do they mean in real life?

3 mathematical ideas

randomness

  • past: unpredictablity: one cannot go searching for something which is found at random

  • morden scientists: probability distribution: allows us to put constraints on this randomness and gives us a limited ability to predict future but random events.—-randomness have a structure that can be described mathematically.

Probability

From Aristotle:“it is the nature of probability that improbable things will happen”

19th century: consisted primarily of sophisticated tricks but lacked a sold theoretical foundation

Early 20th Century: Ronald A. Fisher (rejected by Pearson in the Biometrika and then went to agricultral station) revolutionized statistics with his work on experimental design, MLE, multiple comparison, analysis of variance (ANOVA). Jerzy Neyman and Egon Pearson developed hypothesis testing and confidence intervals.

Mid-20th Century: the development of computers in the 1950s and 1960s enabled more complex data analysis

Bernoullis –Fermat De Moivre(insert Calculus) discern some deep fundabemtal theorems–“laws of large numbers”

Pascal

games of chance, counting equally probable events

statistics

population \(f(x;\theta)\) and \(\theta\) is unknown

old/classical: Bayesian estimation; moment estimation

Moment estimation

  • For the moment estimation, we do not know the density so we could not calculate the expectation directly. Instead, we use LLN to approximate the expectation with sample mean. We then have

\[ m_1(\theta) = E[X] \approx \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \]

then \[ \hat \theta = m_1^{-1}(\frac{1}{n} \sum_{i=1}^n X_i) \]

For the multiple parameter case, we could use multiple moments to estimate the parameters. (k equations to solve k unknown parameters). Here we do not know the precise distribution, so we could not calculate the moments directly. Instead, we use LLN to approximate the moments with sample moments.

Suppose \(X_1, X_2, \ldots, X_n\) are iid samples from the distribution with parameters \(f(x;\theta_1, \theta_2, \ldots, \theta_k)\) then by LLM.

We then have approximatedmoment equation:

\[ E[X^j] = m_j(\theta_1, \theta_2, \ldots, \theta_k) \approx \frac{1}{n} \sum_{i=1}^n X_i^j, j=1,2,\ldots,k \]

then we could solve the equations to get the estimators.

The solution \(\hat \theta_1, \hat \theta_2, \ldots, \hat \theta_k\) are the moment estimators of the parameters.

Bayesian estimation

\(f(x|\theta)\) is regarded as conditional density and the unknown parameter \(\theta\) is the condition. \(\theta\) is a random variable with prior distribution \(\pi(\theta)\).

\((X, \theta)\) has joint distribution \(f(x,\theta)=f(x|\theta)\pi(\theta)\)

Then we get the marginal distribution of \(X\):

\(f_X(x) = \int_{-\infty}^{+\infty} f(x|\theta)\pi(\theta)d\theta=\int_{-\infty}^{+\infty} f(x;\theta)\pi(\theta)d\theta\)

the Bayesian formula is

\[ \frac {P(A|C_i)P(C_i)}{\sum_j P(A|C_j)P(C_j)} =P(C_i|A)\]

\[ f(\theta|x)=\frac{f(x|\theta)\pi(\theta)}{f_X(x)} = \frac{f(x;\theta)\pi(\theta)}{\int_{-\infty}^{+\infty} f(x;\theta)\pi(\theta)d\theta} \]

Then above we have used Bayes theorem to get the posterior distribution of \(\theta\) given \(X=x\)

This is the updation of your belief or learning.

Bayesian statistics is unclear and ambiguous, but leaving us with no choice but to incorporate the most fundamental knowledge we have (不清不楚,不得不用). We then update our understanding through experimental data. If it cannot be accomplished in one attempt, we iterate multiple times—eventually, we will gain a solid understanding of the parameters. The learning process is one of repeated iteration; there is no need to expect it to be completed in a single attempt.

When we do not know every knowledge about the model, we could use the prior of uniform distribution to represent our ignorance. That is, everyone is equally likely.

for \(f(x;\theta)\), we could calculate the expectation ; mode; CI

modern statistics

\(L(\theta) = E[R(\theta,X)]\)

\(\hat \theta = argmin_\theta L(\theta)\)

where \(R(\theta,X)\) is the loss function

Fisher: MLE

Fisher ambitious : 不想依赖于贝叶斯,选prior的时候不能包含个人的任何认知,那就选客观的uniform分布 (Fisher ambitious: Not wanting to rely on Bayesian, when choosing prior, no personal cognition can be included. Then choose the objective uniform distribution)

\(\pi (\theta|x) {d\theta}= d\theta\) uniform

\(f(\theta|x_1, x_2, \ldots, x_n) = \frac{f(x_1, x_2, \ldots, x_n;\theta)\pi(\theta)}{\int f(x_1, x_2, \ldots, x_n;\theta)\pi(\theta)d\theta}\) = \(L(\theta)\)

\[ = \frac{f(x_1;\theta)f(x_2;\theta)\ldots f(x_n;\theta) }{\int f(x_1;\theta)f(x_2;\theta)\ldots f(x_n;\theta) d\theta}=L(\theta) \]

Mode:

Find maximum of \(f(\theta|x_1, x_2, \ldots, x_n)\) i.e. L(\(\theta\))

\(\hat \theta = argmax_\theta L(\theta)=argmax_\theta \frac{f(x_1;\theta)f(x_2;\theta)\ldots f(x_n;\theta)}{f_X(x_1,x_2,..x_n)}\)

\(= argmax_\theta f(x_1;\theta)f(x_2;\theta)\ldots f(x_n;\theta)=argmax_\theta L(\theta)\) this is the likelihood function

When Fisher wanted to publish his idea to the Royal Society, he had been harshly commented through the discussion of the Royal Society. This is a good tradition but nowadays’ conference does ntot keep this. This means now they just have a presentation but the comments will not appear in the published paper.

then we could use log likelihood to simplify the calculation of the derivative to get the maximum likelihood estimator.

modern: Loss function

Modern Statistics

Loss function is a “scanning apparatus”. It will take special value at the true parameter such as 0 or minimum point.

Empirical Kullback-Leibler Divergence Loss and MLE

(Empirical loss means that since we do not know the true distribution, we could use the sample to approximate the loss function. )

Consider the empirical loss function:

\[l(\theta) = \mathbb{E}_{\Theta} \left[ \frac{f(x;\theta_0)}{f(x;\theta)} \log \frac{f(x;\theta_0)}{f(x;\theta)} \right]\]

\[= \int \frac{f(x;\theta_0)}{f(x;\theta)} \log \frac{f(x;\theta_0)}{f(x;\theta)} f(x;\theta) dx\]

\[= \int \left[ \log \frac{f(x;\theta_0)}{f(x;\theta)} \right] f(x;\theta_0) dx\]

\[= \mathbb{E}_{\Theta_0} \left[ \log \frac{f(x;\theta_0)}{f(x;\theta)} \right]\]

\[\approx \frac{1}{n} \sum_{i=1}^{n} \log \frac{f(x_i;\theta_0)}{f(x_i;\theta)}\]

\[= \frac{1}{n} \sum_{i=1}^{n} \log f(x_i;\theta_0) - \frac{1}{n} \sum_{i=1}^{n} \log f(x_i;\theta)\]

\[\text{Argmin}(\hat{\theta}) = \text{Argmin}\left[ -\frac{1}{n} \sum_{i=1}^{n} \log f(x_i;\theta) \right]\]

\[= \text{Argmax}\left[ \sum_{i=1}^{n} \log f(x_i;\theta) \right]\]

\[= \text{Argmax}\left[ \log \prod_{i=1}^{n} f(x_i;\theta) \right]\]

KL-divergence Property

Let \(E_{\theta}\) be the expectation of \(f_{\theta}(x)\).

the true model \(f(x;\theta_0)\),and the proposed model is \(f(x,\theta)\)

KL (entropy) divergence:

\[\mathbb{E}_{\theta} \left[ \frac{f(x;\theta_0)}{f(x;\theta)} \log \frac{f(x;\theta_0)}{f(x;\theta)} \right]\]

Let \(\varphi(x) = x \log x\),then \(\varphi'(x) = \log x + 1\)\(\varphi''(x) = \frac{1}{x} > 0\)

So \(\varphi(x)\) is a convex function over \((0,+\infty)\).。

By Jensen inequality:

\[\varphi\left( \mathbb{E}_{\theta} \left[ \frac{f(x;\theta_0)}{f(x;\theta)} \right] \right) \leq \mathbb{E}_{\theta} \left[ \varphi\left( \frac{f(x;\theta_0)}{f(x;\theta)} \right) \right]\]

\[\mathbb{E}_{\theta} \left[ \frac{f(x;\theta_0)}{f(x;\theta)} \right] = \int_{-\infty}^{\infty} \frac{f(x;\theta_0)}{f(x;\theta)} f(x;\theta) dx\]

\[= \int_{-\infty}^{\infty} f(x;\theta_0) dx = 1\]

Thus

\[\varphi(1) = 1 \times \log 1 = 0 \leq \mathbb{E}_{\theta} \left[ \frac{f(x;\theta_0)}{f(x;\theta)} \log \frac{f(x;\theta_0)}{f(x;\theta)} \right]\]

Loss function

\[l(\theta) = \mathbb{E}_{\theta} \left[ \frac{f(x;\theta_0)}{f(x;\theta)} \log \frac{f(x;\theta_0)}{f(x;\theta)} \right] \geq 0\]

and \(l(\theta_0) = 0\),so \(\theta_0\) is \(l(\theta)\)’s minimum point

Empirical estimators are:

\[l(\theta) \approx \frac{1}{n} \sum_{i=1}^n \frac{f(x_i;\theta_0)}{f(x_i;\theta)} \log \frac{f(x_i;\theta_0)}{f(x_i;\theta)}\]

Conclusion: Maximum likelihood estimation is equivalent to minimizing the KL divergence, which provides the theoretical basis of information theory for MLE.

DeepSeek loss function—Feed the data - gradient descent—LLM

statistical distribution

distribution function used to examine the question, like Laplace’s error function but much more complicated, using probability theory to describe what might be expected from future data taken at random from the same population of people

《心理统计(日常生活中的统计推理原书第3版)》(作者杰弗里O.贝内特、威廉L.布里格斯、马里奥F.特里奥拉)(Psychostatistics (The Original Book of Statistical Reasoning in Everyday Life, 3rd Edition) by Jeffrey O. Bennett, William L. Briggs, Mario F. Triola))

Reading this kind of books to cultivate my ability of telling story, which help me understand statistics knowledge in a deeper depth.

TEN GREAT IDEAS ABOUT CHANCE

Written by PERSI DIACONIS & BRAIN SKYRMS

Measurement

Nov.1st

Judegement

Statistical inference by George Casella and Roger L. Berger

(Casella and Berger 2024) likes Sir Arthur Conan Doyle and Sherlock Holmes, and I also like them (an old memory..). Finding some stories in a theory book is a good way to make the book more interesting. I also like the way they use examples to explain theorems, which makes it easier to understand. I think this is a good way to learn statistics.Sir Arthur Conan Doyle and Sherlock Holmes are indeed timeless classics. Using their stories to explain statistical theories is like adding a dollop of intrigue to your math. Image Holmes cracking a case with Bayesian inference or predicting the next crime using regression analysis. It s like transforming a dry textbook into a thrilling adventure! But yeah, this book is definitely not a walk in the park. It s more like a marathon through a maze of numbers and theorems. —yes, often really difficult for me. Just remember, even Holmes had his tough cases, but he always figured them out in the end. So, keep at it, and maybe one day you ll be the Sherlock of statistics, solving the most baffling statistical mysteries

Differential Equations With Applications And historical Notes by George. F Simmons

Preface really attracts me a lot (since the author loves bell curve very much!!!)

  • how the differential equation for this curve ariases from very simple considerations and can be solved to obtain the equation of the curve itself

Wisdom of the west

  • “No man’s knowledge here can go beyond his experience.”

    • Gain experience beyond immediate theoretical boundaries—to immerse themselves in the problems of biology, economics, medicine, and more. By doing so, they expand their understanding and ensure their work remains benefit.
  • “Once a science becomes solidly groundly, it proceeds more or less independently”

    • In this case, the philosophy meaning of bioststistics still stands well.

Martin Buber I and Thou

Good Quotes with My Feeling (relavent to my understanding of my major)

“It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.”

不应否认,任何理论的终极目标都是尽可能让不可简化的基本元素变得更加简单且更少,但也不能放弃对任何一个单一经验数据的充分解释。—Thanks to the book “The Model Thinker” written by Scott Page to let me know this sentence.

“It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” Einstein’s words have given me the ultimate goal in mastering a field of study (for me, statistics), constantly reminding me to keep moving forward. This is also an example of the power that books bring to me.—Thanks to the book “The Model Thinker” written by Scott Page to let me know this sentence.

  • “Youth means a temperamental predominance of courage over timidity of the appetite, for adventure over the love of ease.”

    • This quote is from the poem “Youth”, which could serve as a call to young statisticians to be bold: venture out of the comfort of theoretical work, take risks in collaborating across fields, and engage with messy, real-world data. This spirit of adventure is what will keep the field vibrant and ensure its continued relevance in the evolving landscape of science and technology.
  • “Theorem go, theorem come, only examples lie forever.”

    • This highlights the enduring value of real-world impact over theoretical intricacies.It is through practical examples, solving real problems, that statistical methods prove their lasting worth.

      (Thinking more concisely and wisely by leveraging examples in our beautiful nature.)

      (Examples remain fixed points of reference, giving continuity in the landscape of mathematical thought.)

  • Feeling: During university, when laying the foundation of knowledge, it’s essential to focus on theoretical fundamentals, including mathematical analysis (such as Fourier analysis, real analysis, complex analysis, and functional analysis), and a profound understanding of linear algebra. As statisticians venture out to apply their skills, they should carry these foundational principles with them, dedicating themselves more to real-world applications that benefit society. For me, it is especially in fields like medicine, which is particularly relevant for a student of biostatistics.Balancing deep theoretical work with practical applications during university life of a statistics- majored student will keep the field dynamic and ensure its continued relevance in the evolving landscape of science and technology.

  • “The most incomprehensible thing about the universe is that it is comprehensible.” –Albert Einstein

    • the ability to make sense of complex data is at the heart of what makes the discipline so valuable. not only in embracing theoretical rigor, which is inherently demanding and critical, but also in seeking out opportunities to collaborate across disciplines and tackle real-world data challenges.

Dealing with living people when studying

It is really a lucky thing to learn from the writer of a book or a paper face-to-face to discuss some problems. Chatting with them could really eliminate the confusion. Even though some good mathematicians or statisticians are also really good at wrting because of their humanistic quality, it is still not the same as communicating directly. For example, learning from teachers during office hour is really more helpful than reading their notes only, especially for the deeper thinking notes.

reference list

Casella, George, and Roger Berger. 2024. Statistical Inference. CRC Press.