APH101 Biostatistics And R + MTH113 Intro to Probability and Statistics+APH003 Exploring the World Through Data

I am deeply grateful to Dr. Daiyun Huang for his instruction in the “Biostatistics and R” course. This course has marked the beginning of my formal journey into the world of statistics. Dr. Huang’s comprehensive lectures, accompanied by detailed PDFs, and his well-structured labs, complete with code and solutions, have been incredibly valuable. They have not only enhanced my intuitive grasp of key statistical concepts, such as hypothesis testing, but also significantly improved my proficiency in R programming.

Also, thanks to the course of “Introduction to statistical learning” by Trevor Hastie and Robert Tibshirani. Examples in their online video lectures are abundant and easy to understand for beginners. Their book is also very helpful.

Review and Introduce Probability

Simple Random Sampling

We can get information about the population by taking a sample from it. The sample should be representative of the population.

A simple random sample of size n is a sample selected from a population in such a way that every possible sample of size n has the same chance of being selected and without replacement.

Unbiased estimator

An estimator is a statistic used to estimate a population parameter. An estimator is unbiased if the expected value of the estimator equals the population parameter.

\(E(\hat \theta) = \theta\)

where \(\hat \theta\) is the estimator and \(\theta\) is the population parameter.

eg.

\[ \mathbb{E}[\hat{p}]=\mathbb{E}\left[\frac{X_1+X_2+\cdots+X_n}{n}\right]=\frac{1}{n}\left(\mathbb{E}\left[X_1\right]+\cdots+\mathbb{E}\left[X_n\right]\right)=p \]

Variance

\[ \operatorname{Var}[X]=\mathbb{E}\left[X^2\right]-(\mathbb{E}[X])^2 \]

\[ \begin{aligned} \mathbb{E}\left[\hat{p}^2\right] & =\mathbb{E}\left[\left(\frac{X_1+X_2+\cdots+X_n}{n}\right)^2\right] \\ & =\frac{1}{n^2} \mathbb{E}\left[X_1^2+\cdots+X_n^2+2\left(X_1 X_2+X_1 X_3+\cdots+X_{n-1} X_n\right)\right] \\ & =\frac{1}{n^2}\left(n \mathbb{E}\left[X_1^2\right]+2\binom{n}{2} \mathbb{E}\left[X_1 X_2\right]\right) \\ & =\frac{1}{n} \mathbb{E}\left[X_1^2\right]+\frac{n-1}{n} \mathbb{E}\left[X_1 X_2\right] \end{aligned} \]

\[ \mathbb{E}\left[\hat{p}^2\right]=\frac{1}{n} \mathbb{E}\left[X_1^2\right]+\frac{n-1}{n} \mathbb{E}\left[X_1 X_2\right] \]

Since \(X_1\) is 0 or \(1, X_1=X_1^2\). Then \(\mathbb{E}\left[X_1^2\right]=\mathbb{E}\left[X_1\right]=p\).

Notice: \(X_1\) and \(X_2\) are not independent.

\[ \mathbb{E}\left[X_1 X_2\right]=\mathbb{P}\left[X_1=1, X_2=1\right]=\mathbb{P}\left[X_1=1\right] \mathbb{P}\left[X_2=1 \mid X_1=1\right] \]

\[ \mathbb{P}\left[X_1=1\right]=p, \quad \mathbb{P}\left[X_2=1 \mid X_1=1\right]=\frac{N p-1}{N-1} \]

\[ \begin{aligned} \operatorname{Var}[\hat{p}] & =\mathbb{E}\left[\hat{p}^2\right]-(\mathbb{E}[\hat{p}])^2 \\ & =\frac{1}{n} p+\frac{n-1}{n} p\left(\frac{N p-1}{N-1}\right)-p^2 \\ & =\left(\frac{1}{n}-\frac{n-1}{n} \frac{1}{N-1}\right) p+\left(\frac{n-1}{n} \frac{N}{N-1}-1\right) p^2 \\ & =\frac{N-n}{n(N-1)} p+\frac{n-N}{n(N-1)} p^2 \\ & =\frac{p(1-p)}{n} \frac{N-n}{N-1}=\frac{p(1-p)}{n}\left(1-\frac{n-1}{N-1}\right) \end{aligned} \] When N is much bigger than n, it is \(\frac{p(1-p)}{n}\), which is like we sample n things with replacement (independently).

Sampling distribution

The sampling distribution of a statistic is the probability distribution of that statistic based on a random sample.

Sample mean of i.i.d. normals

Sample mean follows a normal distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).

Simulation

Below codes explains the central limit theorem (CLT) and the sampling distribution of the sample proportion.

library(ggplot2)

set.seed(111)

population_size <- 12141897

p <- 0.54

num_simulations <- 50

sample_size <- 500

rep()
NULL
p_hat_values <- replicate(num_simulations, {
  # Simulate sampling from the population
  sample <- sample(c(rep(1,population_size * p), rep(0,population_size*(1-p))),sample_size, replace =FALSE) # replicate(num_simulations, {...}):这个函数会重复执行大括号中的代码num_simulations次,并将每次的结果存储在一个向量中; rep创建一个包含population_size * p个1和population_size * (1-p)个0的向量,模拟总体。
mean(sample) #calculate the p_hat for each sample
})


  
histogram <- ggplot(data.frame(p = p_hat_values), aes(x=p))+
  geom_histogram(binwidth = 0.01, fill ="blue", color = "black")+
  labs(title =" ",
       x = "p_hat",
       y= "Frequency")+
  theme_minimal()

print(histogram)

Moment generating functions

Moment generating function (MGF) and characteristic function are powerful functions that describe the underlying features of a random variable.

Definition

\(M_x(t) = \mathbb E (e^{tx})\)

Theorems about MGF

  1. If \(X\) and \(Y\) are random variables with the same MGF, which is finite on \(\left[-t_0, t_0\right]\) for some \(t_0>0\), then \(X\) and \(Y\) have the same distribution.

MGFs can be used as a tool to determine if two random variables have the identical CDF.

  1. Let \(X_1, \cdots, X_n\) be independent random variables, with MGFs \(M_{X_1}, \cdots, M_{X_n}\). Then the MGF of their sum is given by \[ M_{X_1+\cdots+X_n}(t)=M_{X_1}(t) \cdots M_{X_n}(t) \]

Example of Gamma and Exponential MGF

A gamma distribution with shape \(r=1\) is an exponential distribution. If \(X \sim \operatorname{Gamma}\left(r_X, \lambda\right)\) and \(Y \sim \operatorname{Gamma}\left(r_Y, \lambda\right)\) are independent, then we have \(X+Y \sim \operatorname{Gamma}\left(r_X+r_Y, \lambda\right)\). As a special case, if \(X_1, X_2, \cdots, X_n\) are i.i.d. with \(\operatorname{Exp}(\lambda)\) distribution, then \(X_1+X_2+\cdots+X_n\) has \(\operatorname{Gamma}(n, \lambda)\) distribution.

Suppose \(X \sim \operatorname{Exp}(\lambda)\), for \(\lambda>0\). Then \[ M_X(t)=\mathbb{E}\left[e^{t X}\right]=\int_0^{\infty} e^{t x} \lambda e^{-\lambda x} d x=\lambda \int_0^{\infty} e^{(t-\lambda) x} d x \]

Similar to Gamma MGF, the integral of Exponential MGF converges only for \(t<\lambda\). For \(t<\lambda\), we can integrate: \[ M_X(t)=\lambda\left[\frac{e^{(t-\lambda) x}}{t-\lambda}\right]_0^{\infty}=\frac{\lambda}{\lambda-t} \]

Since the Gamma MGF is \(M_X(t)=\frac{\lambda^r}{(\lambda-t)^r}\) for any \(t<\lambda\) For shape \(r=1\), Exponential MGF = Gamma MGF.

Distribution Transformation

Universality of the Uniform—From uniform you can get everything

Let u~unif(0,1), F be a CDF(assume F is strictly increased, continuous). Then there comes a theorem: \[ x = F^{-1}(u) \] Then \[ X \sim F \] (Proof: \[ P(X\leq x)=P(F^{-1}(u)\leq x)=P(F(F^{-1}(u))\leq F(x))=P(u\leq F(x))=F(x) \])

You can convert from the random uniforms to whatever you want to simulate. One example is the simulation of Logistic distribution: F(X)=\(e^x\)/(\(1+e^x\))

u~unif(0,1), \[ X = F^{-1}(u)=log(u/1-u) \] and X~F

Another example is that we could try to use uniform to simulate normal distribution:

The Box-Muller transform generates pairs of independent standard normally distributed (zero mean, unit variance) random numbers, given a source of uniformly distributed random numbers.

Given two independent random variables \(U_1\) and \(U_2\) that are uniformly distributed on the interval (0, 1), we can generate two independent standard normal random variables \(Z_0\) and \(Z_1\) using the following formulas:

\[ Z_0 = \sqrt{-2 \ln U_1} \cos(2 \pi U_2) \]

\[ Z_1 = \sqrt{-2 \ln U_1} \sin(2 \pi U_2) \]

Inversely, the example could be: let \(Z_0\) and \(Z_1\) be standard normal random variables with the following values: \(Z_0\) = 0.5 and \(Z_1\) = -1.0.

  1. Compute CDF values:

    For standard normal distribution, the CDF \[ \Phi(z) \] is given by:

    \[ \Phi(z) = \frac{1}{2} \left[ 1 + \text{erf} \left( \frac{z}{\sqrt{2}} \right) \right] \]

    (ps:\[ \text{erf}(x) = \frac{2}{\sqrt{\pi}} \int_{0}^{x} e^{-t^2} \, dt \])

    • For \[ Z_0 = 0.5 \]:

      \[ \Phi(0.5) = \frac{1}{2} \left[ 1 + \text{erf} \left( \frac{0.5}{\sqrt{2}} \right) \right] \]

    • For \[ Z_1 = -1.0 \]:

      \[ \Phi(-1.0) = \frac{1}{2} \left[ 1 + \text{erf} \left( \frac{-1.0}{\sqrt{2}} \right) \right] \]

  2. Compute Uniform values:

    Since the CDF values \[ \Phi(Z_0) \] and \[ \Phi(Z_1) \] are in the range [0, 1], we can directly use them as uniform random variables \(U_0\) and \(U_1\).

    • \[U_0 = \Phi(0.5)\]
    • \[U_1 = \Phi(-1.0)\]
  3. Result:

    • Using the error function values:
      • \[ \Phi(0.5) \approx 0.6915 \]
      • \[ \Phi(-1.0) \approx 0.1587 \]

    Thus, the corresponding uniform distribution values are:

    • \[ U_0 \approx 0.6915 \]
    • \[ U_1 \approx 0.1587 \]

Another example is that we can create a function that has good quality as F in the theorem, for example, \[ u=F(x)=1-e^{-x} (x>0) \] then we can simulate X~F:X=-ln(1-u)~F

Also, if \[ X \sim F \]

Then, \[ F(x) \sim unif(0,1) \] e.g. let X~F, if\[ F(x_0)=1/3\] then \[P(F(X)\leq 1/3)=P(X\leq x_0)=F(x_0)=1/3 \] which follows the uniform distribution(0,1) because 1/3 generates 1/3. (Uniform distribution: Probability(CDF) is proportional to length)

Normal Distribution

From Standard Normal Distribution to Normal Distribution

\(f(z)=ce^{-z^2/2}\) is a function with good qualities such as symmetric……..

To generate normalization constant c using CDF=1, instead of using impossible integral methods to compute it we should try to find the area under it. In that case, we transfer the integral shape to the double integral’s multiplication then we get c: \[ \int_{-\infty}^{\infty}\exp(-z^2/2) dz=\int_{-\infty}^{\infty}\exp(-z^2/2) dz=\int_{-\infty}^{\infty}\exp(-x^2/2)dx\int_{-\infty}^{\infty}\exp(-y^2/2)dy \] \[ =\int_{0}^{\infty}\exp(-r^2/2)r dr\int_{0}^{2π} d\theta\ =\sqrt2\pi\ \] So \[ c=1/\sqrt2\pi\ \] …….

From…..

111

Exponential distribution

the only one parameter is the rate parameter. The probability density function (pdf) of an exponential distribution with rate parameter (> 0) is given by:

\[ f_X(x) = \begin{cases} \lambda e^{-\lambda x} & \text{for } x \geq 0, \\ 0 & \text{for } x < 0. \end{cases} \] The cumulative distribution function (cdf) of an exponential distribution with rate parameter (> 0) is given by:

\[ F_X(x) = \begin{cases} 1 - e^{-\lambda x} & \text{for } x \geq 0, \\ 0 & \text{for } x < 0. \end{cases} \] 2.Let Y~x, then Y ~Expo(1)

proof: since \[ P(Y\leq y)=P(X\leq y/\lambda)=1-e^{-y} \] (just plot it into x) and we could check that E[X]=Var[X]=1, so X=Y/\(\lambda\) has E[x]=1/\(\lambda\), Var[x]=1/\(\lambda^2\)

e.g. Memoryless Property(This property implies that the remaining lifetime distribution does not depend on how much time has already elapsed.)(The exponential distribution is the only continuous distribution that has the memoryless property):\[ P(X\geq s+t|X\geq s)=P(X\geq t)\] which is actually satisfied by the exponential and we could prove it(though it intuitively makes sence): Here \[ P(X\geq s)=1-P(X\leq s)=e^{-\lambda s} \] \[ P(X \geq s+t \mid X \geq s)=\frac{P(X \geq s+t \text { and } X \geq s)}{P(X \geq s)} \]

Since \(X \geq s+t\) implies \(X \geq s\), we can simplify the numerator: \[ P(X \geq s+t \mid X \geq s)=\frac{P(X \geq s+t)}{P(X \geq s)} \]

Now, substitute the survival function for the exponential distribution: \[ P(X \geq s+t \mid X \geq s)=\frac{e^{-\lambda(s+t)}}{e^{-\lambda s}} \]

Simplify the expression: \[ P(X \geq s+t \mid X \geq s)=\frac{e^{-\lambda s} \cdot e^{-\lambda t}}{e^{-\lambda s}}=e^{-\lambda t} \]

Notice that \(e^{-\lambda t}=P(X \geq t)\) : \[ P(X \geq s+t \mid X \geq s)=P(X \geq t) \] usefulness of it: \[ X\sim Expo(\lambda),E(X|X>a)=a+E(X-a|X>a)=a+1/\lambda \]

e.g.2The hazard rate (or failure rate) for an exponential distribution is constant over time. For an exponential random variable \(X\) with rate parameter \(\lambda\), the hazard rate is: \[ h(t)=\frac{f_X(t)}{1-F_X(t)}=\lambda . \]

A constant hazard rate implies that the event is equally likely to occur at any point in time, which is a reasonable assumption for many processes, such as the lifetime of certain electronic components or the occurrence of certain types of random failures.

e.g.3 The exponential distribution is closely related to the Poisson process, which is a process that models the occurrence of events happening independently at a constant average rate. If the times between consecutive events in a Poisson process are independent and identically distributed, then these interarrival times follow an exponential distribution. This relationship makes the exponential distribution a natural choice in contexts where events occur randomly over time, such as phone calls arriving at a switchboard or buses arriving at a bus stop.

Ga

If r.v.X~N(0,1), then \(X^2\)~\(Ga\)(1/2,1/2)

Proof: If r.v. (X N(0,1)), then (X^2 (, ))

Let (X) be a random variable such that (X N(0,1)).

The probability density function (pdf) of (X) is: \[ f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}, \quad -\infty < x < \infty. \]

First, we find the pdf of (Y = X^2).

The cumulative distribution function (CDF) of (Y) is given by: \[ F_Y(y) = P(Y \leq y) = P(X^2 \leq y). \]

Since (X^2 ), we only consider (y ): \[ F_Y(y) = P(-\sqrt{y} \leq X \leq \sqrt{y}). \]

Using the CDF of the normal distribution, we have: \[ F_Y(y) = \int_{-\sqrt{y}}^{\sqrt{y}} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} \, dx. \]

The pdf of (Y) is the derivative of the CDF: \[ f_Y(y) = \frac{d}{dy} F_Y(y). \]

\[ F_Y(y) = 2 \int_{0}^{\sqrt{y}} \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} \, dx. \]

\[ F_Y(y) = 2 \int_{0}^{\sqrt{y}} \frac{1}{\sqrt{2\pi}} e^{-u} \frac{du}{\sqrt{2u}} = \frac{2}{\sqrt{2\pi}} \int_{0}^{\sqrt{y}} e^{-u} \frac{du}{\sqrt{2u}}. \]

Thus, \[ f_Y(y) = \frac{d}{dy} \left( \frac{2}{\sqrt{2\pi}} \int_{0}^{\sqrt{y}} e^{-u} \frac{du}{\sqrt{2u}} \right). \]

Differentiating with respect to y gives: \[ f_Y(y) = \frac{1}{\sqrt{2\pi}} \cdot e^{-\frac{y}{2}} \cdot y^{-\frac{1}{2}}. \]

Simplifying, we get: \[ f_Y(y) = \frac{1}{\sqrt{2\pi}} y^{-\frac{1}{2}} e^{-\frac{y}{2}}. \]

\[ Y = X^2 \sim \text{Ga}\left(\frac{1}{2}, \frac{1}{2}\right). \]

Chi-square

Chi-square is a

1.Let ( \(Z_1\), \(Z_2\), , \(Z_i\) ) are independent standard normal random variables(i.e. Z~N(0,1)), then the random variable ( X ) defined by

\[ X = Z_1^2 + Z_2^2 + \cdots + Z_i^2 \] (\(Z_j\)~iid.N(0,1))

follows a chi-square distribution with ( i ) degrees of freedom. So chi-square of 1 is the same thing as Gamma of (1/2,1/2) so chi-square of n is Gamma(n/2,1/2)

2.If \[ Z_i = \frac{x_i - \mu}{\sigma} \] The sum of squared standardized deviations is:

\[ \sum_{i=1}^n Z_i^2 = \sum_{i=1}^n \left( \frac{x_i - \mu}{\sigma} \right)^2 = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 \]

Let \(x_1, x_2, \cdots, x_n\) be samples of \(N\left(\mu, \sigma^2\right)\) , \(\mu\) is a known constant, find the distribution of statistics: \[ T=\sum_{i=1}^n\left(x_i-\mu\right)^2 \]

sol: \(y_i=\left(x_i-\mu\right) / \sigma, i=1,2, \cdots, n\), 则 \(y_1, y_2, \cdots, y_n\) are iid r.v. of \(N(0,1)\),so\[ \frac{T}{\sigma^2}=\sum_{i=1}^n\left(\frac{x_i-\mu}{\sigma}\right)^2=\sum_{i=1}^n y_i^2 \sim \chi^2(n), \] (i.e.\[ \sum_{i=1}^n Z_i^2 {\sigma^2}\sim \chi^2(n)\])

Besides, \(T\)’s PDF is \[ p(t)=\frac{1}{\left(2 \sigma^2\right)^{n / 2} \Gamma(n / 2)} \mathrm{e}^{-\frac{1}{2 \sigma^2} t^{\frac{n}{2}}-1}, \quad t>0, \]

which is Gamma distribution \(G a\left(\frac{n}{2}, \frac{1}{2 \sigma^2}\right) \cdot\)

3.chi-square is uesful because of theorems below:let \(x_1, x_2, \cdots, x_n\) are samples from \(N\left(\mu, \sigma^2\right)\) , whose sample mean and sample variance are\[ \bar{x}=\frac{1}{n} \sum_{i=1}^n x_i \text { and } s^2=\frac{1}{n-1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^2, \]

then we can get: (1) \(\bar{x}\) and \(s^2\) are independent; (2) \(\bar{x} \sim N\left(\mu, \sigma^2 / n\right)\); (3) \(\frac{(n-1) s^2}{\sigma^2} \sim \chi^2(n-1)\).

Proof: \[ p\left(x_1, x_2, \cdots, x_n\right)=\left(2 \pi \sigma^2\right)^{-n / 2} \mathrm{e}^{-\sum_{i=1}^n \frac{\left(x_i-\mu\right)^2}{2 \sigma^2}}=\left(2 \pi \sigma^2\right)^{-n / 2} \exp \left\{-\frac{\sum_{i=1}^n x_i^2-2 n \bar{x} \mu+n \mu^2}{2 \sigma^2}\right\} \]

denote\(\boldsymbol{X}=\left(x_1, x_2, \cdots, x_n\right)^{\mathrm{T}}\), then we create an \(n\)-dimension orthogonal \(\boldsymbol{A}\) and every element in the first row is \(1 / \sqrt{n}\), such as \[ A=\left(\begin{array}{ccccc} \frac{1}{\sqrt{n}} & \frac{1}{\sqrt{n}} & \frac{1}{\sqrt{n}} & \cdots & \frac{1}{\sqrt{n}} \\ \frac{1}{\sqrt{2 \cdot 1}} & -\frac{1}{\sqrt{2 \cdot 1}} & 0 & \cdots & 0 \\ \frac{1}{\sqrt{3 \cdot 2}} & \frac{1}{\sqrt{3 \cdot 2}} & -\frac{2}{\sqrt{3 \cdot 2}} & \cdots & 0 \\ \vdots & \vdots & \vdots & & \vdots \\ \frac{1}{\sqrt{n(n-1)}} & \frac{1}{\sqrt{n(n-1)}} & \frac{1}{\sqrt{n(n-1)}} & \cdots & -\frac{n-1}{\sqrt{n(n-1)}} \end{array}\right), \]\(\boldsymbol{Y}=\left(y_1, y_2, \cdots, y_n\right)^{\mathrm{T}}=\boldsymbol{A} \boldsymbol{X}\), \(|Jacobi|=1\), and we can find thatt\[ \begin{gathered} \bar{x}=\frac{1}{\sqrt{n}} y_1, \\ \sum_{i=1}^n y_i^2=\boldsymbol{Y}^{\mathrm{T}} \boldsymbol{Y}=\boldsymbol{X}^{\mathrm{T}} \boldsymbol{A}^{\mathrm{T}} \boldsymbol{A} \boldsymbol{X}=\sum_{i=1}^n x_i^2, \end{gathered} \]

so\(y_1, y_2, \cdots, y_n\) ’s joint density function is \[ \begin{aligned} p\left(y_1, y_2, \cdots, y_n\right) & =\left(2 \pi \sigma^2\right)^{-n / 2} \exp \left\{-\frac{\sum_{i=1}^n y_i^2-2 \sqrt{n} y_1 \mu+n \mu^2}{2 \sigma^2}\right\} \\ & =\left(2 \pi \sigma^2\right)^{-n / 2} \exp \left\{-\frac{\sum_{i=2}^n y_i^2+\left(y_1-\sqrt{n} \mu\right)^2}{2 \sigma^2}\right\} \end{aligned} \]

So, \(\boldsymbol{Y}=\left(y_1, y_2, \cdots, y_n\right)^{\mathrm{T}}\) independently distributed as normal distribution and their variances are all equal to\(\sigma^2\), but their means are not all the same because \(y_2, \cdots, y_n\) ’s means are \(0, y_1\)’s is \(\sqrt{n} \mu\), which ends our proof of (2). \[ (n-1) s^2=\sum_{i=1}^n\left(x_i-\bar{x}\right)^2=\sum_{i=1}^n x_i^2-(\sqrt{n} \bar{x})^2=\sum_{i=1}^n y_i^2-y_1^2=\sum_{i=2}^n y_i^2, \]

This proves conclusion (1). Since \(y_2, \cdots, y_n\) are independently and identically distributed as \(N\left(0, \sigma^2\right)\), we have: \[ \frac{(n-1) s^2}{\sigma^2}=\sum_{i=2}^n\left(\frac{y_i}{\sigma}\right)^2 \sim \chi^2(n-1) . \]

Theorem is proved. (similar to the proof above this maybe easier to understand:\(\begin{aligned} & i z\left(Y_1, Y_2, \cdots, Y_2\right)^{\top}=A\left(X_1, \cdots, x_n\right)^{\top} \\ & \text { then } \sum_{i=1}^n Y_i^2=\left(Y_1, \cdots, Y_n\right)\left(Y_1, \cdots, Y_n\right)^{\top} \\ & =\left[A\left(x_1, \cdots, x_n\right)^{\top}\right]^{\top}\left[A\left(x_1, \cdots, X_n\right)^{\top}\right] \\ & =\left(x_1, \cdots, x_n\right) A^{\top} A\left(x_1, \cdots, x_n\right)^{\top} \\ & =\left(x_1, \cdots, x_n\right) E\left(x_1, \cdots, x_n\right)^{\top}=\sum_{i=1}^n x_i^2 \\ & \end{aligned}\)) \(\begin{aligned} & \text { besides } Y_1=\frac{1}{\sqrt{n}} x_1+\cdots+\frac{1}{\sqrt{n}} x_n=\frac{1}{\sqrt{n}} \sum_{i=1}^n X_i \\ & \text { and } Y_1=\sqrt{n} \cdot \frac{1}{n} \sum_{i=1}^n X_i=\sqrt{n} \bar{X}, \text { then } \bar{x}=\frac{1}{\sqrt{n}} y_i \\ & B S^2=\frac{1}{n-1} \sum_{i=1}^n\left(x_i-\bar{x}\right)^2=\frac{1}{n-1}\left[\sum_{i=1}^n x_i^2-n \bar{x}^2\right] \\ & =\frac{1}{n-1}\left[\sum_{i=1}^n Y_i^2-Y_1^2\right]=\frac{1}{n-1} \sum_{i=2}^n Y_i^2 \\ & 2 \oplus L=(\sqrt{2 \pi} \sigma)^{-n} \exp \left[-\frac{1}{2 \sigma^2} \sum_{i=1}^n\left(x_i-\mu\right)^2\right] \text {. } \\ & =(\sqrt{2 \pi} \sigma)^{-n} \exp \left[-\frac{1}{2 \sigma^2}\left(\sum_{i=1}^n x_i^2-2 \mu n \bar{x}+\mu^2 n\right]\right. \\ & =(\sqrt{2 \pi} \sigma)^{-n} \exp \left[-\frac{1}{2 \sigma^2}\left(\sum_{1=1}^n y_i{ }^2-2 \mu n \frac{1}{\sqrt{n}} Y_1+n \mu^2\right]\right. \\ & \end{aligned}\) \[=(\sqrt2\pi\sigma)^{-1}exp[-1/2\sigma^2(Y_1-\sqrt nu)^2]×(\sqrt2\pi\sigma)^{-1}exp[-1/2\sigma^2{Y_2}^2]*...*(\sqrt2\pi\sigma)^{-1}exp[-1/2\sigma^2{Y_n}^2] \]

So L is \(Y_1\)\(Y_n\)’s joint density function and so they are independent. Besides, we have proved that its mean is \(1/\sqrt n\)\(Y_1\) and \(S^2\)=\(1/n-1 \Sigma{i=2}Y_i^2\), so the normal distribution’s mean and variance are independent.

When the random variable \(\chi^2 \sim \chi^2(n)\), for a given \(\alpha\) (where \(0<\) \(\alpha<1\) ), the value \(\chi_{1-\alpha}^2(n)\) satisfying the probability equation \(P\left(\chi^2 \leqslant \chi_{1-\alpha}^2(n)\right)=1-\) \(\alpha\) is called the \(1-\alpha\) quantile of the \(\chi^2\) distribution with \(n\) degrees of freedom.

Suppose the random variables \(X_1 \sim \chi^2(m)\) and \(X_2 \sim \chi^2(n)\), and \(X_1\) and \(X_2\) are independent. Then the distribution of \(F=\frac{X_1 / m}{X_2 / n}\) is called the \(\mathrm{F}\) distribution with \(m\) and \(n\) degrees of freedom, denoted as \(F \sim F(m, n)\). Here, \(m\) is called the numerator degrees of freedom and \(n\) the denominator degrees of freedom. We derive the density function of the \(\mathrm{F}\) distribution in two steps. First, we derive the density function of \(Z=\frac{X_1}{X_2}\). Let \(p_1(x)\) and \(p_2(x)\) be the density functions of \(\chi^2(m)\) and \(\chi^2(n)\) respectively. According to the formula for the distribution of the quotient of independent random variables, the density function of \(Z\) is: \[ \begin{gathered} p_Z(z)=\int_0^{\infty} x_2 p_1\left(z x_2\right) p_2\left(x_2\right) \mathrm{d} x_2 \\ =\frac{z^{\frac{m}{2}-1}}{\Gamma\left(\frac{m}{2}\right) \Gamma\left(\frac{n}{2}\right) 2^{\frac{m+n}{2}}} \int_0^{\infty} x_2^{\frac{n}{2}-1} e^{-\frac{x_2}{2}(1+z)} \mathrm{d} x_2 . \end{gathered} \]

Using the transformation \(u=\frac{x_2}{2}(1+z)\), we get: \[ p_Z(z)=\frac{z^{\frac{m}{2}-1}(1+z)^{-\frac{m+n}{2}}}{\Gamma\left(\frac{m}{2}\right) \Gamma\left(\frac{n}{2}\right)} \int_0^{\infty} u^{\frac{n}{2}-1} e^{-u} \mathrm{~d} u \]

The final integral is the gamma function \(\Gamma\left(\frac{n}{2}\right)\), so: \[ p_Z(z)=\frac{\Gamma\left(\frac{m+n}{2}\right)}{\Gamma\left(\frac{m}{2}\right) \Gamma\left(\frac{n}{2}\right)} z^{\frac{m}{2}-1}(1+z)^{-\frac{m+n}{2}}, \quad z \geq 0 . \]

Second, we derive the density function of \(F=\frac{n}{m} Z\). Let the value of \(F\) be \(y\). For \(y \geq 0\), we have: \[ \begin{aligned} p_F(y) & =p_Z\left(\frac{m}{n} y\right) \cdot \frac{m}{n}=\frac{\Gamma\left(\frac{m+n}{2}\right)}{\Gamma\left(\frac{m}{2}\right) \Gamma\left(\frac{n}{2}\right)}\left(\frac{m}{n} y\right)^{\frac{m}{2}-1}\left(1+\frac{m}{n} y\right)^{-\frac{m+n}{2}} \cdot \frac{m}{n} \\ & =\frac{\Gamma\left(\frac{m+n}{2}\right)}{\Gamma\left(\frac{m}{2}\right) \Gamma\left(\frac{n}{2}\right)}\left(\frac{m}{n}\right)\left(\frac{m}{n} y\right)^{\downarrow \frac{2}{2}-1}\left(1+\frac{m}{n} y\right)^{-\frac{m+n}{2}} \end{aligned} \]

When the random variable \(F \sim F(m, n)\), for a given \(\alpha\) (where \(0<\alpha<1\) ), the value \(F_{1-\alpha}(m, n)\) satisfying the probability equation \(P\left(F \leqslant F_{1-\alpha}(m, n)\right)=1-\alpha\) is called the \(1-\alpha\) quantile of the \(\mathrm{F}\) distribution with \(m\) and \(n\) degrees of freedom. By the construction of the \(\mathrm{F}\) distribution, if \(F \sim F(m, n)\), then \(1 / F \sim F(n, m)\). Therefore, for a given \(\alpha\) (where \(0<\alpha<1\) ), \[ \alpha=P\left(\frac{1}{F} \leqslant F_\alpha(n, m)\right)=P\left(F \geqslant \frac{1}{F_\alpha(n, m)}\right) . \]

Thus, \[ P\left(F \leqslant \frac{1}{F_\alpha(n, m)}\right)=1-\alpha \]

This implies \[ F_\alpha(n, m)=\frac{1}{F_{1-\alpha}(m, n)} . \]

Corollary Suppose \(x_1, x_2, \cdots, x_m\) is a sample from \(N\left(\mu_1, \sigma_1^2\right)\) and \(y_1, y_2, \cdots, y_n\) is a sample from \(N\left(\mu_2, \sigma_2^2\right)\), and these two samples are independent. Let: \[ s_x^2=\frac{1}{m-1} \sum_{i=1}^m\left(x_i-\bar{x}\right)^2, \quad s_y^2=\frac{1}{n-1} \sum_{i=1}^n\left(y_i-\bar{y}\right)^2, \] where \[ \bar{x}=\frac{1}{m} \sum_{i=1}^m x_i, \quad \bar{y}=\frac{1}{n} \sum_{i=1}^n y_i \] then \[ F=\frac{s_x^2 / \sigma_1^2}{s_y^2 / \sigma_2^2} \sim F(m-1, n-1) . \]

In particular, if \(\sigma_1^2=\sigma_2^2\), then \(F=\frac{s_x}{s_y^2} \sim F(m-1, n-1)\). Proof: Since the two samples are independent, \(s_x^2\) and \(s_y^2\) are independent. According to a Theorem , we have \[ \frac{(m-1) s_x^2}{\sigma_1^2} \sim \chi^2(m-1), \quad \frac{(n-1) s_y^2}{\sigma_2^2} \sim \chi^2(n-1) . \]

By the definition of the \(\mathrm{F}\) distribution, \(F \sim F(m-1, n-1)\). Corollary: Suppose \(x_1, x_2, \cdots, x_n\) is a sample from a normal distribution \(N\left(\mu, \sigma^2\right)\), and let \(\bar{x}\) and \(s^2\) denote the sample mean and sample variance of the sample, respectively. Then \[ t=\frac{\sqrt{n}(\bar{x}-\mu)}{s} \sim t(n-1) . \]

Proof: From a Theorem we obtain \[ \frac{\bar{x}-\mu}{\sigma / \sqrt{n}} \sim N(0,1) \]

Then, \[ \frac{\sqrt{n}(\bar{x}-\mu)}{s}=\frac{\frac{\bar{x}-\mu}{\sigma / \sqrt{n}}}{\sqrt{\frac{(n-1) s^2 / \sigma^2}{n-1}}} \]

Since the numerator is a standard normal variable and the denominator’s square root contains a \(\chi^2\) variable with \(n-1\) degrees of freedom divided by its degrees of freedom, and they are independent, by the definition of the \(t\) distribution, \(t \sim t(n-1)\). The proof is complete.

Corollary: In the notation of Corollary , assume \(\sigma_1^2=\sigma_2^2=\sigma^2\), and let \[ s_w^2=\frac{(m-1) s_x^2+(n-1) s_y^2}{m+n-2}=\frac{\sum_{i=1}^m\left(x_i-\bar{x}\right)^2+\sum_{i=1}^n\left(y_i-\bar{y}\right)^2}{m+n-2} \]

Then \[ \frac{(\bar{x}-\bar{y})-\left(\mu_1-\mu_2\right)}{s_w \sqrt{\frac{1}{m}+\frac{1}{n}}} \sim t(m+n-2) \]

Proof: Since \(\bar{x} \sim N\left(\mu_1, \frac{\sigma^2}{m}\right), \bar{y} \sim N\left(\mu_2, \frac{\sigma^2}{n}\right)\), and \(\bar{x}\) and \(\bar{y}\) are independent, we have

\[ \bar{x}-\bar{y} \sim N\left(\mu_1-\mu_2,\left(\frac{1}{m}+\frac{1}{n}\right) \sigma^2\right) . \]

Thus, \[ \frac{(\bar{x}-\bar{y})-\left(\mu_1-\mu_2\right)}{\sigma \sqrt{\frac{1}{m}+\frac{1}{n}}} \sim N(0,1) . \]

By a Theorem , we know that \(\frac{(m-1) s_x^2}{\sigma^2} \sim \chi^2(m-1)\) and \(\frac{(n-1) s_y^2}{\sigma^2} \sim \chi^2(n-1)\), and they are independent. By additivity, we have \[ \frac{(m+n-2) s_w^2}{\sigma^2}=\frac{(m-1) s_x^2+(n-1) s_y^2}{\sigma^2} \sim \chi^2(m+n-2) . \]

Since \(\bar{x}-\bar{y}\) and \(s_w^2\) are independent, by the definition of the \(\mathrm{t}\) distribution, we get the desired result. \(\square\)

One interesting example shows the relationship of above distributions used charismatically to solve problems: r.v.: \(X_1 , X_2 , X_3 , X_4\) indpendently identically distribute(iid) as \(N\left(0 . \sigma^2\right)\). \(Z=\left(x_1^2+x_2^2\right) /\left(x_1^2+x_2^2+x_3^2+x_4^2\right)\) prove: \(Z \sim U(0.1)\). \[ \begin{aligned} & \text { Solution: } Let Y=\frac{X_3^2+X_4^2}{X_1^2+X_2^2}=\frac{\left[\left(\frac{X_3}{\sigma}\right)^2+\left(\frac{X_4}{\sigma}\right)^2\right] / 2}{\left[\left(\frac{X_1}{\sigma}\right)^2+\left(\frac{X_2}{\sigma}\right)^2\right] / 2} \sim F(2,2) . \\ & \text { i.e. } f_Y(y)=\frac{1}{(1+y)^2}, y>0 \\ & \text { then } P(Z \leq z)=P\left(\frac{1}{1+Y} \leq z\right)=P\left(Y \geqslant \frac{1}{z}-1\right) \\ & =\int_{\frac{1}{z}-1}^{+\infty} \frac{1}{(1+y)^2} d y=z \quad \text { H } 0<z<1 . \\ & \therefore Z \sim U(0.1) \end{aligned} \] (ps:$

\[\begin{aligned} & f(x)=\frac{\Gamma\left(\frac{n_1+n_2}{2}\right)}{\Gamma\left(\frac{n_2}{2}\right) \Gamma\left(\frac{n_1}{2}\right)}\left(\frac{n_1}{n_2}\right)\left(\frac{n_1}{n_2} x\right)^{\frac{n_1}{2}-1}\left(1+\frac{n_1}{n_2} x\right)^{\frac{-1}{2}\left(n_1+n_2\right)} . \\ & \text { of these } x>0 \text {. and } E(x)=\frac{n_2}{n_2-2} \text {, when } n_2>2 \text {. } \\ & \operatorname{Var}(X)=\frac{2 n_2^2\left(n_1+n_2-2\right)}{n_1\left(n_2-2\right)^2\left(n_2-4\right)} \text {when $n_2>4$. } \\ & \end{aligned}\]


PDF functions’s transformation(from books)

Standard normal distribution

Pdf : \(f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}\)

Chi-square distribution

From standard normal distribution \(f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}\), we want to get Chi-square distribution \(f(x; k) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2 - 1} e^{-x/2}\)

Assuming that \(Z \sim N(0, 1)\)
\(W=Z^2\)
So we can deduce the pdf for the variable W \[ \begin{align} f_W(w)&=Pr(Z^2=w)\\ & =f_Z(\sqrt{w}) \left| \frac{d(\sqrt{w})}{dw} \right| + f_Z(-\sqrt{w}) \left| \frac{d(-\sqrt{w})}{dw} \right|\\ &=2 \cdot \frac{1}{\sqrt{2\pi}} e^{-w / 2} \cdot \frac{1}{2\sqrt{w}} = \frac{1}{\sqrt{2\pi w}} e^{-w / 2}\\ \end{align} \]

\(\Gamma(z) = \int_0^\infty t^{z-1} e^{-t} \, dt\)
\(\Gamma(1/2)\)=\(\sqrt(\pi)\),we can know that \(W\sim \chi^2(1)\)

Now, we can introduce the variable Y.
\(Y=\sum_{i=1}^k Z_i^2\) (\(Z_i\) is independent of each other)
We want to get the mgf of Y \[ \begin{align} M_Y(t)&=E[e^{tY}]\\ & =E[e^{tZ_1^2}e^{tZ_2^2}...e^{tZ_k^2}]\\ & =E[e^{tZ_1^2}]E[e^{tZ_2^2}]...E[e^{tZ_k^2}] \\ & = \prod_{i=1}^k M_{Z_i^2}(t)\\ &=(1-2t)^{-r1/2}(1-2t)^{-r2/2}...(1-2t)^{-rk/2}\\ & =(1-2t)^{-\sum_{i=1}^kri/2}\\ \end{align} \] Because the mgf of Chi-square function is \((1-2t)^{-r/2}\)
r is the degree of freedom of this chi-square function
So \(Y\sim \chi^2(r1+r2+...+rk)\)
r1,r2…rk represents the degree of freedom of every single sample
Here, df=1 for every sample

Student- t distribution

From standard normal distribution \(f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}\), we want to get Student- t distribution \(f(t) = \frac{\Gamma \left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \, \Gamma \left(\frac{\nu}{2}\right)} \left(1 + \frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}}\), where \(\Gamma\) is the gamma function and \(\nu\) is the degrees of freedom.

Given two independent variables Z and V (\(Z \sim N(0, 1)\) & \(V\sim \chi^2(\nu)\)), then we construct a new variable \(T=\frac{Z}{\sqrt(V/\nu)}\).
The joint pdf is \[ g(z,v)=\frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}}\frac{1}{2^{\nu/2} \Gamma(\nu/2)} v^{\nu/2 - 1} e^{-v/2} \] Cdf of T is given by \[ \begin{align} F(t) & =Pr(\frac{Z}{\sqrt(V/\nu)}\leq t)\\ & =Pr(Z\leq {\sqrt(V/\nu)}t)\\ & =\int_{0}^{\infty} \int_{-\infty}^{\sqrt(V/\nu)t} g(z,v) \, dz \, dv \end{align} \] Simplify F(t) \[ F(t)=\frac{1}{\sqrt{\pi}\Gamma(\nu/2)}\int_{0}^{\infty}[\int_{-\infty}^{\sqrt(V/\nu)t} \frac{e^{-z^2/2}}{2^\frac{\nu+1}{2}}dz]v^{\frac{\nu}{2}-1}e^{-\frac{v}{2}}dv \]

To get pdf, we will differentiate F(t) \[ \begin{align} f(t) &=F'(t)=\frac{1}{\sqrt{\pi}\Gamma(\nu/2)}\int_{0}^{\infty} \frac{e^{-(v/2)(t^2/\nu)}}{2^{\frac{\nu+1}{2}}}\sqrt\frac{v}{\nu}v^{\nu/2-1}e^{-\frac{v}{2}}dv\\ &=\frac{1}{\sqrt{\pi\nu}\Gamma(\nu/2)}\int_{0}^{\infty}\frac{v^{(\nu+1)/2-1}}{2^{(\nu+1)/2}}e^{-(\nu/2)(1+t^2/\nu)}dv \end{align} \] We need to make the change of variables: \(y=(1+t^2/\nu)v\)
And we need to change dv: \(\frac{dv}{dy}=\frac{1}{1+t^2/\nu}\)

\[ \begin{align} f(t)&=\frac{\Gamma[(\nu+1)/2]}{\sqrt{\pi\nu}\Gamma(\nu/2)}[\frac{1}{(1+t^2/\nu)^{(\nu+1)/2}}]\int_{0}^{\infty}\frac{y^{(\nu+1)/2-1}}{\Gamma[(\nu+1)/2]2^{(\nu+1)/2}}e^{-y/2}dy \end{align} \] This part \(\int_{0}^{\infty}\frac{y^{(\nu+1)/2-1}}{\Gamma[(\nu+1)/2]2^{(\nu+1)/2}}e^{-y/2}dy\) is equal to 1, because this part is the whole area under the chi-square distribution with \(\nu+1\) degrees of freedom. So, the pdf for T can be written as follows \[ f(t)=\frac{\Gamma[(\nu+1)/2]}{\sqrt{\pi\nu}\Gamma(\nu/2)}\frac{1}{(1+t^2/\nu)^{(\nu+1)/2}} \]

F-distribution

We will do some trnsformation on chi-square distribution to get F-distribution

Assuming that we have two independent random variables \[ X \sim \chi^2(n_1) \quad and\quad Y \sim \chi^2(n_2)\ \] Now,we will define a new variable F \[ F = \frac{(X / n_1)}{(Y / n_2)} \] This looks a little complex, so let’s do some simplification.

\[ U = \frac{X}{n_1} \quad \text{and} \quad V = \frac{Y}{n_2}\ \]

So, \(F = \frac{U}{V}\).
Because, X and Y are independent of each other. Obviously, U and V are also independent of each other.

\[ f_{U,V}(u,v) = f_U(u) f_V(v) \] \[ f_U(u) = \frac{(n_1 u)^{n_1/2 - 1} e^{-n_1 u/2}}{2^{n_1/2} \Gamma(n_1/2)} \] \[ f_V(v) = \frac{(n_2 v)^{n_2/2 - 1} e^{-n_2 v/2}}{2^{n_2/2} \Gamma(n_2/2)} \] To find the joint density function of F & V, we use Jacobian transformation \[ J =\left| \frac{\partial(U,V)}{\partial(F,V)} \right| = \left| \begin{matrix} \frac{\partial U}{\partial F} & \frac{\partial U}{\partial V} \\ \frac{\partial V}{\partial F} & \frac{\partial V}{\partial V} \end{matrix} \right| = \left| \begin{matrix} V & F \\ 0 & 1 \end{matrix} \right| = V \] \[ f_{F,V}(f,v) = f_{U,V}(u,v) \left| \frac{\partial(u,v)}{\partial(f,v)} \right|= f_{U,V}(u,v)v \] Then, we substitute\(f_U(u)\) and \(f_V(v)\) into \(f_{F,V}(f,v)\) \[ f_{F,V}(f,v) = \frac{(n_1 u)^{n_1/2 - 1} e^{-n_1 u/2}}{2^{n_1/2} \Gamma(n_1/2)} \cdot \frac{(n_2 v)^{n_2/2 - 1} e^{-n_2 v/2}}{2^{n_2/2} \Gamma(n_2/2)} \cdot v \] Use the condition u=fv \[ \begin{align} f_{F,V}(f,v) &= \frac{(n_1 fv)^{n_1/2 - 1} e^{-n_1 fv/2}}{2^{n_1/2} \Gamma(n_1/2)} \cdot \frac{(n_2 v)^{n_2/2 - 1} e^{-n_2 v/2}}{2^{n_2/2} \Gamma(n_2/2)} \cdot v\\ & =\frac{(n_1 f v)^{n_1/2 - 1} e^{-n_1 f v/2}}{2^{n_1/2} \Gamma(n_1/2)} \cdot \frac{(n_2 v)^{n_2/2 - 1} e^{-n_2 v/2}}{2^{n_2/2} \Gamma(n_2/2)} \cdot v\\ & = \frac{n_1^{n_1/2-1} f^{n_1/2 - 1} v^{n_1/2 - 1} e^{-n_1 f v/2}}{2^{n_1/2} \Gamma(n_1/2)} \cdot \frac{n_2^{n_2/2-1} v^{n_2/2 - 1} e^{-n_2 v/2}}{2^{n_2/2} \Gamma(n_2/2)} \cdot v\\ & = \frac{n_1^{n_1/2-1} n_2^{n_2/2-1} f^{n_1/2 - 1} v^{(n_1 + n_2)/2 - 1} e^{-(n_1 f + n_2) v/2}}{2^{(n_1 + n_2)/2} \Gamma(n_1/2) \Gamma(n_2/2)} \end{align} \] We will integrate this density function with respect to V \[ \begin{align} f_F(f) &= \int_0^\infty f_{F,V}(f,v) dv\\ &= \int_0^\infty \frac{n_1^{n_1/2-1} n_2^{n_2/2-1} f^{n_1/2 - 1} v^{(n_1 + n_2)/2 - 1} e^{-(n_1 f + n_2) v/2}}{2^{(n_1 + n_2)/2} \Gamma(n_1/2) \Gamma(n_2/2)} dv\\ \end{align} \] Let’s do some substitutions to make it look simpler \[ \begin{align} c = \frac{n_1 f + n_2}{2} \end{align} \] This is the definition of Gamma function \[ \int_0^\infty v^{(n_1 + n_2)/2 - 1} e^{-c v} dv = \frac{\Gamma((n_1 + n_2)/2)}{c^{(n_1 + n_2)/2}} \]

Then \[ f_F(f) = \frac{n_1^{n_1/2-1} n_2^{n_2/2-1} f^{n_1/2 - 1}}{2^{(n_1 + n_2)/2} \Gamma(n_1/2) \Gamma(n_2/2)} \cdot \frac{\Gamma((n_1 + n_2)/2)}{\left(\frac{n_1 f + n_2}{2}\right)^{(n_1 + n_2)/2}} \] Let’s rewrite it in an approximate F-distribution pdf form \[ f_F(f) = \frac{n_1^{n_1/2-1} n_2^{n_2/2-1} f^{n_1/2 - 1} \Gamma((n_1 + n_2)/2)}{2^{(n_1 + n_2)/2} \Gamma(n_1/2) \Gamma(n_2/2)} \cdot \left(\frac{2}{n_1 f + n_2}\right)^{(n_1 + n_2)/2} \]

beta and Gamma

When considering the product of two Gamma functions \[(\Gamma(a) \Gamma(b))\], we can write it as two independent integrals:

\[ \Gamma(a) \Gamma(b) = \left( \int_0^\infty t^{a-1} e^{-t} \, dt \right) \left( \int_0^\infty s^{b-1} e^{-s} \, ds \right) \]

To convert this product into a double integral, we can use a change of variables. Consider using polar coordinates in the two independent integrals, which allows us to use certain integration techniques to simplify them. First, we transform the integration region from Cartesian coordinates to polar coordinates:

\[ t = r \cos \theta, \quad s = r \sin \theta \]

The Jacobian determinant is \[r\], so the integral can be rewritten as:

\[ \Gamma(a) \Gamma(b) = \int_0^\infty \int_0^\infty t^{a-1} e^{-t} s^{b-1} e^{-s} \, dt \, ds \]

Using the polar coordinate transformation, the integral becomes:

\[ = \int_0^\frac{\pi}{2} \int_0^\infty (r \cos \theta)^{a-1} e^{-r \cos \theta} (r \sin \theta)^{b-1} e^{-r \sin \theta} r \, dr \, d\theta \]

Separating all \[r\] and \[\theta\] related terms:

\[ = \int_0^\frac{\pi}{2} (\cos \theta)^{a-1} (\sin \theta)^{b-1} \, d\theta \int_0^\infty r^{a+b-1} e^{-r(\cos \theta + \sin \theta)} \, dr \]

First, compute the \[r\] integral part:

\[ \int_0^\infty r^{a+b-1} e^{-r(\cos \theta + \sin \theta)} \, dr = \frac{\Gamma(a+b)}{(\cos \theta + \sin \theta)^{a+b}} \]

Next, compute the \[\theta\] integral part:

\[ \int_0^\frac{\pi}{2} (\cos \theta)^{a-1} (\sin \theta)^{b-1} \, d\theta = B(a, b) \]

where \[B(a, b)\] is the Beta function, defined as:

\[ B(a, b) = \int_0^1 t^{a-1} (1-t)^{b-1} \, dt \]

Therefore, we get:

\[ \Gamma(a) \Gamma(b) = \frac{\Gamma(a+b)}{(\cos \theta + \sin \theta)^{a+b}} \cdot B(a, b) \]

Since \[\cos \theta + \sin \theta = 1\] (in polar coordinates), we have:

\[ \Gamma(a) \Gamma(b) = \Gamma(a+b) B(a, b) \]

Thus, the double integral form of the Gamma function product directly comes from the polar transformation and the application of the Beta function. This is an important mathematical technique used to simplify complex integrals and functional relationships.

Why is the integral result as shown?

The given integral,

\[ \int_0^\infty r^{a+b-1} e^{-r (\cos \theta + \sin \theta)} \, dr, \]

can be evaluated using the definition of the Gamma function. The Gamma function \[\Gamma(z)\] is defined as:

\[ \Gamma(z) = \int_0^\infty t^{z-1} e^{-t} \, dt. \]

To see why the integral can be expressed in terms of the Gamma function, let’s rewrite the integral in a form that matches the Gamma function’s definition. The integral has the form:

\[ \int_0^\infty r^{a+b-1} e^{-r (\cos \theta + \sin \theta)} \, dr. \]

Let’s set \[t = r (\cos \theta + \sin \theta)\], then \[r = \frac{t}{\cos \theta + \sin \theta}\] and \[dr = \frac{dt}{\cos \theta + \sin \theta}\]. Substituting these into the integral gives:

\[ \int_0^\infty \left( \frac{t}{\cos \theta + \sin \theta} \right)^{a+b-1} e^{-t} \cdot \frac{dt}{\cos \theta + \sin \theta}. \]

Simplifying inside the integral:

\[ \int_0^\infty \frac{t^{a+b-1}}{(\cos \theta + \sin \theta)^{a+b}} e^{-t} \, dt. \]

Since \[(\cos \theta + \sin \theta)^{a+b}\] is a constant with respect to \[t\], it can be factored out of the integral:

\[ \frac{1}{(\cos \theta + \sin \theta)^{a+b}} \int_0^\infty t^{a+b-1} e^{-t} \, dt. \]

The integral

\[ \int_0^\infty t^{a+b-1} e^{-t} \, dt \]

is recognized as the Gamma function \[\Gamma(a+b)\]. Thus, we have:

\[ \frac{1}{(\cos \theta + \sin \theta)^{a+b}} \Gamma(a+b). \]

Therefore,

\[ \int_0^\infty r^{a+b-1} e^{-r (\cos \theta + \sin \theta)} \, dr = \frac{\Gamma(a+b)}{(\cos \theta + \sin \theta)^{a+b}}. \]

This demonstrates why the integral is evaluated as shown:

\[ \int_0^\infty r^{a+b-1} e^{-r (\cos \theta + \sin \theta)} \, dr = \frac{\Gamma(a+b)}{(\cos \theta + \sin \theta)^{a+b}}. \] (method 2 for integral calculation: x=uv,y=u(1-v) J=-u)

Three Main Sampling distribution

Chi-square distribution

Suppose \(X_1, \cdots, X_n \stackrel{\text { i.i.d. }}{\sim} \mathcal{N}(0,1)\).the distribution of the statistic \[ X_1^2+\cdots+X_n^2 \] is called a chi-square distribution with \(n\) degrees of freedom, denoted by \(\chi^2(n)\).

Besides, random variable \(X_i^2 \sim \operatorname{Gamma}\left(\frac{1}{2}, \frac{1}{2}\right)\) corresponds to the chi-squared distribution with 1 degree of freedom, denoted as \(\chi_1^2\).

This is derived by the MGF:

Since \[ M_{X_1^2+\cdots+X_n^2}(t)=M_{X_1^2}(t) \times \cdots \times M_{X_n^2}(t)= \begin{cases}\infty & t \geq \frac{1}{2} \\ (1-2 t)^{-\frac{n}{2}} & t<\frac{1}{2}\end{cases} \]

This is the MGF of the \(\operatorname{Gamma}\left(\frac{n}{2}, \frac{1}{2}\right)\) distribution, so \(X_1^2+\cdots+X_n^2 \sim \operatorname{Gamma}\left(\frac{n}{2}, \frac{1}{2}\right)\). This is called the chi-squared distribution with \(\mathbf{n}\) degree of freedom, denoted \(\chi_n^2\).

Properties

  • If \(W_1, \ldots, W_n\) are independent \(\chi^2\) random variables with, respectively, \(v_1, \cdots, v_n\) degrees of freedom, then the random variable \(W_1+\cdots+W_n\) follows a \(\chi^2\)-distribution with \(v_1+\cdots+v_n\) degree of freedom.

  • The random variable \(\frac{(\bar{X}-\mu)^2}{\sigma^2 / n}\) follows a \(\chi^2\)-distribution with 1 degree of freedom when \(X\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\).

Code to plot examples of Chi-square distribution

library(ggplot2)

# Create a sequence of values
x <- seq(0, 20, length.out = 200)

# Calculate the density for different degrees of freedom
df4 <- dchisq(x, df = 4)
df8 <- dchisq(x, df = 8)
df12 <- dchisq(x, df = 12)

chi_data <- data.frame(x, df4, df8, df12)
ggplot(chi_data, aes(x)) +
  geom_line(aes(y = df4, color = "df=4")) +
  geom_line(aes(y = df8, color = "df=8")) +
  geom_line(aes(y = df12, color = "df=12")) +
  labs(title = "Chi-Square Distribution",
       x = "Value",
       y = "Density") +
  scale_color_manual(name = "Degrees of Freedom", values = c("df=4" = "blue","df=8" = "red", "df=12" = "green")) +
  theme_minimal()

Application

Chi-square distribution is primarily used in testing:

  • Goodness-of-fit

  • Independence in contingency tables

Student’s t-distribution

Construction

The statistic \(T=\frac{\bar{X}-\mu}{S / \sqrt{n}}\) follows a \(t\)-distribution with \(v=n-1\) degrees of freedom when \(X_1, \cdots, X_n\) are i.i.d. normal RVs.

\[ \bar{X} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \frac{\bar{X}-\mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0,1) \quad \frac{(n-1) s^2}{\sigma^2} \sim \chi_{n-1}^2 \]

If we know the population variance \(\sigma^2\), we can easily do inference using the statistic \(\frac{\bar{X}-\mu}{\sigma / \sqrt{n}}\). However, \(\sigma^2\) is usually unknown in practice.

\[ \bar{X} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \quad \frac{\bar{X}-\mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0,1) \quad \frac{(n-1) s^2}{\sigma^2} \sim \chi_{n-1}^2 \]

We can construct the \(t\)-statistic using the sample variance \(S^2\) : \[ T=\frac{\frac{\bar{X}-\mu}{\sigma / \sqrt{n}}}{\sqrt{\frac{(n-1) s^2}{\sigma^2} /(n-1)}}=\frac{\bar{X}-\mu}{S / \sqrt{n}} \]

Notice the sample mean \(\bar{X}\) and the sample variance \(S^2\) are independent (the proof is beyond the scope of this course). So the \(T\) is now a ratio of a standard normal variable and the square root of a \(\chi^2 \mathrm{RV}\) divided by its degrees of freedom. This is the definition of a \(t\)-distribution with \(n-1\) degrees of freedom.

Properties

The \(t\)-distribution is primarily used in contexts where the underlying population is assumed to be normally distributed, especially when the sample size is small. Used extensively in problems that deal with inference about population mean \(\mu\) when population variance \(\sigma^2\) is unknown; problems where one is trying to determine if means from two samples are significantly different when population variances \(\sigma_1^2\) and \(\sigma_2^2\) are unknown.

Code to plot examples of t-distribution

# Load necessary libraries
library(ggplot2)

# Create a sequence of values
x <- seq(-4, 4, length.out = 1000)

# Calculate the density for different degrees of freedom
df_values <- c(1, 2, 3, 5, 10, 30)
distributions <- lapply(df_values, function(df) dt(x, df = df))

# Create a data frame to store the data
df_data <- data.frame(x = x)
for (i in seq_along(df_values)) {
  df_data[paste0("df", df_values[i])] <- distributions[[i]]
}

# Plot the densities
ggplot(df_data, aes(x = x)) +
  geom_line(aes(y = df1, color = "df=1")) +
  geom_line(aes(y = df2, color = "df=2")) +
  geom_line(aes(y = df3, color = "df=3")) +
  geom_line(aes(y = df5, color = "df=5")) +
  geom_line(aes(y = df10, color = "df=10")) +
  geom_line(aes(y = df30, color = "df=30")) +
  geom_line(aes(y = dnorm(x), color = "Standard Normal")) +
  scale_color_manual(name = "Distribution", 
                     values = c("df=1" = "red", "df=2" = "red", "df=3" = "red", 
                                "df=5" = "green", "df=10" = "green", "df=30" = "green", 
                                "Standard Normal" = "blue")) +
  labs(title = "Density of the t-distribution compared to the standard normal distribution",
       x = "x",
       y = "Density") +
  theme_minimal()

F-distribution

Let \(U\) and \(V\) be two independent RVs following \(\chi^2\) distributions with \(\nu_1\) and \(\nu_2\) degrees of freedom, respectively. Then the distribution of the random variable \(F=\frac{U / \nu_1}{V / \nu_2}\) is known as \(F\)-distribution.

Example

If \(S_1^2\) and \(S_2^2\) are the variances of independent RVs of size \(n_1\) and \(n_2\) taken from normal populations with variances \(\sigma_1^2\) and \(\sigma_2^2\) respectively, then \[ F=\frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2}=\frac{\sigma_2^2 S_1^2}{\sigma_1^2 S_2^2} \] follows an \(F\)-distribution with \(\nu_1=n_1-1\) and \(\nu_2=n_2-1\) degrees of freedom.

Code to plot examples of F-distribution

# Load necessary libraries
library(ggplot2)

# Create a sequence of values
x <- seq(0, 5, length.out = 1000)

# Calculate the density for different degrees of freedom
df1_values <- c(1, 2, 5, 10, 100)
df2_values <- c(1, 1, 2, 1, 100)
distributions <- lapply(1:length(df1_values), function(i) df(x, df1 = df1_values[i], df2 = df2_values[i]))

# Create a data frame to store the data
df_data <- data.frame(x = x)
for (i in seq_along(df1_values)) {
  df_data[paste0("df1", df1_values[i], "_df2", df2_values[i])] <- distributions[[i]]
}

# Plot the densities
ggplot(df_data, aes(x = x)) +
  geom_line(aes(y = df11_df21, color = "d1=1, d2=1")) +
  geom_line(aes(y = df12_df21, color = "d1=2, d2=1")) +
  geom_line(aes(y = df15_df22, color = "d1=5, d2=2")) +
  geom_line(aes(y = df110_df21, color = "d1=10, d2=1")) +
  geom_line(aes(y = df1100_df2100, color = "d1=100, d2=100")) +
  scale_color_manual(name = "", 
                     values = c("d1=1, d2=1" = "red", "d1=2, d2=1" = "black", 
                                "d1=5, d2=2" = "blue", "d1=10, d2=1" = "green", 
                                "d1=100, d2=100" = "grey")) +
  labs(title = "F-distribution with different degrees of freedom",
       x = "x",
       y = "Density") +
  theme_minimal() +
  theme(legend.position = "top")

Application

Analysis of variance (ANOVA)

Estimation of Population Characteristic

Point Estimation

A point estimate of a population characteristic is a single number that is based on sample data and represents a plausible value of the characteristic.

Interval Estimation—Confidence interval(CI)

An interval estimate of a parameter \(\theta\) is an interval of the form \(\hat{\theta}_L<\theta<\hat{\theta}_U\), where \(\hat{\theta}_L\) and \(\hat{\theta}_U\) depend on the value of \(\hat{\theta}\) for a particular sample and also on the sampling distribution of \(\hat{\Theta}\).

Introduction

  • If we were to construct a 95% confidence interval for some population characteristics (population proportion p or population mean \(\mu\)), we would be using a method that is successful 95% of the time.

  • This is also about the question relevant to “How to choose a sample size”

Definition

A \(100(1-\alpha) \%\) confidence interval is an interval of the form \(\hat{\theta}_L<\theta<\hat{\theta}_U\), where \(\hat{\theta}_L\) and \(\hat{\theta}_U\) are respectively values of \(\widehat{\Theta}_L\) and \(\widehat{\Theta}_U\) obtained for a particular sample, based on \[ P\left(\widehat{\Theta}_L<\theta<\widehat{\Theta}_U\right)=1-\alpha \quad ; \quad 0<\alpha<1 \] in the estimation of population parameter \(\theta\).

Interpretation

For confidence level of 95% CI for any normal distribution: About 95% of the values are within 1.96 standard deviations of the mean. (Recall the concept of Z-scores)

That is, if \(\text{Estimate}± (Z \times \sigma)\) was used to generate an interval estimate over and over again with different samples, in the long run 95% of the resulting intervals would include the actual value of the characteristic being estimated.

  • The confidence level 95% refers to the method used to construct the interval rather than to any particular interval, such as the one we obtained.

eg

Proportion

  • check to make sure that the three necessary conditions are met:

\(n\hat p \ge 10, n(1-\hat p) \ge 10\)

\(\frac {\hat p -p }{\sigma/\sqrt n} = 1.96\))

  • sample size question

Using sample proportion with 95% confidence interval, and we want ME no more than 5%, we may be interested in solving n from the following equation:

0.05=1.96 \(\sqrt {\frac {\hat p(1-\hat p) }{ n}}\)

CI on Mean

Here, \(\bar x\) is the sample mean from a simple random sample.

\(\mu\) is the population mean which we are interested in estimating.

CI on \(\mu\) with \(\sigma\) known n \(\ge 30\) or the population is normal — Use z-statistics

CI: \(\bar x \pm z_{\alpha/2} \frac {\sigma}{\sqrt n}\), for example, 95% CI is \(\bar x \pm 1.96 \frac {\sigma}{\sqrt n}\)

One-side Confidence Bound on \(\mu\) with \(\sigma\) known n \(\ge 30\) or the population is normal — Use z-statistics

Upper one-side bound: \(\mu <\bar x + z_{\alpha} \frac {\sigma}{\sqrt n}\)

Lower one-side bound: \(\mu >\bar x - z_{\alpha} \frac {\sigma}{\sqrt n}\)

For example, 95% Confidence bound on \(\mu\) is \(\bar x \pm 1.645 \frac {\sigma}{\sqrt n}\)

CI on \(\mu\) with \(\sigma\) unknown and the population is normal — Use t-statistics (use s as the estimate for σ (t-statistics with df = n-1))

CI: \(\bar x \pm t_{\alpha/2} \frac {s}{\sqrt n}\), for example, 95% CI is \(\bar x \pm t_{0.025} \frac {s}{\sqrt n}\) and df = n-1

Remark: The distribution of t is more spread out than the standard normal distribution but when n \(\ge 30\), t and z are very close to each other.

CI for \(\mu_1 - \mu_2\), both \(\sigma_1^2\) and \(\sigma_2^2\) are known

CI of \(\mu_1-\mu_2\): \((\bar x_1 - \bar x_2) \pm z_{\alpha/2} \sqrt {\frac {\sigma_1^2}{n_1} + \frac {\sigma_2^2}{n_2}}\)

CI for \(\mu_1 - \mu_2\), both \(\sigma_1^2\) and \(\sigma_2^2\) are unknown but assumed equal

CI of \(\mu_1-\mu_2\): \((\bar x_1 - \bar x_2) \pm t_{\alpha/2} s_p \sqrt {\frac {1}{n_1} + \frac {1}{n_2}}\) with df = \(n_1+n_2-2\) where \(s_p = \sqrt {\frac {(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}\)

CI for paired observations

Previous, we have two independent samples, now we have two dependent samples. We can use the difference between the two samples to construct a confidence interval.

CI of \(\mu_d\): \(\bar d \pm t_{\alpha/2} \frac {s_d}{\sqrt n}\) with df = n-1 where \(s_d\) is the sample standard deviation of the differences \(d_i = x_{1i} - x_{2i}\) and \(\bar d\) is the sample mean of the differences.

Estimating \(\sigma\)

a \(100(1-\alpha)\%\) CI for \(\sigma^2\) is \(\left(\frac{(n-1) S^2}{\chi_{\alpha / 2, n-1}^2}, \frac{(n-1) S^2}{\chi_{1-\alpha / 2, n-1}^2}\right)\)

where \(S^2\) is the sample variance and \(\chi_{\alpha / 2, n-1}^2\) and \(\chi_{1-\alpha / 2, n-1}^2\) are the critical values of the chi-square distribution with \(n-1\) degrees of freedom.

Estimating \(\sigma_1^2/ \sigma_2^2\)

A \(100(1-\alpha)\%\) CI for \(\frac{\sigma_1^2}{\sigma_2^2}\) using F-statistics with \(f_{1-\alpha/2}(n_1-1,n_2-1)=1/f_{\alpha/2}(n_1-1,n_2-1)\) is \[ \frac{s_1^2}{s_2^2} \frac{1}{f_{\alpha / 2}\left(n_1-1, n_2-1\right)}<\frac{\sigma_1^2}{\sigma_2^2}<\frac{s_1^2}{s_2^2} f_{\alpha / 2}\left(n_2-1, n_1-1\right) \] where \(f_{\alpha / 2}\left(v_1, v_2\right)\) is an \(F\)-value with \(v_1\) and \(v_2\) degrees of freedom, leaving an area of \(\alpha / 2\) to the right, and \(f_{\alpha / 2}\left(v_2, v_1\right)\) is a similar \(F\)-value with \(v_2\) and \(v_1\) degrees of freedom.

Hypothesis Testing

  • z-test

Suppose \(X_1, \cdots, X_n \xrightarrow[\sim]{\text { iid }} \mathcal{N}\left(\mu, \sigma^2\right)\), where \(\mu\) is unknown and where \(\sigma^2=\sigma_0^2\) is known.

Suppose we wish to test \(H_0: \mu=\mu_0\) against \(H_1: \mu>\mu_0\). Then we can use the test statistic \[ Z=\frac{\bar{X}-\mu_0}{\sigma_0 / \sqrt{n}} \]

If \(H_0\) is true then \(Z \sim \mathcal{N}(0,1)\). Let \[ z_{\mathrm{obs}}=\frac{\bar{x}-\mu_0}{\sigma_0 / \sqrt{n}} \]

A large value of \(z_{\text {obs }}\) casts doubt on the validity of \(H_0\) and indicates a departure from \(H_0\) in the direction of \(H_1\). So the \(p\)-value for testing \(H_0\) against \(H_1\) is \[ \begin{aligned} p & =P\left(Z \geq Z_{\mathrm{obs}} \mid H_0\right) \\ & =P\left(\mathcal{N}(0,1) \geq Z_{\mathrm{obs}}\right) \\ & =1-\Phi\left(Z_{\mathrm{obs}}\right) \end{aligned} \]

The z-test of \(H_0: \mu=\mu_0\) against the alternative \(H_1: \mu<\mu_0\) is similar but this time a small, i.e. very negative, value of \(Z_{\text {obs }}\) casts doubt on \(H_0\). So the \(p\)-value is \[ \begin{aligned} p & =P\left(Z \leq Z_{\mathrm{obs}} \mid H_0\right) \\ & =P\left(\mathcal{N}(0,1) \leq Z_{\mathrm{obs}}\right) \\ & =\Phi\left(Z_{\mathrm{obs}}\right) \end{aligned} \]

Finally, consider testing \(H_0: \mu=\mu_0\) against the alternative \(H_1: \mu \neq\) \(\mu_0\). Let \(z_0=\left|z_{\text {obs }}\right|\). A large value of \(z_0\) indicates a departure from \(H_0\), so the \(p\)-value is \[ \begin{aligned} p & =P\left(|Z| \geq z_0 \mid H_0\right) \\ & =P\left(\mathcal{N}(0,1) \geq z_0\right)+P\left(\mathcal{N}(0,1) \leq-z_0\right) \\ & =2\left(1-\phi\left(z_0\right)\right) \end{aligned} \]

  • t-test

We can use the test statistic \[ T=\frac{\bar{X}-\mu_0}{S / \sqrt{n}} \]

If \(H_0\) is true then \(T \sim t_{n-1}\). Let \(t_{\mathrm{obs}}=t(x)=\frac{\bar{x}-\mu_0}{s / \sqrt{n}}\) and \(t_0=\left|t_{\mathrm{obs}}\right|\). Similarly, we have \(\square\) the \(p\)-value is \(P\left(t_{n-1} \geq t_{\mathrm{obs}}\right)\) - the \(p\)-value is \(P\left(t_{n-1} \leq t_{\text {obs }}\right)\) the \(p\)-value is \(2 P\left(t_{n-1} \geq t_0\right)\)

Example of one-sample t-test

A marine biologist is studying a species of fish known to have an average length of 20 cm in ocean populations. A new population in a freshwater lake is being analyzed to determine if the environmental differences have altered the fish’s average length. The biologist measures the lengths of 10 randomly selected fish, yielding the following data:

22, 23, 21, 24, 22, 20, 25, 19, 23, 22

Assuming the data satisfy the assumption of normality, please address the following using a significance level of 0.1:

a

  • null hypothesis: The mean length of fish is 20 cm (\(H_0: \mu = 20\)).

  • alternative hypothesis: The mean length of fish is not 20 cm (\(H_1: \mu \ne 20\)).

b

Since the data is supposed to be normally distributed, the sampling distribution of the sample mean follows t-distribution. The t-test statistic is calculated as follows:

\[t = \frac{\bar x - \mu}{s / \sqrt n}\]

The t-test statistic is calculated as follows:

data2_4 <- c(22, 23, 21, 24, 22, 20, 25, 19, 23, 22)

x_bar <- mean(data2_4)

s <- sd(data2_4)

t <- (x_bar - 20) / (s / sqrt(length(data2_4)))

t
[1] 3.705882

The t-test statistic is 3.705882…

c

Using pt() function, the p-value is calculated as follows:

p_value <- 2 * pt(-t, df = 9)
p_value
[1] 0.004875954

The p-value is approximately 0.1. Since the p-value is less than 0.1, we reject the null hypothesis.

Therefore, there is sufficient evidence to conclude that the population mean is not equal to 20 which means the environmental differences have altered the fish’s average length.

d

Using qt() function to find the critical value for a two-tailed test with 90% confidence level:

t_critical <- qt(0.95, df = 9)
t_critical
[1] 1.833113

The critical value for a two-tailed test with 90% confidence level is about 1.833113.

Since the t-test statistic 3.705882 is greater than the critical value 1.833113, which is in the critical region. Therefore, we reject the null hypothesis.

Also, we could use confidence interval to verify the result. The 90% confidence interval for the population mean is calculated as follows:

ci_4 <- c(x_bar - t_critical * s / sqrt(10), x_bar + t_critical * s / sqrt(10))
ci_4
[1] 21.06124 23.13876

So the 90% confidence interval for the population mean is about (21.1, 23.2).

The confidence interval does not contain the hypothesized value 20. Therefore, we reject the null hypothesis.

Therefore, there is sufficient evidence to conclude that the population mean is not equal to 20 which means the environmental differences have altered the fish’s average length.

Example of unpaired t-test

Suppose \(\sigma_1^2\) and \(\sigma_2^2\) are unknown but assumed equal. We want to test the null hypothesis \(H_0: \mu_1 = \mu_2\) against the alternative hypothesis \(H_1: \mu_1 \ne \mu_2\). The test statistic is given by

\[ t = \frac{\bar x_1 - \bar x_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

where \(s_p\) is the pooled sample standard deviation, given by

\[ s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

Example of one-sample Variance Test

\(\chi^2\)-test for variance. Suppose \(X_1, \cdots, X_n\) are i.i.d. normal random variables with mean \(\mu\) and variance \(\sigma^2\). We want to test the null hypothesis \(H_0: \sigma^2 = \sigma_0^2\) against the alternative hypothesis \(H_1: \sigma^2 \ne \sigma_0^2\). The test statistic is given by

\[ \chi^2 = \frac{(n - 1)s^2}{\sigma^2} \]

Example of two-sample Variance Test

\(H_0\) Test Statistic \(H_1\) Rejection Region
\(\sigma_1^2=\sigma_2^2\) \(f=\frac{s_1^2}{s_2^2}\) \(\sigma_1^2<\sigma_2^2\) \(f<f_\alpha\left(\nu_1, \nu_2\right)\)
\(\sigma_1^2>\sigma_2^2\) \(f>f_{1-\alpha}\left(\nu_1, \nu_2\right)\)
\(\sigma_1^2 \neq \sigma_2^2\) \(f<f_{\alpha / 2}\left(\nu_1, \nu_2\right)\) or \(f>f_{1-\alpha / 2}\left(\nu_1, \nu_2\right)\)
\(\nu_1=n_1-1\) and \(\nu_2=n_2-1\) are two degree of freedom.

\[ F=\frac{\frac{\left(n_1-1\right) S_1^2}{\sigma_1^2} /\left(n_1-1\right)}{\frac{\left(n_2-1\right) S_2^2}{\sigma_2^2} /\left(n_2-1\right)}=\frac{\sigma_2^2 S_1^2}{\sigma_1^2 S_2^2} \]

If \(\sigma_1^2=\sigma_2^2\), we have \[ F=\frac{S_1^2}{S_2^2} \sim F_{n_1-1, n_2-1} \]

ANOVA– Analysis of Variance

  • one-way ANOVA

we need to test the null hypothesis that the group population means are all the same against the alternative that at least one group population mean differs from the others. That is,

\(H_0: \mu_1=\mu_2=\cdots=\mu_k\) against \(H_1: \text { at least one } \mu_i \text { differs from the others.}\)

ANOVA Table

Source DF Sum Sq Mean Sq F value p value
Factor m-1 11.84 (SS between) 2.9587 (MSB) 8.074 (MSB/MSW) \(5.38 \mathrm{e}-05\) (p-value)
Error n-m 16.49 (SS Within) 0.3664 (MSW)
Total n-1 28.33 (SS Total)

Source means “the source of the variation in the data.” the possible sources for a one-factor study are Factor, Residuals, and Total.

Factor means “the variability due to the factor of interest.” In the drug example, the factor was the different drug. In the learning example on the previous page, the factor was the method of learning. Sometimes the row heading is labeled as Between.

Error (or Residuals) means “the variability within the groups” or “unexplained random error.” Sometimes the row heading is labeled as Within.

Total means “the total variation in the data from the grand mean”.

DF means “the degrees of freedom in the source.”

Sum Sq means “the sum of squares due to the source.”

Mean Sq means “the mean sum of squares due to the source.”

F value means “the F-statistic.”

P value means “the P-value.”

SS(Total)=SS(Between)+SS(Within), where

SS(Between) is the sum of squares between the group means and the grand mean. As the name suggests, it quantifies the variability between the groups of interest.

SS(Within) is the sum of squares between the data and the group means. It quantifies the variability within the groups of interest.

SS(Total) is the sum of squares between the n data points and the grand mean. As the name suggests, it quantifies the total variability in the observed data.

  • two-way ANOVA

We can extend the idea of a one-way ANOVA, which tests the effects of one factor on a response variable, to a two-way ANOVA which tests the effects of two factors and their interaction on a response variable.

Source DF Sum Sq MSW F
Cells \(a b-1\) \(\sum_{i=1}^a \sum_{j=1}^b n\left(\bar{X}_{i j}-\bar{X}_{\ldots . .}\right)^2\)
A a-1 bn \(\sum_{i=1}^a\left(\bar{X}_{i . .}-\bar{X}_{\ldots .}\right)^2\) SS(A) MS(Error
B b-1 an \(\sum_{j=1}^b\left(\bar{X}_{. j .}-\bar{X}_{\ldots .}\right)^2\) SS(B) MS(B)
\(\mathrm{A} \times \mathrm{B}\) \((a-1)(b-1)\) SS(Cells)-SS(A)-SS(B) SS(AB) MS(Error
DF(A \(\times\) B) MS(Error
Error \(a b(n-1)\) \(\sum_{i=1}^a \sum_{j=1}^b \sum_{l=1}^n\left(X_{i j l}-\bar{X}_{i j} .\right)^2\) SS(Error)
Total \(a b n-1\) \(\sum_{i=1}^a \sum_{j=1}^b n\left(X_{i j l}-\bar{X}_{\ldots .}\right)^2\)
  • \(F=\frac{\mathrm{MS}(\mathrm{A})}{\mathrm{MS}(\text { Error })}\), for \(H_0\) : no effect of factor A on response variable,

  • \(F=\frac{\mathrm{MS}(\mathrm{B})}{\mathrm{MS}(\text { Error) }}\), for \(H_0\) : no effect of factor B on response variable,

  • \(F=\frac{\mathrm{MS}(\mathrm{A} \times \mathrm{B})}{\mathrm{MS}(\text { Error })}\), for \(H_0\) : no effect of interaction on response variable.

We reject any \(H_0\) if \(F \geq F_{\text {critical }}\); otherwise, we do not reject \(H_0\).

Example of two-way ANOVA

Two-way ANOVA. In this question, we will use the built-in R data set ToothGrowth to perform two-way ANOVA test. ToothGrowth includes information from a study on the effects of vitamin C on tooth growth in Guinea pigs. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC). Assuming the data satisfy the assumptions of normality and equal variance, please address the following using a significance level of 0.05

a

  • The effects of vitamin C on tooth growth in guinea pigs:

  • null hypothesis: \(H_0\): mean tooth growth for all doses of vitamin C are equal

  • alternative hypothesis: \(H_1\): at least one of the means of all doses of vitamin C is different from the others

  • The effects of delivery method on tooth growth in guinea pigs:

  • null hypothesis: \(H_0\): mean tooth growth for the delivery method of orange juice and ascorbic acid are equal.

  • alternative hypothesis \(H_1\): mean tooth growth for the delivery method of orange juice and ascorbic acid are different.

  • The interaction effects of the dose of vitamin C and delivery method on tooth growth in guinea pigs:

  • null hypothesis: \(H_0:\) there is no interaction between the dose of vitamin C and delivery method on tooth growth in guinea pigs, meaning that the relationship between vitamin C and tooth growth is the same for both delivery methods (similarly, the relationship between delivery method and tooth growth is the same for all doses of vitamin C).

  • alternative hypothesis: \(H_1\): there is an interaction between the dose vitamin C and delivery method on tooth growth in guinea pigs, meaning that the relationship between vitamin C and tooth growth is different for both delivery methods (similarly, the relationship between delivery method and tooth growth depends on the dose of vitamin C).

b

We can plot the relationship one by one using two plots

library(ggplot2)
data(ToothGrowth)
head(ToothGrowth)
   len supp dose
1  4.2   VC  0.5
2 11.5   VC  0.5
3  7.3   VC  0.5
4  5.8   VC  0.5
5  6.4   VC  0.5
6 10.0   VC  0.5
# potential effects of vitamin C on tooth growth.

ggplot(ToothGrowth, aes(x = factor(dose), y = len)) +
  geom_boxplot() +
  labs(x = "Dose (mg/day)", y = "Tooth Growth (len)", title = "Tooth Growth by Dose of vitamin C")

# potential effects of delivery method on tooth growth.

ggplot(ToothGrowth, aes(x = supp, y = len)) +
  geom_boxplot() +
  labs(x = "Delivery Method", y = "Tooth Growth (len)", title = "Tooth Growth by Delivery Method") 

or just one:

library(ggplot2)

# potential effects of vitamin C and delivery method. 

# OJ represents orange juice and VC represents ascorbic acid.

ggplot(ToothGrowth, aes(x = factor(dose), y = len, fill = supp)) +
  geom_boxplot() +
  labs(x = "Dose (mg/day)", y = "Tooth Growth (len)", title = "Tooth Growth by Dose and Delivery Method") 

c

# Perform two-way ANOVA

anova_result <- aov(len ~ supp * dose, data = ToothGrowth)

summary(anova_result)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  12.317 0.000894 ***
dose         1 2224.3  2224.3 133.415  < 2e-16 ***
supp:dose    1   88.9    88.9   5.333 0.024631 *  
Residuals   56  933.6    16.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since all p-values are less than 0.05, we reject all null hypotheses. Therefore, there is sufficient evidence to conclude that the dose of vitamin C, delivery method, and their interaction have significant effects on tooth growth in guinea pigs.

d

library(car)
Loading required package: carData
qqPlot(anova_result$residuals, main = "QQ-plot of residuals")

[1] 22 50

Non-parametric tests

Application of testing the goodness of fit

Testing whether there is a “good fit” between the observed data and the assumed probability model amounts to testing:

Construction of test statistics with an example of 2 categories

Population is 60% female and 40% male. Then, if a sample of 100 students yields 53 females and 47 males, can we conclude that the sample is (random and) representative of the population? That is, how “good” do the data “fit” the assumed probability model of 60% female and 40% male?

Here, let \(Y_1\) denote the number of females selected, \(Y_1 \sim B(n,p_1)\) and let \(Y_2\) denote males selected, \(Y_2 = (n-Y_1)\sim B(n,p_2)=B(n,1-p_1)\).

for samples satisfying the general rule of thumb (the expected number of successes must be at least 5 and the expected number of failures must be at least 5), we can use the normal approximation to the binomial distribution. The test statistic is given by

\[ Z = \frac{Y_1 - np_1}{\sqrt{np_1(1-p_1)}}\sim N(0,1) \]

which is at least approximately normally distributed.

and \[ Z^2 =Q_1= \frac{(Y_1 - np_1)^2}{np_1(1-p_1)}\sim \chi^2(1) \]

which is an approximate chi-square distribution with one degree of freedom.

Now we can multiply \(Q_1\) by 1 = \((1-p_1)+p_1\) to get

\[ Q_1 = \frac{(Y_1 - np_1)^2(1-p_1)}{np_1(1-p_1)} + \frac{(Y_1 - np_1)^2p_1}{np_1(1-p_1)}\sim \chi^2(1) \]

Since \(Y_1=n-Y_2\) and \(p_1=1-p_2\), after simplifying, we have

\[ Q_1 = \frac{(Y_1-np_1)^2}{np_1} + \frac{(-(Y_2-np_2))^2}{np_2}\sim \chi^2(1) \]

which is \(Q_1=\sum_{i=1}^2 \frac{\left(Y_i-n p_i\right)^2}{n p_i}=\sum_{i=1}^2 \frac{(\text { Observed }- \text { Expected })^2}{\text { Expected }}\sim \chi^2(1)\)

Hence, it is observed that if the observed counts are very different from the expected counts, then the test statistic will be large. So we reject the null hypothesis if \(Q_1\) is large and how large is large is determined by the critical value of the chi-square distribution with one degree of freedom.

The statistics \(Q_1\) is called the chi-square goodness of fit statistic.

Going back to the example,

  • \(H_0\):\(p_F = 0.6\)

  • \(H_1\):\(p_F \ne 0.6\)

we can calculate the test statistic using a significant level of \(\alpha = 0.05\) (\(\chi^2_{0.05,1}=3.84\))as follows:

\[ Q_1 = \frac{(53-60)^2}{60} + \frac{(47-40)^2}{40} = 2.04 \]

Since \(Q_1=2.04<3.84\), we do not reject the null hypothesis. Therefore, we conclude that the sample is (random and) representative of the population.

This can be extended to k categories

Construction of test statistics with an example of k categories

For categories more than 2, i.e.

Categories 1 2 \(\cdots\) \(k-1\) \(k\)
Observed \(Y_1\) \(Y_2\) \(\cdots\) \(Y_{k-1}\) \(n-Y_1-Y_2-\cdots-Y_{k-1}\)
Expected \(n p_1\) \(n p_2\) \(\cdots\) \(n p_{k-1}\) \(n p_k\)

Karl Pearson showed that the chi-square statistic \(Q_{k-1}\) defined as: \[ Q_{k-1}=\sum_{i=1}^k \frac{\left(Y_i-n p_i\right)^2}{n p_i} \] follows approximately a chi-square random variable with \(k-1\) degrees of freedom. Let’s try it out on an example.

  • Example:
Categories Brown Yellow Orange Green Coffee Total
Observed \(y_i\) 224 119 130 48 59 580
Assumed \(H_0\left(p_i\right)\) 0.4 0.2 0.2 0.1 0.1 1.0
Expected \(n p_i\) 232 116 116 58 58 580

\(Q_4=\frac{(224-232)^2}{232}+\frac{(119-116)^2}{116}+\frac{(130-116)^2}{116}+\frac{(48-58)^2}{58}+\frac{(59-58)^2}{58}=3.784\)

Because there are \(k=5\) categories, we have to compare our chisquare statistic \(Q_4\) to a chi-square distribution with \(k-1=5-1=4\) degrees of freedom: \[ Q_4=3.784<\chi_{4,0.05}^2=9.488 \] we fail to reject the null hypothesis.

Application of testing for homogeneity

This is to look at a method for testing whether two or more multinomial distributions are equal.

  • Example:

Test the hypothesis that the acceptances of males and females are ditributed equally among the four schools,

(Acceptances) Bus Eng L Arts Sci (FIXED) Total
Male 240 (20%) 480 (40%) 120 (10%) 360 (30%) 1200
Female 240 (30%) 80 (10%) 320 (40%) 160 (20%) 800
Total 480 (24%) 560 (28%) 440 (22%) 520 (26%) 2000

Here,

\(H_0: p_{M B}=p_{F B}, p_{M E}=p_{F E}, p_{M L}=p_{F L}\), and \(p_{M S}=p_{F S}\)

\(H_1: p_{M B} \neq p_{F B}\) or \(p_{M E} \neq p_{F E}\) or \(p_{M L} \neq p_{F L}\), or \(p_{M S} \neq p_{F S}\)

where:

  • \(p_{M j}\) is the proportion of males accepted into school \(j=B, E, L, S\).

  • \(p_{F j}\) is the proportion of females accepted into school \(j=B, E, L, S\).

In conducting such a hypothesis test, we’re comparing the proportions of two multinomial distributions.

#(Acc) Bus ( \(j=1\) ) Eng ( \(j=2\) ) L Arts ( \(j=3\) ) Sci \((j=4)\) (FIXED) Total
\(\mathrm{M}(i=1)\) \(y_{11}\left(\hat{p}_{11}\right)\) \(y_{12}\left(\hat{p}_{12}\right)\) \(y_{13}\left(\hat{p}_{13}\right)\) \(y_{14}\left(\hat{p}_{14}\right)\) \(n_1=\sum_{j=1}^k y_{1 j}\)
F \((i=2)\) \(y_{21}\left(\hat{p}_{21}\right)\) \(y_{22}\left(\hat{p}_{22}\right)\) \(y_{23}\left(\hat{p}_{23}\right)\) \(y_{24}\left(\hat{p}_{24}\right)\) \(n_2=\sum_{j=1}^k y_{2 j}\)
Total \(y_{11}+y_{21}\left(\hat{p}_1\right)\) \(y_{12}+y_{22}\left(\hat{p}_2\right)\) \(y_{13}+y_{23}\left(\hat{p}_3\right)\) \(y_{14}+y_{24}\left(\hat{p}_4\right)\) \(n_1+n_2\)

The chi-square test statistic for testing the equality of two multinomial distributions: \[ Q=\sum_{i=1}^2 \sum_{j=1}^k \frac{\left(y_{i j}-n_i \hat{p}_j\right)^2}{n_i \hat{p}_j} \] follows an approximate chi-square distribution with \(k-1\) degrees of freedom. Reject the null hypothesis of equal proportions if \(Q\) is large (since if male and female distributed nearly equally, the expected number of each should be \(n_i\hat p_j\)): \[ Q \geq \chi_{\alpha, k-1}^2 \]

(omit the derive of the above Q)

Generally,

\[ Q=\sum_{i=1}^h \sum_{j=1}^k \frac{\left(y_{i j}-n_i \hat{p}_j\right)^2}{n_i \hat{p}_j}\sim \chi^2_{(h-1)(k-1)} \]

Further example

The head of a surgery department at a university medical center was concerned that surgical residents in training applied unnecessary blood transfusions at a different rate than the more experienced attending physicians. Therefore, he ordered a study of the 49 Attending Physicians and 71 Residents in Training with privileges at the hospital. For each of the 120 surgeons, the number of blood transfusions prescribed unnecessarily in a one-year period was recorded. Based on the number recorded, a surgeon was identified as either prescribing unnecessary blood transfusions Frequently, Occasionally, Rarely, or Never. Here’s a summary table (or “contingency table”) of the resulting data:

Physician Frequent Occasionally Rarely Never Total
Attending 6.942 12.658 22.05 7.35 49
Resident 10.058 18.342 31.95 10.65 71
Total 17 31 54 18 120

Here,

\(H_0: p_{R F}=p_{A F}, p_{R O}=p_{A O}, p_{R R}=p_{A R}\), and \(p_{R N}=p_{A N}\)

\(H_1: p_{R F} \neq p_{A F}\) or \(p_{R O} \neq p_{A O}\) or \(p_{R R} \neq p_{A R}\), or \(p_{R N} \neq p_{A N}\)

We should also calculate the expected counts under the null hypothesis. The expected counts are calculated as follows:

Physician Frequent Occasionally Rarely Never Total
Attending 6.942 12.658 22.05 7.35 49
Resident 10.058 18.342 31.95 10.65 71
Total 17 31 54 18 120

where, for example, \(6.942=\frac{17}{120} \times 49\) and \(10.058=\frac{17}{120} \times 71\). Now that we have the observed and expected counts, calculating the chisquare statistic is a straightforward exercise: \[ Q=\frac{(2-6.942)^2}{6.942}+\cdots+\frac{(5-10.65)^2}{10.65}=31.88 \]

The chi-square test tells us to reject the null hypothesis, at the 0.05 level, if \(Q\) is greater than a chi-square random variable with 3 degrees of freedom, that is, if \(Q=31.88>7.815\), we reject the null hypothesis.

Application of testing for independence

This is to look at whether two or more categorical variables are independent.

(previously,the the sampling scheme involves: Taking two random (and therefore independent) samples with n1 and n2 fixed in advance and observing into which of the k categories the first random samples fall, and observing into which of the k categories the second random samples fall. )

lets consider a different example to illustrate an alternative sampling scheme. Suppose 395 people are randomly selected, and are “cross-classified” into one of eight cells, depending into which age category they fall and whether or not they support legalizing marijuana:

(the sampling scheme involves: Taking one random sample of size n, with n fixed in advance, and then “cross-classifying” each subject into one and only one of the mutually exclusive and exhaustive \(A_i\cap B_j\) cells.)

Marijuana Support Variable B (Age)
Variable A OBSERVED \((18-24) B_1\) (25-34) \(B_1 2\) (35-49) \(B_3\) \((50-64) B_4\) Total
(YES) \(A_1\) 60 54 46 41 201
(NO) \(A_2\) 40 44 53 57 194
Total 100 98 99 98 \(n=395\)

Here,

H0 : Variable A is independent of variable B, that is \(P(A_i\cap B_j)=PA_i \times B_j\) for all i and j

H1: Variable A is not independent of variable B.

Generally,

Suppose we have \(k\) (column) levels of Variable B indexed by the letter \(j\), and \(h\) (row) levels of Variable \(A\) indexed by the letter \(i\). Then, we can summarize the data and probability model in tabular format, as follows:

Variable B
Variable A \(B_1(j=1)\) \(B_2(j=2)\) \(B_3(j=3)\) \(B_4(j=4)\) Total
\(A_1(i=1)\) \(Y_{11}\left(p_{11}\right)\) \(Y_{12}\left(p_{12}\right)\) \(Y_{13}\left(p_{13}\right)\) \(Y_{14}\left(p_{14}\right)\) \(\left(p_{1 .}\right)\)
\(A_2(i=2)\) \(Y_{21}\left(p_{21}\right)\) \(Y_{22}\left(p_{22}\right)\) \(Y_{23}\left(p_{23}\right)\) \(Y_{24}\left(p_{24}\right)\) \(\left(p_{2 .}\right)\)
Total \(\left(p_{.1}\right)\) \(\left(p_{.2}\right)\) \(\left(p_{.3}\right)\) \(\left(p_{.4}\right)\) \(n\)

where \(p_{i j}=Y_{i j} / n, p_i .=\sum_{j=1}^k p_{i j}\), and \(p_{. j}=\sum_{i=1}^h p_{i j}\)

\(Q=\sum_{j=1}^k \sum_{i=1}^h \frac{\left(y_{i j}-\frac{y_i \cdot y_{\cdot j}}{n}\right)^2}{\frac{y_i \cdot y_{\cdot j}}{n}}\sim \chi^2_{(h-1)(k-1)}\)

Are chi-square statistic for homogeneity and the chi-square statistic for independence equivalent?

Although their chi-square statistics are equivalent, the two tests are not equivalent since their sampling experiment designs are different.

Here’s the table of expected counts:

Bicycle Riding Interest Variable B (Age)
Variable A EXPECTED 18-24 25-34 35-49 50-64 Total
YES 50.886 49.868 50.377 49.868 201
NO 49.114 48.132 48.623 48.132 194
Total 100 98 99 98 395

\[ Q=\frac{(60-50.886)^2}{50.886}+\cdots+\frac{(57-48.132)^2}{48.132}=8.006 \]

The chi-square test tells us to reject the null hypothesis, at the 0.05 level, since \(Q\) is greater than a chi-square random variable with 3 degrees of freedom, that is, \(Q=8.006>7.815\).

Summary

Parametric tests make assumptions that aspects of the data follow some sort of theoretical probability distribution. Non-parametric tests or distribution free methods do not, and are used when the distributional assumptions for a parametric test are not met. While this is an advantage, it often comes at a cost of power (in the sense they are less likely to be able to detect a difference when a true difference exists).

Most non-parametric tests are just hypothesis tests; there is no estimation of a confidence interval.

Most non-parametric methods are based on ranking the values of a variable in ascending order and then calculating a test statistic based on the sums of these ranks.

Non-parametric tests include:

  • Two-sample independent t-test - Wilcoxon rank-sum test or Mann-Whitney U test

  • Paired t-test - Wilcoxon signed-rank test

  • One-way ANOVA - Kruskal-Wallace Test

  • Normality tests - Shapiro-Wilk test and Kolmogorov-Smirnov test

Linear Regression

Simple linear regression

Brief introduction

Some people think simple methods is bad and like complicated methods , but actually simple is very good–SLR works very well in lots of situations.

SLR is used to answer:

is there a relationship between..

How strong the relationship is

which variable contribute to this relationship

How accurate could we predict the response variable

Is the relationship linear?

Is there a synergy among independent variables?

This is a model with two random variables, X and Y, where we are trying to predict Y from X. Here are the model’s assumptions:

  • The distribution of X is arbitrary, possibly is even non-random;

  • If X=x, then \(Y = \beta_0+\beta_1x+\epsilon\) for some constants \(\beta_0, \beta_1\) and some random noise variable \(\epsilon\)

  • \(\epsilon\) has mean 0, a constant variance \(\sigma^2\), and is uncorrelated with X and uncorrelated across observations Cov\((\epsilon_i, \epsilon_j)=0\) for \(i \ne j\)

Using Least Squares, we can estimate \(\hat \beta_0\) and \(\hat \beta_1\), which are unbiased estimates of \(\beta_0\) and \(\beta_1\).

  • Gaussian-Noise Simple Linear Regression Model

Now we further assume that the distribution of \(\epsilon\) is normal, i.e. \(\epsilon \sim N(0, \sigma^2)\), independent of X.

They tell us, exactly, the probability distribution for Y given X, and so will let us get exact distributions for predictions and for other inferential statistics.

Maximum Likelihood Estimation (MLE)

Introduction to MLE

Likelihood is a fundamental concept in statistics that measures how well a particular set of parameters (e.g., the mean of a distribution) explains observed data. Think of it as a “score” that tells you which parameter values make your data most plausible.

Compared to probability, which answers: “What’s the chance of seeing this data if we assume specific parameters?” , likelihood answers: “Given this data, how plausible are these parameters?”

If the parameters are \(b_0, b_1, s^2\) (reserving the Greek letters for their true values), then \(Y \mid X=x \sim N\left(b_0+b_1 x, s^2\right)\), and \(Y_i\) and \(Y_j\) are independent given \(X_i\) and \(X_j\), so the overall likelihood is \[ \prod_{i=1}^n \frac{1}{\sqrt{2 \pi s^2}} e^{-\frac{\left(y_i-\left(b_0+b_i x_i\right)\right)^2}{2 s^2}} \]

As usual, we work with the log-likelihood, which gives us the same information but replaces products with sums: \[ L\left(b_0, b_1, s^2\right)=-\frac{n}{2} \ln \left(2 \pi s^2\right)-\frac{1}{2 s^2} \sum_{i=1}^n\left(y_i-\left(b_0+b_1 x_i\right)\right)^2 \]

maximize it:

\(\begin{aligned} \frac{\partial L}{\partial b_0} & =-\frac{1}{2 s^2} \sum_{i=1}^n 2\left(y_i-\left(b_0+b_1 x_i\right)\right)(-1) \\ \frac{\partial L}{\partial b_1} & =-\frac{1}{2 s^2} \sum_{i=1}^n 2\left(y_i-\left(b_0+b_1 x_i\right)\right)\left(-x_i\right)\end{aligned}\)

Same result of MLE as least squares in linear regression

Notice that when we set these derivatives to zero, all the multiplicative constants - in particular, the prefactor of \(1 / 2 s^2\) - go away. We are left with \[ \begin{aligned} \sum_{i=1}^n\left(y_i-\left(\hat{\beta_0}+\hat{\beta_1} x_i\right)\right) & =0 \\ \sum_{i=1}^n\left(y_i-\left(\hat{\beta_0}+\hat{\beta_0} x_i\right)\right) x_i & =0 \end{aligned} \]

These are, up to a factor of \(1 / n\), exactly the equations we got from the method of least squares. That means that the least squares solution is the maximum likelihood estimate under the Gaussian noise model.

Maximum likelihood estimates of the regression curve coincide with least-squares estimates when the noise around the curve is additive, Gaussian, of constant variance, and both independent of \(X\) and of other noise terms. If any of those assumptions fail, maximum likelihood and least squares estimates can diverge.

Hypothesis testing for estimates with unknown \(\sigma^2\)

Residual sum of squares (RSS) and t-statistics construction in linear regression

It can be shown that (the proof is beyond the scope of this course) \[ \frac{R S S}{\sigma^2}=\frac{(n-2) \hat{\sigma}^2}{\sigma^2} \sim \chi_{n-2}^2 \]

This allows us to construct a \(t\)-value \[ t=\frac{\hat{\beta}-\beta}{s_{\hat{\beta}}} \sim t_{n-2} \]

Under the normality assumption of the error terms, the estimator of the slope coefficient will itself be normally distributed with mean \(\beta_i\) and variance \(\operatorname{Var}\left[\beta_i\right]\) For \(\hat{\beta_1}\), its mean is \(\beta_1\) and its variance is \(\sigma^2 / \sum\left(x_i-\bar{x}\right)^2\). When \(\sigma^2\) is known, we know \[ \frac{\hat{\beta}_1-\beta_1}{\frac{\sigma}{\sqrt{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}}} \] follows standard normal distribution. However, in practice, \(\sigma^2\) is often unknown. We then divide this standard normal distributed term by \[ \sqrt{\frac{(n-2) \hat{\sigma}^2}{(n-2) \sigma^2}}=\frac{\hat{\sigma}}{\sigma} \]

Therefore, when we write \[ s_{\hat{\beta_1}}=\frac{\hat{\sigma}}{\sqrt{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}} \] we construct a \(t\)-statistic for \(\hat{\beta_1}\) with degrees of freedom \(n-2\). This then allows us to construct a \(100(1-\alpha) \%\) confidence interval for \(\beta_1\) : \[ \hat{\beta_1} \pm t_{n-2, \alpha / 2} \times \boldsymbol{s}_{\hat{\beta_1}} \]

We can also do similar calculation to get the \(t\)-statistic and confidence interval for \(\beta_0\).

Hyphothesis Testing

\[ t =\hat {\beta}-\beta / s_{\hat{\beta_1}} \sim t_{n-2} \] \(H_0: \beta_i =0\)

\(H_1: \beta_i \ne 0\)

Codes of linear regression with CI

lmodel <- lm(Petal.Length ~ Petal.Width, data=iris)
summary(lmodel)

Call:
lm(formula = Petal.Length ~ Petal.Width, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.33542 -0.30347 -0.02955  0.25776  1.39453 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.08356    0.07297   14.85   <2e-16 ***
Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4782 on 148 degrees of freedom
Multiple R-squared:  0.9271,    Adjusted R-squared:  0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16
confint(lmodel)
                2.5 %   97.5 %
(Intercept) 0.9393664 1.227750
Petal.Width 2.1283752 2.331506
conf_interval <- predict(lmodel, data=iris, interval='confidence', level=0.95)
plot(iris$Petal.Width, iris$Petal.Length,
     xlab='Petal.Width', ylab='Petal.Length',
     main='Simple Linear Regression')
abline(lmodel, col='lightblue')
matlines(iris$Petal.Width, conf_interval[,2:3], col='blue', lty=2)

# Using ggplot2
library(ggplot2)
ggplot(iris, aes(x=Petal.Width, y=Petal.Length)) +
  geom_point() +
  geom_smooth(method=stats::lm, se=T, level=0.95)
`geom_smooth()` using formula = 'y ~ x'

Linear Regression and ANOVA

fev_dat <- read.table('fev_dat.txt', header=T)
fev_dat_subset <- fev_dat[fev_dat$age >= 6 & fev_dat$age <= 10,]
ggplot(fev_dat_subset, aes(x=age, y=FEV)) +
  geom_point() +
  geom_smooth(method=stats::lm, se=T, level=0.95)
summary(aov(FEV ~ age, data=fev_dat_subset))
summary(lm(FEV ~ age, data=fev_dat_subset))
anova(lm(FEV ~ age, data=fev_dat_subset))

\(R^2\)–the fraction of variability explained by the regression

\[ R^2 =1-\frac{SSR}{SSTO} \]

Multiple linear regression (MLR)

\(y=X\beta+\epsilon\), where y is a n × 1 row vector, X is a n × (k + 1) matrix, and β is a (k + 1) × 1 column vector for all n observations.

A potential problem in practice –multicollinearity

When multicollinearity exists, any of the following pitfalls can be exacerbated:

  • The estimated regression coefficient of any one variable depends on which other predictors are included in the model

  • The precision of the estimated regression coefficients decreases as more predictors are added to the model

  • The marginal contribution of any one predictor variable in reducing the error sum of squares depends on which other predictors are already in the model

  • Typothesis tests for βj = 0 may yield different conclusions depending on which predictors are in the model

Perfect multicollinearity

Perfect multicollinearity refers to a situation where the predictive variables have an exact linear relationship. When there is perfect collinearity, the design matrix X has less than full rank, and therefore the moment matrix X′X cannot be inverted. In this situation, the parameter estimates of the regression are not well-defined, as the system of equations has infinitely many solutions.

Imperfect multicollinearity

Imperfect multicollinearity refers to a situation where the predictive variables have a nearly exact linear relationship.

\(R^2=r^2\) where r is the Pearson correlation coefficient.

Adjusted R-squared

Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a model. It provides a more accurate measure of the model’s explanatory power, penalizing for the addition of irrelevant predictors. This helps in comparing models with different numbers of predictors.

Logistic Regression

This is a regression of categorical outcome.

In regression analysis with a categorical outcome, such as predicting a binary variable (yes or no), simple linear regression is not ideal. This is because:

  • The predicted values may fall outside the range of 0 to 1, which is not meaningful for probabilities.

  • Small changes in the predictors can lead to relatively small fluctuations in the predicted probabilities near the 0.5 mark (natural threshold), which is actually where decision-making is most critical.

So we need S-curve to satisfy above things.

curve(1 / (1 + exp(-x)), from = -6, to = 6, xlab = "x", ylab = "f(x)",
main = "Logistic Function: f(x) = 1 / (1 + exp(-x))", col = "blue", lwd = 2)

An example of a not well-predicted logistic model (since stock is not easy to predict)

library(ISLR2)
attach(Smarket)
glm.fits <- glm(
    Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
    data = Smarket, family = binomial
  )
summary(glm.fits)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3
glm.probs <- predict(glm.fits, type = "response")
glm.probs[1:10]
        1         2         3         4         5         6         7         8 
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 
        9        10 
0.5176135 0.4888378 
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"
table(glm.pred, Direction)
        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507
(507 + 145) / 1250
[1] 0.5216
mean(glm.pred == Direction)
[1] 0.5216

Odds

Odds are another way of quantifying the probability of an event, commonly used in gambling (and logistic regression).

For some event \(E\), \[ \operatorname{odds}(E)=\frac{P(E)}{P\left(E^c\right)}=\frac{P(E)}{1-P(E)} \]

The odds ratio (OR) is the ratio of the odds of an event occurring in one group to the odds of it occuring in another group. If the event in each of the groups are \(p_1\) (first group) and \(p_2\) (second group), then the odds ratio is: \[ \mathrm{OR}=\frac{p_1 /\left(1-p_1\right)}{p_2 /\left(1-p_2\right)} \]

Generalized Linear Models

Under the hood, we’re still using a linear model \(\left(\beta_0+\beta_1 x\right)\), but now it’s embedded in a function that ensures valid probabilities. This is the essence of logistic regression - a generalized linear model (GLM) designed for binary outcomes.

Given predictor \(X\) and d an outcom \(Y\), a GLM is defined by three components: - A random component, that specifies a distribution for \(Y \mid X\) - A systematic component, that relates a parameter \(\eta\) to the predictor \(X\) \[ \eta=\beta_0+\beta_1 X_1+\cdots+\beta_k X_k \] - A link function, that connects the random and systematic component

Random Component

The random component specifies a distribution for the outcome variable (conditional on \(X\) ). In the case of linear regression, we assume that \(Y \mid X \sim \mathcal{N}\left(\mu, \sigma^2\right)\), for some mean \(\mu\) and variance \(\sigma^2\). In the case of logistic regression, we assume that \(Y \mid X \sim \operatorname{Bern}(p)\) for some probability \(p\).

In a generalized model, we are allowed to assume that \(Y \mid X\) has a probability density function or probability mass function of the form \[ f(y ; \theta, \phi)=\exp \left(\frac{y \theta-b(\theta)}{a(\phi)}+c(y, \phi)\right) \]

Here \(\theta, \phi\) are parameters, and \(a, b, c\) are functions. Any density of the above form is called an exponential family density. The parameter \(\theta\) is called the natural parameter, and the parameter \(\phi\) the dispersion parameter.

Exponential Family

Exponential families include many of the most common distributions. For example: - Exponential \[ f(y ; \lambda)=\lambda e^{-\lambda y}=\exp (-y \lambda+\ln \lambda) \] where \(\theta=-\lambda, \phi=1, b(\theta)=\ln \lambda, a(\phi)=1\), and \(c(y, \phi)=0\) - Poisson \[ f(y ; \lambda)=\frac{e^{-\lambda} \lambda^y}{y!}=\exp (y \ln \lambda-\lambda-\ln (y!)) \] where \(\theta=\ln \lambda, \phi=1, b(\theta)=e^\theta=\lambda, a(\phi)=1\), and \(c(y, \phi)=-\lambda-\ln (y!)\)

Example

Gaussian-noise Linear Regression

  • Random Component: \(Y \mid X \sim \mathcal{N}\left(\mu, \sigma^2\right)\) and \(\mathbb{E}[Y \mid X]=\mu\)
  • Systematic Component: \(\eta=X \beta\)
  • Link Component: \(g(\mu)=\mu\), so that \(\mu=\eta=X \beta\)

Bernoulli

Suppose that \(Y \in\{0,1\}\), and we model the distribution of \(Y \mid X\) as Bernoulli with success probability \(p\). Then the probability mass function (not a density, since \(Y\) is discrete) is \[ f(y)=p^y(1-p)^{1-y} \]

We can rewrite to fit the exponential family form as \[ \begin{aligned} f(y) & =\exp (y \log p+(1-y) \log (1-p)) \\ & =\exp (y \log (p /(1-p))+\log (1-p)) \end{aligned} \]

\[ f(y ; \theta, \phi)=\exp \left(\frac{y \theta-b(\theta)}{a(\phi)}+c(y, \phi)\right) \]

Here we would identify \(\theta=\log (p /(1-p))\) as the natural parameter. Note that the mean here is \(\mu=p\), and using the inverse of the above relationship, we can directly write the mean \(p\) as a function of \(\theta\), as in \(p=e^\theta /\left(1+e^\theta\right)\). Hence \(b(\theta)=\log (1-p)=-\log \left(1+e^\theta\right)\). There is no dispersion parameter, so we can set \(a(\phi)=1\). Also, \(c(y, \phi)=0\).

Survival Analysis

Survival analysis is used to analyze data in which the time until the event is of interest. The response is often referred to as a failure time, survival time, or event time.

Some definations

Hazard ratios; ratios of hazard functions between different groups (e.g., exposed vs. unexposed) while adjusting for confounders.

censoring - which occurs when the survival time is only partially known

  • Fixed type I censoring occurs when a study is designed to end after C years of follow-up. In this case, everyone who does not have an event observed during the course of the study is censored at C years.

  • In random type I censoring, the study is designed to end after C years, but censored subjects do not all have the same censoring time. This is the main type of right-censoring we will be concerned with.

  • In type II censoring, a study ends when there is a pre-specified number of events.

Kaplan-Meier estimate

library(ISLR2)
names(BrainCancer)
[1] "sex"       "diagnosis" "loc"       "ki"        "gtv"       "stereo"   
[7] "status"    "time"     
attach(BrainCancer)
table(status)
status
 0  1 
53 35 
library(survival)
fit.surv <- survfit(Surv(time, status) ~ 1)
plot(fit.surv, xlab = "Months",
    ylab = "Estimated Probability of Survival")

Cox-proportional harzards model

S(t) = P(T>t)=1-F(t)

\(h(t)=\lim _{\Delta t \rightarrow 0} \frac{P(t<T \leq t+\Delta t \mid T>t)}{\Delta t}\)

fit.all <- coxph(
Surv(time, status) ~ sex + diagnosis + loc + ki + gtv +
   stereo)
fit.all
Call:
coxph(formula = Surv(time, status) ~ sex + diagnosis + loc + 
    ki + gtv + stereo)

                       coef exp(coef) se(coef)      z        p
sexMale             0.18375   1.20171  0.36036  0.510  0.61012
diagnosisLG glioma  0.91502   2.49683  0.63816  1.434  0.15161
diagnosisHG glioma  2.15457   8.62414  0.45052  4.782 1.73e-06
diagnosisOther      0.88570   2.42467  0.65787  1.346  0.17821
locSupratentorial   0.44119   1.55456  0.70367  0.627  0.53066
ki                 -0.05496   0.94653  0.01831 -3.001  0.00269
gtv                 0.03429   1.03489  0.02233  1.536  0.12466
stereoSRT           0.17778   1.19456  0.60158  0.296  0.76760

Likelihood ratio test=41.37  on 8 df, p=1.776e-06
n= 87, number of events= 35 
   (1 observation deleted due to missingness)
modaldata <- data.frame(
     diagnosis = levels(diagnosis),
     sex = rep("Female", 4),
     loc = rep("Supratentorial", 4),
     ki = rep(mean(ki), 4),
     gtv = rep(mean(gtv), 4),
     stereo = rep("SRT", 4)
     )
survplots <- survfit(fit.all, newdata = modaldata)
plot(survplots, xlab = "Months",
    ylab = "Survival Probability", col = 2:5)
legend("bottomleft", levels(diagnosis), col = 2:5, lty = 1)

Final review of R codes

Calculation

log(exp(1)) #  base `e` is the default (log(e) is not defined)
[1] 1

Vectors

x <- c(1,2,3,4)
class(x)
[1] "numeric"
x %*%x # scalar ("inner") product (but default in R as an 1*1 matrix)
     [,1]
[1,]   30
rep(c('a','b'),3)
[1] "a" "b" "a" "b" "a" "b"
rep(c(2, 4, 8), each = 3)
[1] 2 2 2 4 4 4 8 8 8
y <- seq(from = 1, to = 4, by =1)
class(x)
[1] "numeric"
str(x)
 num [1:4] 1 2 3 4
x <- c(-5:5)
str(x)
 int [1:11] -5 -4 -3 -2 -1 0 1 2 3 4 ...
1:4 # c(1,2,3,4) and 1:4 are the same in.R
[1] 1 2 3 4
seq(1,5, length.out=11)
 [1] 1.0 1.4 1.8 2.2 2.6 3.0 3.4 3.8 4.2 4.6 5.0
# vac<-c((1,2,3),(3,4,5)) is wrong, but the below is true
vec1 <- c(1,2,3)
vec2 <- c(4,5,6)
vec3 <- c(vec1, vec2)
vec3[1] == vec1[1]
[1] TRUE
vec3[3:5];vec3[c(2,3)]
[1] 3 4 5
[1] 2 3
vec3[-1] # everything but the first element
[1] 2 3 4 5 6
vec3[-2*c(1,2)]
[1] 1 3 5 6
x <- -5:5

abs(-5:5)
 [1] 5 4 3 2 1 0 1 2 3 4 5
x <- c(1, 2, 3, 4, 5,6)
y <- c(10,11)
result <- x + y 
print(result)
[1] 11 13 13 15 15 17
# x * y 
c(1,2)*c(2,3)
[1] 2 6
c(1,2)%*%c(2,3)
     [,1]
[1,]    8
#####Application
# Calculate the sample(var(x)) and population variance
x <- c(1, 2, 3, 4, 5)
n <- length(x)
# Calculate the sample variance using R's var() function
sample_variance <-var(x)
sample_variance
[1] 2.5
population_variance <- sample_variance * (n - 1) / n
population_variance
[1] 2

Rmd knowledge

{r, echo = FALSE} –Hidden Code

{r, eval = FALSE} –Do not run this code

{r, message = FALSE} –Do not show the message

{r, warning = FALSE} –Do not show the warning

{r,results=‘hide’} –Do not show the results

Probability in R

dnorm(0) # density at 0
[1] 0.3989423
pnorm(-1) # cumulative probability at -1
[1] 0.1586553
pnorm(-1,lower.tail = F) # cumulative probability at -1, upper tail
[1] 0.8413447
pnorm(0)
[1] 0.5
qnorm(0) # quantile at 0 (with the cumulative probability of 0)
[1] -Inf
pnorm(1.645) 
[1] 0.9500151
qnorm(0.95) # norm quantile at 0.95 (with the cumulative probability of 0.95)
[1] 1.644854
pnorm(1.96)
[1] 0.9750021
qnorm(0.975)
[1] 1.959964
library(mosaic)
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'
The following objects are masked from 'package:dplyr':

    count, do, tally
The following object is masked from 'package:Matrix':

    mean
The following objects are masked from 'package:car':

    deltaMethod, logit
The following object is masked from 'package:ggplot2':

    stat
The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var
The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
plotDist('norm',mean=10,sd=2)

plotDist('binom',size=10,prob=.2) # x-axis is the number of success from 1-10

plotDist("pois", lambda = 2)

sum(dpois(0:3, lambda = 2)) # the probability it is less than or equal to 3 of Poison(2)
[1] 0.8571235
ppois(3, lambda = 2) 
[1] 0.8571235
ppois(3, lambda = 2) == sum(dpois(0:3, lambda = 2)) # this inequivalance is because the floating point precision
[1] FALSE
#xpnorm(-1)  # P(Z ≤ -1) = 0.1586553 -package mosaic
#xpnorm(-1, lower.tail = FALSE)  # P(Z ≥ 1.5) = 0.0668072

R basic

paste("Good", "afternoon", "ladies", "and", "gentlemen")
[1] "Good afternoon ladies and gentlemen"
paste0("Good", "afternoon", "ladies", "and", "gentlemen")
[1] "Goodafternoonladiesandgentlemen"
x <- -10:10
which(x>0)
 [1] 12 13 14 15 16 17 18 19 20 21
x[which(x>0)]
 [1]  1  2  3  4  5  6  7  8  9 10
vowels <- c('a','e','i','o','u')
which(is.element(letters, vowels))
[1]  1  5  9 15 21

Data class

Numeric

Integer

y <- 42L
class(y)
[1] "integer"

Character

x <- "123"
class(x)
[1] "character"
# [1] "character"

x <- as.numeric(x)
class(x)
[1] "numeric"
# [1] "numeric"

Logical

# Note that logical elements are NOT in quotes.
z = c("TRUE", "FALSE", "TRUE", "FALSE")
class(z)
[1] "character"
as.logical(z)
[1]  TRUE FALSE  TRUE FALSE
# TRUE = 1 and FALSE = 0. sum() and mean() work on logical vectors

# remember:
TRUE & TRUE
[1] TRUE
TRUE & FALSE
[1] FALSE
TRUE | FALSE
[1] TRUE

Factor

y <- c('B','B','A','A','C')
z <- factor(y)
str(z)
 Factor w/ 3 levels "A","B","C": 2 2 1 1 3
as.numeric(z)
[1] 2 2 1 1 3
levels(z)
[1] "A" "B" "C"
z <- factor(z,                       # vector of data levels to convert 
            levels=c('B','A','C'),   # Order of the levels
            labels=c("B Group", "A Group", "C Group")) # Pretty labels to use
z
[1] B Group B Group A Group A Group C Group
Levels: B Group A Group C Group
#####Application
### eg of use in the plot's x-axis name lable

iris$Species <- factor(iris$Species,
                       levels = c('versicolor','setosa','virginica'),
                       labels = c('Versicolor','Setosa','Virginica'))
#boxplot(Sepal.Length ~ Species, data=iris)

### another eg
#age_category <- ifelse(ages >= 18, "Adult", "Minor")
#age_factor <- factor(age_category, levels = c("Minor", "Adult"))
#age_factor

### transform a continuous numerical vector into a factor

x <- 1:10
cut(x, breaks = c(0, 2.5, 5.0, 7.5, 10))
 [1] (0,2.5]  (0,2.5]  (2.5,5]  (2.5,5]  (2.5,5]  (5,7.5]  (5,7.5]  (7.5,10]
 [9] (7.5,10] (7.5,10]
Levels: (0,2.5] (2.5,5] (5,7.5] (7.5,10]
x<-cut(x, breaks=3, labels=c('Low','Medium','High'))
str(x)
 Factor w/ 3 levels "Low","Medium",..: 1 1 1 1 2 2 2 3 3 3

Date and time

The following symbols can be used with the format() function to print dates.

  • %d day as a number (0-31) 01-31
  • %a abbreviated weekday Mon
  • %A unabbreviated weekday Monday
  • %m month (00-12) 00-12
  • %b abbreviated month Jan
  • %B unabbreviated month January
  • %y 2-digit year 07
  • %Y 4-digit year 2007
mydates <- as.Date(c("2023-04-07", "2023-01-01"))
mydates
[1] "2023-04-07" "2023-01-01"
days <- mydates[1] - mydates[2]; days
Time difference of 96 days
today <- Sys.Date()
format(today, format="%B %d %Y")
[1] "July 04 2025"

Data frame

Data_Frame <- data.frame(Tr =c("1","2","3"),
                         Pu =c(11,21,32),
                         Dur=c(22,222,1))
Data_Frame
  Tr Pu Dur
1  1 11  22
2  2 21 222
3  3 32   1
summary(Data_Frame)
      Tr                  Pu             Dur        
 Length:3           Min.   :11.00   Min.   :  1.00  
 Class :character   1st Qu.:16.00   1st Qu.: 11.50  
 Mode  :character   Median :21.00   Median : 22.00  
                    Mean   :21.33   Mean   : 81.67  
                    3rd Qu.:26.50   3rd Qu.:122.00  
                    Max.   :32.00   Max.   :222.00  
### table and data frame
table(mpg$class)

   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
df <- as.data.frame(table(mpg$class))
df
        Var1 Freq
1    2seater    5
2    compact   47
3    midsize   41
4    minivan   11
5     pickup   33
6 subcompact   35
7        suv   62

List

# List ---Can hold vectors, strings, matrices, models, list of other list, lists upon lists!

mylist <- list(letters=c("a","b","c"),
               numbers=1:3,matrix(1:25,ncol=5))
head(mylist)
$letters
[1] "a" "b" "c"

$numbers
[1] 1 2 3

[[3]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
# Can reference data using $ (if the elements are named), or using [], or [[]]

mylist[1] # list
$letters
[1] "a" "b" "c"
mylist["letters"] # list
$letters
[1] "a" "b" "c"
mylist[[1]] # vector
[1] "a" "b" "c"
mylist$letters == mylist[["letters"]]
[1] TRUE TRUE TRUE
mylist[[3]][1:2,1:2]
     [,1] [,2]
[1,]    1    6
[2,]    2    7
class(mylist[[3]][1:2,1:2])
[1] "matrix" "array" 
x = c(0, 2, 2, 3, 4); 2 %in% x
[1] TRUE
# eg of using list
x <- c(5.1, 4.9, 5.6, 4.2, 4.8, 4.5, 5.3, 5.2)   # some toy data
results <- t.test(x, alternative='less', mu=5)   # do a t-test
str(results)    
List of 10
 $ statistic  : Named num -0.314
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 7
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.381
 $ conf.int   : num [1:2] -Inf 5.25
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 4.95
  ..- attr(*, "names")= chr "mean of x"
 $ null.value : Named num 5
  ..- attr(*, "names")= chr "mean"
 $ stderr     : num 0.159
 $ alternative: chr "less"
 $ method     : chr "One Sample t-test"
 $ data.name  : chr "x"
 - attr(*, "class")= chr "htest"
results$p.value
[1] 0.3813385

Matrix

# Matrices

n=1:9
mat = matrix(n,nrow=3)
mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
y <- diag(n)

# use %*% as the product of matrices

## Eigenvalue and Eigenvector

A <- matrix(c(13, -4, 2, -4, 11, -2, 2, -2, 8), 3, 3, byrow=TRUE)
ev <- eigen(A)

(values <- ev$values)
[1] 17  8  7
(vectors <- ev$vectors)
           [,1]       [,2]      [,3]
[1,]  0.7453560  0.6666667 0.0000000
[2,] -0.5962848  0.6666667 0.4472136
[3,]  0.2981424 -0.3333333 0.8944272
## Data selection --row then column

mat[1, 1]
[1] 1
mat[1,]
[1] 1 4 7
mat[,1] 
[1] 1 2 3
class(mat[1, ]) # Note that the class of the returned object is no longer a matrix
[1] "integer"

Example 1

db_data <- list(
  drugs = list(
    general_information = data.frame(
      drugbank_id = c("DB001", "DB002", "DB003", "DB004", "DB005"),
      name = c("Aspirin", "Ibuprofen", "Paracetamol", "Insulin", "Morphine"),
      type = c("small molecule", "small molecule", "small molecule", "biotech", "small molecule"),
      created = as.Date(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01")),
      stringsAsFactors = FALSE
    ),
    drug_classification = data.frame(
      drugbank_id = c("DB001", "DB002", "DB003", "DB004", "DB005"),
      classification = c("Analgesic", "Anti-inflammatory", "Analgesic", "Hormone", "Analgesic"),
      stringsAsFactors = FALSE
    ),
    experimental_properties = data.frame(
      drugbank_id = c("DB001", "DB002", "DB003", "DB004", "DB005", "DB001", "DB002", "DB003", "DB004", "DB005"),
      kind = c("logP", "logP", "logP", "logP", "logP", "Molecular Weight", "Molecular Weight", "Molecular Weight", "Molecular Weight", "Molecular Weight"),
      value = c("1.2", "1.5", "0.8", "2.1", "1.8", "180.1", "206.3", "151.2", "5800.0", "281.5"),
      stringsAsFactors = FALSE
    )
  )
)
db_data
$drugs
$drugs$general_information
  drugbank_id        name           type    created
1       DB001     Aspirin small molecule 2020-01-01
2       DB002   Ibuprofen small molecule 2020-02-01
3       DB003 Paracetamol small molecule 2020-03-01
4       DB004     Insulin        biotech 2020-04-01
5       DB005    Morphine small molecule 2020-05-01

$drugs$drug_classification
  drugbank_id    classification
1       DB001         Analgesic
2       DB002 Anti-inflammatory
3       DB003         Analgesic
4       DB004           Hormone
5       DB005         Analgesic

$drugs$experimental_properties
   drugbank_id             kind  value
1        DB001             logP    1.2
2        DB002             logP    1.5
3        DB003             logP    0.8
4        DB004             logP    2.1
5        DB005             logP    1.8
6        DB001 Molecular Weight  180.1
7        DB002 Molecular Weight  206.3
8        DB003 Molecular Weight  151.2
9        DB004 Molecular Weight 5800.0
10       DB005 Molecular Weight  281.5
general_information <- db_data$drugs$general_information

print(general_information)
  drugbank_id        name           type    created
1       DB001     Aspirin small molecule 2020-01-01
2       DB002   Ibuprofen small molecule 2020-02-01
3       DB003 Paracetamol small molecule 2020-03-01
4       DB004     Insulin        biotech 2020-04-01
5       DB005    Morphine small molecule 2020-05-01
# 20. Number of drugs in the general_information dataframe
general_information <- db_data$drugs$general_information
nrow(general_information)
[1] 5
# 21. Filter drugs of type "biotech"
general_information[general_information$type == 'biotech',]
  drugbank_id    name    type    created
4       DB004 Insulin biotech 2020-04-01
# 22. Sort by the created column and display the first 5 rows
general_information$created <- as.Date(general_information$created)
sorted_df <- general_information[order(general_information$created), ]
head(sorted_df, 5)
  drugbank_id        name           type    created
1       DB001     Aspirin small molecule 2020-01-01
2       DB002   Ibuprofen small molecule 2020-02-01
3       DB003 Paracetamol small molecule 2020-03-01
4       DB004     Insulin        biotech 2020-04-01
5       DB005    Morphine small molecule 2020-05-01
# 23. Subset with specific columns and display the first 5 rows
subset_df <- general_information[, c("drugbank_id", "name")]
head(subset_df, 5)
  drugbank_id        name
1       DB001     Aspirin
2       DB002   Ibuprofen
3       DB003 Paracetamol
4       DB004     Insulin
5       DB005    Morphine
# 24. Merge dataframes and count rows
drug_classification <- db_data$drugs$drug_classification
merged_df <- merge(general_information, drug_classification, by = "drugbank_id")
nrow(merged_df)
[1] 5
# 25. Count unique experimental properties (kind)
experimental_properties <- db_data$drugs$experimental_properties
unique_kinds <- unique(experimental_properties$kind)
length(unique_kinds)
[1] 2
# 26. Filter for kind "logP" and count rows
logP_df <- experimental_properties[experimental_properties$kind == "logP", ]
nrow(logP_df)
[1] 5
# 27. Convert value column to numeric and calculate mean
logP_df$value <- as.numeric(logP_df$value)
mean(logP_df$value, na.rm = TRUE)
[1] 1.48
# 28. Calculate summary statistics for logP values
summary(logP_df$value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.80    1.20    1.50    1.48    1.80    2.10 
sd(logP_df$value, na.rm = TRUE)
[1] 0.5069517
# 29. Create a histogram of molecular weight values
molecular_weight <- experimental_properties[experimental_properties$kind == "Molecular Weight", ]
molecular_weight$value <- as.numeric(molecular_weight$value)
# clean based on 3 sigma rule
molecular_weight_clean <- molecular_weight[
  abs(molecular_weight$value - mean(molecular_weight$value, na.rm = TRUE)) <= 3 * sd(molecular_weight$value, na.rm = TRUE),]
hist(molecular_weight_clean$value, main = "Histogram of Molecular Weight", xlab = "Molecular Weight", ylab = "Frequency", col = "lightblue",breaks = 20)

# 30. Filter for kind "Water Solubility" and count unique values
water_solubility_df <- experimental_properties[experimental_properties$kind == "Water Solubility", ]
length(unique(water_solubility_df$value))
[1] 0

Data visualization (mainly: ggplot2)

Box plots

library(ggplot2)
data(iris)
boxplot(iris$Sepal.Length ~ iris$Species)

ggplot(mpg, aes(x=class, y=hwy)) + 
  geom_boxplot() +
  scale_y_continuous(breaks = seq(10, 45, by=5))  #---diy scale in y-axis 

# ggplot(mpg, aes(x=class, y=hwy)) + geom_boxplot() +scale_y_continuous(breaks = seq(10, 45, by=5), minor_breaks = NULL)

Histogram plots

hist(iris$Sepal.Length)

plot(Petal.Length ~ Sepal.Length, data=iris)
abline(lm(Petal.Length ~ Sepal.Length, data=iris), col="red")

data(mpg, package='ggplot2')
ggplot(data=mpg, aes(x=class)) +
  geom_bar()

# By default, the geom_bar() just counts the number of cases and displays how many observations were in each factor level. If we have a data frame that we have already summarized, geom_col will allow us to set the height of the bar by a y column.
table(mpg$class)

   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
df <- as.data.frame(table(mpg$class))
df
        Var1 Freq
1    2seater    5
2    compact   47
3    midsize   41
4    minivan   11
5     pickup   33
6 subcompact   35
7        suv   62
ggplot(df, aes(Var1, Freq)) +
  geom_col()

ggplot(mpg,aes(x=hwy)) +geom_histogram(binwidth = 2)

Density plots

p1 <- ggplot(mpg, aes(x=hwy, y=after_stat(density))) + 
  geom_histogram(bins=8, fill="blue", alpha=0.5) +
  labs(title="Histogram of Highway MPG density")
p2 <- ggplot(mpg, aes(x=hwy)) + 
  geom_density(fill='red', alpha=0.5) +
  labs(title="Density Plot of Highway MPG")
library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
grid.arrange(p1, p2, ncol = 2)

  • mutiple plots in one figure
ggplot(iris, aes(x = Sepal.Length)) +
  geom_density(aes(fill = Species,color=Species), alpha = 0.5) +
  labs(title = "Density plot") +
  labs(x = "Sepal Length", y = "Density") +
  labs(fill = "area",color="line")  # fill is the area,color is the line or dot

Scatter plots

mtcars$cyl <- factor(mtcars$cyl) 

ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point() +
  labs(title='Weight vs Miles per Gallon') +
  labs(x="Weight in tons (2000 lbs)", y="Miles per Gallon (US)" ) +
  labs(color="Cylinders") + # color is dot or line
  scale_color_manual(values=c('blue', 'darkmagenta', 'aquamarine')) # diy color

Scatter plots with regression line

ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length,color=Species))+
  geom_point()+# Anything set inside an aes() command will be of the form attribute=Column_Name and will change based on the data. Anything set outside an aes() command will be in the form attribute=value and will be fixed.
   geom_smooth(method="lm") #By default, geom_smooth(method="lm") fits a linear regression line for each Species separately because Species are mapped to colors, and geom_smooth automatically draws a line for each category.
`geom_smooth()` using formula = 'y ~ x'

ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length)) +
  geom_point(aes(color=Species,shape=Species))+
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

# Zooming in/out--Danger!  This removes the data points first!
#ggplot(trees, aes(x=Girth, y=Volume)) + 
  #geom_point() +
  #geom_smooth(method='lm') +
  #xlim( 8, 19 ) + ylim(0, 60)
library(palmerpenguins)
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) + geom_point(aes(color=bill_depth_mm, shape=species), na.rm=T) + geom_smooth(na.rm=T, se=F) + scale_color_gradient2(low='yellow', mid='green', high='blue', midpoint = 17) + labs(x='Flipper length (millimeters)', y='Body mass (grams)', color='Bill depth (millimeters)') + theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Heat map

# data
mine.table <- data.frame(
  Sample.name = rep(paste0("Sample", 1:5), each = 3),
  Class = rep(c("Class1", "Class2", "Class3"), times = 5),
  Abundance = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5),
  Depth = c(0.5,0.5,0.5, 0.6,0.6,0.6, 0.7,0.7,0.7, 0.8,0.8,0.8, 0.9,0.9,0.9)
)
print(mine.table)
   Sample.name  Class Abundance Depth
1      Sample1 Class1       0.1   0.5
2      Sample1 Class2       0.2   0.5
3      Sample1 Class3       0.3   0.5
4      Sample2 Class1       0.4   0.6
5      Sample2 Class2       0.5   0.6
6      Sample2 Class3       0.6   0.6
7      Sample3 Class1       0.7   0.7
8      Sample3 Class2       0.8   0.7
9      Sample3 Class3       0.9   0.7
10     Sample4 Class1       1.0   0.8
11     Sample4 Class2       1.1   0.8
12     Sample4 Class3       1.2   0.8
13     Sample5 Class1       1.3   0.9
14     Sample5 Class2       1.4   0.9
15     Sample5 Class3       1.5   0.9
mine.heatmap <- ggplot(data = mine.table, mapping = aes(x = Sample.name, y = Class, fill = Abundance)) + 
  geom_tile() + # create the heatmap with tiles+
  scale_y_discrete(limits = rev(levels(factor(mine.table$Class)))) +  # reverse the order of the y-axis --class1 to class 3 (if not have this:class 3 to class 1)
  scale_fill_gradient(low = "white", high = "blue") +  # color
  theme_minimal() +  # control the theme of the plot
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate the X-axis label for better display
  labs(x = "Sample Name",  # x-axis label
       y = "Class",  # y-axis label
       fill = "Abundance")+ # fill legend label by "Abundance"
       ggtitle("Heatmap of Abundance by Sample and Class")+ # add title
       geom_text(aes(label = round(Abundance, 2)), color = "black", size = 3) # add text labels to the tiles


print(mine.heatmap)

Create heat map using facet_grid to show the data in different panels by depth

mine.heatmap <- ggplot(data = mine.table, mapping = aes(x = Sample.name, y = Class, fill = Abundance)) + 
  geom_tile() + # create the heatmap with tiles+
  facet_grid(~ Depth,switch = 'x', scales='free', space='free')+ # facet_grid to show the data in different panels by depth
  scale_y_discrete(limits = rev(levels(factor(mine.table$Class)))) +  # reverse the order of the y-axis --class1 to class 3 (if not have this:class 3 to class 1)
  scale_fill_gradient(low="#FFFFFF", high="#012345")+  # color
  theme_minimal() +  # control the theme of the plot
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate the X-axis label for better display
  labs(x = "Sample Name",  # x-axis label
       y = "Class",  # y-axis label
       fill = "Abundance")+ # fill legend label by "Abundance"
       ggtitle("Heatmap of Abundance by Sample and Class")+ # add title
       geom_text(aes(label = round(Abundance, 2)), color = "black", size = 3) # add text labels to the tiles
  
       


print(mine.heatmap)

Faceting (make many panels of graphics where each panel represents the same relationship between variables, but something changes between each pane)

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_point() +
  facet_grid(.~Species) #or facet_grid(Species~.)--Categorical variables of species will be vertical

  • Another example
library(reshape)

Attaching package: 'reshape'
The following object is masked from 'package:dplyr':

    rename
The following object is masked from 'package:Matrix':

    expand
data(tips, package='reshape')
head(tips, 3)
  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
ggplot(tips, aes(x = total_bill, y = tip / total_bill)) +
  geom_point() +
  facet_grid( smoker ~ day )

# 'free_y' means the scale of different panels are adjusted by themselves
ggplot(tips, aes(x = total_bill, y = tip / total_bill)) +
  geom_point() +
  facet_wrap( ~ day, scales='free_y')

# log scales ---a wrapper of  scale_y_continuous() function , trans_new() function
# ggplot(ACS, aes(x=Age, y=Income)) + geom_point() +
# scale_y_log10(breaks=c(1, 10, 100),
#            minor=c(1:10,
#                 seq(10, 100, by=10 ),
#                seq(100, 1000, by=100))) +
#  ylab('Income (1000s of dollars)')
  • Multi-plot
p1 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet, group=Chick)) +
    geom_line() +
    ggtitle("Growth curve for individual chicks")
# Second plot
p2 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet)) +
    geom_point(alpha=.3) +
    geom_smooth(alpha=.2, linewidth=1) +
    ggtitle("Fitted growth curve per diet")
# Third plot 
p3 <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, colour=Diet)) +
    geom_density() +
    ggtitle("Final weight, by diet")


# to realize:
# plot1 plot2 plot2
# plot1 plot2 plot2
# plot1 plot3 plot3

my.layout = cbind( c(1,1,1), c(2,2,3), c(2,2,3) ) # each c represents a column in a matrix and 1,2,3 represents p1,p2,p3
library(Rmisc)
Loading required package: plyr
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'
The following objects are masked from 'package:reshape':

    rename, round_any
The following object is masked from 'package:mosaic':

    count
The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize
Rmisc::multiplot( p1, p2, p3, layout=my.layout)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# OR library(ggpubr) https://rpkgs.datanovia.com/ggpubr/). 

library(ggpubr)

Attaching package: 'ggpubr'
The following object is masked from 'package:plyr':

    mutate
# Box plot (bp)
bxp <- ggboxplot(ToothGrowth, x = "dose", y = "len",
                 color = "dose", palette = "jco")
# Dot plot (dp)
dp <- ggdotplot(ToothGrowth, x = "dose", y = "len",
                 color = "dose", palette = "jco", binwidth = 1)
mtcars$name <- rownames(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
bp <- ggbarplot(mtcars, x = "name", y = "mpg",
          fill = "cyl",               # change fill color by cyl
          color = "white",            # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in ascending order
          sort.by.groups = TRUE,      # Sort inside each group
          x.text.angle = 90           # Rotate vertically x axis texts
          ) + font("x.text", size = 8)
# Scatter plots (sp)
sp <- ggscatter(mtcars, x = "wt", y = "mpg",
                add = "reg.line",               # Add regression line
                conf.int = TRUE,                # Add confidence interval
                color = "cyl", palette = "jco", # Color by groups "cyl"
                shape = "cyl"                   # Change point shape by groups "cyl"
                ) + 
  stat_cor(aes(color = cyl), label.x = 3)       # Add correlation coefficient

ggarrange(bxp, dp, bp + rremove("x.text"),
          labels = c("A", "B", "C"),
          ncol = 2, nrow = 2)

# Themes
# Rmisc::multiplot( p1 + theme_bw(),          # Black and white
#                   p1 + theme_minimal(),   
#                   p1 + theme_dark(),        
#                   p1 + theme_light(),
#                   cols=2 )

#ggsave('p1.png', width=6, height=3, dpi=350)

Data manipulation

library(dplyr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ plyr::arrange()      masks dplyr::arrange()
✖ gridExtra::combine() masks dplyr::combine()
✖ purrr::compact()     masks plyr::compact()
✖ plyr::count()        masks mosaic::count(), dplyr::count()
✖ purrr::cross()       masks mosaic::cross()
✖ plyr::desc()         masks dplyr::desc()
✖ mosaic::do()         masks dplyr::do()
✖ tidyr::expand()      masks reshape::expand(), Matrix::expand()
✖ plyr::failwith()     masks dplyr::failwith()
✖ dplyr::filter()      masks stats::filter()
✖ plyr::id()           masks dplyr::id()
✖ dplyr::lag()         masks stats::lag()
✖ ggpubr::mutate()     masks plyr::mutate(), dplyr::mutate()
✖ tidyr::pack()        masks Matrix::pack()
✖ dplyr::recode()      masks car::recode()
✖ plyr::rename()       masks reshape::rename(), dplyr::rename()
✖ purrr::some()        masks car::some()
✖ lubridate::stamp()   masks reshape::stamp()
✖ mosaic::stat()       masks ggplot2::stat()
✖ plyr::summarise()    masks dplyr::summarise()
✖ plyr::summarize()    masks dplyr::summarize()
✖ mosaic::tally()      masks dplyr::tally()
✖ tidyr::unpack()      masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

apply

# apply
# Summarize each column by calculating the mean.
apply(iris[,-5],        # what object do we want to apply the function to
      MARGIN=1,    # rows = 1, columns = 2, (same order as [rows, cols]
      FUN=mean     # what function do we want to apply     
     ) %>% head(10)
 [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400
average <- apply( 
  iris[,-5],        # what object do we want to apply the function to
  MARGIN=2,    # rows = 1, columns = 2, (same order as [rows, cols]
  FUN=mean     # what function do we want to apply
)
iris <- rbind(iris[,-5], average)
iris %>% head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2

There are several variants of the apply() function, and the most frequently used ones are lapply() and sapply(). These two functions apply a given function to each element of a list or vector and returns a corresponding list or vector of results.

#lapply
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
x
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$beta
[1]  0.04978707  0.13533528  0.36787944  1.00000000  2.71828183  7.38905610
[7] 20.08553692

$logic
[1]  TRUE FALSE FALSE  TRUE
lapply(x, quantile, probs = 1:3/4) # list 
$a
 25%  50%  75% 
3.25 5.50 7.75 

$beta
      25%       50%       75% 
0.2516074 1.0000000 5.0536690 

$logic
25% 50% 75% 
0.0 0.5 1.0 
sapply(x, quantile, probs = 1:3/4) # matrix
       a      beta logic
25% 3.25 0.2516074   0.0
50% 5.50 1.0000000   0.5
75% 7.75 5.0536690   1.0

Tibbles

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don t change variable names or types, and don t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

data <- data.frame(a = 1:3, b = letters[1:3], c = Sys.Date() - 1:3)
data
  a b          c
1 1 a 2025-07-03
2 2 b 2025-07-02
3 3 c 2025-07-01
as_tibble(data)
# A tibble: 3 × 3
      a b     c         
  <int> <chr> <date>    
1     1 a     2025-07-03
2     2 b     2025-07-02
3     3 c     2025-07-01

%>%

The pipe operator %>% is used to pass the result of one function to the next function in a chain, making the code more readable and concise. For example, if we wanted to start with x, and first apply function f(), then g(), and then h(), the usual R command would be h(g(f(x))) which is hard to read because you have to start reading at the innermost set of parentheses. Using the pipe command %>%, this sequence of operations becomes x %>% f() %>% g() %>% h().

select


# Correct usage of select() within a pipeline
starwars %>% select(-ends_with('color'))

filter

library(dplyr)

# Filter rows where species is "Droid" and mass is greater than or equal to 100
filtered_data <- starwars %>% filter(species == "Droid", mass < 100)
print(filtered_data)
# A tibble: 3 × 14
  name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
1 C-3PO    167    75 <NA>       gold        yellow           112 none  masculine
2 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
3 R5-D4     97    32 <NA>       white, red  red               NA none  masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

slice

This function is used to select rows by their position in the data frame. It can be used to select specific rows or a range of rows.

filter rows based on row number:

starwars %>% slice(2:4)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
2 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
3 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

arrange

This function is used to sort the rows of a data frame by one or more columns. The default sorting of the number in the dataset is in ascending order, but you can use the desc() function to sort in descending order.

starwars %>% arrange(desc(name)) #The default sorting is in ascending order
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Zam Wes…    168    55 blonde     fair, gre… yellow            NA fema… femin…
 2 Yoda         66    17 white      green      brown            896 male  mascu…
 3 Yarael …    264    NA none       white      yellow            NA male  mascu…
 4 Wilhuff…    180    NA auburn, g… fair       blue              64 male  mascu…
 5 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 6 Wedge A…    170    77 brown      fair       hazel             21 male  mascu…
 7 Watto       137    NA black      blue, grey yellow            NA male  mascu…
 8 Wat Tam…    193    48 none       green, gr… unknown           NA male  mascu…
 9 Tion Me…    206    80 none       grey       black             NA male  mascu…
10 Taun We     213    NA none       grey       black             NA fema… femin…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars %>% arrange(desc(height)) %>% head(3)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
2 Tarfful      234   136 brown      brown      blue              NA male  mascu…
3 Lama Su      229    88 none       grey       black             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
dd <- data.frame(
  Trt = factor(c("High", "Med", "High", "Low"),        
               levels = c("Low", "Med", "High")), # level
  y = c(8, 3, 9, 9),      
  z = c(1, 1, 1, 2)) 
dd %>% arrange(Trt, desc(y))
   Trt y z
1  Low 9 2
2  Med 3 1
3 High 9 1
4 High 8 1

mutate

This function is used to create new columns or modify existing columns in a data frame. It allows you to perform calculations and transformations on the data.

# select using the old columns
starwars$bmi = starwars$mass / ((starwars$height / 100) ^ 2)
starwars %>% select(name, bmi) %>% head(3)

# mutate avoids all the starwars$

starwars$bmi <- NULL
starwars %>% 
  mutate(bmi = mass / ((height / 100) ^ 2)) %>%  
  select(name, bmi) %>% head(3)

# mutate_at() and mutate_if() allow us to apply a function to a particular column and save the output.

subset <- starwars %>% 
  mutate(square_height = (height / 100) ^ 2,
         bmi = mass / square_height) %>%
  select(name, square_height, bmi)
subset %>% head(3)

subset %>% mutate_if(is.numeric, round, digits=0) # here, is.numeric is the condition

subset %>% mutate_at(2:3, round, digits=0) %>% head() # column 2 3

# Apply the transformation to columns 2 and 3 for rows 1 to 3
result <- subset %>% 
  mutate_at(2:3, ~ifelse(row_number() %in% 1:3, round(., digits = 0), .))

subset %>% mutate(avg.example = select(., square_height:bmi) %>% rowMeans())

summarise

This function is used to create a summary table. It reduces the data frame to a single row containing summary statistics.

starwars %>% summarise(mean.height=mean(height, na.rm=T), sd.height=sd(height, na.rm=T))

# apply the same statistic to each column
starwars %>% select(height:mass) %>% summarise_all(list(min=min, max=max), na.rm=T)
starwars %>% summarise_if(is.numeric, list(min=min, max=max), na.rm = T)

group_by

This function is used to group the data frame by one or more columns. It is often used in combination with summarise() to calculate summary statistics for each group.


library(dplyr)
library(palmerpenguins)
table(penguins$sex, penguins$species)
penguins %>% 
  filter(!is.na(sex)) %>%
  group_by(sex, species) %>%           
  summarise(n = n(), 
            mean.flipper = mean(flipper_length_mm),
            sd.flipper = sd(flipper_length_mm),
            .groups='keep') %>%
  head(3)
  

examples

Find the flight with the longest departure delay among flights from the same origin and destination (use filter()). Relocate the origin, destination, and departure delay to the first three columns and sort by origin and dest.

flights %>% 
  filter(!is.na(dep_delay)) %>% 
  group_by(origin, dest) %>% 
  filter(dep_delay == max(dep_delay)) %>% 
  relocate(origin, dest, dep_delay) %>% 
  arrange(origin, dest)

Find the flight with the longest departure delay among flights from the same origin and destination (use top_n() or slice_max()). Relocate the origin, destination, and departure delay to the first three columns and sort by origin and dest.

flights %>% 
  filter(!is.na(dep_delay)) %>% 
  group_by(origin, dest) %>% 
  top_n(1, dep_delay) %>%  # or using slice_max(dep_delay) %>% 
  relocate(origin, dest, dep_delay) %>% 
  arrange(origin, dest)

How do departure delays vary at different times of the day? Summarize the averaged departure delays by hours and create an new column named as dep_delay_level which cut() the averaged departure delays into three levels (low, median, and high).

flights %>% 
  group_by(hour) %>% 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  mutate(dep_delay_level = cut(avg_dep_delay, breaks=3, c('low', 'median', 'high')))

How do departure delays vary at different times of the day? Illustrate your answer with a geom_smooth() plot.

flights %>% 
  group_by(hour) %>% 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = avg_dep_delay)) + geom_smooth()
# ways to deleter the blanks
students %>%
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  ) %>% head(3)


read_csv(
  "# A comment I want to skip
  x,y,z
  1,2,3",
  comment = "#"
)
library(tidyr)
grade.book <- rbind(
  data.frame(name='Alison',  HW.1=8, HW.2=5, HW.3=8, HW.4=4),
  data.frame(name='Brandon', HW.1=5, HW.2=3, HW.3=6, HW.4=9),
  data.frame(name='Charles', HW.1=9, HW.2=7, HW.3=9, HW.4=10))
grade.book
     name HW.1 HW.2 HW.3 HW.4
1  Alison    8    5    8    4
2 Brandon    5    3    6    9
3 Charles    9    7    9   10
tidy.scores <- grade.book %>%
  pivot_longer(
    cols = starts_with("HW"),
    names_to = "Homework",
    values_to = "Score"
  )
tidy.scores
# A tibble: 12 × 3
   name    Homework Score
   <chr>   <chr>    <dbl>
 1 Alison  HW.1         8
 2 Alison  HW.2         5
 3 Alison  HW.3         8
 4 Alison  HW.4         4
 5 Brandon HW.1         5
 6 Brandon HW.2         3
 7 Brandon HW.3         6
 8 Brandon HW.4         9
 9 Charles HW.1         9
10 Charles HW.2         7
11 Charles HW.3         9
12 Charles HW.4        10
tidy.scores %>% pivot_wider(names_from=Homework, values_from=Score)
# A tibble: 3 × 5
  name     HW.1  HW.2  HW.3  HW.4
  <chr>   <dbl> <dbl> <dbl> <dbl>
1 Alison      8     5     8     4
2 Brandon     5     3     6     9
3 Charles     9     7     9    10
# table joins

Fish.Data <- tibble(
  Lake_ID = c('A','A','B','B','C','C'), 
  Fish.Weight=rnorm(6, mean=260, sd=25) ) # make up some data
Fish.Data
# A tibble: 6 × 2
  Lake_ID Fish.Weight
  <chr>         <dbl>
1 A              237.
2 A              247.
3 B              262.
4 B              245.
5 C              300.
6 C              278.
Lake.Data <- tibble(
  Lake_ID = c('B','C','D'),   
  Lake_Name = c('Lake Elaine', 'Mormon Lake', 'Lake Mary'),   
  pH=c(6.5, 6.3, 6.1),
  area = c(40, 210, 240),
  avg_depth = c(8, 10, 38))
Lake.Data
# A tibble: 3 × 5
  Lake_ID Lake_Name      pH  area avg_depth
  <chr>   <chr>       <dbl> <dbl>     <dbl>
1 B       Lake Elaine   6.5    40         8
2 C       Mormon Lake   6.3   210        10
3 D       Lake Mary     6.1   240        38
full_join(Fish.Data, Lake.Data)
Joining with `by = join_by(Lake_ID)`
# A tibble: 7 × 6
  Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
  <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
1 A              237. <NA>         NA      NA        NA
2 A              247. <NA>         NA      NA        NA
3 B              262. Lake Elaine   6.5    40         8
4 B              245. Lake Elaine   6.5    40         8
5 C              300. Mormon Lake   6.3   210        10
6 C              278. Mormon Lake   6.3   210        10
7 D               NA  Lake Mary     6.1   240        38
left_join(Fish.Data, Lake.Data)
Joining with `by = join_by(Lake_ID)`
# A tibble: 6 × 6
  Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
  <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
1 A              237. <NA>         NA      NA        NA
2 A              247. <NA>         NA      NA        NA
3 B              262. Lake Elaine   6.5    40         8
4 B              245. Lake Elaine   6.5    40         8
5 C              300. Mormon Lake   6.3   210        10
6 C              278. Mormon Lake   6.3   210        10
inner_join(Fish.Data, Lake.Data)
Joining with `by = join_by(Lake_ID)`
# A tibble: 4 × 6
  Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
  <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
1 B              262. Lake Elaine   6.5    40         8
2 B              245. Lake Elaine   6.5    40         8
3 C              300. Mormon Lake   6.3   210        10
4 C              278. Mormon Lake   6.3   210        10
mutate(
    week = parse_number(week)
  )

how many data points are in the data set

gender_year <- Survey %>% 
  filter(!is.na(Year)) %>% 
  group_by(Sex, Year) %>% 
  count() %>% 
  rename(nu=n)
gender_year
gender_year %>% pivot_wider(names_from = Year, values_from = nu)
who2 %>% 
  head(3)
who2 <- who2 %>% 
  pivot_longer(
    cols = !(country:year), 
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  ) %>% 
  filter(!is.na(count))
who2
left_join(feb14_VX, airports, by=c('dest'='faa'))

library(psych)
drug_prop <- drug_prop %>% 
  filter(class == 'Carboxylic acids and derivatives') 
drug_prop %>% 
  select(logP, logS, water_solubility) %>% 
  pairs.panels()

ps:the comparison between whether to use %>% or not

# %>% 
penguins %>% 
  filter(!is.na(sex)) %>%
  group_by(sex, species) %>%           
  mutate(Sum.Sq.Cells = (flipper_length_mm - mean(flipper_length_mm))^2)  %>%  
  select(sex, species, flipper_length_mm, Sum.Sq.Cells) %>% head()

# not use %>% 
head(
  select(mutate(group_by(filter(penguins, !is.na(sex)), sex, species),
                Sum.Sq.Cells = (flipper_length_mm - mean(flipper_length_mm))^2),
       sex, species, flipper_length_mm, Sum.Sq.Cells))
library(nycflights13)
str(nycflights13::flights)
# the order of group_by and summarize matters
flights %>% 
  group_by(carrier) %>% 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_dep_delay))

Control flow

while loop

x <- 1
while (x < 10) {
  print(x)
  x <- x + 1
}

for loop

Fibonacci sequence

F <- rep(0, 10)        
F[1] <- 0             
F[2] <- 1              
cat('F = ', F, '\n') 
F =  0 1 0 0 0 0 0 0 0 0 
for( n in 3:10 ){
  F[n] <- F[n-1] + F[n-2]
  cat('F = ', F, '\n')    
}
F =  0 1 1 0 0 0 0 0 0 0 
F =  0 1 1 2 0 0 0 0 0 0 
F =  0 1 1 2 3 0 0 0 0 0 
F =  0 1 1 2 3 5 0 0 0 0 
F =  0 1 1 2 3 5 8 0 0 0 
F =  0 1 1 2 3 5 8 13 0 0 
F =  0 1 1 2 3 5 8 13 21 0 
F =  0 1 1 2 3 5 8 13 21 34 

bootstrap estimate of a sampling distribution

  • bootstrap estimate of a sampling distribution
library(dplyr)
library(ggplot2)
SampDist <- data.frame()

for (i in 1:1000){
  SampDist <- trees %>% 
    slice_sample(n=30, replace =TRUE) %>% 
    dplyr::summarise(xbar=mean(Height)) %>% 
    rbind(SampDist)
}
ggplot(SampDist,aes(x=xbar)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Functions

Functions condtruction

copy of x

k <- 3
example.func <- function(x){
  x <- sort(x)
  if (k > 1){
    print(x)
  }
}
x <- c(3,1,5,4,2)
example.func(x)
[1] 1 2 3 4 5
x # x is changed inside the function but not outsied the function
[1] 3 1 5 4 2

Ellipses

# a function that draws the regression line and confidence interval
# notice it doesn't return anything, all it does is draw a plot
show.lm <- function(m, interval.type='confidence', fill.col='light grey', ...){
  x <- m$model[,2]       # extract the predictor variable
  y <- m$model[,1]       # extract the response
  pred <- predict(m, interval=interval.type)
  plot(x, y, ...)
  polygon( c(x,rev(x)),                         # draw the ribbon defined
           c(pred[,'lwr'], rev(pred[,'upr'])),  # by lwr and upr - polygon
           col='light grey')                    # fills in the region defined by            
  lines(x, pred[, 'fit'])                       # a set of vertices, need to reverse            
  points(x, y)                                  # the uppers to make a nice figure
} 

Data class

df<-("R is the statistical analysis language")
strsplit(df, split = " ")
strsplit(df, split = "")
df<-"all16i5need6is4a9long8vacation"
?strsplit
strsplit(df,split = "[0-9]+")
paste("Good", "afternoon", "ladies", "and", "gentlemen")
paste0("Good", "afternoon", "ladies", "and", "gentlemen")
6 != 10
x <- -10:10
which(x>0)
x[which(x>0)]
vowels <- c('a','e','i','o','u')
which(is.element(letters, vowels))

# Note that logical elements are NOT in quotes.
z = c("TRUE", "FALSE", "TRUE", "FALSE")
class(z)
as.logical(z)

# TRUE = 1 and FALSE = 0. sum() and mean() work on logical vectors

# remember:
TRUE & TRUE
TRUE & FALSE
TRUE | FALSE


## Factor

y <- c('B','B','A','A','C')
z <- factor(y)
str(z)
as.numeric(z)
levels(z)

z <- factor(z,                       # vector of data levels to convert 
            levels=c('B','A','C'),   # Order of the levels
            labels=c("B Group", "A Group", "C Group")) # Pretty labels to use
z

### eg of use in the plot's x-axis name lable

iris$Species <- factor(iris$Species,
                       levels = c('versicolor','setosa','virginica'),
                       labels = c('Versicolor','Setosa','Virginica'))
#boxplot(Sepal.Length ~ Species, data=iris)

### another eg
#age_category <- ifelse(ages >= 18, "Adult", "Minor")
#age_factor <- factor(age_category, levels = c("Minor", "Adult"))
#age_factor

# transform a conituous numerical vector and transform it into a factor

x <- 1:10

# cut(x,breaks=3)
cut(x, breaks = c(0, 2.5, 5.0, 7.5, 10))
cut(x, breaks=3, labels=c('Low','Medium','High'))

# dates
mydates <- as.Date(c("2023-04-07", "2023-01-01"))
mydates

days <- mydates[1] - mydates[2]; days

The following symbols can be used with the format() function to print dates.

  • %d day as a number (0-31) 01-31
  • %a abbreviated weekday Mon
  • %A unabbreviated weekday Monday
  • %m month (00-12) 00-12
  • %b abbreviated month Jan
  • %B unabbreviated month January
  • %y 2-digit year 07
  • %Y 4-digit year 2007
today <- Sys.Date()
format(today, format="%B %d %Y")
[1] "July 04 2025"
# or library(lubridate)
#x = c("2014-02-4 05:02:00","2016/09/24 14:02:00") 
#ymd_hms(x)

# POSIXct
# Matrices

n=1:9
mat = matrix(n,nrow=3)
mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
y <- diag(n)

# use %*% as the product of matrices

## Eigenvalue and Eigenvector

A <- matrix(c(13, -4, 2, -4, 11, -2, 2, -2, 8), 3, 3, byrow=TRUE)
ev <- eigen(A)

(values <- ev$values)
[1] 17  8  7
(vectors <- ev$vectors)
           [,1]       [,2]      [,3]
[1,]  0.7453560  0.6666667 0.0000000
[2,] -0.5962848  0.6666667 0.4472136
[3,]  0.2981424 -0.3333333 0.8944272
## Data selection --row then column

mat[1, 1]
[1] 1
mat[1,]
[1] 1 4 7
mat[,1] 
[1] 1 2 3
class(mat[1, ]) # Note that the class of the returned object is no longer a matrix
[1] "integer"
# Data frames

Data_Frame <- data.frame(Tr =c("1","2","3"),
                         Pu =c(11,21,32),
                         Dur=c(22,222,1))
Data_Frame
  Tr Pu Dur
1  1 11  22
2  2 21 222
3  3 32   1
summary(Data_Frame)
      Tr                  Pu             Dur        
 Length:3           Min.   :11.00   Min.   :  1.00  
 Class :character   1st Qu.:16.00   1st Qu.: 11.50  
 Mode  :character   Median :21.00   Median : 22.00  
                    Mean   :21.33   Mean   : 81.67  
                    3rd Qu.:26.50   3rd Qu.:122.00  
                    Max.   :32.00   Max.   :222.00  
# List ---Can hold vectors, strings, matrices, models, list of other list, lists upon lists!

mylist <- list(letters=c("a","b","c"),
               numbers=1:3,matrix(1:25,ncol=5))
head(mylist)
$letters
[1] "a" "b" "c"

$numbers
[1] 1 2 3

[[3]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
# Can reference data using $ (if the elements are named), or using [], or [[]]

mylist[1] # list
$letters
[1] "a" "b" "c"
mylist["letters"] # list
$letters
[1] "a" "b" "c"
mylist[[1]] # vector
[1] "a" "b" "c"
mylist$letters == mylist[["letters"]]
[1] TRUE TRUE TRUE
mylist[[3]][1:2,1:2]
     [,1] [,2]
[1,]    1    6
[2,]    2    7
class(mylist[[3]][1:2,1:2])
[1] "matrix" "array" 
x = c(0, 2, 2, 3, 4); 2 %in% x
[1] TRUE
# eg of Matrices data frames
x <- c(5.1, 4.9, 5.6, 4.2, 4.8, 4.5, 5.3, 5.2)   # some toy data
results <- t.test(x, alternative='less', mu=5)   # do a t-test
str(results)    
List of 10
 $ statistic  : Named num -0.314
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 7
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.381
 $ conf.int   : num [1:2] -Inf 5.25
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 4.95
  ..- attr(*, "names")= chr "mean of x"
 $ null.value : Named num 5
  ..- attr(*, "names")= chr "mean"
 $ stderr     : num 0.159
 $ alternative: chr "less"
 $ method     : chr "One Sample t-test"
 $ data.name  : chr "x"
 - attr(*, "class")= chr "htest"
results$p.value
[1] 0.3813385

Practice of Lab 5

db_data <- list(
  drugs = list(
    general_information = data.frame(
      drugbank_id = c("DB001", "DB002", "DB003", "DB004", "DB005"),
      name = c("Aspirin", "Ibuprofen", "Paracetamol", "Insulin", "Morphine"),
      type = c("small molecule", "small molecule", "small molecule", "biotech", "small molecule"),
      created = as.Date(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01")),
      stringsAsFactors = FALSE
    ),
    drug_classification = data.frame(
      drugbank_id = c("DB001", "DB002", "DB003", "DB004", "DB005"),
      classification = c("Analgesic", "Anti-inflammatory", "Analgesic", "Hormone", "Analgesic"),
      stringsAsFactors = FALSE
    ),
    experimental_properties = data.frame(
      drugbank_id = c("DB001", "DB002", "DB003", "DB004", "DB005", "DB001", "DB002", "DB003", "DB004", "DB005"),
      kind = c("logP", "logP", "logP", "logP", "logP", "Molecular Weight", "Molecular Weight", "Molecular Weight", "Molecular Weight", "Molecular Weight"),
      value = c("1.2", "1.5", "0.8", "2.1", "1.8", "180.1", "206.3", "151.2", "5800.0", "281.5"),
      stringsAsFactors = FALSE
    )
  )
)

general_information <- db_data$drugs$general_information

print(general_information)
  drugbank_id        name           type    created
1       DB001     Aspirin small molecule 2020-01-01
2       DB002   Ibuprofen small molecule 2020-02-01
3       DB003 Paracetamol small molecule 2020-03-01
4       DB004     Insulin        biotech 2020-04-01
5       DB005    Morphine small molecule 2020-05-01
# 20. Number of drugs in the general_information dataframe
general_information <- db_data$drugs$general_information
nrow(general_information)
[1] 5
# 21. Filter drugs of type "biotech"
general_information[general_information$type == 'biotech',]
  drugbank_id    name    type    created
4       DB004 Insulin biotech 2020-04-01
# 22. Sort by the created column and display the first 5 rows
general_information$created <- as.Date(general_information$created)
sorted_df <- general_information[order(general_information$created), ]
head(sorted_df, 5)
  drugbank_id        name           type    created
1       DB001     Aspirin small molecule 2020-01-01
2       DB002   Ibuprofen small molecule 2020-02-01
3       DB003 Paracetamol small molecule 2020-03-01
4       DB004     Insulin        biotech 2020-04-01
5       DB005    Morphine small molecule 2020-05-01
# 23. Subset with specific columns and display the first 5 rows
subset_df <- general_information[, c("drugbank_id", "name")]
head(subset_df, 5)
  drugbank_id        name
1       DB001     Aspirin
2       DB002   Ibuprofen
3       DB003 Paracetamol
4       DB004     Insulin
5       DB005    Morphine
# 24. Merge dataframes and count rows
drug_classification <- db_data$drugs$drug_classification
merged_df <- merge(general_information, drug_classification, by = "drugbank_id")
nrow(merged_df)
[1] 5
# 25. Count unique experimental properties (kind)
experimental_properties <- db_data$drugs$experimental_properties
unique_kinds <- unique(experimental_properties$kind)
length(unique_kinds)
[1] 2
# 26. Filter for kind "logP" and count rows
logP_df <- experimental_properties[experimental_properties$kind == "logP", ]
nrow(logP_df)
[1] 5
# 27. Convert value column to numeric and calculate mean
logP_df$value <- as.numeric(logP_df$value)
mean(logP_df$value, na.rm = TRUE)
[1] 1.48
# 28. Calculate summary statistics for logP values
summary(logP_df$value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.80    1.20    1.50    1.48    1.80    2.10 
sd(logP_df$value, na.rm = TRUE)
[1] 0.5069517
# 29. Create a histogram of molecular weight values
molecular_weight <- experimental_properties[experimental_properties$kind == "Molecular Weight", ]
molecular_weight$value <- as.numeric(molecular_weight$value)
# clean based on 3 sigma rule
molecular_weight_clean <- molecular_weight[
  abs(molecular_weight$value - mean(molecular_weight$value, na.rm = TRUE)) <= 3 * sd(molecular_weight$value, na.rm = TRUE),]
hist(molecular_weight_clean$value, main = "Histogram of Molecular Weight", xlab = "Molecular Weight", ylab = "Frequency", col = "lightblue",breaks = 20)

# 30. Filter for kind "Water Solubility" and count unique values
water_solubility_df <- experimental_properties[experimental_properties$kind == "Water Solubility", ]
length(unique(water_solubility_df$value))
[1] 0

LAB 5 –data visualization

library(ggplot2)
data(iris)
boxplot(iris$Sepal.Length ~ iris$Species)

hist(iris$Sepal.Length)

plot(Petal.Length ~ Sepal.Length, data=iris)
abline(lm(Petal.Length ~ Sepal.Length, data=iris), col="red")

data(mpg, package='ggplot2')
ggplot(data=mpg, aes(x=class)) +
  geom_bar()

table(mpg$class)

   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
df <- as.data.frame(table(mpg$class))
df
        Var1 Freq
1    2seater    5
2    compact   47
3    midsize   41
4    minivan   11
5     pickup   33
6 subcompact   35
7        suv   62
# By default, the geom_bar() just counts the number of cases and displays how many observations were in each factor level. If we have a data frame that we have already summarized, geom_col will allow us to set the height of the bar by a y column.

ggplot(df, aes(Var1, Freq)) +
  geom_col()

ggplot(mpg,aes(x=hwy)) +geom_histogram(binwidth = 2)

# density instead of the count 

p1 <- ggplot(mpg, aes(x=hwy, y=after_stat(density))) + 
  geom_histogram(bins=8, fill="blue", alpha=0.5) +
  labs(title="Histogram of Highway MPG density")
p2 <- ggplot(mpg, aes(x=hwy)) + 
  geom_density(fill='red', alpha=0.5) +
  labs(title="Density Plot of Highway MPG")
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

#Scatterplots
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length,color=Species))+
  geom_point()+# Anything set inside an aes() command will be of the form attribute=Column_Name and will change based on the data. Anything set outside an aes() command will be in the form attribute=value and will be fixed.
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length)) +
  geom_point(aes(color=Species,shape=Species))+
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(iris, aes(x = Sepal.Length)) +
  geom_density(aes(color = Species, fill = Species), alpha = 0.5) +
  labs(title = "Density plot") +
  labs(x = "Sepal Length", y = "Density")

ggplot(iris, aes(x = Sepal.Length)) +
  geom_density(aes(fill = Species,color=Species), alpha = 0.5) +
  labs(title = "Density plot") +
  labs(x = "Sepal Length", y = "Density") +
  labs(fill = "Revise the name",color="Revise the name")  # fill is the area,color is the line or dot

mtcars$cyl <- factor(mtcars$cyl) 

ggplot(mtcars, aes(x=wt, y=mpg, col=cyl)) + geom_point() +
  labs(title='Weight vs Miles per Gallon') +
  labs(x="Weight in tons (2000 lbs)", y="Miles per Gallon (US)" ) +
  labs(color="Cylinders") + # color is dot or line
  scale_color_manual(values=c('blue', 'darkmagenta', 'aquamarine')) # diy color

rainbow(3);rainbow(6)
[1] "#FF0000" "#00FF00" "#0000FF"
[1] "#FF0000" "#FFFF00" "#00FF00" "#00FFFF" "#0000FF" "#FF00FF"
library(colorspace)   # these two packages have some decent 
library(grDevices)    # color palettes functions.
library(MDplot) #Ramachandran plots is often used to show the sampling of the 𝜙/𝜓protein backbone dihedral angles in order to assign propensities of secondary structure elements to the protein of interest
Loading required package: MASS

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

The following object is masked from 'package:ISLR2':

    Boston

Loading required package: RColorBrewer
Loading required package: gplots

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

Loading required package: gtools

Attaching package: 'gtools'

The following object is masked from 'package:mosaic':

    logit

The following object is masked from 'package:car':

    logit
exp_data <- load_ramachandran(system.file( "extdata/ramachandran_example_GROMACS.txt.gz",package = "MDplot" ), mdEngine = "GROMACS")
exp_data <- data.frame(exp_data)
colnames(exp_data) <- c('phi', 'psi')
head(exp_data)
        phi      psi
1 -73.75620 115.6350
2  65.97530 -57.9671
3 -56.51010 130.7000
4 -34.58640 158.7300
5  -3.98544 -79.4994
6 -71.90600 -79.6882
#ggplot(exp_data, aes(x=phi, y=psi) ) + geom_bin_2d(bins=100)
ggplot(exp_data, aes(x=phi, y=psi) ) + geom_hex(bins=100)

ggplot(exp_data, aes(x=phi, y=psi) ) + geom_hex(bins=100) +
  scale_fill_continuous(type = "viridis") # scale_fill_gradient(low = "blue", high = "red")

ggplot(exp_data, aes(x=phi, y=psi) ) + geom_hex(bins=100) +
  scale_fill_gradient2(low = "blue", mid='purple', high = "red", midpoint=10) 

#ggplot(exp_data, aes(x=phi, y=psi) ) + geom_hex(bins=100) +
  #scale_fill_gradientn(colours = terrain.colors(5))

ggplot(exp_data, aes(x=phi, y=psi) ) + geom_density_2d_filled()

# Adjusting axes

ggplot(mpg, aes(x=class, y=hwy)) + 
  geom_boxplot() +
  scale_y_continuous(breaks = seq(10, 45, by=5))  #---diy scale in y-axis 

# ggplot(mpg, aes(x=class, y=hwy)) + geom_boxplot() +scale_y_continuous(breaks = seq(10, 45, by=5), minor_breaks = NULL)

# Zooming in/out--Danger!  This removes the data points first!
#ggplot(trees, aes(x=Girth, y=Volume)) + 
  #geom_point() +
  #geom_smooth(method='lm') +
  #xlim( 8, 19 ) + ylim(0, 60)
  • Example (https://github.com/jcoliver/learn-r/blob/gh-pages/data/mine-microbe-class-data.csv)
mine.data <- data.frame(
  Site = c(1, 2, 3, 1, 2, 3, 2, 3),
  Depth = c(0.5, 0.5, 0.5, 3.5, 3.5, 3.5, 25, 25),
  Sample.name = c('1-S', '2-S', '3-S', '1-M', '2-M', '3-M', '2-B', '3-B'),
  Actinobacteria = c(373871, 332856, 326695, 409809, 319778, 445572, 128251, 96304),
  Cytophagia = c(8052, 28561, 10468, 4481, 15885, 7302, 4732, 5566),
  Flavobacteriia = c(0, 0, 0, 0, 5230, 6218, 5917, 6353),
  Sphingobacteriia = c(0, 10013, 4918, 0, 8274, 8284, 0, 0),
  Nitrospira = c(0, 0, 0, 0, 0, 0, 4609, 0),
  Planctomycetia = c(4553, 10008, 0, 0, 0, 0, 56836, 67380),
  Alphaproteobacteria = c(143534, 70575, 105890, 110746, 52504, 45000, 133851, 95580),
  Betaproteobacteria = c(124454, 170161, 187673, 87245, 146073, 91711, 90204, 85707),
  Deltaproteobacteria = c(0, 0, 0, 0, 0, 0, 4260, 0),
  Gammaproteobacteria = c(8426, 9005, 12935, 7025, 110825, 69452, 31956, 165572)
)
  1. Check the structure of the data frame.
str(mine.data)
'data.frame':   8 obs. of  13 variables:
 $ Site               : num  1 2 3 1 2 3 2 3
 $ Depth              : num  0.5 0.5 0.5 3.5 3.5 3.5 25 25
 $ Sample.name        : chr  "1-S" "2-S" "3-S" "1-M" ...
 $ Actinobacteria     : num  373871 332856 326695 409809 319778 ...
 $ Cytophagia         : num  8052 28561 10468 4481 15885 ...
 $ Flavobacteriia     : num  0 0 0 0 5230 ...
 $ Sphingobacteriia   : num  0 10013 4918 0 8274 ...
 $ Nitrospira         : num  0 0 0 0 0 ...
 $ Planctomycetia     : num  4553 10008 0 0 0 ...
 $ Alphaproteobacteria: num  143534 70575 105890 110746 52504 ...
 $ Betaproteobacteria : num  124454 170161 187673 87245 146073 ...
 $ Deltaproteobacteria: num  0 0 0 0 0 0 4260 0
 $ Gammaproteobacteria: num  8426 9005 12935 7025 110825 ...

The data frame has 8 rows (“obs.”) and 13 columns (“variables”). The first three columns have information about the observation (Site, Depth, Sample.name), and the remaining columns have the abundance for each of 10 classes of bacteria.

We ultimately want a heatmap where the different sites are shown along the x-axis, the classes of bacteria are shown along the y-axis, and the shading of the cell reflects the abundance. This latter value, the abundance of a bacteria at specific depths at different site, is really just a third dimension. However, instead of creating a 3-dimensional plot that can be difficult to visualize, we instead use shading for our “z-axis”. To this end, we need our data formatted so we have a column corresponding to each of these three dimensions:

  • X: Sample identity
  • Y: Bacterial class
  • Z: Abundance

The challenge is that our data are not formatted like this. While the Sample.name column corresponds to what we would like for our x-axis, we do not have columns that correspond to what is needed for the y- and z-axes. All the data are in our data frame, but we need to take a table that looks like this:

Sample.name Actinobacteria Cytophagia 1-S 373871 8052 2-S 332856 28561 … … …

And transform it to one with a column for bacterial class and a column for abundance, like this:

Sample.name Class Abundance 1-S Actinobacteria 373871 1-S Cytophagia 8052 2-S Actinobacteria 332856 … … …

In the later labs, you will learn convenient package and function to achieve this. But for now, we just use some basic function to do so.

  1. Remove the first two columns.
mine.data <- mine.data[,-c(1,2)]
mine.data
  Sample.name Actinobacteria Cytophagia Flavobacteriia Sphingobacteriia
1         1-S         373871       8052              0                0
2         2-S         332856      28561              0            10013
3         3-S         326695      10468              0             4918
4         1-M         409809       4481              0                0
5         2-M         319778      15885           5230             8274
6         3-M         445572       7302           6218             8284
7         2-B         128251       4732           5917                0
8         3-B          96304       5566           6353                0
  Nitrospira Planctomycetia Alphaproteobacteria Betaproteobacteria
1          0           4553              143534             124454
2          0          10008               70575             170161
3          0              0              105890             187673
4          0              0              110746              87245
5          0              0               52504             146073
6          0              0               45000              91711
7       4609          56836              133851              90204
8          0          67380               95580              85707
  Deltaproteobacteria Gammaproteobacteria
1                   0                8426
2                   0                9005
3                   0               12935
4                   0                7025
5                   0              110825
6                   0               69452
7                4260               31956
8                   0              165572
  1. Use “colnames()” to get a vector of class names and assign it to the object named by “class.name”
class.name <- colnames(mine.data)[-1]
class.name
 [1] "Actinobacteria"      "Cytophagia"          "Flavobacteriia"     
 [4] "Sphingobacteriia"    "Nitrospira"          "Planctomycetia"     
 [7] "Alphaproteobacteria" "Betaproteobacteria"  "Deltaproteobacteria"
[10] "Gammaproteobacteria"
  1. For Sample.name 1-S, use the as.numeric() function to convert its corresponding row to a vector. You should first exclude the 1-S in that row. Store this vector using an object named “abundance”.
abundance <- as.numeric(mine.data[mine.data$Sample.name == '1-S', -1])
abundance
 [1] 373871   8052      0      0      0   4553 143534 124454      0   8426
  1. Combine the Sample.name 1-S, class.name vector, and abundance vector into a 10-row, 3-column (Sample.name , Class, Abundance) data.frame as shown in the table above.
df1s <- data.frame(Sample.name = '1-S',
                   Class=class.name,
                   Abundance=abundance)
df1s
   Sample.name               Class Abundance
1          1-S      Actinobacteria    373871
2          1-S          Cytophagia      8052
3          1-S      Flavobacteriia         0
4          1-S    Sphingobacteriia         0
5          1-S          Nitrospira         0
6          1-S      Planctomycetia      4553
7          1-S Alphaproteobacteria    143534
8          1-S  Betaproteobacteria    124454
9          1-S Deltaproteobacteria         0
10         1-S Gammaproteobacteria      8426
  1. Repeat question 6 and 7 for all Sample.name, and rbind() the results into a data frame named “mine.table”.
df2s <- data.frame(Sample.name = '2-S',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '2-S', -1]))
df3s <- data.frame(Sample.name = '3-S',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '3-S', -1]))
df1m <- data.frame(Sample.name = '1-M',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '1-M', -1]))
df2m <- data.frame(Sample.name = '2-M',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '2-M', -1]))
df3m <- data.frame(Sample.name = '3-M',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '3-M', -1]))
df2b <- data.frame(Sample.name = '2-B',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '2-B', -1]))
df3b <- data.frame(Sample.name = '3-B',
                   Class=class.name,
                   Abundance=as.numeric(mine.data[mine.data$Sample.name == '3-B', -1]))
mine.table <- rbind(df1s, df2s, df3s, df1m, df2m, df3m, df2b, df3b)
mine.table
   Sample.name               Class Abundance
1          1-S      Actinobacteria    373871
2          1-S          Cytophagia      8052
3          1-S      Flavobacteriia         0
4          1-S    Sphingobacteriia         0
5          1-S          Nitrospira         0
6          1-S      Planctomycetia      4553
7          1-S Alphaproteobacteria    143534
8          1-S  Betaproteobacteria    124454
9          1-S Deltaproteobacteria         0
10         1-S Gammaproteobacteria      8426
11         2-S      Actinobacteria    332856
12         2-S          Cytophagia     28561
13         2-S      Flavobacteriia         0
14         2-S    Sphingobacteriia     10013
15         2-S          Nitrospira         0
16         2-S      Planctomycetia     10008
17         2-S Alphaproteobacteria     70575
18         2-S  Betaproteobacteria    170161
19         2-S Deltaproteobacteria         0
20         2-S Gammaproteobacteria      9005
21         3-S      Actinobacteria    326695
22         3-S          Cytophagia     10468
23         3-S      Flavobacteriia         0
24         3-S    Sphingobacteriia      4918
25         3-S          Nitrospira         0
26         3-S      Planctomycetia         0
27         3-S Alphaproteobacteria    105890
28         3-S  Betaproteobacteria    187673
29         3-S Deltaproteobacteria         0
30         3-S Gammaproteobacteria     12935
31         1-M      Actinobacteria    409809
32         1-M          Cytophagia      4481
33         1-M      Flavobacteriia         0
34         1-M    Sphingobacteriia         0
35         1-M          Nitrospira         0
36         1-M      Planctomycetia         0
37         1-M Alphaproteobacteria    110746
38         1-M  Betaproteobacteria     87245
39         1-M Deltaproteobacteria         0
40         1-M Gammaproteobacteria      7025
41         2-M      Actinobacteria    319778
42         2-M          Cytophagia     15885
43         2-M      Flavobacteriia      5230
44         2-M    Sphingobacteriia      8274
45         2-M          Nitrospira         0
46         2-M      Planctomycetia         0
47         2-M Alphaproteobacteria     52504
48         2-M  Betaproteobacteria    146073
49         2-M Deltaproteobacteria         0
50         2-M Gammaproteobacteria    110825
51         3-M      Actinobacteria    445572
52         3-M          Cytophagia      7302
53         3-M      Flavobacteriia      6218
54         3-M    Sphingobacteriia      8284
55         3-M          Nitrospira         0
56         3-M      Planctomycetia         0
57         3-M Alphaproteobacteria     45000
58         3-M  Betaproteobacteria     91711
59         3-M Deltaproteobacteria         0
60         3-M Gammaproteobacteria     69452
61         2-B      Actinobacteria    128251
62         2-B          Cytophagia      4732
63         2-B      Flavobacteriia      5917
64         2-B    Sphingobacteriia         0
65         2-B          Nitrospira      4609
66         2-B      Planctomycetia     56836
67         2-B Alphaproteobacteria    133851
68         2-B  Betaproteobacteria     90204
69         2-B Deltaproteobacteria      4260
70         2-B Gammaproteobacteria     31956
71         3-B      Actinobacteria     96304
72         3-B          Cytophagia      5566
73         3-B      Flavobacteriia      6353
74         3-B    Sphingobacteriia         0
75         3-B          Nitrospira         0
76         3-B      Planctomycetia     67380
77         3-B Alphaproteobacteria     95580
78         3-B  Betaproteobacteria     85707
79         3-B Deltaproteobacteria         0
80         3-B Gammaproteobacteria    165572

This can also be realized by

exp.long <- pivot_longer(data = exp.data, 
                         cols = -c(subject, treatment),
                         names_to = "gene", 
                         values_to = "expression")

better look of lables


exp.heatmap + theme(axis.title.y = element_blank(), # Remove the y-axis title
                    axis.text.x = element_text(angle = 45, vjust = 0.5)) # Rotate the x-axis labels

Now the data are ready - on to the plot!

  1. To plot a heatmap, we are going to use the ggplot2 package. Install and library ggplot2.

For this plot, we are going to first create the heatmap object with the ggplot function, then print the plot. We create the object by assigning the output of the ggplot call to the variable mine.heatmap, then entering the name of this object to print it to the screen.

  • Plot heatmap with x=Sample.name and y=Class. Fill the heatmap using Abundance value.
  • Choose right geom_ function to perform heatmap
mine.heatmap <- ggplot(data = mine.table, mapping = aes(x = Sample.name,
                                                        y = Class,
                                                        fill = Abundance)) + 
  geom_tile() # use geom_tile to draw the heatmap
mine.heatmap

  1. First let’s deal with a scaling issue. Most values are dark blue, but this is partly caused by the distribution of values - there are a few really high values in the Abundance column. We can transform the data for display by taking the square root of abundance and plotting that. To do this, we need to add another column to our data frame and update our call to ggplot to reference this new column, Sqrt.abundance.
mine.table$Sqrt.abundance <- sqrt(mine.table$Abundance)
mine.heatmap <- ggplot(data = mine.table, mapping = aes(x = Sample.name,
                                                        y = Class,
                                                        fill = Sqrt.abundance)) +
  geom_tile() +
  xlab(label = "Sample")
mine.heatmap

  1. Change the shading so low values are light and high values are dark. Use white (#FFFFFF) for the low values and a dark blue (#012345) for high values. You can also pick whatever color you like.
mine.heatmap <- mine.heatmap + scale_fill_gradient(low="#FFFFFF", high="#012345")
mine.heatmap
  1. Add a title “Microbe Class Abundance” to this plot.
mine.heatmap <- mine.heatmap + ggtitle(label = "Microbe Class Abundance")
mine.heatmap
  1. Reverse the order of the y-axis, so ‘Actinobacteria’ is at the top. Use rev() function to get the reverse of the levels of the Class. Use the obtained character vector as the limits of scale_y_discrete() function.
mine.heatmap <- mine.heatmap + scale_y_discrete(limits=rev(levels(as.factor(mine.table$Class))))
mine.heatmap

  1. Use “geom_text()” to add abundance values on top of the heatmap. Maybe the text is too big to display on top of the heatmap? what can we do? Consider removing decimals and resizing the text.
mine.heatmap + geom_text(aes(label = round(Sqrt.abundance, 0)), size=2)

  1. Congratulations on completing the above task, but you may have noticed that the x-axis Sample.name are not in the correct order. In fact, a good idea is to visualize different categories of data, we will learn this function next lab. But if you have some time, you may have a try by looking the help information of facet_grid.

This approach, called “faceting” requires one additional layer to the plotting command called facet_grid. With facet_grid we indicate which column contains the categories we want to use for the plot. But before we do that, we need add Depth back to the data frame. For Sample.name %in% c(“1-S”, “2-S”, “3-S”), add Depth=0.5 in a new column named Depth; For Sample.name ends with M, add Depth=3.5; and 25 for those end with B.

mine.table$Depth <- NULL
mine.table[mine.table$Sample.name %in% c("1-S", "2-S", "3-S"), 'Depth'] = 0.5
mine.table[mine.table$Sample.name %in% c("1-M", "2-M", "3-M"), 'Depth'] = 3.5
mine.table[mine.table$Sample.name %in% c("2-B", "3-B"), 'Depth'] = 25
mine.table
   Sample.name               Class Abundance Sqrt.abundance Depth
1          1-S      Actinobacteria    373871      611.44992   0.5
2          1-S          Cytophagia      8052       89.73294   0.5
3          1-S      Flavobacteriia         0        0.00000   0.5
4          1-S    Sphingobacteriia         0        0.00000   0.5
5          1-S          Nitrospira         0        0.00000   0.5
6          1-S      Planctomycetia      4553       67.47592   0.5
7          1-S Alphaproteobacteria    143534      378.85881   0.5
8          1-S  Betaproteobacteria    124454      352.78038   0.5
9          1-S Deltaproteobacteria         0        0.00000   0.5
10         1-S Gammaproteobacteria      8426       91.79325   0.5
11         2-S      Actinobacteria    332856      576.93674   0.5
12         2-S          Cytophagia     28561      169.00000   0.5
13         2-S      Flavobacteriia         0        0.00000   0.5
14         2-S    Sphingobacteriia     10013      100.06498   0.5
15         2-S          Nitrospira         0        0.00000   0.5
16         2-S      Planctomycetia     10008      100.03999   0.5
17         2-S Alphaproteobacteria     70575      265.65956   0.5
18         2-S  Betaproteobacteria    170161      412.50576   0.5
19         2-S Deltaproteobacteria         0        0.00000   0.5
20         2-S Gammaproteobacteria      9005       94.89468   0.5
21         3-S      Actinobacteria    326695      571.57239   0.5
22         3-S          Cytophagia     10468      102.31324   0.5
23         3-S      Flavobacteriia         0        0.00000   0.5
24         3-S    Sphingobacteriia      4918       70.12845   0.5
25         3-S          Nitrospira         0        0.00000   0.5
26         3-S      Planctomycetia         0        0.00000   0.5
27         3-S Alphaproteobacteria    105890      325.40744   0.5
28         3-S  Betaproteobacteria    187673      433.21242   0.5
29         3-S Deltaproteobacteria         0        0.00000   0.5
30         3-S Gammaproteobacteria     12935      113.73214   0.5
31         1-M      Actinobacteria    409809      640.16326   3.5
32         1-M          Cytophagia      4481       66.94027   3.5
33         1-M      Flavobacteriia         0        0.00000   3.5
34         1-M    Sphingobacteriia         0        0.00000   3.5
35         1-M          Nitrospira         0        0.00000   3.5
36         1-M      Planctomycetia         0        0.00000   3.5
37         1-M Alphaproteobacteria    110746      332.78522   3.5
38         1-M  Betaproteobacteria     87245      295.37265   3.5
39         1-M Deltaproteobacteria         0        0.00000   3.5
40         1-M Gammaproteobacteria      7025       83.81527   3.5
41         2-M      Actinobacteria    319778      565.48917   3.5
42         2-M          Cytophagia     15885      126.03571   3.5
43         2-M      Flavobacteriia      5230       72.31874   3.5
44         2-M    Sphingobacteriia      8274       90.96153   3.5
45         2-M          Nitrospira         0        0.00000   3.5
46         2-M      Planctomycetia         0        0.00000   3.5
47         2-M Alphaproteobacteria     52504      229.13751   3.5
48         2-M  Betaproteobacteria    146073      382.19498   3.5
49         2-M Deltaproteobacteria         0        0.00000   3.5
50         2-M Gammaproteobacteria    110825      332.90389   3.5
51         3-M      Actinobacteria    445572      667.51180   3.5
52         3-M          Cytophagia      7302       85.45174   3.5
53         3-M      Flavobacteriia      6218       78.85430   3.5
54         3-M    Sphingobacteriia      8284       91.01648   3.5
55         3-M          Nitrospira         0        0.00000   3.5
56         3-M      Planctomycetia         0        0.00000   3.5
57         3-M Alphaproteobacteria     45000      212.13203   3.5
58         3-M  Betaproteobacteria     91711      302.83824   3.5
59         3-M Deltaproteobacteria         0        0.00000   3.5
60         3-M Gammaproteobacteria     69452      263.53747   3.5
61         2-B      Actinobacteria    128251      358.12149  25.0
62         2-B          Cytophagia      4732       68.78953  25.0
63         2-B      Flavobacteriia      5917       76.92204  25.0
64         2-B    Sphingobacteriia         0        0.00000  25.0
65         2-B          Nitrospira      4609       67.88962  25.0
66         2-B      Planctomycetia     56836      238.40302  25.0
67         2-B Alphaproteobacteria    133851      365.85653  25.0
68         2-B  Betaproteobacteria     90204      300.33981  25.0
69         2-B Deltaproteobacteria      4260       65.26868  25.0
70         2-B Gammaproteobacteria     31956      178.76241  25.0
71         3-B      Actinobacteria     96304      310.32886  25.0
72         3-B          Cytophagia      5566       74.60563  25.0
73         3-B      Flavobacteriia      6353       79.70571  25.0
74         3-B    Sphingobacteriia         0        0.00000  25.0
75         3-B          Nitrospira         0        0.00000  25.0
76         3-B      Planctomycetia     67380      259.57658  25.0
77         3-B Alphaproteobacteria     95580      309.16015  25.0
78         3-B  Betaproteobacteria     85707      292.75758  25.0
79         3-B Deltaproteobacteria         0        0.00000  25.0
80         3-B Gammaproteobacteria    165572      406.90539  25.0
  1. Faceting the plot by Depth, set scales and space as free and add x to switch.
mine.table$Sqrt.abundance <- sqrt(mine.table$Abundance)
mine.heatmap <- ggplot(data = mine.table, mapping = aes(x = Sample.name,
                                                        y = Class,
                                                        fill = Sqrt.abundance)) +
  geom_tile() +
  facet_grid(~Depth, switch = 'x', scales='free', space='free') + 
  xlab(label = "Sample") + 
  scale_fill_gradient(low="#FFFFFF", high="#012345") + 
  ggtitle(label = "Microbe Class Abundance") + 
  scale_y_discrete(limits=rev(levels(as.factor(mine.table$Class)))) + 
  geom_text(aes(label = round(Sqrt.abundance, 0)), size=2)
mine.heatmap

library(palmerpenguins)
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g)) + geom_point(aes(color=bill_depth_mm, shape=species), na.rm=T) + geom_smooth(na.rm=T, se=F) + scale_color_gradient2(low='yellow', mid='green', high='blue', midpoint = 17) + labs(x='Flipper length (millimeters)', y='Body mass (grams)', color='Bill depth (millimeters)') + theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Failed to fit group -1.
Caused by error in `if (se) ...`:
! the condition has length > 1

  • Faceting (make many panels of graphics where each panel represents the same relationship between variables, but something changes between each pane)
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_point() +
  facet_grid(.~Species) #or facet_grid(Species~.)--Categorical variables of species will be vertical

  • Another example
library(reshape)
data(tips, package='reshape')
head(tips, 3)
  total_bill  tip    sex smoker day   time size
1      16.99 1.01 Female     No Sun Dinner    2
2      10.34 1.66   Male     No Sun Dinner    3
3      21.01 3.50   Male     No Sun Dinner    3
ggplot(tips, aes(x = total_bill, y = tip / total_bill)) +
  geom_point() +
  facet_grid( smoker ~ day )

# 'free_y' means the scale of different panels are adjusted by themselves
ggplot(tips, aes(x = total_bill, y = tip / total_bill)) +
  geom_point() +
  facet_wrap( ~ day, scales='free_y')

# log scales ---a wrapper of  scale_y_continuous() function , trans_new() function
# ggplot(ACS, aes(x=Age, y=Income)) + geom_point() +
# scale_y_log10(breaks=c(1, 10, 100),
#            minor=c(1:10,
#                 seq(10, 100, by=10 ),
#                seq(100, 1000, by=100))) +
#  ylab('Income (1000s of dollars)')
  • Multi-plot
p1 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet, group=Chick)) +
    geom_line() +
    ggtitle("Growth curve for individual chicks")
# Second plot
p2 <- ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet)) +
    geom_point(alpha=.3) +
    geom_smooth(alpha=.2, linewidth=1) +
    ggtitle("Fitted growth curve per diet")
# Third plot 
p3 <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, colour=Diet)) +
    geom_density() +
    ggtitle("Final weight, by diet")


# to realize:
# plot1 plot2 plot2
# plot1 plot2 plot2
# plot1 plot3 plot3

my.layout = cbind( c(1,1,1), c(2,2,3), c(2,2,3) ) # each c represents a column in a matrix and 1,2,3 represents p1,p2,p3
library(Rmisc)
Rmisc::multiplot( p1, p2, p3, layout=my.layout)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# OR library(ggpubr) https://rpkgs.datanovia.com/ggpubr/). 

library(ggpubr)
# Box plot (bp)
bxp <- ggboxplot(ToothGrowth, x = "dose", y = "len",
                 color = "dose", palette = "jco")
# Dot plot (dp)
dp <- ggdotplot(ToothGrowth, x = "dose", y = "len",
                 color = "dose", palette = "jco", binwidth = 1)
mtcars$name <- rownames(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
bp <- ggbarplot(mtcars, x = "name", y = "mpg",
          fill = "cyl",               # change fill color by cyl
          color = "white",            # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in ascending order
          sort.by.groups = TRUE,      # Sort inside each group
          x.text.angle = 90           # Rotate vertically x axis texts
          ) + font("x.text", size = 8)
# Scatter plots (sp)
sp <- ggscatter(mtcars, x = "wt", y = "mpg",
                add = "reg.line",               # Add regression line
                conf.int = TRUE,                # Add confidence interval
                color = "cyl", palette = "jco", # Color by groups "cyl"
                shape = "cyl"                   # Change point shape by groups "cyl"
                ) + 
  stat_cor(aes(color = cyl), label.x = 3)       # Add correlation coefficient

ggarrange(bxp, dp, bp + rremove("x.text"),
          labels = c("A", "B", "C"),
          ncol = 2, nrow = 2)

# Themes
# Rmisc::multiplot( p1 + theme_bw(),          # Black and white
#                   p1 + theme_minimal(),   
#                   p1 + theme_dark(),        
#                   p1 + theme_light(),
#                   cols=2 )

#ggsave('p1.png', width=6, height=3, dpi=350)

  • data manipulation
library(dplyr)
library(tidyverse)
# apply
# Summarize each column by calculating the mean.
apply(iris[,-5],        # what object do we want to apply the function to
      MARGIN=1,    # rows = 1, columns = 2, (same order as [rows, cols]
      FUN=mean     # what function do we want to apply     
     ) %>% head(10)
 [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400
average <- apply( 
  iris[,-5],        # what object do we want to apply the function to
  MARGIN=2,    # rows = 1, columns = 2, (same order as [rows, cols]
  FUN=mean     # what function do we want to apply
)
iris <- rbind(iris[,-5], average)
iris %>% head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
# There are several variants of the apply() function, and the most frequently used ones are lapply() and sapply(). These two functions apply a given function to each element of a list or vector and returns a corresponding list or vector of results.

#lapply
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
x
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$beta
[1]  0.04978707  0.13533528  0.36787944  1.00000000  2.71828183  7.38905610
[7] 20.08553692

$logic
[1]  TRUE FALSE FALSE  TRUE
lapply(x, quantile, probs = 1:3/4) # list 
$a
 25%  50%  75% 
3.25 5.50 7.75 

$beta
      25%       50%       75% 
0.2516074 1.0000000 5.0536690 

$logic
25% 50% 75% 
0.0 0.5 1.0 
sapply(x, quantile, probs = 1:3/4) # matrix
       a      beta logic
25% 3.25 0.2516074   0.0
50% 5.50 1.0000000   0.5
75% 7.75 5.0536690   1.0

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don t change variable names or types, and don t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

data <- data.frame(a = 1:3, b = letters[1:3], c = Sys.Date() - 1:3)
data
  a b          c
1 1 a 2025-07-03
2 2 b 2025-07-02
3 3 c 2025-07-01
as_tibble(data)
# A tibble: 3 × 3
      a b     c         
  <int> <chr> <date>    
1     1 a     2025-07-03
2     2 b     2025-07-02
3     3 c     2025-07-01
  • %>%

For example, if we wanted to start with x, and first apply function f(), then g(), and then h(), the usual R command would be h(g(f(x))) which is hard to read because you have to start reading at the innermost set of parentheses. Using the pipe command %>%, this sequence of operations becomes x %>% f() %>% g() %>% h().

  • select

library(dplyr)

# Correct usage of select() within a pipeline
starwars %>% select(-ends_with('color'))
  • filter
library(dplyr)

# Filter rows where species is "Droid" and mass is greater than or equal to 100
filtered_data <- starwars %>% filter(species == "Droid", mass < 100)
print(filtered_data)
# A tibble: 3 × 14
  name  height  mass hair_color skin_color  eye_color birth_year sex   gender   
  <chr>  <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>    
1 C-3PO    167    75 <NA>       gold        yellow           112 none  masculine
2 R2-D2     96    32 <NA>       white, blue red               33 none  masculine
3 R5-D4     97    32 <NA>       white, red  red               NA none  masculine
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • slice

filter rows based on row number:

starwars %>% slice(2:4)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
2 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
3 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
  • arrange
starwars %>% arrange(desc(name)) #The default sorting is in ascending order
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Zam Wes…    168    55 blonde     fair, gre… yellow            NA fema… femin…
 2 Yoda         66    17 white      green      brown            896 male  mascu…
 3 Yarael …    264    NA none       white      yellow            NA male  mascu…
 4 Wilhuff…    180    NA auburn, g… fair       blue              64 male  mascu…
 5 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 6 Wedge A…    170    77 brown      fair       hazel             21 male  mascu…
 7 Watto       137    NA black      blue, grey yellow            NA male  mascu…
 8 Wat Tam…    193    48 none       green, gr… unknown           NA male  mascu…
 9 Tion Me…    206    80 none       grey       black             NA male  mascu…
10 Taun We     213    NA none       grey       black             NA fema… femin…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
starwars %>% arrange(desc(height)) %>% head(3)
# A tibble: 3 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Yarael P…    264    NA none       white      yellow            NA male  mascu…
2 Tarfful      234   136 brown      brown      blue              NA male  mascu…
3 Lama Su      229    88 none       grey       black             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
dd <- data.frame(
  Trt = factor(c("High", "Med", "High", "Low"),        
               levels = c("Low", "Med", "High")), # level
  y = c(8, 3, 9, 9),      
  z = c(1, 1, 1, 2)) 
dd %>% arrange(Trt, desc(y))
   Trt y z
1  Low 9 2
2  Med 3 1
3 High 9 1
4 High 8 1
  • mutate
# select using the old columns
starwars$bmi = starwars$mass / ((starwars$height / 100) ^ 2)
starwars %>% select(name, bmi) %>% head(3)

# mutate avoids all the starwars$

starwars$bmi <- NULL
starwars %>% 
  mutate(bmi = mass / ((height / 100) ^ 2)) %>%  
  select(name, bmi) %>% head(3)

# mutate_at() and mutate_if() allow us to apply a function to a particular column and save the output.

subset <- starwars %>% 
  mutate(square_height = (height / 100) ^ 2,
         bmi = mass / square_height) %>%
  select(name, square_height, bmi)
subset %>% head(3)

subset %>% mutate_if(is.numeric, round, digits=0) # here, is.numeric is the condition

subset %>% mutate_at(2:3, round, digits=0) %>% head() # column 2 3

# Apply the transformation to columns 2 and 3 for rows 1 to 3
result <- subset %>% 
  mutate_at(2:3, ~ifelse(row_number() %in% 1:3, round(., digits = 0), .))

subset %>% mutate(avg.example = select(., square_height:bmi) %>% rowMeans())
  • summarise

This function is used to create a summary table. It reduces the data frame to a single row containing summary statistics.

starwars %>% summarise(mean.height=mean(height, na.rm=T), sd.height=sd(height, na.rm=T))

# apply the same statistic to each column
starwars %>% select(height:mass) %>% summarise_all(list(min=min, max=max), na.rm=T)
starwars %>% summarise_if(is.numeric, list(min=min, max=max), na.rm = T)
  • group_by

library(dplyr)
library(palmerpenguins)
table(penguins$sex, penguins$species)
penguins %>% 
  filter(!is.na(sex)) %>%
  group_by(sex, species) %>%           
  summarise(n = n(), 
            mean.flipper = mean(flipper_length_mm),
            sd.flipper = sd(flipper_length_mm),
            .groups='keep') %>%
  head(3)
  
  • the comparison between whether to use %>% or not
# %>% 
penguins %>% 
  filter(!is.na(sex)) %>%
  group_by(sex, species) %>%           
  mutate(Sum.Sq.Cells = (flipper_length_mm - mean(flipper_length_mm))^2)  %>%  
  select(sex, species, flipper_length_mm, Sum.Sq.Cells) %>% head()

# not use %>% 
head(
  select(mutate(group_by(filter(penguins, !is.na(sex)), sex, species),
                Sum.Sq.Cells = (flipper_length_mm - mean(flipper_length_mm))^2),
       sex, species, flipper_length_mm, Sum.Sq.Cells))
library(nycflights13)
str(nycflights13::flights)
# the order of group_by and summarize matters
flights %>% 
  group_by(carrier) %>% 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_dep_delay))
  1. Find the flight with the longest departure delay among flights from the same origin and destination (use filter()). Relocate the origin, destination, and departure delay to the first three columns and sort by origin and dest.
flights %>% 
  filter(!is.na(dep_delay)) %>% 
  group_by(origin, dest) %>% 
  filter(dep_delay == max(dep_delay)) %>% 
  relocate(origin, dest, dep_delay) %>% 
  arrange(origin, dest)
  1. Find the flight with the longest departure delay among flights from the same origin and destination (use top_n() or slice_max()). Relocate the origin, destination, and departure delay to the first three columns and sort by origin and dest.
flights %>% 
  filter(!is.na(dep_delay)) %>% 
  group_by(origin, dest) %>% 
  top_n(1, dep_delay) %>%  # or using slice_max(dep_delay) %>% 
  relocate(origin, dest, dep_delay) %>% 
  arrange(origin, dest)
  1. How do departure delays vary at different times of the day? Summarize the averaged departure delays by hours and create an new column named as dep_delay_level which cut() the averaged departure delays into three levels (low, median, and high).
flights %>% 
  group_by(hour) %>% 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  mutate(dep_delay_level = cut(avg_dep_delay, breaks=3, c('low', 'median', 'high')))
  1. How do departure delays vary at different times of the day? Illustrate your answer with a geom_smooth() plot.
flights %>% 
  group_by(hour) %>% 
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = avg_dep_delay)) + geom_smooth()
# ways to deleter the blanks
students %>%
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  ) %>% head(3)


read_csv(
  "# A comment I want to skip
  x,y,z
  1,2,3",
  comment = "#"
)
library(tidyr)
grade.book <- rbind(
  data.frame(name='Alison',  HW.1=8, HW.2=5, HW.3=8, HW.4=4),
  data.frame(name='Brandon', HW.1=5, HW.2=3, HW.3=6, HW.4=9),
  data.frame(name='Charles', HW.1=9, HW.2=7, HW.3=9, HW.4=10))
grade.book
     name HW.1 HW.2 HW.3 HW.4
1  Alison    8    5    8    4
2 Brandon    5    3    6    9
3 Charles    9    7    9   10
tidy.scores <- grade.book %>%
  pivot_longer(
    cols = starts_with("HW"),
    names_to = "Homework",
    values_to = "Score"
  )
tidy.scores
# A tibble: 12 × 3
   name    Homework Score
   <chr>   <chr>    <dbl>
 1 Alison  HW.1         8
 2 Alison  HW.2         5
 3 Alison  HW.3         8
 4 Alison  HW.4         4
 5 Brandon HW.1         5
 6 Brandon HW.2         3
 7 Brandon HW.3         6
 8 Brandon HW.4         9
 9 Charles HW.1         9
10 Charles HW.2         7
11 Charles HW.3         9
12 Charles HW.4        10
tidy.scores %>% pivot_wider(names_from=Homework, values_from=Score)
# A tibble: 3 × 5
  name     HW.1  HW.2  HW.3  HW.4
  <chr>   <dbl> <dbl> <dbl> <dbl>
1 Alison      8     5     8     4
2 Brandon     5     3     6     9
3 Charles     9     7     9    10
# table joins

Fish.Data <- tibble(
  Lake_ID = c('A','A','B','B','C','C'), 
  Fish.Weight=rnorm(6, mean=260, sd=25) ) # make up some data
Fish.Data
# A tibble: 6 × 2
  Lake_ID Fish.Weight
  <chr>         <dbl>
1 A              301.
2 A              243.
3 B              277.
4 B              309.
5 C              259.
6 C              254.
Lake.Data <- tibble(
  Lake_ID = c('B','C','D'),   
  Lake_Name = c('Lake Elaine', 'Mormon Lake', 'Lake Mary'),   
  pH=c(6.5, 6.3, 6.1),
  area = c(40, 210, 240),
  avg_depth = c(8, 10, 38))
Lake.Data
# A tibble: 3 × 5
  Lake_ID Lake_Name      pH  area avg_depth
  <chr>   <chr>       <dbl> <dbl>     <dbl>
1 B       Lake Elaine   6.5    40         8
2 C       Mormon Lake   6.3   210        10
3 D       Lake Mary     6.1   240        38
full_join(Fish.Data, Lake.Data)
Joining with `by = join_by(Lake_ID)`
# A tibble: 7 × 6
  Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
  <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
1 A              301. <NA>         NA      NA        NA
2 A              243. <NA>         NA      NA        NA
3 B              277. Lake Elaine   6.5    40         8
4 B              309. Lake Elaine   6.5    40         8
5 C              259. Mormon Lake   6.3   210        10
6 C              254. Mormon Lake   6.3   210        10
7 D               NA  Lake Mary     6.1   240        38
left_join(Fish.Data, Lake.Data)
Joining with `by = join_by(Lake_ID)`
# A tibble: 6 × 6
  Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
  <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
1 A              301. <NA>         NA      NA        NA
2 A              243. <NA>         NA      NA        NA
3 B              277. Lake Elaine   6.5    40         8
4 B              309. Lake Elaine   6.5    40         8
5 C              259. Mormon Lake   6.3   210        10
6 C              254. Mormon Lake   6.3   210        10
inner_join(Fish.Data, Lake.Data)
Joining with `by = join_by(Lake_ID)`
# A tibble: 4 × 6
  Lake_ID Fish.Weight Lake_Name      pH  area avg_depth
  <chr>         <dbl> <chr>       <dbl> <dbl>     <dbl>
1 B              277. Lake Elaine   6.5    40         8
2 B              309. Lake Elaine   6.5    40         8
3 C              259. Mormon Lake   6.3   210        10
4 C              254. Mormon Lake   6.3   210        10
mutate(
    week = parse_number(week)
  )

how many data points are in the data set

gender_year <- Survey %>% 
  filter(!is.na(Year)) %>% 
  group_by(Sex, Year) %>% 
  count() %>% 
  rename(nu=n)
gender_year
gender_year %>% pivot_wider(names_from = Year, values_from = nu)
who2 %>% 
  head(3)
who2 <- who2 %>% 
  pivot_longer(
    cols = !(country:year), 
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  ) %>% 
  filter(!is.na(count))
who2
left_join(feb14_VX, airports, by=c('dest'='faa'))

library(psych)
drug_prop <- drug_prop %>% 
  filter(class == 'Carboxylic acids and derivatives') 
drug_prop %>% 
  select(logP, logS, water_solubility) %>% 
  pairs.panels()
  • for loop
F <- rep(0, 10)        
F[1] <- 0             
F[2] <- 1              
cat('F = ', F, '\n') 

for( n in 3:10 ){
  F[n] <- F[n-1] + F[n-2]
  cat('F = ', F, '\n')    
}
  • bootstrap estimate of a sampling distribution
library(dplyr)
library(ggplot2)
SampDist <- data.frame()

for (i in 1:1000){
  SampDist <- trees %>% 
    slice_sample(n=30, replace =TRUE) %>% 
    dplyr::summarise(xbar=mean(Height)) %>% 
    rbind(SampDist)
}
ggplot(SampDist,aes(x=xbar)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ellipses

# a function that draws the regression line and confidence interval
# notice it doesn't return anything, all it does is draw a plot
show.lm <- function(m, interval.type='confidence', fill.col='light grey', ...){
  x <- m$model[,2]       # extract the predictor variable
  y <- m$model[,1]       # extract the response
  pred <- predict(m, interval=interval.type)
  plot(x, y, ...)
  polygon( c(x,rev(x)),                         # draw the ribbon defined
           c(pred[,'lwr'], rev(pred[,'upr'])),  # by lwr and upr - polygon
           col='light grey')                    # fills in the region defined by            
  lines(x, pred[, 'fit'])                       # a set of vertices, need to reverse            
  points(x, y)                                  # the uppers to make a nice figure
} 

copy of x

k <- 3
example.func <- function(x){
  x <- sort(x)
  if (k > 1){
    print(x)
  }
}
x <- c(3,1,5,4,2)
example.func(x)
[1] 1 2 3 4 5
x # x is changed inside the function but not outsied the function
[1] 3 1 5 4 2

LAB 10

Part 1

  1. Load the bloodpress.txt
bloodpress <- read.table("bloodpress.txt", header=T)
bloodpress
   Pt  BP Age Weight  BSA  Dur Pulse Stress
1   1 105  47   85.4 1.75  5.1    63     33
2   2 115  49   94.2 2.10  3.8    70     14
3   3 116  49   95.3 1.98  8.2    72     10
4   4 117  50   94.7 2.01  5.8    73     99
5   5 112  51   89.4 1.89  7.0    72     95
6   6 121  48   99.5 2.25  9.3    71     10
7   7 121  49   99.8 2.25  2.5    69     42
8   8 110  47   90.9 1.90  6.2    66      8
9   9 110  49   89.2 1.83  7.1    69     62
10 10 114  48   92.7 2.07  5.6    64     35
11 11 114  47   94.4 2.07  5.3    74     90
12 12 115  49   94.1 1.98  5.6    71     21
13 13 114  50   91.6 2.05 10.2    68     47
14 14 106  45   87.1 1.92  5.6    67     80
15 15 125  52  101.3 2.19 10.0    76     98
16 16 114  46   94.5 1.98  7.4    69     95
17 17 106  46   87.0 1.87  3.6    62     18
18 18 113  46   94.5 1.90  4.3    70     12
19 19 110  48   90.5 1.88  9.0    71     99
20 20 122  56   95.7 2.09  7.0    75     99
  1. Use pairs.panels() function from psych pacakge to draw scatterplots, histograms, and calculate correlations between variables.
library(psych)

Attaching package: 'psych'
The following object is masked from 'package:gtools':

    logit
The following objects are masked from 'package:mosaic':

    logit, rescale
The following object is masked from 'package:car':

    logit
The following objects are masked from 'package:ggplot2':

    %+%, alpha
pairs.panels(bloodpress[, -1])

  1. Fit a simple linear regression model of BP vs Stress. Is Stress significant?
model.1 <- lm(BP ~ Stress, data=bloodpress)
summary(model.1)

Call:
lm(formula = BP ~ Stress, data = bloodpress)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6394 -3.3014  0.0722  2.2181  9.9287 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 112.71997    2.19345  51.389   <2e-16 ***
Stress        0.02399    0.03404   0.705     0.49    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.502 on 18 degrees of freedom
Multiple R-squared:  0.02686,   Adjusted R-squared:  -0.0272 
F-statistic: 0.4969 on 1 and 18 DF,  p-value: 0.4899
  1. Fit a simple linear regression model of BP vs Weight.
model.2 <- lm(BP ~ Weight, data=bloodpress)
summary(model.2)

Call:
lm(formula = BP ~ Weight, data = bloodpress)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6933 -0.9318 -0.4935  0.7703  4.8656 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.20531    8.66333   0.255    0.802    
Weight       1.20093    0.09297  12.917 1.53e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.74 on 18 degrees of freedom
Multiple R-squared:  0.9026,    Adjusted R-squared:  0.8972 
F-statistic: 166.9 on 1 and 18 DF,  p-value: 1.528e-10
  1. Fit a simple linear regression model of BP vs BSA.
model.3 <- lm(BP ~ BSA, data=bloodpress)
summary(model.3)

Call:
lm(formula = BP ~ BSA, data = bloodpress)

Residuals:
   Min     1Q Median     3Q    Max 
-5.314 -1.963 -0.197  1.934  4.831 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   45.183      9.392   4.811  0.00014 ***
BSA           34.443      4.690   7.343 8.11e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.79 on 18 degrees of freedom
Multiple R-squared:  0.7497,    Adjusted R-squared:  0.7358 
F-statistic: 53.93 on 1 and 18 DF,  p-value: 8.114e-07
  1. Fit a multiple linear regression model of BP vs Weight + BSA. Is BSA still significant? Why?
model.4 <- lm(BP ~ Weight + BSA, data=bloodpress)
summary(model.4)

Call:
lm(formula = BP ~ Weight + BSA, data = bloodpress)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8932 -1.1961 -0.4061  1.0764  4.7524 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.6534     9.3925   0.602    0.555    
Weight        1.0387     0.1927   5.392 4.87e-05 ***
BSA           5.8313     6.0627   0.962    0.350    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.744 on 17 degrees of freedom
Multiple R-squared:  0.9077,    Adjusted R-squared:  0.8968 
F-statistic: 83.54 on 2 and 17 DF,  p-value: 1.607e-09
  1. Predict BP for Weight=92 and BSA=2 for the two simple linear regression models and the multiple linear regression model, by hand and by predict() function.
2.20531 + 1.20093 * 92
[1] 112.6909
predict(model.2,
        newdata=data.frame(Weight=92))
      1 
112.691 
45.183 + 34.443 * 2
[1] 114.069
predict(model.3,
        newdata=data.frame(BSA=2))
       1 
114.0689 
5.6534 + 1.0387 * 92 + 5.8313 * 2
[1] 112.8764
predict(model.4,
        newdata=data.frame(Weight=92, BSA=2))
       1 
112.8794 
  1. Fit a multiple linear regression model of BP vs Age + Weight. Set argument x and y as TRUE. Save the output of lm() as model.5. How do we interpret each estimated coefficients?
model.5 <- lm(BP ~ Age + Weight, data=bloodpress, x=TRUE, y=TRUE)
summary(model.5)

Call:
lm(formula = BP ~ Age + Weight, data = bloodpress, x = TRUE, 
    y = TRUE)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89968 -0.35242  0.06979  0.35528  0.82781 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16.57937    3.00746  -5.513 3.80e-05 ***
Age           0.70825    0.05351  13.235 2.22e-10 ***
Weight        1.03296    0.03116  33.154  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5327 on 17 degrees of freedom
Multiple R-squared:  0.9914,    Adjusted R-squared:  0.9904 
F-statistic: 978.2 on 2 and 17 DF,  p-value: < 2.2e-16
  1. Use the plot_ly function in the plotly package to create a 3D scatterplot of the data with the fitted plane for a multiple linear regression model of BP vs Age + Weight.
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:MASS':

    select
The following objects are masked from 'package:plyr':

    arrange, mutate, rename, summarise
The following object is masked from 'package:reshape':

    rename
The following object is masked from 'package:mosaic':

    do
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
plot_ly(x=bloodpress$Age, y=bloodpress$Weight, z=bloodpress$BP, type='scatter3d', mode='markers', color=bloodpress$BP)
  1. Extract the matrix x and y of model.5 and assign it to a new object X and y. Remember, if you save the output of lm() as an object, this object contains many elements. After we set x=TRUE and y=TRUE in question 8, we can find x and y in this list.
X <- model.5$x
y <- model.5$y
  1. Calculate \(X^{T}X\), \(X^{T}y\), \((X^{T}X)^{-1}\), and \((X^{T}X)^{-1}X^{T}y\). Use t() for transpose, %*% for matrix multiplication, and solve() for inverse of matrix. For the last one, is your result same as the estimated values you obtained in question 7? –Of course!
t(X) %*% X
            (Intercept)     Age   Weight
(Intercept)        20.0   972.0   1861.8
Age               972.0 47358.0  90566.6
Weight           1861.8 90566.6 173665.4
t(X) %*% y
                [,1]
(Intercept)   2280.0
Age         110978.0
Weight      212666.1
solve(t(X) %*% X)
            (Intercept)          Age       Weight
(Intercept)  31.8748075 -0.267669593 -0.202127676
Age          -0.2676696  0.010092130 -0.002393468
Weight       -0.2021277 -0.002393468  0.003420885
solve(t(X) %*% X) %*% (t(X) %*% y)
                   [,1]
(Intercept) -16.5793694
Age           0.7082515
Weight        1.0329611
  1. Use the anova function to display the ANOVA table with sequential (type I) sums of squares for the model.5.

$SS_{} = *{i=1}^n (* - {y})^2 $

\(\hat{y}_{\text{Variable}, i}\) is the model including only varible i.

$ F = $

If (F \(\approx\) 1) : It indicates that the sizes of MS_Variable and MS_Residuals are approximately the same, suggesting that the explanatory power of the independent variable for the dependent variable is comparable to the random error, and the null hypothesis ((H_0)) cannot be rejected.

If (F \(\gg\) 1) : It indicates that MS_Variable is significantly greater than MS_Residuals, suggesting that the independent variable has a significant influence on the dependent variable, and the null hypothesis ((H_0)) can be rejected.

anova(model.5)
Analysis of Variance Table

Response: BP
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Age        1 243.266 243.266  857.29 5.481e-16 ***
Weight     1 311.910 311.910 1099.20 < 2.2e-16 ***
Residuals 17   4.824   0.284                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# remark
sum((model.5$y-mean(model.5$y))^2) == 243.266+311.910+4.824
[1] TRUE

$SS_{} = SS_{} + SS_{} + SS_{} $

  1. Use the residuals element in fitted model or residuals() function to extract the fitted residuals. Calculate the sum of square of these residual values. Extract the df.residual element in fitted model and use the above elements to calculate the MSE. Is your result same as the anova() output?
sum((model.5$residuals)^2)/model.5$df.residual
[1] 0.2837604
  1. Fit a multiple linear regression model of BP vs Age + Weight + Pulse. Save the output of lm() as model.6.
model.6 <- lm(BP ~ Age + Weight + Pulse, data=bloodpress)
summary(model.6)

Call:
lm(formula = BP ~ Age + Weight + Pulse, data = bloodpress)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.71174 -0.45422 -0.01909  0.41745  0.88743 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16.69000    2.93761  -5.681 3.40e-05 ***
Age           0.75018    0.06074  12.350 1.36e-09 ***
Weight        1.06135    0.03695  28.722 3.40e-15 ***
Pulse        -0.06566    0.04852  -1.353    0.195    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5201 on 16 degrees of freedom
Multiple R-squared:  0.9923,    Adjusted R-squared:  0.9908 
F-statistic: 684.7 on 3 and 16 DF,  p-value: < 2.2e-16
  1. Use anova() function to obtain the ANOVA table for model.6. We may consider model.6 as full model, and model.5 as reduced model in this question. Based on the obtained ANOVA table and the output of question 11, calculate the F-statistic for testing the reduced model by hand. You may use the Residuals Sum sq and the corresponding Residuals Df from both tables. Then, calculate the p-value using \(pf()\) function, don’t forget about the lower.tail.
anova(model.6)
Analysis of Variance Table

Response: BP
          Df  Sum Sq Mean Sq   F value    Pr(>F)    
Age        1 243.266 243.266  899.2446 1.726e-15 ***
Weight     1 311.910 311.910 1152.9909 2.433e-16 ***
Pulse      1   0.496   0.496    1.8319    0.1947    
Residuals 16   4.328   0.271                        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fstat <- (4.824-4.328)/(17-16) / (4.328/16)
Fstat
[1] 1.833641
pf(Fstat, 1, 16, lower.tail = F)
[1] 0.1945157
  1. Use anova() function to do the F-test on model.5 and model.6. Compare the output with your answers of question 14. What is the conclustion of the F-test?

$ = _{i=1}^n (y_i - _i)^2 $

$ = - $ (It represents the variation in the interpretation of the dependent variable by the newly added variable Pulse.)

\[ F = \frac{\text{Sum of Sq} / \text{Df}}{\text{RSS(Model 2)} / \text{Res.Df(Model 2)}} \\F = \frac{0.49557}{0.270525} \approx 1.8319\]

anova(model.5, model.6)
Analysis of Variance Table

Model 1: BP ~ Age + Weight
Model 2: BP ~ Age + Weight + Pulse
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     17 4.8239                           
2     16 4.3284  1   0.49557 1.8319 0.1947
  1. Plot the qqPlot for residuals of model.5. What is the x-axis and y-axis of the qqPlot? What can we say about the qqPlot?
library(car)
qqPlot(model.5$residuals)

[1] 19 10
  1. Plot the residual vs fitted plot of model.5. You may extract fitted.values from model.5 and use it as x in plot().
plot(x=model.5$fitted.values, y=model.5$residuals)

  1. Directly use plot() function on model.5.
plot(model.5)

Part 2

  1. Load the hospital_infct.txt data and select observations with Stay <= 14.
infectionrisk <- read.table("/Users/luyu/Desktop/APH——course/NOTEsAPH101/hospital_infct.txt", header=T)
infectionrisk <- infectionrisk[infectionrisk$Stay<=14,]
infectionrisk
     ID  Stay  Age InfctRsk Culture  Xray Beds MedSchool Region Census Nurses
1     1  7.13 55.7      4.1     9.0  39.6  279         2      4    207    241
2     2  8.82 58.2      1.6     3.8  51.7   80         2      2     51     52
3     3  8.34 56.9      2.7     8.1  74.0  107         2      3     82     54
4     4  8.95 53.7      5.6    18.9 122.8  147         2      4     53    148
5     5 11.20 56.5      5.7    34.5  88.9  180         2      1    134    151
6     6  9.76 50.9      5.1    21.9  97.0  150         2      2    147    106
7     7  9.68 57.8      4.6    16.7  79.0  186         2      3    151    129
8     8 11.18 45.7      5.4    60.5  85.8  640         1      2    399    360
9     9  8.67 48.2      4.3    24.4  90.8  182         2      3    130    118
10   10  8.84 56.3      6.3    29.6  82.6   85         2      1     59     66
11   11 11.07 53.2      4.9    28.5 122.0  768         1      1    591    656
12   12  8.30 57.2      4.3     6.8  83.8  167         2      3    105     59
13   13 12.78 56.8      7.7    46.0 116.9  322         1      1    252    349
14   14  7.58 56.7      3.7    20.8  88.0   97         2      2     59     79
15   15  9.00 56.3      4.2    14.6  76.4   72         2      3     61     38
16   16 11.08 50.2      5.5    18.6  63.6  387         2      3    326    405
17   17  8.28 48.1      4.5    26.0 101.8  108         2      4     84     73
18   18 11.62 53.9      6.4    25.5  99.2  133         2      1    113    101
19   19  9.06 52.8      4.2     6.9  75.9  134         2      2    103    125
20   20  9.35 53.8      4.1    15.9  80.9  833         2      3    547    519
21   21  7.53 42.0      4.2    23.1  98.9   95         2      4     47     49
22   22 10.24 49.0      4.8    36.3 112.6  195         2      2    163    170
23   23  9.78 52.3      5.0    17.6  95.9  270         1      1    240    198
24   24  9.84 62.2      4.8    12.0  82.3  600         2      3    468    497
25   25  9.20 52.2      4.0    17.5  71.1  298         1      4    244    236
26   26  8.28 49.5      3.9    12.0 113.1  546         1      2    413    436
27   27  9.31 47.2      4.5    30.2 101.3  170         2      1    124    173
28   28  8.19 52.1      3.2    10.8  59.2  176         2      1    156     88
29   29 11.65 54.5      4.4    18.6  96.1  248         2      1    217    189
30   30  9.89 50.5      4.9    17.7 103.6  167         2      2    113    106
31   31 11.03 49.9      5.0    19.7 102.1  318         2      1    270    335
32   32  9.84 53.0      5.2    17.7  72.6  210         2      2    200    239
33   33 11.77 54.1      5.3    17.3  56.0  196         2      1    164    165
34   34 13.59 54.0      6.1    24.2 111.7  312         2      1    258    169
35   35  9.74 54.4      6.3    11.4  76.1  221         2      2    170    172
36   36 10.33 55.8      5.0    21.2 104.3  266         2      1    181    149
37   37  9.97 58.2      2.8    16.5  76.5   90         2      2     69     42
38   38  7.84 49.1      4.6     7.1  87.9   60         2      3     50     45
39   39 10.47 53.2      4.1     5.7  69.1  196         2      2    168    153
40   40  8.16 60.9      1.3     1.9  58.0   73         2      3     49     21
41   41  8.48 51.1      3.7    12.1  92.8  166         2      3    145    118
42   42 10.72 53.8      4.7    23.2  94.1  113         2      3     90    107
43   43 11.20 45.0      3.0     7.0  78.9  130         2      3     95     56
44   44 10.12 51.7      5.6    14.9  79.1  362         1      3    313    264
45   45  8.37 50.7      5.5    15.1  84.8  115         2      2     96     88
46   46 10.16 54.2      4.6     8.4  51.5  831         1      4    581    629
48   48 10.90 57.2      5.5    10.6  71.9  593         2      2    446    211
49   49  7.67 51.7      1.8     2.5  40.4  106         2      3     93     35
50   50  8.88 51.5      4.2    10.1  86.9  305         2      3    238    197
51   51 11.48 57.6      5.6    20.3  82.0  252         2      1    207    251
52   52  9.23 51.6      4.3    11.6  42.6  620         2      2    413    420
53   53 11.41 61.1      7.6    16.6  97.9  535         2      3    330    273
54   54 12.07 43.7      7.8    52.4 105.3  157         2      2    115     76
55   55  8.63 54.0      3.1     8.4  56.2   76         2      1     39     44
56   56 11.15 56.5      3.9     7.7  73.9  281         2      1    217    199
57   57  7.14 59.0      3.7     2.6  75.8   70         2      4     37     35
58   58  7.65 47.1      4.3    16.4  65.7  318         2      4    265    314
59   59 10.73 50.6      3.9    19.3 101.0  445         1      2    374    345
60   60 11.46 56.9      4.5    15.6  97.7  191         2      3    153    132
61   61 10.42 58.0      3.4     8.0  59.0  119         2      1     67     64
62   62 11.18 51.0      5.7    18.8  55.9  595         1      2    546    392
63   63  7.93 64.1      5.4     7.5  98.1   68         2      4     42     49
64   64  9.66 52.1      4.4     9.9  98.3   83         2      2     66     95
65   65  7.78 45.5      5.0    20.9  71.6  489         2      3    391    329
66   66  9.42 50.6      4.3    24.8  62.8  508         2      1    421    528
67   67 10.02 49.5      4.4     8.3  93.0  265         2      2    191    202
68   68  8.58 55.0      3.7     7.4  95.9  304         2      3    248    218
69   69  9.61 52.4      4.5     6.9  87.2  487         2      3    404    220
70   70  8.03 54.2      3.5    24.3  87.3   97         2      1     65     55
71   71  7.39 51.0      4.2    14.6  88.4   72         2      2     38     67
72   72  7.08 52.0      2.0    12.3  56.4   87         2      3     52     57
73   73  9.53 51.5      5.2    15.0  65.7  298         2      3    241    193
74   74 10.05 52.0      4.5    36.7  87.5  184         1      1    144    151
75   75  8.45 38.8      3.4    12.9  85.0  235         2      2    143    124
76   76  6.70 48.6      4.5    13.0  80.8   76         2      4     51     79
77   77  8.90 49.7      2.9    12.7  86.9   52         2      1     37     35
78   78 10.23 53.2      4.9     9.9  77.9  752         1      2    595    446
79   79  8.88 55.8      4.4    14.1  76.8  237         2      2    165    182
80   80 10.30 59.6      5.1    27.8  88.9  175         2      2    113     73
81   81 10.79 44.2      2.9     2.6  56.6  461         1      2    320    196
82   82  7.94 49.5      3.5     6.2  92.3  195         2      2    139    116
83   83  7.63 52.1      5.5    11.6  61.1  197         2      4    109    110
84   84  8.77 54.5      4.7     5.2  47.0  143         2      4     85     87
85   85  8.09 56.9      1.7     7.6  56.9   92         2      3     61     61
86   86  9.05 51.2      4.1    20.5  79.8  195         2      3    127    112
87   87  7.91 52.8      2.9    11.9  79.5  477         2      3    349    188
88   88 10.39 54.6      4.3    14.0  88.3  353         2      2    223    200
89   89  9.36 54.1      4.8    18.3  90.6  165         2      1    127    158
90   90 11.41 50.4      5.8    23.8  73.0  424         1      3    359    335
91   91  8.86 51.3      2.9     9.5  87.5  100         2      3     65     53
92   92  8.93 56.0      2.0     6.2  72.5   95         2      3     59     56
93   93  8.92 53.9      1.3     2.2  79.5   56         2      2     40     14
94   94  8.15 54.9      5.3    12.3  79.8   99         2      4     55     71
95   95  9.77 50.2      5.3    15.7  89.7  154         2      2    123    148
96   96  8.54 56.1      2.5    27.0  82.5   98         2      1     57     75
97   97  8.66 52.8      3.8     6.8  69.5  246         2      3    178    177
98   98 12.01 52.8      4.8    10.8  96.9  298         2      1    237    115
99   99  7.95 51.8      2.3     4.6  54.9  163         2      3    128     93
100 100 10.15 51.9      6.2    16.4  59.2  568         1      3    452    371
101 101  9.76 53.2      2.6     6.9  80.1   64         2      4     47     55
102 102  9.89 45.2      4.3    11.8 108.7  190         2      1    141    112
103 103  7.14 57.6      2.7    13.1  92.6   92         2      4     40     50
104 104 13.95 65.9      6.6    15.6 133.5  356         2      1    308    182
105 105  9.44 52.5      4.5    10.9  58.5  297         2      3    230    263
106 106 10.80 63.9      2.9     1.6  57.4  130         2      3     69     62
107 107  7.14 51.7      1.4     4.1  45.7  115         2      3     90     19
108 108  8.02 55.0      2.1     3.8  46.5   91         2      2     44     32
109 109 11.80 53.8      5.7     9.1 116.9  571         1      2    441    469
110 110  9.50 49.3      5.8    42.0  70.9   98         2      3     68     46
111 111  7.70 56.9      4.4    12.2  67.9  129         2      4     85    136
113 113  9.41 59.5      3.1    20.6  91.7   29         2      3     20     22
    Facilities
1         60.0
2         40.0
3         20.0
4         40.0
5         40.0
6         40.0
7         40.0
8         60.0
9         40.0
10        40.0
11        80.0
12        40.0
13        57.1
14        37.1
15        17.1
16        57.1
17        37.1
18        37.1
19        37.1
20        77.1
21        17.1
22        37.1
23        57.1
24        57.1
25        57.1
26        57.1
27        37.1
28        37.1
29        37.1
30        37.1
31        57.1
32        54.3
33        34.3
34        54.3
35        54.3
36        54.3
37        34.3
38        34.3
39        54.3
40        14.3
41        34.3
42        34.3
43        34.3
44        54.3
45        34.3
46        74.3
48        51.4
49        11.4
50        51.4
51        51.4
52        71.4
53        51.4
54        31.4
55        31.4
56        51.4
57        31.4
58        51.4
59        51.4
60        31.4
61        31.4
62        68.6
63        28.6
64        28.6
65        48.6
66        48.6
67        48.6
68        48.6
69        48.6
70        28.6
71        28.6
72        28.6
73        48.6
74        68.6
75        48.6
76        28.6
77        28.6
78        68.6
79        48.6
80        45.7
81        65.7
82        45.7
83        45.7
84        25.7
85        45.7
86        45.7
87        65.7
88        65.7
89        45.7
90        45.7
91        25.7
92        25.7
93         5.7
94        25.7
95        25.7
96        45.7
97        45.7
98        45.7
99        42.9
100       62.9
101       22.9
102       42.9
103       22.9
104       62.9
105       42.9
106       22.9
107       22.9
108       22.9
109       62.9
110       22.9
111       62.9
113       22.9
  1. Create new dummy/indicator columns (i1, i2, i3, i4) for regions using ifelse() function. For example, i1 = 1 when Region = 1 and i1 = 0 when Region is not equal to 1; i2 = 1 when Region = 2 and i2 = 0 when Region is not equal to 2; …
infectionrisk$i1 <- ifelse(infectionrisk$Region == 1, 1, 0)
infectionrisk$i2 <- ifelse(infectionrisk$Region == 2, 1, 0)
infectionrisk$i3 <- ifelse(infectionrisk$Region == 3, 1, 0)
infectionrisk$i4 <- ifelse(infectionrisk$Region == 4, 1, 0)
  1. Fit a multiple linear regression model of InfctRsk on Stay + Xray + i2 + i3 + i4.
model.7 <- lm(InfctRsk ~ Stay + Xray + i2 + i3 + i4, data=infectionrisk)
summary(model.7)

Call:
lm(formula = InfctRsk ~ Stay + Xray + i2 + i3 + i4, data = infectionrisk)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.66492 -0.65420  0.04265  0.64034  2.51391 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.134259   0.877347  -2.433  0.01668 *  
Stay         0.505394   0.081455   6.205 1.11e-08 ***
Xray         0.017587   0.005649   3.113  0.00238 ** 
i2           0.171284   0.281475   0.609  0.54416    
i3           0.095461   0.288852   0.330  0.74169    
i4           1.057835   0.378077   2.798  0.00612 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.036 on 105 degrees of freedom
Multiple R-squared:  0.4198,    Adjusted R-squared:  0.3922 
F-statistic: 15.19 on 5 and 105 DF,  p-value: 3.243e-11
  1. Can we include i1 + i2 + i3 + i4 in this multiple linear regression? Why?

No. In the context of using dummy variables for categorical data in regression analysis, it’s essential to designate a reference category. This reference category is represented by a coefficient of zero, while the coefficients for the other categories are interpreted as deviations from this reference point.

model.8 <- lm(InfctRsk ~ Stay + Xray + i1 + i2 + i3 + i4, data=infectionrisk)
summary(model.8)

Call:
lm(formula = InfctRsk ~ Stay + Xray + i1 + i2 + i3 + i4, data = infectionrisk)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.66492 -0.65420  0.04265  0.64034  2.51391 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.076424   0.721361  -1.492  0.13864    
Stay         0.505394   0.081455   6.205 1.11e-08 ***
Xray         0.017587   0.005649   3.113  0.00238 ** 
i1          -1.057835   0.378077  -2.798  0.00612 ** 
i2          -0.886551   0.339887  -2.608  0.01042 *  
i3          -0.962374   0.323365  -2.976  0.00362 ** 
i4                 NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.036 on 105 degrees of freedom
Multiple R-squared:  0.4198,    Adjusted R-squared:  0.3922 
F-statistic: 15.19 on 5 and 105 DF,  p-value: 3.243e-11
  1. Conduct an F-test (use anova() function) to see if at least one of i2, i3, and i4 are useful.
model.9 <- lm(InfctRsk ~ Stay + Xray, data=infectionrisk)
anova(model.7, model.9)
Analysis of Variance Table

Model 1: InfctRsk ~ Stay + Xray + i2 + i3 + i4
Model 2: InfctRsk ~ Stay + Xray
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    105 112.71                              
2    108 123.56 -3   -10.849 3.3687 0.02135 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confusion Matrix

Total population \(=\mathrm{P}+\mathrm{N}\) Predicted Positive (PP) Predicted Negative (PN) Informedness, bookmaker informedness (BM) \(=\mathrm{TPR}+\mathrm{TNR}-1\) Prevalence threshold (PT) \(=\frac{\sqrt{\mathrm{TPR} \times \mathrm{FPR}}-\mathrm{FPR}}{\mathrm{TPR}-\mathrm{FPR}}\)
Positive (P) \({ }^{[\text {a }]}\) True positive (TP), hit \({ }^{[b]}\) False negative (FN), miss, underestimation True positive rate (TPR), recall, sensitivity (SEN), probability of detection, hit rate, power \(=\frac{\mathrm{TP}}{\mathrm{P}}=1-\mathrm{FNR}\) False negative rate (FNR), miss rate type ll error \({ }^{[c]}\) \(=\frac{\mathrm{FN}}{\mathrm{P}}=1-\mathrm{TPR}\)
Negative ( N ) \({ }^{[d]}\) False positive (FP), false alarm, overestimation True negative (TN), correct rejection \({ }^{[\text {e }]}\) False positive rate (FPR), probability of false alarm, fall-out type I error \({ }^{[7]}\) \(=\frac{\mathrm{FP}}{\mathrm{~N}}=1-\mathrm{TNR}\) True negative rate (TNR), specificity (SPC), selectivity \(=\frac{\mathrm{TN}}{\mathrm{N}}=1-\mathrm{FPR}\)

example

Lung Cancer Classification (https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer?select=survey+lung+cancer.csv)

The effectiveness of cancer prediction system helps the people to know their cancer risk with low cost and it also helps the people to take the appropriate decision based on their cancer risk status. The data is collected from the website online lung cancer prediction system.

Total no. of attributes: 16 No. of instances: 284

Attribute information:

Gender: M(male), F(female) Age: Age of the patient Smoking: YES=2, NO=1. Yellow fingers: YES=2, NO=1. Anxiety: YES=2, NO=1. Peer_pressure: YES=2, NO=1. Chronic Disease: YES=2, NO=1. Fatigue: YES=2, NO=1. Allergy: YES=2, NO=1. Wheezing: YES=2, NO=1. Alcohol: YES=2, NO=1. Coughing: YES=2, NO=1. Shortness of Breath: YES=2, NO=1. Swallowing Difficulty: YES=2, NO=1. Chest pain: YES=2, NO=1. Lung Cancer: YES, NO.

Goal: It is your job to classify Lung Cancer using other variables

Example Code

#Load the dataset 
library(readr)
data = read_csv('/Users/luyu/Desktop/APH——course/survey_lung_cancer.csv', show_col_types = FALSE)
data$LUNG_CANCER <- ifelse(data$LUNG_CANCER=="YES", 1, 0)
summary(data)
    GENDER               AGE           SMOKING      YELLOW_FINGERS
 Length:309         Min.   :21.00   Min.   :1.000   Min.   :1.00  
 Class :character   1st Qu.:57.00   1st Qu.:1.000   1st Qu.:1.00  
 Mode  :character   Median :62.00   Median :2.000   Median :2.00  
                    Mean   :62.67   Mean   :1.563   Mean   :1.57  
                    3rd Qu.:69.00   3rd Qu.:2.000   3rd Qu.:2.00  
                    Max.   :87.00   Max.   :2.000   Max.   :2.00  
    ANXIETY      PEER_PRESSURE   CHRONIC DISEASE    FATIGUE     
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
 Median :1.000   Median :2.000   Median :2.000   Median :2.000  
 Mean   :1.498   Mean   :1.502   Mean   :1.505   Mean   :1.673  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
 Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
    ALLERGY         WHEEZING     ALCOHOL CONSUMING    COUGHING    
 Min.   :1.000   Min.   :1.000   Min.   :1.000     Min.   :1.000  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000     1st Qu.:1.000  
 Median :2.000   Median :2.000   Median :2.000     Median :2.000  
 Mean   :1.557   Mean   :1.557   Mean   :1.557     Mean   :1.579  
 3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000     3rd Qu.:2.000  
 Max.   :2.000   Max.   :2.000   Max.   :2.000     Max.   :2.000  
 SHORTNESS OF BREATH SWALLOWING DIFFICULTY   CHEST PAIN     LUNG_CANCER    
 Min.   :1.000       Min.   :1.000         Min.   :1.000   Min.   :0.0000  
 1st Qu.:1.000       1st Qu.:1.000         1st Qu.:1.000   1st Qu.:1.0000  
 Median :2.000       Median :1.000         Median :2.000   Median :1.0000  
 Mean   :1.641       Mean   :1.469         Mean   :1.557   Mean   :0.8738  
 3rd Qu.:2.000       3rd Qu.:2.000         3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :2.000       Max.   :2.000         Max.   :2.000   Max.   :1.0000  
library(ggplot2)
ggplot(data, aes(x = factor(SMOKING), fill = factor(LUNG_CANCER))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("0" = "lightblue", "1" = "salmon")) +
theme_minimal()
### Data SAMPLING ####
library(caret)

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
The following object is masked from 'package:mosaic':

    dotPlot
The following object is masked from 'package:survival':

    cluster
set.seed(101)
split = createDataPartition(data$LUNG_CANCER, p = 0.80, list = FALSE)
train_data = data[split,]
test_data = data[-split,]
nrow(train_data)
[1] 248
nrow(test_data)
[1] 61
#error metrics -- Confusion Matrix
err_metric=function(CM)
{
  TN =CM[1,1]
  TP =CM[2,2]
  FP =CM[1,2]
  FN =CM[2,1]
  precision =(TP)/(TP+FP)
  recall_score =(TP)/(TP+FN)
  f1_score=2*((precision*recall_score)/(precision+recall_score))
  accuracy_model  =(TP+TN)/(TP+TN+FP+FN)
  False_positive_rate =(FP)/(FP+TN)
  False_negative_rate =(FN)/(FN+TP)
  print(paste("Precision value of the model: ",round(precision,2)))
  print(paste("Accuracy of the model: ",round(accuracy_model,2)))
  print(paste("Recall value of the model: ",round(recall_score,2)))
  print(paste("False Positive rate of the model: ",round(False_positive_rate,2)))
  print(paste("False Negative rate of the model: ",round(False_negative_rate,2)))
  print(paste("F1 score of the model: ",round(f1_score,2)))
}
# Logistic regression
logit_m =glm(formula = LUNG_CANCER ~ ., data = train_data, family = 'binomial')
summary(logit_m)

Call:
glm(formula = LUNG_CANCER ~ ., family = "binomial", data = train_data)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -32.13449    6.57393  -4.888 1.02e-06 ***
GENDERM                  -0.84716    0.82793  -1.023 0.306203    
AGE                       0.01537    0.03490   0.440 0.659756    
SMOKING                   1.10495    0.82246   1.343 0.179119    
YELLOW_FINGERS            1.10971    0.82005   1.353 0.175982    
ANXIETY                   2.08645    1.08610   1.921 0.054725 .  
PEER_PRESSURE             1.94159    0.74009   2.623 0.008704 ** 
`CHRONIC DISEASE`         3.91378    1.12190   3.489 0.000486 ***
FATIGUE                   2.62906    0.89032   2.953 0.003148 ** 
ALLERGY                   1.42702    0.83764   1.704 0.088454 .  
WHEEZING                  1.07933    0.90761   1.189 0.234357    
`ALCOHOL CONSUMING`       2.41116    0.98495   2.448 0.014365 *  
COUGHING                  3.14783    1.22386   2.572 0.010110 *  
`SHORTNESS OF BREATH`    -0.22681    0.84245  -0.269 0.787757    
`SWALLOWING DIFFICULTY`   2.24613    1.24347   1.806 0.070865 .  
`CHEST PAIN`              0.89356    0.73930   1.209 0.226791    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 198.230  on 247  degrees of freedom
Residual deviance:  77.353  on 232  degrees of freedom
AIC: 109.35

Number of Fisher Scoring iterations: 8
# Logistic regression
logit_m2 =glm(formula = LUNG_CANCER ~ ANXIETY+PEER_PRESSURE+`CHRONIC DISEASE`+FATIGUE+ALLERGY+`ALCOHOL CONSUMING`+COUGHING+`SWALLOWING DIFFICULTY`, data = train_data, family = 'binomial')
summary(logit_m2)

Call:
glm(formula = LUNG_CANCER ~ ANXIETY + PEER_PRESSURE + `CHRONIC DISEASE` + 
    FATIGUE + ALLERGY + `ALCOHOL CONSUMING` + COUGHING + `SWALLOWING DIFFICULTY`, 
    family = "binomial", data = train_data)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -27.3830     5.5206  -4.960 7.05e-07 ***
ANXIETY                   2.5514     0.8659   2.947 0.003213 ** 
PEER_PRESSURE             2.1822     0.7245   3.012 0.002593 ** 
`CHRONIC DISEASE`         3.5120     0.9696   3.622 0.000292 ***
FATIGUE                   2.4939     0.6979   3.573 0.000353 ***
ALLERGY                   2.0104     0.7428   2.707 0.006796 ** 
`ALCOHOL CONSUMING`       2.5084     0.8653   2.899 0.003744 ** 
COUGHING                  3.2220     0.9540   3.377 0.000732 ***
`SWALLOWING DIFFICULTY`   2.5273     1.0096   2.503 0.012309 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 198.230  on 247  degrees of freedom
Residual deviance:  83.501  on 239  degrees of freedom
AIC: 101.5

Number of Fisher Scoring iterations: 8
library(dplyr)
logit_P_prob = predict(logit_m, newdata = select(test_data, -LUNG_CANCER), type = 'response')
logit_P_prob[1:3]
        1         2         3 
0.9912600 0.9997708 0.9980692 
logit_P <- ifelse(logit_P_prob > 0.5, 1, 0) # Probability check
logit_P[1:3]
1 2 3 
1 1 1 
CM = table(test_data$LUNG_CANCER, logit_P)
print(CM)
   logit_P
     0  1
  0  3  2
  1  1 55
err_metric(CM)
[1] "Precision value of the model:  0.96"
[1] "Accuracy of the model:  0.95"
[1] "Recall value of the model:  0.98"
[1] "False Positive rate of the model:  0.4"
[1] "False Negative rate of the model:  0.02"
[1] "F1 score of the model:  0.97"
#ROC-curve using pROC library
library(pROC)
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following object is masked from 'package:colorspace':

    coords
The following objects are masked from 'package:mosaic':

    cov, var
The following objects are masked from 'package:stats':

    cov, smooth, var
roc_score=roc(test_data$LUNG_CANCER, logit_P_prob) #AUC score
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_score, main = "ROC curve -- Logistic Regression")

Below is the further remark

margin of error

For a given level of confidence c, The greatest possible distance between the point estimate and the value of the parameter it is estimating.

Sometimes called the maximum error of estimate or error tolerance.

Why n-1 in sampling variance

  • degree of freedom

  • ubiased estimation. (是指在已知sample时,这个无偏估计是所有估计中的平均值,这样我们才能称其为“无偏”。)

  • sampling proportion is unbiased

E(\(\hat P\)) = E(X/n)=npn (binomial distribution) =p Var—also 如果E满足 variance 不满足呢

Simple random samlple

We know the total population but do not know p of our interests.

Based on hypergeometric probability calculation we can estimate p as \(\hat p = \text{intrests}/N= X_1+...+X_N/N\) where \(X_i\)~ Bernoulli(p) and \(\hat p\) is a random variable. Therefore, \(E[\hat p] = p\) (E\(X_i\)=p based on Bernouli distribution).

  • If \(X_i\) are independent then E(\(X_1+X_2\)) = 2E(\(X_1\))

Here, eg. E\(\hat p^2\)= 1/n E\(X_1^2\)+n-1/nE\(X_1 X_2\)

Since \(X_1=1\) \(X_1^2=X_1\)

and E\(X_1 X_2\) = P\(X_1, X_2\)

Finally, we can look at the distribution of \(\hat p\), suppose we know the true p is 0.54, we can use simulation to randomly sample \(X_1,...X_n\) from Np people who support 1 and ……who support

  • God’s perspective
library(ggplot2)

set.seed(111)

population_size <- 12141897

p <- 0.54

num_simulations <- 500

sample_size <- 1000

rep()
p_hat_values <- replicate(num_simulations, {
  # Simulate sampling from the population
  sample <- sample(c(rep(1,population_size * p), rep(0,population_size*(1-p))), sample_size, replace =FALSE) # replicate(num_simulations, {...}):这个函数会重复执行大括号中的代码num_simulations次,并将每次的结果存储在一个向量中; rep创建一个包含population_size * p个1和population_size * (1-p)个0的向量,模拟总体。
mean(sample) #calculate the p_hat for each sample
})


  
histogram <- ggplot(data.frame(p = p_hat_values), aes(x=p))+
  geom_histogram(binwidth = 0.01, fill ="blue", color = "black")+
  labs(title =" ",
       x = "p_hat",
       y= "Frequency")+
  theme_minimal()

print(histogram)

doing this 500 times,

based on the central limit theory

Statistical inference and Probability distribution

  • estimation

  • condifidence intervals

  • hypothesis testing

Sample Space and Probability Measure

\(\sigma\) algebra \(F\) is a collection of satisfying :

full and null set

completerment

countably union

measurble we can find a function that takes the elements of F and output a real number

A probability measure is a mapping P:F–> R satisfying the following 3 axioms:

countably for mutually exclusive events \(A_1, A_2,...\in F\)

interpret the probability

  • frequentist view

  • Bayesian view

Borel set is a combination of open set in some spaces

library(ggplot2)
library(dplyr)
library(readr)
library(magrittr)

circ <- read_csv("Charm_City_Circulator_Ridership.csv")
## take just average ridership per day
avg = circ %>% 
  filter(type == "Average")
# keep non-missing data
avg = avg %>% 
  filter(!is.na(number))

Moment Generating Functions

\(M_x (t)=E(e^{tx})\)

case 1

\(M_x\) may not exist. When it exists in a neighborhood of 0, using talor

\[e^{tx}=1+tX+(tX)^2/2+...\]

\[M_x(t)=1+t\mu+t^2 \mu/2+...\]

\(\mu_j = E(X^i)\) is the j-th moment of X. Therefore,

\[E(X^i)=M^{(j)}(0)\]

eg. Then we could also get the variance by take the 2nd order derivatives

case 2

\(\int\)

eg. normal

  • X~N(0,1)

idea: try to write an integral of a certain distribution’s pdf and we get the result of this integral as 1. here, we get the pdf of the N(t,1) and finally we get \(e^{t^2/2}\)

  • X~N(\(\mu, \sigma^2\))

Then X=\(\mu+\sigma Z\) where Z~N(0,1)

\[M_x(t)=E[e^{tx}]=e^{\mu t} E[e^{\sigma t Z}]=e^{\mu t} M_Z(\sigma t)=e^{\mu t +\sigma^2 t^2/2}\]

Gamma disteibution

the family of gamma distributions generalizes the family of exponential distributions. The gamma distribution with shape r and rate \(\lambda\)

addition rule

r +

\(\lambda\) stays

why MGF is useful? – to determine if two random variables have the identical CDF / to prove the addition property of distributions

MGF includes all characteristics of a distribution, from whom we could get pdf, cdf, expectation, variance

  • Thm If X and Y are random variables with the same MGF, which is finite on [-t, t ] for some t >0 then X and Y have the same distribution

A gamma distribution with shape r =1 is an exponential distribution

A more general function than MGF is the characteristic function.

\(\phi_X (t) = E(e^{itX})\)

Normal –> Chi-squared distribution

MGF

the MGF of the Gamma(1/2, 1/2) distribution is the chi-squared

standard normal random variables’ sum of their each own squers follow a \(\chi^2\) distribution

the addition rule lies on the sum of the dgree of freedom

\((n-1)S^2/\sigma^2\)

consider

Preface

The first section here is the knowledge that attracts most of my interests on this module, followed by the lecture notes I tapped when I am on the journey of this module.

  • Motivation

The history of the development of statistics has helped me gain a deeper understanding of this fascinating and practical field of knowledge. Each advancement represents a leap from specific facts to broader, generalized conclusions. What truly captivates me is the remarkable alignment between natural phenomena and statistical principles; there always seems to be a coincidence where reality and theory intersect.

Initially, I only grasped the surface of this knowledge. Then, as I delved deeper, I began to understand the underlying principles through observable phenomena. Eventually, almost miraculously, I realized how incredibly useful these theories are and how they coincide with natural occurrences. This process of moving closer and closer to generalization has profoundly illustrated to me the truth in the saying: “Mathematics is the art of giving the same name to different things.”

Theorems come, theorems go. Only examples are lying forever. (Practice to use statistics to interpret real world examples)

把公式推导单独放一栏(后)

error?

  • Independent = Disjoint? NO!

    Independent vs Disjoint: If P(E) > 0 and P(F) > 0, then E and F can NOT be both independent and disjoint.

    Subsets are dependent. If E ⊂ F and neither P(E) = 0 nor P(F) = 1, then E and F are dependent.

    Complements are dependent. If neither P(E) = 0 nor P(E) = 1, then E and E C are dependent.

My reading during this course:

十堂极简概率课 中信出版 diaconis

心理统计

The lady tasting tea

概率论与数理统计(第三版) 峁诗松等老师编著

Probability….

bivariate data

linear regression

background

  • Regression to the mean (Galton’s thinking)

it is not stable to predict the data outside our data sample’

Residual plot should have no pattern

across the whole range, it could not be showing a certain trend or a specific shape.

positive and negative points sperate averagely .

# 数据
x <- c(50, 55, 50, 79, 44, 37, 70, 45, 49)  # Rock surface area
y <- c(152, 48, 22, 35, 38, 171, 13, 185, 25)  # Algae colony density

# (a) 计算最小二乘回归方程
model <- lm(y ~ x)
summary(model)

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
-65.53 -63.91 -14.47  46.99  84.39 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  232.258     92.390   2.514   0.0402 *
x             -2.926      1.690  -1.731   0.1271  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 63.32 on 7 degrees of freedom
Multiple R-squared:  0.2998,    Adjusted R-squared:  0.1997 
F-statistic: 2.997 on 1 and 7 DF,  p-value: 0.1271
# 获取回归系数
intercept <- coef(model)[1]
slope <- coef(model)[2]
cat("最小二乘回归方程: y =", intercept, "+", slope, "* x\n")
最小二乘回归方程: y = 232.2575 + -2.925507 * x
# (b) 计算 R^2 值并解释
r_squared <- summary(model)$r.squared
cat("R^2 值:", r_squared, "\n")
R^2 值: 0.2997552 
cat("解释: R^2 表示了", round(r_squared * 100, 2), "% 的 y 的变异可以通过 x 来解释。\n")
解释: R^2 表示了 29.98 % 的 y 的变异可以通过 x 来解释。
# (c) 计算残差标准误差 s_e
se <- summary(model)$sigma
cat("残差标准误差 s_e:", se, "\n")
残差标准误差 s_e: 63.31527 
cat("解释: s_e 表示了回归模型的平均预测误差,越小表明预测的精确度越高。\n")
解释: s_e 表示了回归模型的平均预测误差,越小表明预测的精确度越高。
# (d) 判断线性关系的方向和强度
correlation <- cor(x, y)
cat("相关系数 r:", correlation, "\n")
相关系数 r: -0.547499 
if (correlation > 0) {
  direction <- "正相关"
} else {
  direction <- "负相关"
}

if (abs(correlation) > 0.7) {
  strength <- "强相关"
} else if (abs(correlation) > 0.3) {
  strength <- "中等相关"
} else {
  strength <- "弱相关"
}

cat("线性关系:", direction, "且为", strength, "\n")
线性关系: 负相关 且为 中等相关 
# 数据
quality_rating <- c(111, 113, 93, 130, 170, 87, 83, 117, 135, 109)
satisfaction_rating <- c(832, 845, 794, 854, 836, 842, 877, 745, 797, 795)

# 计算相关系数
correlation_coefficient <- cor(quality_rating, satisfaction_rating)
print(paste("相关系数 r:", correlation_coefficient))
[1] "相关系数 r: -0.115403519735578"
# 绘制散点图
plot(quality_rating, satisfaction_rating,
     main = "Scatterplot of Quality Rating vs. Satisfaction Rating",
     xlab = "Quality Rating",
     ylab = "Satisfaction Rating",
     pch = 19, col = "blue")
abline(lm(satisfaction_rating ~ quality_rating), col = "red")  # 添加回归线

skew

Data skewed to the left(negatively skewed) have a longer left tail, and the mean and media are to the left of the mode.

Data skewed to the right(positively skewed) have a longer right tail, and the mean and media are to the right of the mode.

mean is much sensitive than the median to the extreame value(not a resistant measure of center) so the outlier drags the mean to the skew……

Compare \(Q_2-Q_1\) and \(Q_3-Q_2\) to decide left/right skew or symmetry.

ps: the trimmed mean is more resistant.

As long as we know the mean and standard deviation of the data, we can determine how far a certain proportion of the data falls, without knowing the specific distribution shape of the data.

outlier (eg of normal distribution)

a mild outlier if it lies more than 1.5(iqr) away from the nearest quartile (the nearest end of the box);

an extreme outlier if it lies more than 3(iqr) away from the nearest quartile.

(These definitions and distances are based on the hypothetical Normal distribution (bell shaped, symmetric, normal tails). When there is a reason to suspect that the distribution is skewed, the bounds should be changed.)

quartile

\(Q_1\)(First quartile): at least 25% of the sorted values are less than or equal to \(Q_1\) and at least 75% of the values are greater than or equal to \(Q_1\)

\(Q_3\)(Third quartile): ..

modified boxplot

A modified boxplot is a box plot where the whiskers only extend to the largest (or smallest) observation that is not an outlier and the outliers are plotted using a full circle (mild) or empty circle (extreme).

If there are no outliers, then the the whiskers end at the maximum (or minimum)

# 数据
ratios <- c(0.553, 0.570, 0.576, 0.601, 0.606, 0.606, 0.609, 0.611, 
            0.615, 0.628, 0.654, 0.662, 0.668, 0.670, 0.672, 0.690, 
            0.693, 0.749, 0.844, 0.933)

# 计算四分位数、IQR、温和和极端异常值的界限
Q1 <- quantile(ratios, 0.25)
Q3 <- quantile(ratios, 0.75)
median_val <- median(ratios)
iqr <- Q3 - Q1
mild_outlier_limit <- 1.5 * iqr
extreme_outlier_limit <- 3 * iqr

lower_mild <- Q1 - mild_outlier_limit
lower_extreme <- Q1 - extreme_outlier_limit
upper_mild <- Q3 + mild_outlier_limit
upper_extreme <- Q3 + extreme_outlier_limit

# 标记温和和极端异常值
mild_outliers <- ratios[ratios > upper_mild & ratios <= upper_extreme | ratios < lower_mild & ratios >= lower_extreme]
extreme_outliers <- ratios[ratios > upper_extreme | ratios < lower_extreme]

# 绘制箱线图,并标记温和和极端异常值
boxplot(ratios, main = "Modified Boxplot of Width-to-Length Ratios", ylim = c(0.3, 1))
points(which(ratios %in% mild_outliers), mild_outliers, pch = 16, col = "blue")    # 实心圆表示温和异常值
points(which(ratios %in% extreme_outliers), extreme_outliers, pch = 1, col = "red") # 空心圆表示极端异常值

# 输出四分位数、中位数和IQR结果

#确定第一四分位数(Q1):这是数据从小到大排列后,位于下四分之一位置的值,表示前25%的数据范围的最大值。
#确定第三四分位数(Q3):这是数据从小到大排列后,位于上四分之一位置的值,表示后75%的数据范围的最小值。

cat("Q1:", Q1, "\n")
Q1: 0.606 
cat("Median:", median_val, "\n")
Median: 0.641 
cat("Q3:", Q3, "\n")
Q3: 0.6765 
cat("IQR:", iqr, "\n")
IQR: 0.0705 
cat("Lower mild outlier limit:", lower_mild, "\n")
Lower mild outlier limit: 0.50025 
cat("Lower extreme outlier limit:", lower_extreme, "\n")
Lower extreme outlier limit: 0.3945 
cat("Upper mild outlier limit:", upper_mild, "\n")
Upper mild outlier limit: 0.78225 
cat("Upper extreme outlier limit:", upper_extreme, "\n")
Upper extreme outlier limit: 0.888 
# 加载ggplot2包
if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)

# 数据
ratios <- c(0.553, 0.570, 0.576, 0.601, 0.606, 0.606, 0.609, 0.611, 
            0.615, 0.628, 0.654, 0.662, 0.668, 0.670, 0.672, 0.690, 
            0.693, 0.749, 0.844, 0.933)

# 计算四分位数、IQR、温和和极端异常值的界限
Q1 <- quantile(ratios, 0.25)
Q3 <- quantile(ratios, 0.75)
median_val <- median(ratios)
iqr <- Q3 - Q1
mild_outlier_limit <- 1.5 * iqr
extreme_outlier_limit <- 3 * iqr

lower_mild <- Q1 - mild_outlier_limit
lower_extreme <- Q1 - extreme_outlier_limit
upper_mild <- Q3 + mild_outlier_limit
upper_extreme <- Q3 + extreme_outlier_limit

# 标记温和和极端异常值
mild_outliers <- ratios[ratios > upper_mild & ratios <= upper_extreme | ratios < lower_mild & ratios >= lower_extreme]
extreme_outliers <- ratios[ratios > upper_extreme | ratios < lower_extreme]

# 创建数据框
data <- data.frame(ratios = ratios)
data$outlier_type <- ifelse(data$ratios %in% mild_outliers, "Mild Outlier",
                            ifelse(data$ratios %in% extreme_outliers, "Extreme Outlier", "Normal"))

# 使用ggplot2绘制
ggplot(data, aes(x = "", y = ratios)) +
  geom_boxplot(outlier.shape = NA, fill = "lightblue") +  # 不显示默认的异常值
  geom_point(data = subset(data, outlier_type == "Mild Outlier"), aes(y = ratios), color = "blue", size = 3, shape = 16) + # 温和异常值,实心圆
  geom_point(data = subset(data, outlier_type == "Extreme Outlier"), aes(y = ratios), color = "red", size = 3, shape = 1) + # 极端异常值,空心圆
  labs(title = "Modified Boxplot of Width-to-Length Ratios", y = "Width-to-Length Ratios") +
  theme_minimal() +
  theme(axis.title.x = element_blank()) + # 移除x轴标签
  coord_cartesian(ylim = c(0.3, 1))      # 设置y轴范围

Measures of relative standing

z Scores

eg. height(among women or men (come from different populations)instead of just comparing the height itself)

tutorial

dotplot没有纵轴in r(vs scallar plot)

price of a textbook is discrete

zip code is categorical

dotplot and scallar plot

do not manipulate data—experimental

table 2.1

11

huge data for 顺序的 shuffle is ok

but ..

the likelihood is totally different

The role of statistics and the Data Analysis Process

Intro

stat is a large field in math involving the collection, organization, analysis,interpretation, and presentation of data(a collection of observations on one or more variables(A characteristic whose value may change from one observation to another))

Statistics is the scientific discipline that provides methods to help us make sense of data.

It is important to be able to:

1 Extract information from tables, charts, and graphs.

2 Follow numerical arguments.

3 Understand the basics of how data should be gathered, summarized, and analysed to draw statistical conclusions.

The Data Analysis Process

1 Understanding the nature of the research problem or goals.

2 Deciding what to measure and how.

3 Collecting data.

4 Data summarization and preliminary analysis.

5 Formal Data Analysis (Statistical Methods).

6 Interpretation of the results.

populations and samples

population: The entire collection of individuals or objects about which information is desired

sample: A sample is a subset of the population, selected for study.

then select the sample

then we could summarize it using 2 branches of stat.— Decriptive stat.(methods for organizing and summarizing data.) or inferential stat.(generalizing from a sample(incomplete information) to the population from which the sample was selected and assessing the reliability of such generalizations.So we run the risk(An important aspect of statistics and making statistics inferences involves quantifying the chance of making an incorrect conclusions.))

descriptive stat

inferential stat

sample

Types of data

uni data set and bivariate and multivariate

categorical and numerical(discrete and continuous) with plot using excel (data analysis) or rstudio plot (ggplot2)

for categorial data we could use a bar chart which is a graph of a frequency distribution for categorical data.

for a small numerical data we could use dotplot

  • discrete
library(ggplot2)

# creat data:Wechat number
discrete_data <- data.frame(value = c(30, 15, 20,30,60))

# plot
ggplot(discrete_data, aes(x = value)) +
  geom_dotplot(binwidth = 1, dotsize = 1) +
  ggtitle("Dot Plot of Discrete Data (Number of Wechats)") +
  theme_minimal()

  • continuous
all_athletes <- c(79, 79, 86, 85, 95, 78, 89, 84, 81, 85, 89, 89, 85, 85, 81, 80, 98, 84, 
                  80, 82, 81, 70, 85, 87, 83, 86, 92, 85, 93, 94, 76, 69, 82, 80, 94, 98)
basketball <- c(55, 36, 83, 20, 100, 62, 100, 100, 90, 91, 93, 89, 90, 80, 46, 75, 100, 71, 
                50, 62, 82, 50, 100, 83, 90, 64, 91, 67, 83, 100, 83, 100, 83, 63, 91, 95)

# 设置画布的高度,以便将两个图绘制在同一页面上
plot.new()
plot.window(xlim = c(0, 100), ylim = c(0.5, 2.5))

# 绘制 Basketball 数据的 dotplot
stripchart(basketball, method = "stack", at = 2, pch = 16, col = "orange", 
           add = TRUE, offset = 0.5, cex = 1.2)

# 绘制 All Athletes 数据的 dotplot
stripchart(all_athletes, method = "stack", at = 1, pch = 16, col = "orange", 
           add = TRUE, offset = 0.5, cex = 1.2)

# 添加 X 轴
axis(1, at = seq(10, 100, by = 10), labels = seq(10, 100, by = 10))

# 添加标签
text(-5, 2, "Basketball", xpd = TRUE, adj = 1)
text(-5, 1, "All Athletes", xpd = TRUE, adj = 1)

# 添加横线
abline(h = 1.5, col = "black", lwd = 2)

# 添加 X 轴标签
title(xlab = "Graduation rates (%)")

# creat data--time spent in minutes
continuous_data <- data.frame(value = c(6, 5.25, 3.62,1,2,3.1,3.2,4,5,6,7,4,10))

# dotplot
ggplot(continuous_data, aes(x = value)) +
  geom_dotplot(binwidth = 0.1, dotsize = 1) +
  ggtitle("Dot Plot of Continuous Data (Time Spent in Minutes)") +
  theme_minimal()

library(ggplot2)

# 毕业率数据
school <- 33:68
all_athletes <- c(79, 79, 86, 85, 95, 78, 89, 84, 81, 85, 89, 89, 85, 85, 81, 80, 98, 84, 
                  80, 82, 81, 70, 85, 87, 83, 86, 92, 85, 93, 94, 76, 69, 82, 80, 94, 98)
basketball <- c(55, 36, 83, 20, 100, 62, 100, 100, 90, 91, 93, 89, 90, 80, 46, 75, 100, 71, 
                50, 62, 82, 50, 100, 83, 90, 64, 91, 67, 83, 100, 83, 100, 83, 63, 91, 95)

# 创建数据框
data <- data.frame(school, all_athletes, basketball)

# 画图
ggplot() +
  geom_dotplot(data = data, aes(x = all_athletes, y = "All Athletes"), binaxis = 'x', stackdir = 'up', dotsize = 0.5) +
  geom_dotplot(data = data, aes(x = basketball, y = "Basketball"), binaxis = 'x', stackdir = 'up', dotsize = 0.5, color = "red") +
  xlab("Graduation rates (%)") +
  ylab("") +
  theme_minimal() +
  ggtitle("Dotplot of Graduation Rates for All Athletes and Basketball Players")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

histogram excel plot

collect data sensibly

chapter 2

Two types of studies: Observational studies and Experiments.

Observational

A study in which the investigator observes characteristics of a sample selected from one or more existing populations. The goal is to draw conclusions about the corresponding population or about differences between two or more populations.

In an observational study, it is impossible to draw clear cause-and-effect conclusions

Experiments

A study in which the investigator observes how a response variable behaves when one or more explanatory variables, also called factors, are manipulated.

A well-designed experiment can result in data that provide evidence for a cause-and-effect relationship.

  • Experimental conditions: Any particular combination of values for the explanatory variables, which are also called treatments.

comparison

  • Both observational studies and experiments can be used to compare groups, but in an experiment the researcher controls who is in which group, whereas this is not the case in an observational study.

  • In an observational study, it is impossible to draw clear cause-andeffect conclusions

confounding vars

A variable that is related to both how the experimental groups were formed and the response variable of interest.

  • Two methods for data collection: Sampling and Experimentation.

  • distinguish between selection bias, measurement or response bias, and non-response bias.

  • select a simple random sample from a given population.

  • distinguish between simple random sampling, stratified random sampling, cluster sampling, systematic sampling, and convenience sampling

variable

response variable–y

The response variable is the focus of a question in a study or experiment.

explanotory variable–x

An explanatory variable is one that explains for changes in the response variable.

experiments and obeservational study

bias

selection bias:When the way the sample is selected systematically excludes some part of the population of interest.

measurement or response bias

eg: survey question/scale(The scale or a machine used for measurements is not calibrated properly)

Non-response Bias:When responses are not obtained from all individuals selected for inclusion in the sample.

non-response bias can distort results if those who respond differ in important ways from those who do not respond (e.g. laziness a confounding variable).

random sampling

def: A sample that is selected from a population in a way that ensures that every different possible sample of size n has the same chance of being selected.

the same chance to be selected

counter eg:

Consider 100 students in a classroom, 60 females and 40 males. If we randomly sample 6 females, and 4 males, then each female has a 6/60 = 0.1 chance of being selected. Same for males, 4/40=0.1. However, not every group of 10 students is equally likely to be selected. This is not simple random sampling

The random selection process allows us to be confident that the sample adequately reflects the population, even when the sample consists of only a small fraction of the population.

eg.Voting Sample Size in a country

stratified and cluster

  • stratified random sampling:In stratified random sampling, separate simple random samples are independently selected from each subgroup. Each subgroup is called a strata.

In general, it is much easier to produce relatively accurate estimates of characteristics of a homogeneous group than of a heterogeneous group.

stratified: according to certain characteristic

eg.Even with a small sample, it is possible to obtain an accurate estimate of the average grade point average (GPA) of students graduating with high honours from a university (Similar high grades, homogenous, thus only sample a few students). On the other hand, producing a reasonably accurate estimate of the average GPA of all seniors at the university, a much more diverse group of GPAs, is a more difficult task. Not only does this ensure that students at each GPA level are represented, it also allows for a more accurate estimate of the overall average GPA.

  • cluster reflect general characteristic about the whole entire population

cluster: randomly groups

Cluster sampling involves dividing the population of interest into non-overlapping subgroups, called clusters. Clusters are then selected at random, and then all individuals in the selected clusters are included in the sample.

systematic sampling

A value k is specified (e.g. k = 50 or k = 200). Then one of the first k individuals is selected at random, after which every k-th individual in the sequence is included in the sample. A sample selected in this way is called a 1 in k systematic sample.

In the case of large samples, it can ensure that the sample is evenly distributed in the population.

Random Variable

Random Variable (R.V.)

A numerical variable whose value depends on the outcome of a chance experiment. A random variable associates a numerical value with each outcome of a chance experiment. (Think of it as a rule that translates each result of a chance event into a number.)

In shorta: random variables convert random events into numbers.

A real-valued random variable X is a function: X : S → \(\mathbb R\), where S is the sample space of a chance experiment.

Continuous random variable X: S–> R is continuous if its set of possible values includes an entire interval on the number line(measurement), which could not be count.

Discrete random variable: X: S–> R if its set of possible value is a collection of isolated points along the number line.(counting)

eg

Examples: Coin Tossing (Discrete Random Variable) If we flip a coin 5 times, let X be the number of heads we get. Possible values of X {0,1,2,3,4,5} (where 0 means no heads, and 5 means all heads). Here, X turns each outcome of multiple coin tosses into a count of heads.

Departure Time (Continuous Random Variable) Imagine tracking when people leave a subway station between 10 PM and 11 PM. Let Y represent the time (in hours) someone leaves, so Y can be any number from 10 to 11. Here, Y assigns each departure time to a point in the range [10,11].

sample space of multivariables

G:gender; F: year—corresponding to a certain student:S

(G, H) : S → \(\mathbb R^2\), S={1,2,3,4}

Probability Mass Function and Cumulative Distribution Function for Discrete Random Variables

Probability Mass Function (PMF): \(p_x (x) := P(X = x) ,\forall x\)

Cumulative Distribution Function (CDF): \(F_x (x) := P(X \leq x) ,\forall x\)

# 定义每个事件的概率和对应的 X 值
outcomes <- c("GGGG", "EGGG", "GEGG", "GGEG", "GGGE", 
              "EEGG", "EGEG", "EGGE", "GEEG", "GEGE", 
              "GGEE", "GEEE", "EEEG", "EEGE", "EEEE")
probabilities <- c(0.1296, 0.0864, 0.0864, 0.0864, 0.0864, 
                   0.0576, 0.0576, 0.0576, 0.0576, 0.0576, 
                   0.0384, 0.0384, 0.0384, 0.0384, 0.0256)
X_values <- c(0, 1, 1, 1, 1, 
              2, 2, 2, 2, 2,
              3, 3, 3, 3, 4)

# 计算每个 X 值的 PMF 通过分组和求和
pmf <- tapply(probabilities, X_values, sum)

# 定义可能的 X 值
X_values_unique <- sort(unique(X_values))

# 计算 CDF
cdf <- cumsum(pmf)

# 确保 CDF 在 x > 4 时为 1
cdf <- c(cdf, 1)

# 更新 X 值以包括 x > 4 的情况
X_values_unique <- c(X_values_unique, ">4")

# 创建数据框显示 PMF 和 CDF
table <- data.frame(
  X = X_values_unique,
  `PMF P(X=x)` = c(pmf, 1-0.1296-0.3456-0.2880-0.1536-0.0256),  # PMF 没有对应的 x > 4 值,填 NA
  `CDF F(X<=x)` = cdf
)

# 打印表格
print(table)
   X PMF.P.X.x. CDF.F.X..x.
0  0     0.1296      0.1296
1  1     0.3456      0.4752
2  2     0.2880      0.7632
3  3     0.1536      0.9168
4  4     0.0256      0.9424
  >4     0.0576      1.0000
  • Note that the domain of a cdf is (−∞, ∞)

\(\mathrm{pmf} \Longrightarrow \mathrm{cdf}\) \[ F_x(x)=\sum_{y \leq x} p_x(y) \] cdf \(\Longrightarrow\) pmf Suppose \(X\) takes ordered values \(x_1, x_2, x_3, \cdots\), then \[ \begin{aligned} p_X\left(x_i\right) & =P\left(X=x_i\right)=P\left(x_{i-1}<X \leq x_i\right) \\ & =P\left(X \leq x_i\right)-P\left(X \leq x_{i-1}\right) \\ & =F\left(x_i\right)-F\left(x_{i-1}\right) \end{aligned} \]

  • remark: The probability of a discrete distribution varies depending on the inclusion and exclusion of the boundary values.

Expectation and Variance for Discrete Random Variables

\[ \frac{0 \cdot f_0+1 \cdot f_1+2 \cdot f_2+\cdots+n \cdot f_n}{N}=\frac{1}{N} \sum_{i=0}^n i \cdot f_i \]

Note that in \(\frac{1}{N} \sum_{i=0}^n i \cdot f_i\), \[ \lim _{N \rightarrow \infty} \frac{f_i}{N}=P(X=i) \]

So the average number will be \[ \sum_{i=0}^n i \cdot P(X=i) \]

  • Definition: Expectation Given a discrete random variable \(X\), the expectation of \(X\) is \[ E[X]=\sum_x x \cdot p_X(x) \]

  • Properties of Expectation

  • If \(c\) is a constant, then \(E[c]=c\).

  • If \(X \geq 0\) then \(E[X] \geq 0\).

  • If \(a \leq X \leq b\) then \(a \leq E[X] \leq b\).

  • Proof of 3: First show \(E[X] \geq a\), then show \(E[X] \leq b\), \[ \begin{aligned} E[X] & =\sum_x x p_X(x) \geq \sum_x a p_X(x), \\ & =a \sum_x p_x(x)=a . \end{aligned} \]

Similarly, \(E[X] \leq b\).

  • Suppose \(X\) is a discrete random variable and \(Y=g(X)\), then \[ \begin{aligned} E[Y] & =\sum_y y p_Y(y)=\sum_y y P(Y=y) \\ & =\sum_y y \sum_{\{x: g(x)=y\}} P(X=x) \\ & =\sum_y \sum_{\{x: g(x)=y\}} y P(X=x) \\ & =\sum_y \sum_{\{x: g(x)=y\}} g(x) P(X=x) \\ & =\sum_x g(x) P(X=x) \end{aligned} \]

  • First moment of \(X\) (mean): \[ E[X]=\sum_x x p_X(x) . \]

  • Second moment of \(X\) : \[ E\left[X^2\right]=\sum_x x^2 p_x(x) . \]

  • In general, \(E[g(X)] \neq g(E[X])\). For example, let \(g(x)=x^2\), and consider \(X\) such that \[ p_X(x)= \begin{cases}0.5, & \text { for } x=-1 \\ 0.5, & \text { for } x=1\end{cases} \]

Then clearly \(E\left[X^2\right]=1 \neq 0=(E[X])^2\).

  • There are exceptions (e.g. when g is linear)!

  • Linearity of Expectation

E[aX + b] = aE[X] + b

proof:

Suppose \(g(x)=a x+b\). Then \[ \begin{aligned} E[g(X)] & =\sum_x g(x) p_X(x), \\ & =\sum_x(a x+b) p_x(x), \\ & =\sum_x a x p_x(x)+\sum_x b p_x(x), \\ & =a \sum_x x p_x(x)+b \sum_x p_x(x), \\ & =a E[X]+b=g(E[X]), \end{aligned} \] this implies \[ E[a X+b]=a E[X]+b \]

  • Remark: Apart from this case, always assume \(E[g(X)] \neq g(E[X])\).

joint distribution of X and Y

The joint probability mass function (i.e., joint pmf) of X and Y for discrete random variables is defined as \(p_{X,Y} (x, y) := P(X = x \text{ and } Y = y)\)

P[(X,Y) ∈ A] =\(\sum_{(x,y)\in A}p_{X,Y}(x,y)\),where A belongs to a subset of the \(\mathbb R^2\) where X and Y taking values.

Marginal Probability Mass Function

Let \(X\) and \(Y\) have the joint probability mass function \(p_{X, Y}(x, y)\) with space \(\mathcal{S}\). The probability mass function of \(X\) (or \(Y\) ) alone, is called the marginal probability mass function of \(X\) (or \(Y\) ) and defined by: \[ \begin{aligned} & p_X(x)=P(X=x)=\sum_y p_{X, Y}(x, y) \\ & p_Y(y)=P(Y=y)=\sum_x p_{X, Y}(x, y) \end{aligned} \]

For the marginal of X, we sum over all values of y.

For the marginal of Y, we sum over all values of x.

eg.–helping understand both sample space, event and Marginal probability mass function

Functions of multiple random variables

it has the same philosophy as one random variable things, such as expectation:

\[E[g(X,Y)]=\sum_x \sum_y g(x,y)p_{X,Y}(x,y)\], also the linearity of expectation

eg. Z = X + 2Y.

independence

  • P(\(A\cap B\))=P(A)P(B) or P(A|B)=P(A)

  • Two discrete random variables X and Y are independent if P(X = x and Y = y) = P(X = x)P(Y = y):

\(p_{X,Y}(x,y)=p_X(x)p_Y(y), \forall x,y\)

\(p_{X|Y}(x|y)=p_X(x), \forall x,y\)

it is easy to detect the dependent as long as we find one example, eg: \(P_{X|Y}(1|1)\) is not equal to \(P(X=1)\)

independence, expectations(mean) and variance

Independence and Expectations If \(X\) and \(Y\) are independent, then \[ E[X Y]=E[X] E[Y] . \]

Proof: We use \(E[g(X, Y)]\) where \(g(x, y)=x y\). \[ \begin{aligned} E[X Y] & =\sum_x \sum_y x y p_{X, Y}(x, y) \\ & =\sum_x \sum_y x y p_X(x) p_Y(y), \quad(\text { by inde } \\ & =\left(\sum_x x p_X(x)\right)\left(\sum_y y p_Y(y)\right)=E[X] E[Y] . \end{aligned} \] (by independence)

Similarly, if \(X\) and \(Y\) are independent, then \[ E[g(X) h(Y)]=E[g(X)] E[h(Y)] \]

  • Independence and Variances

It is always true that \[ \operatorname{Var}(a X)=a^2 \operatorname{Var}(X), \quad \text { and } \quad \operatorname{Var}(X+a)=\operatorname{Var}(X) \]

In general, when we have a sum of random variables \(X\) and \(Y\) \[ \operatorname{Var}(X+Y) \neq \operatorname{Var}(X)+\operatorname{Var}(Y) . \]

It is only true if \(X\) and \(Y\) are independent

  • Sum of Variance for independent R.V. If two random variables \(X\) and \(Y\) are independent then \[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y) \]

  • Sum of Variance for independent R.V.

If two random variables \(X\) and \(Y\) are independent then \[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y) . \]

Proof: Independence implies \(E[X Y]=E[X] E[Y]\). Thus \[ \begin{aligned} & \operatorname{Var}(X+Y)=E\left[(X+Y-(E[X]+E[Y]))^2\right], \\ & =E\left[(X-E[X])^2+2(X-E[X])(Y-E[Y])+(Y-E[Y])^2\right], \\ & =\operatorname{Var}(X)+\operatorname{Var}(Y)+2 E[(X-E[X])(Y-E[Y])] \end{aligned} \]

As \(X\) is indep to \(Y\), then \(X-\mu_X\) is indep to \(Y-\mu_Y\) so \[ E[(X-E[X])(Y-E[Y])]=E[X-E[X]] E[Y-E[Y]]=0 \]

Example: Assume independence, \(\operatorname{Var}(3 X-5 Y)=\) \[ \operatorname{Var}(3 X)+\operatorname{Var}(-5 Y)=9 \operatorname{Var}(X)+25 \operatorname{Var}(Y) \]

  • Example. Independence, mean and variance

Let \(Y\) be the random variable denoting the total number of heads by tossing a coin \(n\) times. Find the mean and variance of \(Y\).

Let \[ Y_i= \begin{cases}1, & \text { if the } i^{\text {th }} \text { toss gets a head } \\ 0, & \text { otherwise. }\end{cases} \]

Then \(Y=Y_1+Y_2+\cdots+Y_n\) where \(Y_1, Y_2, \cdots, Y_n\) are independent. For any \(i=1,2, \cdots, n\), we have \[ \begin{array}{ccc} y & 0 & 1 \\ P_{Y_i}(y) & 1 / 2 & 1 / 2 \end{array} \]

As \(E\left[Y_i\right]=\frac{1}{2}\) and \(\operatorname{Var}\left(Y_i\right)=\frac{1}{4}\) for \(i=1,2, \cdots, n\) \[ \begin{gathered} E[Y]=E\left[Y_1\right]+E\left[Y_2\right]+\cdots+E\left[Y_n\right]=\frac{n}{2} \\ \operatorname{Var}(Y)=\operatorname{Var}\left(Y_1\right)+\operatorname{Var}\left(Y_2\right)+\cdots+\operatorname{Var}\left(Y_n\right)=\frac{n}{4} \end{gathered} \] (for not independent cases we have the same result as E but not Var because Var is not linear)

Continuous Random Variables

We have to consider intervals instead of one when referring to continuous random variables or the probability will equals to 0, which has no meaning.

eg. \(P(a\leq Y\leq b)=P(a<Y<b)\) if Y is a continuous variable.

Definition. Probability Density Function (pdf)

For a continuous random variable X, the Probability Density Function (PDF) of X is f(x) where P(X = x) = 0 for all x and for any a ≤ b

discrete and continuous

Density function’s value could be greater than 1 because the integral of it in a very tiny interval could be very small, which represents the probability(which could not be greater than 1 and since the interval could be very tiny this condition is satisfied!).(This reminder notices that the value of pdf is not the probability since the integration is the probability which is totally 1)

Covariance

wait for reviewing…..

Descrete distributions

A probability distribution is a graph, table of formula that gives the probability for each value of the random variable.

Bernoulli is composed by binomial

Hypergeometric distribution could be approximated by Binomial when samples are large

Poisson distribution as the limit of Binomial distribution when the number of trials is large and the probabiliy of success of each trial is inverse-proportional to the number of trials

  • The Poisson distribution is a discrete probability distribution that applies to occurrences of some event over a specified interval. The random variable x is the number of occurrences of the event in an interval.The probability of the event occurring x times over an interval is given by \[P(x)=\frac{u^x\cdot e^{-\mu}}{x!}\] where the random variable x is the number of occurences of an event over some interval and the ovvurrences must be random and independent of each other.

中心极限定理:当泊松分布的参数 λ 较大时,泊松分布的形状会接近正态分布。这是因为中心极限定理指出,大量独立随机变量的和趋向于正态分布,而泊松分布可以看作是大量伯努利试验成功次数的分布,当试验次数足够多时,其和可以用正态分布来近似

???The occurrences must be uniformly distributed over the interval being used

The mean is \(\mu\)

The standard deviation is \(\sigma= \sqrt \mu\)

  • eg of Poisson distribution: (describing the behavior of rare events(with small probabilities). radioactive decay, arrivals of people in a line, eagles nesting in a region, patients arriving at an emergency room(the local hospital experiences a mean of 2.3 patients arriving at the emergency room during 10-11 P.M. on Fri. is known, we can find the probability that for a randomly selected Fri. between 10-11 P.M., exactly four patients arrive), Internet users logging onto a Web site )

  • Comparison between Binomial:

Binomal distribution is affected by yhe smaple size n and the probability p, whereas the Poisson distribution is affected only by mean \(\mu\)

A binomial distribution has a limit of possible values but a Poisson distribution has a possible values x without upper bound.

Suppose the \(X_n,n\geq1\) is a sequence of random variables such that \(X_n\)~Bin(\(n,p_n\)), where \(p_n\)~\(\lambda/n\) as \(n->\infty\) \(lim_{n->\infty}(np_0)=\lambda\), intuitively we can observe that \(\lambda\) is the mean of

given the well-known limit \(\lim _{n \rightarrow \infty}\left(1-\frac{\lambda}{n}\right)^n=e^{-\lambda}\), and \[ \begin{aligned} & \frac{n}{n} \frac{n-1}{n} \ldots \frac{n-k+1}{n}=\prod_{i=0}^{k-1}\left(1-\frac{i}{n}\right) \\ & \lim _{n \rightarrow \infty} \prod_{i=0}^{k-1}\left(1-\frac{i}{n}\right)=1 \\ & P\left(X_n=k\right)=\binom{n}{k} p_n^k\left(1-p_n\right)^{n-k} \\ &= \frac{n \cdots(n-k+1)}{k!} p_n^k\left(1-p_n\right)^n\left(1-p_n\right)^{-k} \\ &= \frac{1}{k!}(\underbrace{n p_n}_{\rightarrow \lambda})^k \underbrace{\frac{n}{n} \frac{n-1}{n} \cdots \frac{n-k+1}{n}}_{\rightarrow 1} \underbrace{\left(1-p_n\right)^n}_{\rightarrow \mathrm{e}^{-\lambda}} \underbrace{\left(1-p_n\right)^{-k}}_{\rightarrow 1} \\ & \rightarrow \frac{\lambda^k}{k!} \mathrm{e}^{-\lambda} \underbrace{}_{\text {as } n \rightarrow \infty .} \end{aligned} \]

Check that \(p_x(k), k=0,1,2, \cdots\) defines a probability mass function: given the Taylor Series of \(e^\lambda\) around \(\lambda=0\) is given by \(f(x)=\) \(\sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} x^n\) and \(e^\lambda=\sum_{n=0}^{\infty} \frac{\lambda^n}{n!}\). Hence, \[ \begin{aligned} \sum_{k=0}^{\infty} p_X(k) & =\sum_{k=0}^{\infty} \frac{\lambda^k}{k!} \mathrm{e}^{-\lambda} \\ & =\mathrm{e}^{-\lambda} \sum_{k=0}^{\infty} \frac{\lambda^k}{k!} \\ & =\mathrm{e}^{-\lambda} \cdot \mathrm{e}^\lambda \\ & =1 . \end{aligned} \] ## Geometric distribution (and Geometric series–powerful!fantastic series)

  • The probability of the first happening

  • well-defined \(\begin{aligned} \sum_{k=1}^{\infty} p_X(k) & =\sum_{k=1}^{\infty}(1-p)^{k-1} p \\ & =p \sum_{k=0}^{\infty}(1-p)^k \\ & =p \cdot \frac{1}{1-(1-p)} \\ & =1 .\end{aligned}\)

  • Tail probability of the Geometric distribution

Let \(X \sim \operatorname{Geom}(p)\) then \[ \begin{aligned} P(X>n) & =P(X=n+1)+P(X=n+2)+P(X=n+3)+\cdots \\ & =(1-p)^n p+(1-p)^{n+1} p+(1-p)^{n+2} p+\cdots \\ & =(1-p)^n p\left(1+(1-p)+(1-p)^2+\cdots\right) \\ & =(1-p)^n p \frac{1}{1-(1-p)} \\ & =(1-p)^n \end{aligned} \] for any \(n=0,1,2, \ldots\)

  • Memoryless property of Geometric distribution

Suppose that \(X \sim \operatorname{Geom}(p)\) and \(n \in\{1,2,3, \cdots\}\). Then \[ P(X-n=k \mid X>n)=P(X=k) \quad, k=1,2,3, \cdots \]

That is, the distribution of \(X-n\) under the probability function \(P(\cdot \mid X>n)\) is the same as the distribution of \(X\)

the memoryless property is saying that given the first n trials are unsuccessful, the number of trials until success after the first n trials has the same distribution as the unconditional number of trials until success (independent)

  • Expectation

Suppose that \(X \sim \operatorname{Geom}(p)\). Then \[ \begin{aligned} E(X)=\sum_{k=1}^{\infty} k P(X=k) & =\sum_{k=1}^{\infty} k(1-p)^{k-1} p \\ & =p \sum_{k=1}^{\infty}\left[-\frac{\mathrm{d}}{\mathrm{~d} p}(1-p)^k\right] \\ & =-p \frac{\mathrm{~d}}{\mathrm{~d} p}\left[\sum_{k=0}^{\infty}(1-p)^k\right] \\ & =-p \frac{\mathrm{~d}}{\mathrm{~d} p}\left[\frac{1}{1-(1-p)}\right] \\ & =-p \frac{\mathrm{~d}}{\mathrm{~d} p}\left[\frac{1}{p}\right] \\ & =\frac{1}{p} \end{aligned} \] — variance

Likewise, \[ \begin{aligned} E(X(X-1)) & =\sum_{k=1}^{\infty} k(k-1) P(X=k) \\ & =\sum_{k=2}^{\infty} k(k-1)(1-p)^{k-1} p \\ & =p(1-p) \sum_{k=2}^{\infty} k(k-1)(1-p)^{k-2} \\ & =p(1-p) \frac{\mathrm{d}^2}{\mathrm{~d} p^2}\left[\sum_{k=0}^{\infty}(1-p)^k\right] \\ & =p(1-p) \frac{\mathrm{d}^2}{\mathrm{~d} p^2} \frac{1}{p} \\ & =p(1-p) \cdot \frac{2}{p^3} \\ & =\frac{2(1-p)}{p^2} . \end{aligned} \] \[\begin{aligned} \operatorname{Var}(X) & =E(X(X-1))+E(X)-(E(X))^2 \\ & =\frac{2(1-p)}{p^2}+\frac{1}{p}-\frac{1}{p^2} \\ & =\frac{1-p}{p^2}\end{aligned}\]

Geometric distribution

The irrelevance of past events to the probability of future independent events

Given the first n trials are unsuccessful, the number of trials until success after the first n trials has the same distribution as the unconditional number of trials until success.

X~Geom(p) and n$${1,2,3,….}

P(X-n=k|X>n)=P(X=k), k=1,2,3,…

Tail probability for the probability calculation of more/higher than….

Addition

Independent random variable with the same distribution allows the addition law

Normal distribution

interpretation:

  • Standard normally distributed sth. z=1.58(corresponding to 0.9429):

    the probability of randomly selecting sth. with a value less than 1.58(unit) is equal to the area(probability) of 0.9429.

    (Or: 94.29% of sth, will have a value below 1.58(unit))

Sampling distributions and Estimators

We are beginning to embark a jourfa d dney that allows us to learn about populations by obtaining data from samples since it is rare that we know all values in an entire population.

Sampling distribution of a statistic is the probability distribution of a sample statistics (such as mean/proportion which tend to target the population mean/proportion), with all samples having the same sample size. This concept is important to understand. The behavior of a statistic can be known by understanding its distribution( (The random variable in this case is the value of that sample statistics)). Under certain condition, the distribution of sampling mean/proportion approximates a normal distribution.

Though statistics does not depend on unknown parameters, the distribution of it depend on unknown parameters.(eg. Normal distribution of sample means depends on population mean(an unknow parameter) and standard deviation)

(ps: the advantage of sampling with replacement:

when selecting a relatively small sample from a large population, it makes no significant difference whether we sample with or without replacement.

Sampling with replacement results in independent events that are unaffected by previous outcomes, and independent events are easier to analyze and they result in simpler formulas.)

For a fixed sample size, the mean of all possible sample means is equal to the mean of population though sample means vary(sampling variability)

Unbiased estimators and biased estimators

Statistics that target population parameters: Mean, variance, proportion

Statistics that do target population parameters: Median, Range, Standard Deviation

(the bias is relatively small then sampling standard deviation in large samples so s is ofent used to estimate \(\sigma\))

The Central Limit Thm

It is the foundation for estimation population parameters and hypothesis testing

(Recall: A random variable is a variable that has a single numerical value(eg. x=1,x=2), determined by chance, for each outcome of a procedure随机变量是一种变量,对于一个过程的每个结果都有一个随机确定的单一数值(可以理解为,现象作用其上

A probability distribution is a graph, table or formula that gives the probability for each value of a random variable

The sampling distribution of the mean is the probability fistribution of sample means, with all sample having the same sample size n)

As the sample size increases, the corresponding sample means tend to vary less. The central limit thm tells us that if the sample size is large engough, the distribution of sample means can be approximated by normal distribution

Conclusion of CLT: (it is so important since it allows us to use the basic normal distribution methods in a wide variety of different circumstances)

The distribution of sample means will, as the sample size increases, approach a normal distribution

The mean of all sample means is the population mean \(\mu.\). \[\mu_{\bar x}=\mu\]

The standard deviation of all sample means is \(\sigma/ \sqrt n\) (i,e, the normal distribution from conclusion”The distribution of sample means will, as the sample size increases, approach a normal distribution” has standard deviation \(\sigma/ \sqrt n\) ) \[\sigma_{\bar x}=\frac{\sigma}{\sqrt n}\] \(\sigma_{\bar x}\) is often called the standard error of the mean.

30

Notice: Be careful to look at if it is from a normally ditributed population or a mean for some sample/group.

If given the population mean is a and a sample(c subjects) mean is b which has so small probbility under the sampling distribution with this populatio mean, if the mean is really a, then there is an extremely small probability of getting a sample mean of b or lower when c subjects are randomly selected. we can interpret it in such 2 ways:

  1. population mean is true and their sample represents a chance event that is extreamely rare

  2. population mean is not true and sample is typical.

Because the probability is too low, it seems more reasonable to conclude that the population mean is lower than a—hypothesis testing’s thinking

Optional: Correction for a finite population

When sampling without replacement and the sample size n is greater than 5% of the finite populaiton size B(i.e. n>0.05N), …

Using the Normal distribution as an approximation to the binomial distribution

requirement: np, n(1-p) \(\geq\) 5

Be careful: adjust x for continuity by + or - 0.5(eg. at least 99, choose 98.5)

to hypothesis testing

a question (eg):

In a test of a gender-selection technique assume that 100 couples using a particular treatment give birth to 52 girls (and 48 boys). If the technique has no effect, then the probability of a girl is approximately 0.5. If the probability of a girl is 0.5, find the probability that among 100 newborn babies, exactly 52 are girls. Based on the result, is there strong evidence supporting a claim that the gender-selection technique increases the likelihood that a baby is a girl?

it is a binomial distribution with np=nq=100*0.5=50\(\geq 5\)

so we use the normal distribution with mean of 50 and \(\sigma = \sqrt {npq} = 5\)as an approximation to the binomial distribution

to answer”is there strong evidence supporting a claim that the gender-selection technique increases the likelihood that a baby is a girl?“, we need to calculate more than 52(x successes among n trials is an unusually high number of successes if P(\(x\geq a\)) is very small)

原因是因为如果只看52这个数字肯定概率很小,因为任何单个数字发生的可能性概率都很小

So if the answer of P(\(x\geq 52\)) is small we could conclude that the gender selection is useful

(总结:如果是0。5概率来看的话52以上本来就不是难事,as indicated by the such large probability of P(\(x\geq 52\)) 所以我们没有充分证据拒绝“not effective”这个假设前提)

此处还没有引入假设检验所以都是用using probability to determine when results are unusual这个思想来思考问题的:

  • Unusually low: x successes among n trails is an unually low number of successes if P(x or fewer) is very small

Interpretation for another example of gender selection(using only unusual results to explain): Because the probability of more than 13 girls, which is 0.001 is so low, we conclude that it is unusual to get 13 girls among 14 babies (using binomial to calculate it). This suggests that the technique of gender selection appears to be effective since it is highly unlikely that the result of 13 girls among 14 births happened by chance.

Normal distribution

If a variable is the superposition result of a large number of small independent random factors, then the variable must obey the normal distribution of variables

e.g. measure error

Assessing Normality

In general, quantile plots can be used to assess any probability distribution.

For a normal quantile plot(or normal probability plot), it is a graph of points(x,y) where each x vaue is from the original set of sample data and each y value is the cooresponding z score that is a quantile value expected from the standard normal distribution

Procedures

  • Histogram(not helpful for small data set)

  • outliers: reject normality if there is more than 1 outlier present(not helpful for small data set)

  • normal quantile plot:

sort data from lowest to highest

  • Sampling distribution of pˆ when population is infinite

inferences from 2 samples introduces the differences between two populaton means using matched pairs but correlaition and regression analyze the association between the 2 variables and if such an association exists we wnat to describe it with an equation that can be used for predictions

paired sampled data(or called bivariate data)

  • a correlation exists between two variables when one of them is related to the other in some way.

  • the linear correlation coefficient r measures the strength of the linear association between the paired x- and y-quantitative values in a sample. Its value is computed by using the formula(Pearson(1857-1937) product moment correlation coefficient)

……otherwise there is not sufficient evidence to support the conclusion of a significant linear equation

!!!: interpreting r: explained variation: the value of \(r^2\) is the proportion of the variation in y that is explained by the linear association between x and y.(and the other percentage is explained by factors other thanx such as characteristics not included in the study)

  • correlation does not imply causality,just the association

  • average suppress individual variation and may inflate the correlation coefficient(the linear correlation coefficient became higher when reginal averages were used)

point estimation and confident interval

  • Tomorrow is the exam Good luck! 12.29

  • 12.30 Everything on MTH113 exam was fine

All in all, the exam did not decide anything. I think there is a long way to go and MTH113 just include not many about statistics, more than a half has been teached during high school……I will keep exploring the charismatic statistics world since we could not only understand it using the very basic life common sense but also prove them in a rigorous way, while during the proof, such as regression, we also could enjoy its explanation by linear algebra which I also enjoy…..

  • I love you MTH113!!!