Statistical_Learning (Mechine_Learning)

Referece books

  • Efron, B., & Hastie, T. (2021). Computer age statistical inference, student edition: algorithms, evidence, and data science (Vol. 6). Cambridge University Press.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). An introduction to statistical learning.

  • Casella, G., & Berger, R. (2024). Statistical inference. Chapman and Hall/CRC.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning.

Linear Regression

OLS

\((y-Xw)^T \Sigma ^{-1 }(y-Xw)\)

homo assumption: \(\Sigma = \sigma^2 I\) (a circle) but this can also be generalized to \(\Sigma = \sigma^2 D\) where \(D\) is a diagonal matrix (a ellipse).

\(X(X^TX)^{-1}X^T\) could be seen as a projection matrix, so if \(w=(X^TX)^{-1}X^Ty\), then \(Xw\) is the projection of \(y\) onto the column space of \(X\). From a similar perspective, \(y-Xw\) is the residual vector, which is orthogonal to the column space of \(X\).

If singular, use psuedo-inverse: \(w = X^+y\) that could be linked to SVD, where \(X^+\) is the Moore-Penrose pseudo-inverse of \(X\), where SVD is \(X = U \Sigma V^T\), then \(X^+ = V \Sigma^{-1} U^T\). –this case has many solutions.

Remark: often we have n >> d+1, so the matrix \(X^TX\) is invertible, but if \(n < d+1\), then \(X^TX\) is singular, and we need to use the pseudo-inverse. (\(X^TX\) is invertible if and only if the columns of \(X\) are linearly independent, which is often the case in practice when n >> d+1 , but not always.)

Eg Simplified Ceres Orbit Problem

SEE PPT

Gauss’s secret Weapon —- observe 3 points

200 years ago: agrues about the relationship of the linear regression and Normal — think of MLE

regularization

It is used to limited the situation that n < d+1

loss function

Why using square : —- firstly, the geometric meaning — projection

MLE

Likelihood function

ps: parameter statistics

Suppose the data satisfies the normal distribution — in that case, we could use the mean = \(\mu = \frac{1}{n} \sum_{i=1}^n x_i\)

skew

mean > median > mode — right skew

skew should use median or mode to represent some characteristic

Gaussion did not publish until 9 years later, he has really understood the relationship between linear regression and normal distribution,

WLS

Standard OLS assumes Var(εi ) = σ2 (constant variance)

Real-world data often violate this assumption, leading to heteroscedasticity (non-constant variance of errors), say Var(εi ) = σ2 Di or = \(\sigma_i^2\), where Di is a diagonal matrix with varying variances for each observation.

OLS estimates remain unbiased but not efficient (higher variance than necessary) when the errors are heteroscedastic (i.e., the variance of the errors is not constant across observations).

( error satisfies the normal distribution” fix x and we hace the points follow normal distribution)

Consequences of ignoring the heteroscedasticity:

\((y-Xw)^T \Sigma ^{-1 }(y-Xw)\)

\(\Sigma ^{-1 }\) is the weight matrix which is diagonal for uncorrelated errors and could be a full matrix for correlated errors.

Remark: standard OLS assumes \(\Sigma = \sigma^2 I\), which is a special case of the weighted least squares where all weights are equal.

Problem: can not know D, So we use the weighted least squares to estimate D (iteration)

Now, the weighted residual sum of squares is:

\[ (y-Xw)^T \Sigma^{-1} (y-Xw) \]

Let $ D = ^{-1} $, then the weighted residual sum of squares is:

\[ y^TDy - 2w^TX^TDy + w^TX^TDXw \]

\(\hat w_{OLS} = (X^TDX)^{-1}X^TDy\)

import statsmodels.api as sm
mod_wls = sm.WLS(y, X, weights=1/var_est)
res_wls = mod_wls.fit()

Concave and Hessian Matrices

(Semi-) Defined Matrices

For the non-eigenvector x, the role of matrix A can be regarded as projecting x onto the eigenvector basis and then stretching it in each basis direction.

Since the eigenvalues are non-negative, this stretching will not reverse the direction of the vector in any direction. From the perspective that the space of A is A linear combination of its eigenvector and generalized eigenvector, explain why the role of matrix A can be regarded as projecting x onto the basis of the eigenvectors:

Under the basis of the eigenvectors, when we apply A to a vector x, we can think of it as first projecting x onto the subspace spanned by the eigenvectors (write x as the linear combination of eigenvectors) and then stretching it according to the eigenvalues associated with those eigenvectors (Ax=\(\lambda\)x).

  • Remark: We do not need to consider the generalized eigenvectors here. This is because whether it is a positive definite, semi-positive definite or negative definite matrix, their eigenvectors can support the entire space without the need for generalized eigenvectors. This is because these matrices are all symmetric matrices, and symmetric matrices can always be diagonalized, and their eigenvectors can form a set of bases in space, i.e.

Since \[ A = A ^T \] and \[ Av_1 = \lambda_1 v_1 \] and \[ Av_2 = \lambda_2 v_2 \]

then we have

\[ v_2^TAv_1=\lambda_1 v_2^Tv_1 \] Since \(A=A^T\),

\[ v_2^TAv_1 = (Av_2)^Tv_1 = \lambda_2 v_2^Tv_1 \]

So

\[ \lambda_1 v_2^Tv_1 = \lambda_2 v_2^Tv_1 \]

So

if \(\lambda_1 \neq \lambda_2\), then \(v_2^Tv_1 = 0\), which means \(v_1\) and \(v_2\) are orthogonal and therefore can form a basis of the space.

??? USe the understanding of “Left multiplication” to explain why the role of matrix A can be regarded as projecting x onto the basis of the eigenvectors:

if X has full rank, then \(X^TX\) is positive definite.

  • Singular Value Decomposition (SVD):

Every matrix can be decomposed into the product of three matrices: \(X = U \Sigma V^T\), where \(U\) and \(V\) are orthogonal matrices and \(\Sigma\) is a diagonal matrix with non-negative entries (the singular values). The columns of \(U\) and \(V\) are the left and right singular vectors, respectively.

We now have:

\[ X^TX= V \Sigma^T U^T U \Sigma V^T = V \Sigma^2 V^T \]

This means that the eigenvalues of \(X^TX\) are the squares of the singular values of \(X\), which are non-negative. Therefore, \(X^TX\) is positive semi-definite.

When X has full rank, all singular values are positive, and thus \(X^TX\) is positive definite.

\(X^TAX\) is a elliptic

\(A=Q\Lambda Q^T\) is the eigendecomposition of \(A\), where \(Q\) is an orthogonal matrix whose columns are the eigenvectors of \(A\), and \(\Lambda\) is a diagonal matrix whose diagonal entries are the eigenvalues of \(A\).

Then we have:

\[ X^TAX = X^TQ\Lambda Q^TX = (Q^TX)^T\Lambda (Q^TX) = Y^T\Lambda Y=1 \]

where \(Y = Q^TX\).

this is saying \(\lambda_1 y_1^2 + \lambda_2 y_2^2 + ... + \lambda_n y_n^2 = 1\), where \(lambda_i\) are the eigenvalues of \(A\) and \(y_i\) are the components of the vector \(Y\).

If \(\lambda_i > 0\) for all \(i\), then this is an equation of an ellipsoid in the space spanned by the eigenvectors of \(A\). A special case is when all \(\lambda_i\) are equal, which corresponds to a sphere or a circle in 2D.

See 3 blue 1 brown video

Mahalanobis distance and Euclidean distance

The differences between the Euclidean distance and Mahalanobis is the linear transformation

Categorical variables in linear / logistic regression –reference variable

glm

  • random component: \(p (y|\theta, \phi) = exp ( (y \theta - b(\theta)) / a(\phi) + c(y, \phi) )\)

  • systematic component: \(Xw\)

  • link function:

special case : logistic regression

软分类 给概率而非分到哪一类

Supervised learning: 既定事实—-0,1

unsupervised learning: 无既定事实

机器学习: 感知机 (单个神经元)

非线形: 线性套函数

why logit

from 1 to 0 ;

KL distance — whethere the 2 distributions have significant difference

xiangnong 香浓shang

no analytic solution – Newton-Raphson method (IRLS) / gradient Ascent – only suitable for concave function

\(w^{(t+1)} = w^{(t)} + \lambda ..\), where the lambda is called the rate of learning – too small might be slow, too large might be in a cycle which could not be converge

training data and test data

2,8 分 — cross validation

  • Extension of logistic regression : Softmax regression (multinomial logistic regression) – multiple classes

Micro jordan

Cross-entropy loss

neural networks use it

batch effect

可解释性神经网络: 给理由为什么lambda

alphago: 不懂围棋的人用足够厉害的算法就能赢

x \(->^f\) x — score to decide the goodness of f

Fisher’s discriminant analysis (FDA) – linear discriminant analysis (LDA) – linear regression

Contribution:

Perceptron algorithm

LDA has the analytic solution

激活函数

logistic

RELU

Perceptron

Logistic regression

Surf poster– gradient accent

variable transformation formula–the first conversation with Dr. Zhu

  • idea: CDF’s derivative is PDF and ..

Assume we have two random variables \(X\) and \(Y\), where \(Y = g(X)\) is a monotonic and differentiable function of \(X\). Our goal is to find the probability density function \(f_Y(y)\) of \(Y\), given the probability density function \(f_X(x)\) of \(X\).

Variable Transformation Formula

The variable transformation formula is stated as follows: \[f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right|\]

Proof Process

  1. Transformation of the Cumulative Distribution Function (CDF):

    First, consider the cumulative distribution function \(F_Y(y)\) of \(Y\), which is defined as the probability that \(Y\) is less than or equal to \(y\):

    \[F_Y(y) = P(Y \leq y)\] Since \(Y = g(X)\), we can express this probability in terms of \(X\):

    \[F_Y(y) = P(g(X) \leq y)\]

  2. Using the Inverse Function Transformation:

    If \(g(x)\) is strictly increasing (or strictly decreasing), then \(g(X) \leq y\) is equivalent to \(X \leq g^{-1}(y)\) (or \(X \geq g^{-1}(y)\)). Therefore:

    \[F_Y(y) = P(X \leq g^{-1}(y)) = F_X(g^{-1}(y))\]

    where \(F_X(x)\) is the cumulative distribution function of \(X\).

  3. Deriving the Probability Density Function (PDF):

    To find the probability density function \(f_Y(y)\) of \(Y\), we need to differentiate \(F_Y(y)\) with respect to \(y\):

    \[f_Y(y) = \frac{d}{dy} F_Y(y) = \frac{d}{dy} F_X(g^{-1}(y))\]

    Using the chain rule, we get:

    \[f_Y(y) = f_X(g^{-1}(y)) \frac{d}{dy} g^{-1}(y)\]

  4. Considering the Absolute Value of the Derivative:

    Since the probability density function must be non-negative, we take the absolute value of the derivative: \[f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy} g^{-1}(y) \right|\]

This completes the proof of the variable transformation formula. This formula is very useful when dealing with transformations of random variables, especially in statistical analysis and probability modeling. With this formula, we can derive the probability density function of one related random variable from the known probability density function of another.

Sampling Methods

Bootstrap

Example of calculating p-value of KS-statstics (Goodness of fit) with the problem of censoring data distribution estimate

Research Question: No Difference Between Empirical (KM-method) and Theoretical Distributions(estimated) (Does My Interval-Censored Data Come from the Estimated (theoretical) Distribution?)

Null Hypothesis: The data indeed come from the estimated distribution.

Steps of the Parametric Bootstrap Method:

  1. Simulate a World Where the Null Hypothesis Holds:

    • Generate a large number (B = 1000) of new samples from the estimated distribution.

    • Apply the exact same interval-censoring structure to each new sample (using the original left_times and right_times).

    Assume the observed time points does not change

  2. Calculate Bootstrap KS Values:

    • For each bootstrap sample, compute the KS-statistic.

    • Obtain 1000 bootstrap KS values (ks_bootstrap).

  3. Calculate the p-value:

    • p = (number of ks_bootstrap \(≥\) ks_stat) / B (a larger ks-stat indicates a greater difference between the data and the distribution).

Interpretation: - A large p-value: If H₀ is true, the probability of observing a KS statistic as large as that from the original data is high. Therefore, we do not have sufficient evidence to reject H₀ and conclude that the data likely come from this estimated distribution. This is because a large p-value implies that it is quite probable to generate a sample with a difference from the theoretical distribution at least as large as that observed in the current data. In other words, the observed difference is not unusual and falls within the normal range of random variation. Hence, we cannot reject the hypothesis that the data come from the estimated distribution.

Some Perspectives to Penalized Regression Mode

OLS / MLE Solution to LR with issues

We have learned the equivalence of Ordinary Least Squares (OLS) and Maximum Likelihood Estimation (MLE) in the context of linear regression assuming the error term follows a Gaussian distribution but with limitations as below:

  • Issues:

Even if the OLS or MLE solution is optimal in terms of minimizing the error, it may lead to overfitting, especially when the number of features is large relative to the number of observations. This can result in high variance and poor generalization to new data.

How to handle the issues?

\[\begin{align*} \text{LASSO:} & \quad Argmin_{\beta} \sum_{i=1}^{N} \left( y_i - \sum_{j}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j}^{p} |\beta_j|, \quad \text{where } \lambda > 0 \\ \text{Ridge:} & \quad Argmin_{\beta} \sum_{i=1}^{N} \left( y_i - \sum_{j}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j}^{p} \beta_j^2, \quad \text{where } \lambda > 0 \end{align*}\]
  • OLS perspective thinking: This shrinkage method not only achieves the objective of making the error to be smallest like OLS, but also prevents overfitting by penalizing large coefficients.

  • MLE perspective thinking: Compared with MLE, this method could be regarded as a Bayesian estimation that can avoid overfitting by introducing the prior distribution \(P(\beta)\), especially when the data volume is small, the prior information can play a regularization role.

Why does the terms of the regularization look like this?

Just like what we have learned about the relationship between the OLS and MLE and get the idea of why the error term is the square shape, now we are going to use the perspective of Bayesian to understand the penalized terms

Choose Prior Distribution \(P(\beta)\)

Idea:

We insist that \(\beta\) should be more likely to be small also with small variances (very near to 0). This means we should pick a distribution that has a peak around zero and decays quickly as we move away from zero.

  • Prior Distribution: In the context of penalized regression, we choose a prior distribution for the coefficients \(\beta\) that reflects our beliefs about their values. Common choices include:
    • Laplace Prior: Assumes coefficients follow a Laplace distribution, leading to \(L^1\) regularization (Lasso Regression).
    • Gaussian Prior: Assumes coefficients are normally distributed around zero, which leads to \(L^2\) regularization (Ridge Regression).

This also means it would take extreame evidence with the data that we see in order to accept very large and very high variances beta because of the prior.

Choose Prior Distribution \(P(\beta)\)

Maximize the posterior probability to get the point estimation of \(\beta\) – Ridge

\[ P(\beta|y) \propto P(y|\beta) \cdot P(\beta) \]

\[ P(y|\beta) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left\{-\frac{(y - \beta^T x)^2}{2\sigma^2}\right\} \]

And for the prior part, we assume \(beta\) follows a Gaussian distribution, $(0, ’^2) $,

\[ P(\beta) = \frac{1}{\sqrt{2\pi}\sigma'} \exp\left\{-\frac{\beta^2}{2\sigma'^2}\right\} \]

Thus,

\[ \hat{\beta} = \arg\max_\beta \log P(\beta|Y) \]

\[ = \arg\max_{\beta} \log P(Y|\beta)P(\beta) \]

\[ = \arg\max_{\beta} \log \prod_{i=1}^N P(y_i|\beta)P(\beta) \]

\[ = \arg\max_{\beta} \sum_{i=1}^N \log[P(y_i|\beta)P(\beta)] \]

\[ = \arg\max_{\beta} \sum_{i=1}^N \left[ \log\left(\frac{1}{\sqrt{2\pi}\sigma } \exp\left\{-\frac{(y - \beta^T x)^2}{2\sigma^2}\right\}\right) + \log(\frac{1}{\sqrt{2\pi}\sigma'})- \frac{\beta^2}{2\sigma'^2} \right] \]

\[ = \arg\max_{\beta} \sum_{i=1}^N \left[ \log\left(\frac{1}{2\pi \sigma' \sigma}\right) - \frac{(y_i - \beta^T x_i)^2}{2\sigma^2} - \frac{\beta^2}{2\sigma'^2} \right] \]

\[ = \arg\min_\beta \sum_{i=1}^N \left[ \frac{(y_i - \beta^T x_i)^2}{2\sigma^2} + \frac{\beta^2}{2\sigma'^2} \right] \]

\[ = \arg\min_\beta (Y-X\beta)^T (\frac{1}{2\sigma^2}I_N) (Y-X\beta) + \frac{1}{\sigma'^2} \beta^2 \]

((Weighted least squares with the weight \(\frac{1}{2\sigma^2}I_N\) and the penalty term \(\frac{1}{\sigma'^2} \beta^2\)

If we let \(\lambda = \frac{1}{\sigma'^2}\), then the loss function is \(L(\beta) = \sum_{i=1}^N (\beta^T x_i - y_i)^2 + \lambda \beta^T \beta\), which is equivalent to the minimization of the regularized least squares in \(L^2\).

  • Remark of the Ridge Regression

Ridge regression solution is also the posterior mean, this is because the likelihood of the data given the parameters is Gaussian, and the prior is also Gaussian, which results in a posterior that is also Gaussian by the property of the conjugate prior of the Gaussian distribution, i.e.

NEED to ADD!

Maximize the posterior probability to get the point estimation of \(\beta\) – Lasso

Now the prior part is assumed to be a Laplace distribution, \(\beta \sim \text{Laplace}(0, b), i.e., P(\beta) = \frac{1}{2b} \exp\left(-\frac{|\beta|}{b}\right)\).

\[ \arg\max_{\beta} \sum_{i=1}^N \left[ \log\left(\frac{1}{\sqrt{2\pi}\sigma_{\beta} } \exp\left\{-\frac{(y - \beta^T x)^2}{2\sigma^2}\right\}\right) - \log 2b - \frac{|\beta|}{b} \right] \]

\[ = \arg\max_{\beta} \sum_{i=1}^N \left[ \log\left(\frac{1}{\sqrt{2\pi}\sigma_{\beta} 2b}\right) - \frac{(y_i - \beta^T x_i)^2}{2\sigma^2} - \frac{|\beta|}{b} \right] \]

\[ = \arg\min_\beta \sum_{i=1}^N \left[ \frac{(y_i - \beta^T x_i)^2}{2\sigma^2} + \frac{|\beta|}{b} \right] \]

\[ = \arg\min_\beta (Y-X\beta)^T (\frac{1}{2\sigma^2}I_N) (Y-X\beta) + \frac{1}{b} |\beta| \]

If we let \(\lambda = \frac{1}{b}\), then the loss function is \(L(\beta) = \sum_{i=1}^N (\beta^T x_i - y_i)^2 + \lambda |\beta|\), which is equivalent to the minimization of the regularized least squares in \(L^1\).

Further check \(\lambda\) and corresponding term in the posterior likelihood–How the penalized terms are derived from the Bayesian perspective

Gaussian:

\(\lambda = \frac{1}{\sigma'^2}\) i.e. Larger \(\lambda\) with less \(\sigma'^2\) means more regularization corresponding to smaller variance of the prior distribution (normal)

Laplace:

\(\lambda = \frac{1}{b}\) i.e. Larger \(\lambda\) with less \(b\) means more regularization corresponding to smaller variance of the prior distribution (laplace)

How to choose the \(\lambda\)?

  • Cross-Validation: A common method to choose the regularization parameter \(\lambda\) is through cross-validation. This involves splitting the data into training and validation sets, fitting the model with different values of \(\lambda\), and selecting the one that minimizes the prediction error on the validation set.

Lagrange Multipliers Perspective of Penalized Models

Lagrange multipliers provide a way to incorporate constraints into optimization problems. In the context of penalized regression, we can view the regularization term as a constraint on the size of the coefficients.

Choose Lasoo as an example, the Lagrange multipliers perspective could be expressed as:

\[ \left( \hat{\alpha}, \hat{\beta} \right) = \arg \min \left\{ \sum_{i=1}^{N} \left( y_{i} - \alpha - \sum_{j} \beta_{j} x_{ij} \right)^{2} \right\} \quad \text{subject to} \quad \sum_{j} \left| \beta_{j} \right| \leq t. \]

i.e.

\[ g(\beta)= \sum_{j} \left| \beta_{j} \right| - t \leq 0 \]

\[ L(\alpha,\beta, \lambda) = \sum_{i=1}^{N} \left( y_{i} - \alpha - \sum_{j} \beta_{j} x_{ij} \right)^{2} + \lambda \left( \sum_{j} \left| \beta_{j} \right| - t \right) \]

\[ \partial L / \partial \alpha = 0 \]

\[ \hat{\alpha} = \bar{y} - \sum_{j} \hat{\beta}_{j} \bar{x}_{j} \]

\[ \partial L / \partial \beta_{j} = 0 \]

\[ -2 \sum_{i=1}^{N} x_{ij} \left( y_{i} - \hat{\alpha} - \sum_{k} \hat{\beta}_{k} x_{ik} \right) + \lambda \cdot sign(\hat{\beta}_{j}) = 0 \]

This may use the iterative method to solve the equation.

Generally, the formula could be expressed as:

\[ \min_{\beta} \left\{ \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_p^p \right\} \]

where \(\lVert \beta \rVert_p^p\) is the \(L^p\) norm of the coefficients, which serves as a penalty term.

The gradient of the objective function of Ridge with respect to \(\beta\) is given by:

\[ \nabla J(\beta) = -2X^T(y - X\beta) + \lambda \nabla \lVert \beta \rVert_p^p \]

(see matrix form (Go to Matrix Form) below for more details)

This should be set to zero to find the optimal \(\beta\), which means the gradient of the loss function (the first term) is equal to the gradient of the penalty term (the second term) scaled by \(\lambda\).

Geometric Interpretation of Penalized Models in Lagrange Multipliers Perspective

Geometric Interpretation of LASSO in Lagrange Multipliers Perspective and Bayesian Perspective

Story

  • Humorous story

Ryan Tibshirani interviews Rob Tibshirani about his work on the Lasso. Ryan said that when he was young he was into draw a diamond and circles and he happened to draw a circle that just touched the corner of a diamond and that led to a big idea Rob did not credited him for. “I did not cite you,” Rob answered humorously.

  • Idea

Yes, this is actually an interesting geometric meaning of LASSO.

  • To me(a biostat student) —a big meaning!!

Rob Tibshirani previously went back to his hometown of Toronto when he was a professor in preventative medicine and biostatistics and he was working on LASSO here.

Story—Rob Tibshirani’s motivation

  • Paper about non-negative garrote by Leo Breiman

    Leo started with least squares estimates and then multiply them by non-negative constants that were bounderd by some bounded T, which had the effect of producing a sparse solution because being non-negative with a bound of 0 forces sparsity.

  • why choose “Lasso” name

    More gentle than garrote and more softer term than garrote

  • Many similar things that time

    You do not need to do research on unique things, but promising things.

  • From LM to GLM to Cox, this generalization is cool

Understanding of \(\lambda\)

Lagrange Multipliers help explain how \(\lambda\) balances fitting the data well:

Bigger \(\lambda\) makes the ellipse (representing the constraint region in the parameter space in WLS, if it satisfies the homoscedasticity, it would be a circle) smaller. In this case, the gradient of \(\beta\) becomes steeper with a larger \(\lambda\), which means that the optimization algorithm will grow the penalty of \(\beta\) to stay within the shrinking feasible region. This helps reduce variance and prevent overfitting, but overly large \(\lambda\) can cause underfitting.

Matrix Form of Penalized Regression Models

Ridge Regression

Objective Function

\[ \min_{\beta} \left\{ \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2 \right\} \]

Matrix Form Derivation:

  • Expand the loss term: \[ \lVert y - X\beta \rVert_2^2 = (y - X\beta)^T (y - X\beta) = y^T y - 2y^T X\beta + \beta^T X^T X\beta. \]
  • Add the \(L^2\) penalty: \[ J(\beta) = y^T y - 2y^T X\beta + \beta^T X^T X\beta + \lambda \beta^T \beta. \]
  • Compute the gradient: \[ \frac{\partial J}{\partial \beta} = \nabla J(\beta) = -2X^T(y - X\beta) + \lambda \nabla \lVert \beta \rVert_p^p=-2X^T y + 2X^T X\beta + 2\lambda \beta. \]
  • Set gradient to zero: \[ -X^T y + X^T X\beta + \lambda \beta = 0 \implies (X^T X + \lambda I)\beta = X^T y. \]
  • Solve for \(\beta\): \[ \hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y \]

Why Regularization Helps:

The term \(\lambda I\) ensures \(X^T X + \lambda I\) is always invertible (since \(\lambda > 0\) adds positive values to the diagonal, guaranteeing full rank). This avoids singularity issues in \(X^T X\) when features are collinear.

Matrix Form of Penalized Regression Models

Lasso Regression

Formula:

\[ \hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \|\beta\|_1 \right\} \]

Want

\[ \min_{\beta} \left\{ \|y - X\beta\|_2^2 + \lambda \sum_{j=1}^p |\beta_j| \right\} \]

Matrix Form Derivation:

Lasso lacks a closed-form solution due to the non-differentiable L1 norm. Instead, we use the subgradient optimality condition:

  • Subgradient equation: \[ -2X^T(y - X\beta) + \lambda \cdot \text{sign}(\beta) = 0, \] where \(\text{sign}(\beta)\) is defined component-wise:

Key Insight

  • For \(\beta_j \neq 0\), the solution balances data fit and shrinkage.
  • For \(\beta_j = 0\), the condition requires: \[ \left| 2 \mathbf{x}_j^T (y - X\beta) \right| \leq \lambda. \] This induces sparsity (exact zeros in \(\beta\)), which Ridge cannot achieve.

Differences between Lasso and Ridge

  • Lasso could realize variable selection by shrinking some coefficients to exactly zero, while Ridge regression shrinks all coefficients but does not set any to zero.

R code Example of Penalized Linear Regression Models

# Install and load necessary packages
# install.packages("glmnet")
# install.packages("ISLR2")
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
library(ISLR2)

# Load the Hitters dataset
data(Hitters)

# Remove missing values
Hitters <- na.omit(Hitters)

# Define the predictor matrix and response variable
# exclude the intercept term
x <- model.matrix(Salary ~ ., Hitters)[, -1] 
y <- Hitters$Salary
# Split the data into training and test sets
set.seed(1)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- (-train)
y.test <- y[test]

# Fit the Lasso model on the training data
grid <- 10^seq(10, -2, length = 100)
# LASSO-LR
lasso.mod <- glmnet(x[train, ], y[train], alpha = 1,
                    lambda = grid, family = 'gaussian')
# Plot the Lasso model to visualize coefficient paths
plot(lasso.mod)
Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
collapsing to unique 'x' values

# Perform cross-validation to find the optimal lambda
set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
# Make predictions on the test set using the optimal lambda
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test, ])
test.mse <- mean((lasso.pred - y.test)^2)
print(paste("Test MSE with Lasso and optimal lambda:", test.mse))
[1] "Test MSE with Lasso and optimal lambda: 143673.618543046"
# Fit the Lasso model on the full dataset using the optimal lambda
lasso.full <- glmnet(x, y, alpha = 1,lambda = bestlam)
lasso.coef <- coef(lasso.full)
print("Lasso coefficients on the full dataset:")
[1] "Lasso coefficients on the full dataset:"
print(lasso.coef)
20 x 1 sparse Matrix of class "dgCMatrix"
                       s0
(Intercept)   -3.42073206
AtBat          .         
Hits           2.02965136
HmRun          .         
Runs           .         
RBI            .         
Walks          2.24850782
Years          .         
CAtBat         .         
CHits          .         
CHmRun         0.04994886
CRuns          0.22212444
CRBI           0.40183027
CWalks         .         
LeagueN       20.83775664
DivisionW   -116.39019204
PutOuts        0.23768309
Assists        .         
Errors        -0.93567863
NewLeagueN     .         
result_matrix <- matrix(lasso.coef[lasso.coef != 0], ncol = 3, byrow = TRUE)
Warning in matrix(lasso.coef[lasso.coef != 0], ncol = 3, byrow = TRUE): data
length [10] is not a sub-multiple or multiple of the number of rows [4]
print(result_matrix)
            [,1]         [,2]      [,3]
[1,] -3.42073206    2.0296514 2.2485078
[2,]  0.04994886    0.2221244 0.4018303
[3,] 20.83775664 -116.3901920 0.2376831
[4,] -0.93567863   -3.4207321 2.0296514
# 提取非零系数(排除截距)
non_zero_coef <- lasso.coef[lasso.coef[,1] != 0, ][-1]  # 去除截距项
coef_df <- data.frame(
  Variable = names(non_zero_coef),
  Coefficient = as.numeric(non_zero_coef)
)

# 按系数绝对值排序
coef_df <- coef_df[order(abs(coef_df$Coefficient), decreasing = TRUE), ]

# 绘制条形图
library(ggplot2)

ggplot(coef_df, aes(x = reorder(Variable, Coefficient), 
                   y = Coefficient, 
                   fill = ifelse(Coefficient > 0, "Positive", "Negative"))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Positive" = "dodgerblue", "Negative" = "firebrick")) +
  labs(title = "Lasso Regression Coefficients",
       subtitle = paste("Optimal lambda =", round(bestlam, 4)),
       x = "Predictor Variables",
       y = "Coefficient Value",
       fill = "Effect Direction") +
  coord_flip() +  # 水平条形图更易阅读变量名
  theme_minimal() +
  theme(legend.position = "top",
        plot.title = element_text(face = "bold", size = 14),
        axis.text.y = element_text(size = 10))

Remark

Actually the penalized model could be applied in many regression models not only for linear regression, but also for logistic regression, Poisson regression, Cox regression, etc.

The penalized terms could be added to the loss function of these models in a similar way.

R Code of Penalized Cox Regression - Lasso using glmnet package

Want to minimize \[ -\log \left( \prod_{i:\delta_i=1} \frac{\exp\left(\sum_{j=1}^p x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}\exp\left(\sum_{j=1}^p x_{i'j}\beta_j\right)} \right) + \lambda P(\beta), \] where \(\delta_i\) is the indicator function for censoring and \(P(\beta) = \sum_{j=1}^p \beta_j^2\) corresponds to a ridge penalty, or \(P(\beta) = \sum_{j=1}^p |\beta_j|\) corresponds to a lasso penalty.

R Code example for LASSO in Cox Regression applied on selecting important features of pesticide poisoning

Click here to see LASSO code provided by SURF2024 instructed by Dr.He

Summary

We have seen many perspectives to understand the penalized regression models, including:

  • Bayes

  • Lagrange multipliers

  • matrix form

The greatest truths are the simplest.

Penalized regression models balance the trade-off between fitting the data well and keeping the model simple.

Everything is connected to each other. Good theory interpretes this connection interestingly.

Linear Regression: OLS and MLE

Penalized regression models: Lagrange multiplier (restriction in the geometric meaning and also algebra interpretation) and Bayesian perspective (prior: history knowledge restriction)

Background

proof of the Chebyshev’s Inequality

  • Proof of the Markov’s Inequality

If \(X\) is a non-negative random variable and \(a>0\), then \[P(X\ge a) \le \frac{E(X)}{a}\] \[ E(X) = \int_{0}^{\infty} x f(x) \, dx \]

\[ E(X) = \int_{0}^{a} x f(x) \, dx + \int_{a}^{\infty} x f(x) \, dx \] \[ \int_{a}^{\infty} x f(x) \, dx \geq \int_{a}^{\infty} a f(x) \, dx = a \int_{a}^{\infty} f(x) \, dx \]

\[ E(X) \geq a \cdot P(X \geq a) \]

\[ P(X \geq a) \leq \frac{E(X)}{a} \]

  • Proof of the Chebyshev’s Inequality

\[ P((X - \mu)^2 \geq k^2\sigma^2) \leq \frac{E[(X-\mu)^2]}{k^2\sigma^2}=\frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2} \]

\[ P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} \]

Law of Large Numbers (LLN)

  • Weak Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be a sequence of i.i.d. random variables with finite mean \(\mu\) and finite variance \(\sigma^2\). Then for any \(\epsilon > 0\),

\[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \to 0 \text{ as } n \to \infty \]

  • Strong Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be a sequence of i.i.d. random variables with finite mean \(\mu\). Then,

\[ P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu\right) = 1 \]

  • Proof of the Weak Law of Large Numbers

By Chebyshev’s inequality, for any \(\epsilon > 0\),

\[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \leq \frac{\text{Var}\left(\frac{1}{n}\sum_{i=1}^{n} X_i\right)}{\epsilon^2} = \frac{\frac{1}{n^2}\cdot n\sigma^2}{\epsilon^2}=\frac{\sigma^2/n}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \]

As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). Therefore, \[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \to 0 \text{ as } n \to \infty \]

  • Proof of the Strong Law of Large Numbers (using Borel-Cantelli Lemma)

By Chebyshev’s inequality, for any \(\epsilon > 0\),

\[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \leq \frac{\sigma^2}{n\epsilon^2} \]

Let \(A_n = \left\{\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right\}\). Then,

\[ \sum_{n=1}^{\infty} P(A_n) \leq \sum_{n=1}^{\infty} \frac{\sigma^2}{n\epsilon^2} = \frac{\sigma^2}{\epsilon^2} \sum_{n=1}^{\infty} \frac{1}{n} \]

The series \(\sum_{n=1}^{\infty} \frac{1}{n}\) diverges, so we cannot directly apply the Borel-Cantelli lemma here. However, we can use a modified approach. Consider the events \(B_k = \left\{\left|\frac{1}{2^k}\sum_{i=1}^{2^k} X_i - \mu\right| \geq \epsilon\right\}\). Then,

\[ \sum_{k=1}^{\infty} P(B_k) \leq \sum_{k=1}^{\infty} \frac{\sigma^2}{2^k\epsilon^2} = \frac{\sigma^2}{\epsilon^2} \sum_{k=1}^{\infty} \frac{1}{2^k} = \frac{\sigma^2}{\epsilon^2} \]

Since \(\sum_{k=1}^{\infty} P(B_k)\) converges, by the Borel-Cantelli lemma(如果一系列独立事件的概率之和是有限的,那么这些事件无限次发生的概率为零), we have

\[ P(B_k \text{ i.o.}) = 0 \]

(i.o means infinitely often)

(p.s.: if 一系列独立事件的概率之和是无限的,并且这些事件是独立的,那么这些事件无限次发生的概率为一)

This implies that \[ P\left(\lim_{k \to \infty} \frac{1}{2^k}\sum_{i=1}^{2^k} X_i = \mu\right) = 1 \]

Now, for any \(n\), there exists a \(k\) such that \(2^k \leq n < 2^{k+1}\). We can write \[ \frac{1}{n}\sum_{i=1}^{n} X_i = \frac{1}{n}\left(\sum_{i=1}^{2^k} X_i + \sum_{i=2^k+1}^{n} X_i\right) \]

As \(k \to \infty\), the second term \(\frac{1}{n}\sum_{i=2^k+1}^{n} X_i\) becomes negligible, and we have \[ \lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu \]

with probability 1. Therefore,

\[ P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu\right) = 1 \]

Central Limit Theorem (CLT)