APH203

Lecture 1

6 examples of Time Series Data

library(astsa)
plot(jj,type="o",ylab="Quarterly Earnings per Share")

plot(gtemp_both,type="o",ylab="Global Temperature Deviations")

library(astsa)
library(xts)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
djiar = diff(log(djia$Close))[-1]
plot(djiar,ylab="DJIAReturns",type="n")

lines(djiar)

return: 投资回报率

Lesson: Leave quickly once get some profit

par(mfrow =c(2,1))
plot(soi, ylab="", xlab="", main="Southern Oscillation Index")
plot(rec, ylab="", xlab="", main="Recruitment")

par(mfrow=c(2,1))
ts.plot(fmri1[,2:5], col=1:4, ylab="BOLD", main="Cortex")
ts.plot(fmri1[,6:9], col=1:4, ylab="BOLD", main="Thalamus & Cerebellum")

periodic

# Earthquakes and Explosions
par(mfrow=c(2,1))
plot(EQ5, main="Earthquake")
plot(EXP6, main="Explosion")

classification of Time Series Data

  • Univariate vs Multivariate

  • Continuous vs Discrete

  • Stationary vs Non-Stationary

  • Linear vs Non-Linear

Methods for Time Series Analysis

  • Descriptive Methods

  • Statistical Time Series Analysis

  • White noise

w = rnorm(500,0,1) # 500 N(0,1) variates
plot.ts(w, main="white noise")

  • Moving Average and Filtering

\(v_t = \frac{1}{3}(w_{t-1}+w_t+w_{t+1})\)

v = filter(w, sides=2, filter=rep(1/3,3)) # moving average 
plot.ts(v, ylim=c(-3,3), main="moving average")

new model after 1970’s ARIMA

Since the Gaussian-Markov assumption (\(E(\epsilon) =0\) and \(\epsilon \sim N\) i.i.d )is so strict, and heteroscedasticity is often observed in the real data, new models were developed after 1970’s.

?Questions: If the variance ….. why not WLS?

Basic background knowledge (review of the past)

proof of the Chebyshev’s Inequality

  • Proof of the Markov’s Inequality

If \(X\) is a non-negative random variable and \(a>0\), then \[P(X\ge a) \le \frac{E(X)}{a}\] \[ E(X) = \int_{0}^{\infty} x f(x) \, dx \]

\[ E(X) = \int_{0}^{a} x f(x) \, dx + \int_{a}^{\infty} x f(x) \, dx \] \[ \int_{a}^{\infty} x f(x) \, dx \geq \int_{a}^{\infty} a f(x) \, dx = a \int_{a}^{\infty} f(x) \, dx \]

\[ E(X) \geq a \cdot P(X \geq a) \]

\[ P(X \geq a) \leq \frac{E(X)}{a} \]

  • Proof of the Chebyshev’s Inequality

\[ P((X - \mu)^2 \geq k^2\sigma^2) \leq \frac{E[(X-\mu)^2]}{k^2\sigma^2}=\frac{\sigma^2}{k^2\sigma^2} = \frac{1}{k^2} \]

\[ P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2} \]

Law of Large Numbers (LLN)

  • Weak Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be a sequence of i.i.d. random variables with finite mean \(\mu\) and finite variance \(\sigma^2\). Then for any \(\epsilon > 0\),

\[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \to 0 \text{ as } n \to \infty \]

  • Strong Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be a sequence of i.i.d. random variables with finite mean \(\mu\). Then,

\[ P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu\right) = 1 \]

  • Proof of the Weak Law of Large Numbers

By Chebyshev’s inequality, for any \(\epsilon > 0\),

\[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \leq \frac{\text{Var}\left(\frac{1}{n}\sum_{i=1}^{n} X_i\right)}{\epsilon^2} = \frac{\frac{1}{n^2}\cdot n\sigma^2}{\epsilon^2}=\frac{\sigma^2/n}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \]

As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). Therefore, \[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \to 0 \text{ as } n \to \infty \]

  • Proof of the Strong Law of Large Numbers (using Borel-Cantelli Lemma)

By Chebyshev’s inequality, for any \(\epsilon > 0\),

\[ P\left(\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right) \leq \frac{\sigma^2}{n\epsilon^2} \]

Let \(A_n = \left\{\left|\frac{1}{n}\sum_{i=1}^{n} X_i - \mu\right| \geq \epsilon\right\}\). Then,

\[ \sum_{n=1}^{\infty} P(A_n) \leq \sum_{n=1}^{\infty} \frac{\sigma^2}{n\epsilon^2} = \frac{\sigma^2}{\epsilon^2} \sum_{n=1}^{\infty} \frac{1}{n} \]

The series \(\sum_{n=1}^{\infty} \frac{1}{n}\) diverges, so we cannot directly apply the Borel-Cantelli lemma here. However, we can use a modified approach. Consider the events \(B_k = \left\{\left|\frac{1}{2^k}\sum_{i=1}^{2^k} X_i - \mu\right| \geq \epsilon\right\}\). Then,

\[ \sum_{k=1}^{\infty} P(B_k) \leq \sum_{k=1}^{\infty} \frac{\sigma^2}{2^k\epsilon^2} = \frac{\sigma^2}{\epsilon^2} \sum_{k=1}^{\infty} \frac{1}{2^k} = \frac{\sigma^2}{\epsilon^2} \]

Since \(\sum_{k=1}^{\infty} P(B_k)\) converges, by the Borel-Cantelli lemma(如果一系列独立事件的概率之和是有限的,那么这些事件无限次发生的概率为零), we have

\[ P(B_k \text{ i.o.}) = 0 \]

(i.o means infinitely often)

(p.s.: if 一系列独立事件的概率之和是无限的,并且这些事件是独立的,那么这些事件无限次发生的概率为一)

This implies that \[ P\left(\lim_{k \to \infty} \frac{1}{2^k}\sum_{i=1}^{2^k} X_i = \mu\right) = 1 \]

Now, for any \(n\), there exists a \(k\) such that \(2^k \leq n < 2^{k+1}\). We can write \[ \frac{1}{n}\sum_{i=1}^{n} X_i = \frac{1}{n}\left(\sum_{i=1}^{2^k} X_i + \sum_{i=2^k+1}^{n} X_i\right) \]

As \(k \to \infty\), the second term \(\frac{1}{n}\sum_{i=2^k+1}^{n} X_i\) becomes negligible, and we have \[ \lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu \]

with probability 1. Therefore,

\[ P\left(\lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^{n} X_i = \mu\right) = 1 \]

Central Limit Theorem (CLT)

Lecture 3

AIC and BIC

KL (Kullback-Leibler) divergence

  • defination of the entropy

The entropy of a discrete random variable \(X\) with probability mass function \(P(X)\) is defined as:

\[ H(X) = -\sum_{x} P(x) \log(P(x)) \]

  • Definition of KL divergence

The Kullback-Leibler (KL) divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It is defined as follows:

For discrete probability distributions \(P\) and \(Q\) defined on the same probability space, the KL divergence from \(Q\) to \(P\) is given by:

\[ D_{KL}(P || Q) = \sum_{x} P(x) \log\left(\frac{P(x)}{Q(x)}\right)=\sum_x P(x) \frac{1}{\log(Q(x))} - \sum_x P(x) \frac{1}{\log(P(x))}\\ = H(P,Q) -H(P) \] P(x) is the true distribution, Q(x) is the approximating distribution (proposed p.d.f)

For continuous probability distributions, the KL divergence is defined as:

\[ D_{KL}(P || Q) = \int_{-\infty}^{\infty} p(x) \log\left(\frac{p(x)}{q(x)}\right) dx \]

where \(p(x)\) and \(q(x)\) are the probability density functions of \(P\) and \(Q\), respectively.

  • example

  • disadvantage: we do not the true distribution P(x)

AIC (Akaike Information Criterion)

By Akaike (1974), although the true distribution \(P(x)\) is unknown, we can use the maximum likelihood estimate (MLE) to approximate it. The AIC is derived from an estimate of the KL divergence between the true model and a candidate model.

AIC is defined as:

\[ AIC = -2 \log(L) + 2k \]

AICc

AICc is a corrected version of AIC that adjusts for small sample sizes. It is defined as:

\[ AICc = AIC + \frac{2k(k+1)}{n-k-1} \]

when n goes to infinity, AICc converges to AIC.

BIC (Bayesian Information Criterion)

  • Bayesian Thm

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

A goes to \(\theta\), B goes to x

\[ \hat \theta = \arg\max_{\theta} P(\theta|x) = \arg\max_{\theta} \frac{P(x|\theta)P(\theta)}{P(x)} = \arg\max_{\theta} P(x|\theta)P(\theta) \]

  • BIC formula

\[ BIC = -2 \log(L) + k \log(n) \]

loglikelihood penalized by the number of parameters and the sample size

loglikelihood could be replaced by the loss function

BIC is more popular used in high-dimensional data analysis. If not in high-dimensional data analysis, AIC and BIC values are similar.

reference

https://harvard-iacs.github.io/2018-CS109A/a-sections/a-section-2/presentation/a-sec2-MLEtoAIC.pdf

Emphasis

stationary

non-parametric

$x_1,

eg.

\(e^x = \sum_{n=0}^{\infty} \frac{x^n}{n!}\) 多项式逼近

B-spline

Kernel

非参–也有不好的方面

拿到数据,想要估计,选择哪个kernel function都可以,只是估计效率的问题,结果差不多,尤其是对于小样本