Preptide

Statistics & Probability

Core statistical concepts: Z-score, Correlation, MLE, MAP, and Information Theory

Fundamental Statistics

1. Z-score (Standardization)

Measure of how many standard deviations a data point is from the mean.

z=xμσz = \frac{x - \mu}{\sigma}

  • Mean of z-scores is 0.
  • Std Dev of z-scores is 1.
  • Use: Compare across different scales, detect outliers (typically z>3|z| > 3).

2. Correlation (Pearson)

Measure of linear relationship between two variables XX and YY.

ρX,Y=Cov(X,Y)σXσY=E[(XμX)(YμY)]σXσY\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y}

Range: [1,1][-1, 1]. ρ=0\rho = 0 means no linear relationship (but may have nonlinear).

Sample Correlation: r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}}

3. Spearman's Rank Correlation

Non-parametric measure based on ranked values. Captures monotonic (not just linear) relationships.

ρs=16di2n(n21)\rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

where did_i is the difference between ranks of xix_i and yiy_i.

4. Cosine Similarity

Measure of similarity between two non-zero vectors of an inner product space.

similarity=cos(θ)=ABAB=i=1nAiBii=1nAi2i=1nBi2\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}}

Range: [1,1][-1, 1]. Used often in NLP and high-dimensional spaces.

5. Covariance

Measure of joint variability of two variables.

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y]

Properties:

  • Cov(X,X)=Var(X)\text{Cov}(X, X) = \text{Var}(X)
  • Cov(aX,bY)=abCov(X,Y)\text{Cov}(aX, bY) = ab \cdot \text{Cov}(X, Y)
  • Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)

6. Variance and Standard Deviation

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 σ=Var(X)\sigma = \sqrt{\text{Var}(X)}

Sample Variance (unbiased): s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2

7. Skewness

Measure of asymmetry of the distribution.

γ1=E[(Xμσ)3]=E[(Xμ)3]σ3\gamma_1 = E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right] = \frac{E[(X - \mu)^3]}{\sigma^3}

  • γ1=0\gamma_1 = 0: Symmetric (e.g., Normal).
  • γ1>0\gamma_1 > 0: Right-skewed (long tail to right).
  • γ1<0\gamma_1 < 0: Left-skewed (long tail to left).

8. Kurtosis

Measure of "tailedness" of the distribution.

γ2=E[(Xμσ)4]3\gamma_2 = E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3

  • γ2=0\gamma_2 = 0: Mesokurtic (Normal).
  • γ2>0\gamma_2 > 0: Leptokurtic (heavy tails, sharp peak).
  • γ2<0\gamma_2 < 0: Platykurtic (light tails, flat peak).

Statistical Inference

1. Maximum Likelihood Estimation (MLE)

Method to estimate parameters θ\theta of a probability distribution by maximizing the likelihood function L(θ)L(\theta).

L(θ)=P(Xθ)=i=1nP(xiθ)L(\theta) = P(X | \theta) = \prod_{i=1}^n P(x_i | \theta)

θ^MLE=argmaxθL(θ)=argmaxθi=1nlogP(xiθ)\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} L(\theta) = \arg\max_{\theta} \sum_{i=1}^n \log P(x_i | \theta)

Example (Gaussian): For xiN(μ,σ2)x_i \sim \mathcal{N}(\mu, \sigma^2): μ^=1nxi,σ^2=1n(xiμ^)2\hat{\mu} = \frac{1}{n} \sum x_i, \quad \hat{\sigma}^2 = \frac{1}{n} \sum (x_i - \hat{\mu})^2

Properties:

  • Consistency: θ^npθ\hat{\theta}_n \xrightarrow{p} \theta as nn \to \infty.
  • Asymptotic Normality: n(θ^nθ)dN(0,I(θ)1)\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1}) where I(θ)I(\theta) is Fisher Information.
  • Efficiency: Achieves Cramér-Rao Lower Bound asymptotically.

2. Maximum A Posteriori (MAP)

Estimate θ\theta by maximizing the posterior distribution (incorporating a prior P(θ)P(\theta)).

θ^MAP=argmaxθP(θX)=argmaxθP(Xθ)P(θ)P(X)\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(\theta | X) = \arg\max_{\theta} \frac{P(X | \theta) P(\theta)}{P(X)} θ^MAP=argmaxθ(i=1nlogP(xiθ)+logP(θ))\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \left( \sum_{i=1}^n \log P(x_i | \theta) + \log P(\theta) \right)

  • If prior P(θ)P(\theta) is uniform, MAP \equiv MLE.
  • L2 Regularization (Ridge) corresponds to Gaussian Prior.
  • L1 Regularization (Lasso) corresponds to Laplace Prior.

3. Conjugacy in Bayesian Inference

A prior distribution P(θ)P(\theta) is conjugate to a likelihood function P(Xθ)P(X|\theta) if the posterior distribution P(θX)P(\theta|X) belongs to the same family as the prior.

Benefits:

  • Closed-form solution: Avoids expensive numerical integration (MCMC).
  • Interpretability: Posterior parameters update intuitively.
  • Sequential updating: Posterior from one step becomes prior for next.

Common Conjugate Pairs:

LikelihoodPriorPosterior
Bernoulli (pp)Beta(α,β\alpha, \beta)Beta(α+x,β+nx\alpha + x, \beta + n - x)
Binomial (pp)Beta(α,β\alpha, \beta)Beta(α+x,β+nx\alpha + x, \beta + n - x)
Poisson (λ\lambda)Gamma(k,θk, \theta)Gamma(k+xi,θ1+nθk + \sum x_i, \frac{\theta}{1 + n\theta})
Normal (μ\mu, known σ2\sigma^2)Normal(μ0,σ02\mu_0, \sigma_0^2)Normal(μnew,σnew2\mu_{new}, \sigma_{new}^2)
Multinomial (pp)Dirichlet(α\alpha)Dirichlet(α+x\alpha + x)

Example: Beta-Bernoulli: Prior pBeta(α,β)p \sim \text{Beta}(\alpha, \beta). Observe xx successes in nn trials. Posterior: P(pX)P(Xp)P(p)px(1p)nxpα1(1p)β1P(p|X) \propto P(X|p) P(p) \propto p^x (1-p)^{n-x} \cdot p^{\alpha-1} (1-p)^{\beta-1} P(pX)px+α1(1p)nx+β1    Beta(α+x,β+nx)P(p|X) \propto p^{x+\alpha-1} (1-p)^{n-x+\beta-1} \implies \text{Beta}(\alpha+x, \beta+n-x)

Interpretation: α\alpha and β\beta are "pseudo-counts" representing prior successes and failures.

4. Method of Moments (MoM)

Estimate parameters by equating sample moments to population moments.

θ^:1nxik=E[Xk] for k=1,2,\hat{\theta}: \quad \frac{1}{n}\sum x_i^k = E[X^k] \text{ for } k=1,2,\ldots

Example (Normal): μ^=xˉ\hat{\mu} = \bar{x}, σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum (x_i - \bar{x})^2.

5. Expectation-Maximization (EM)

Iterative method for MLE when data has latent variables.

E-Step: Compute Q(θθ(t))=EZX,θ(t)[logP(X,Zθ)]Q(\theta | \theta^{(t)}) = E_{Z|X,\theta^{(t)}}[\log P(X, Z | \theta)].

M-Step: θ(t+1)=argmaxθQ(θθ(t))\theta^{(t+1)} = \arg\max_{\theta} Q(\theta | \theta^{(t)}).

Guarantee: logP(Xθ(t+1))logP(Xθ(t))\log P(X | \theta^{(t+1)}) \geq \log P(X | \theta^{(t)}).

Applications: Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Factor Analysis.

Hypothesis Testing

1. Null Hypothesis Significance Testing (NHST)

Framework for statistical hypothesis testing.

  • Null Hypothesis H0H_0: Default assumption (e.g., no effect).
  • Alternative Hypothesis H1H_1: What we want to detect.
  • p-value: Probability of observing data as extreme as observed, assuming H0H_0 is true.
  • Significance Level α\alpha: Threshold (typically 0.05). Reject H0H_0 if p<αp < \alpha.

Type I Error: False Positive (Reject H0H_0 when true). P(Type I)=αP(\text{Type I}) = \alpha. Type II Error: False Negative (Fail to reject H0H_0 when false). P(Type II)=βP(\text{Type II}) = \beta. Power: 1β1 - \beta (Probability of correctly rejecting H0H_0).

2. t-test

Test if mean of population differs from a value (one-sample), or if two populations have different means (two-sample).

One-Sample t-statistic: t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} where ss is sample standard deviation. Follows tn1t_{n-1} distribution under H0:μ=μ0H_0: \mu = \mu_0.

Two-Sample t-test (equal variance): t=xˉ1xˉ2sp1n1+1n2t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} where sp2=(n11)s12+(n21)s22n1+n22s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2} is pooled variance.

3. Chi-Squared Test

Test for independence in contingency tables or goodness-of-fit.

Test Statistic: χ2=i=1k(OiEi)2Ei\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} where OiO_i is observed frequency, EiE_i is expected frequency under H0H_0.

Follows χk12\chi^2_{k-1} distribution.

4. F-test

Test equality of variances between two populations.

F=s12s22F = \frac{s_1^2}{s_2^2}

Follows Fn11,n21F_{n_1-1, n_2-1} distribution under H0:σ12=σ22H_0: \sigma_1^2 = \sigma_2^2.

Use in ANOVA: Test if group means are equal.

5. Kolmogorov-Smirnov (KS) Test

Non-parametric test to determine if a sample comes from a reference probability distribution (One-Sample) or if two samples come from the same distribution (Two-Sample).

Statistic: The maximum absolute difference between the empirical CDFs.

Dn=supxFn(x)F(x)D_n = \sup_x |F_n(x) - F(x)|

where Fn(x)F_n(x) is the empirical CDF and F(x)F(x) is the reference CDF.

6. Mann-Whitney U Test (Wilcoxon Rank-Sum)

Non-parametric alternative to two-sample t-test. Tests if two samples have the same distribution.

Statistic: Count pairs (xi,yj)(x_i, y_j) where xi>yjx_i > y_j.

7. Multiple Testing Correction

Bonferroni: For mm tests, use significance level α/m\alpha/m.

Benjamini-Hochberg (FDR): Controls False Discovery Rate. Order p-values: p(1)p(m)p_{(1)} \leq \ldots \leq p_{(m)}. Reject H(i)H_{(i)} for all iki \leq k where k=max{i:p(i)imα}k = \max\{i: p_{(i)} \leq \frac{i}{m}\alpha\}.

Information Theory

1. Entropy (Shannon Entropy)

Measure of uncertainty or average information content.

H(X)=xP(x)logP(x)=E[logP(X)]H(X) = - \sum_{x} P(x) \log P(x) = E[-\log P(X)]

For binary classification (pp vs 1p1-p): H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log (1-p)

Properties:

  • H(X)0H(X) \geq 0 with equality iff XX is deterministic.
  • H(X)H(X) is maximized when XX is uniform.
  • Joint Entropy: H(X,Y)=x,yP(x,y)logP(x,y)H(X, Y) = -\sum_{x,y} P(x,y) \log P(x,y).
  • Conditional Entropy: H(YX)=xP(x)H(YX=x)H(Y|X) = \sum_x P(x) H(Y|X=x).
  • Chain Rule: H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y|X).

2. KL Divergence (Kullback-Leibler)

Measure of how one probability distribution QQ diverges from a second, expected probability distribution PP. (Relative Entropy).

DKL(PQ)=xP(x)logP(x)Q(x)=EP[logP(X)logQ(X)]D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = E_P \left[ \log P(X) - \log Q(X) \right]

DKL(PQ)=H(P,Q)H(P)D_{KL}(P \| Q) = H(P, Q) - H(P)

Properties:

  • Non-negative: DKL(PQ)0D_{KL}(P \| Q) \geq 0 with equality iff P=QP = Q (Gibbs' Inequality).
  • Asymmetric: DKL(PQ)DKL(QP)D_{KL}(P \| Q) \neq D_{KL}(Q \| P).
  • Not a metric (doesn't satisfy triangle inequality).

3. Cross-Entropy

Loss function often used in classification.

H(P,Q)=xP(x)logQ(x)=H(P)+DKL(PQ)H(P, Q) = - \sum_{x} P(x) \log Q(x) = H(P) + D_{KL}(P \| Q)

For binary classification (True label y{0,1}y \in \{0,1\}, predicted prob y^\hat{y}): L=[ylogy^+(1y)log(1y^)]L = -[y \log \hat{y} + (1-y) \log (1-\hat{y})]

Minimizing Cross-Entropy w.r.t. QQ is equivalent to minimizing KL Divergence from PP to QQ.

4. Mutual Information

Measures reduction in uncertainty of XX given knowledge of YY.

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) I(X;Y)=DKL(P(X,Y)P(X)P(Y))I(X; Y) = D_{KL}(P(X,Y) \| P(X)P(Y))

Interpretation: Amount of information shared between XX and YY. I(X;Y)=0I(X;Y) = 0 iff XX and YY are independent.

5. Jensen-Shannon Divergence

Symmetric version of KL divergence.

JSD(PQ)=12DKL(PM)+12DKL(QM)JSD(P \| Q) = \frac{1}{2}D_{KL}(P \| M) + \frac{1}{2}D_{KL}(Q \| M) where M=12(P+Q)M = \frac{1}{2}(P + Q).

Properties: Symmetric, bounded [0,1][0, 1] (with log2\log_2).

Probabilistic Models

1. Naive Bayes Classifier

Probabilistic classifier based on Bayes' Theorem with "naive" independence assumptions between features.

P(yx1,,xn)=P(y)P(x1,,xny)P(x1,,xn)P(y | x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n | y)}{P(x_1, \dots, x_n)}

Assumption: Features xix_i are conditionally independent given yy.

P(xiy,x1,,xi1,)=P(xiy)P(x_i | y, x_1, \dots, x_{i-1}, \dots) = P(x_i | y)

Decision Rule: y^=argmaxyP(y)i=1nP(xiy)\hat{y} = \arg\max_y P(y) \prod_{i=1}^n P(x_i | y)

Log-Space (Numerical Stability): y^=argmaxy(logP(y)+i=1nlogP(xiy))\hat{y} = \arg\max_y \left( \log P(y) + \sum_{i=1}^n \log P(x_i | y) \right)

Variants:

  • Gaussian Naive Bayes: P(xiy)N(μy,i,σy,i2)P(x_i | y) \sim \mathcal{N}(\mu_{y,i}, \sigma_{y,i}^2).
  • Multinomial Naive Bayes: For count data (text classification).
  • Bernoulli Naive Bayes: For binary features.

2. Exponential Family

Family of distributions that can be written in the form:

P(xθ)=h(x)exp(η(θ)TT(x)A(θ))P(x|\theta) = h(x) \exp\left(\eta(\theta)^T T(x) - A(\theta)\right)

  • η(θ)\eta(\theta): Natural parameter.
  • T(x)T(x): Sufficient statistic.
  • A(θ)A(\theta): Log-partition function (normalizer).
  • h(x)h(x): Base measure.

Members: Gaussian, Exponential, Gamma, Beta, Bernoulli, Poisson, Multinomial, Dirichlet.

Properties:

  • Sufficient statistics exist.
  • Conjugate priors exist.
  • E[T(X)]=A(θ)E[T(X)] = \nabla A(\theta).
  • Var(T(X))=2A(θ)\text{Var}(T(X)) = \nabla^2 A(\theta).

3. Gaussian Mixture Model (GMM)

Mixture of KK Gaussian distributions.

P(x)=k=1KπkN(xμk,Σk)P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)

where πk=1\sum \pi_k = 1 (mixing coefficients).

EM for GMM:

  • E-Step: Compute responsibilities γik=P(zi=kxi)\gamma_{ik} = P(z_i = k | x_i).
  • M-Step: Update πk,μk,Σk\pi_k, \mu_k, \Sigma_k using weighted MLE.

4. Hidden Markov Model (HMM)

Model with latent states ztz_t and observations xtx_t.

Assumptions:

  • Markov: P(ztz1:t1)=P(ztzt1)P(z_t | z_{1:t-1}) = P(z_t | z_{t-1}) (Transition).
  • Conditional Independence: P(xtx1:t1,z1:t)=P(xtzt)P(x_t | x_{1:t-1}, z_{1:t}) = P(x_t | z_t) (Emission).

Algorithms:

  • Forward Algorithm: Compute P(x1:T)P(x_{1:T}) in O(TK2)O(T K^2).
  • Viterbi Algorithm: Find most likely state sequence argmaxz1:TP(z1:Tx1:T)\arg\max_{z_{1:T}} P(z_{1:T} | x_{1:T}).
  • Baum-Welch (EM for HMM): Learn parameters.

Sampling Methods

1. Rejection Sampling

Sample from target distribution p(x)p(x) using proposal distribution q(x)q(x) where p(x)Mq(x)p(x) \leq M q(x).

Algorithm:

  1. Sample xq(x)x \sim q(x).
  2. Sample uUniform(0,Mq(x))u \sim \text{Uniform}(0, Mq(x)).
  3. If up(x)u \leq p(x), accept xx. Else, reject and repeat.

Acceptance Rate: 1/M1/M. Lower MM (tighter bound) is better.

2. Importance Sampling

Estimate Ep[f(X)]E_p[f(X)] by sampling from proposal qq.

Ep[f(X)]=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx1ni=1nf(xi)wiE_p[f(X)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx \approx \frac{1}{n} \sum_{i=1}^n f(x_i) w_i

where xiqx_i \sim q, wi=p(xi)q(xi)w_i = \frac{p(x_i)}{q(x_i)} (importance weights).

3. Markov Chain Monte Carlo (MCMC)

Construct a Markov chain with stationary distribution π(x)\pi(x) equal to target distribution.

Metropolis-Hastings:

  1. Propose xq(xxt)x' \sim q(x' | x_t).
  2. Accept with probability α=min(1,π(x)q(xtx)π(xt)q(xxt))\alpha = \min\left(1, \frac{\pi(x') q(x_t | x')}{\pi(x_t) q(x' | x_t)}\right).
  3. If accept, xt+1=xx_{t+1} = x'. Else, xt+1=xtx_{t+1} = x_t.

Gibbs Sampling: Sample each variable conditional on others. Special case of Metropolis-Hastings with acceptance probability 1.

xi(t+1)P(xix1(t+1),,xi1(t+1),xi+1(t),,xn(t))x_i^{(t+1)} \sim P(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_n^{(t)})

4. Hamiltonian Monte Carlo (HMC)

Uses gradient information and Hamiltonian dynamics to propose distant moves with high acceptance rate.

Hamiltonian: H(x,p)=U(x)+K(p)H(x, p) = U(x) + K(p) where U(x)=logπ(x)U(x) = -\log \pi(x) (potential energy), K(p)=12pTM1pK(p) = \frac{1}{2}p^T M^{-1} p (kinetic energy).

Leapfrog Integration: Simulate Hamiltonian dynamics to propose new state.

Advanced Topics

1. Sufficient Statistics

A statistic T(X)T(X) is sufficient for θ\theta if P(XT(X),θ)=P(XT(X))P(X | T(X), \theta) = P(X | T(X)) (data provides no additional information about θ\theta beyond T(X)T(X)).

Factorization Theorem: T(X)T(X) is sufficient iff p(xθ)=g(T(x),θ)h(x)p(x|\theta) = g(T(x), \theta) h(x).

2. Fisher Information

Measures amount of information that observable XX carries about parameter θ\theta.

I(θ)=E[(logp(Xθ)θ)2]=E[2logp(Xθ)θ2]I(\theta) = E\left[\left(\frac{\partial \log p(X|\theta)}{\partial \theta}\right)^2\right] = -E\left[\frac{\partial^2 \log p(X|\theta)}{\partial \theta^2}\right]

Cramér-Rao Lower Bound: For unbiased estimator θ^\hat{\theta}: Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

3. Bias-Variance Tradeoff

For estimator θ^\hat{\theta} of parameter θ\theta:

MSE(θ^)=E[(θ^θ)2]=Bias(θ^)2+Var(θ^)\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})

where Bias(θ^)=E[θ^]θ\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta.

4. Bootstrap

Resampling method to estimate sampling distribution of a statistic.

Algorithm:

  1. Sample nn observations with replacement from data {x1,,xn}\{x_1, \ldots, x_n\}.
  2. Compute statistic θ^\hat{\theta}^* on bootstrap sample.
  3. Repeat BB times.
  4. Use distribution of {θ^1,,θ^B}\{\hat{\theta}^{*1}, \ldots, \hat{\theta}^{*B}\} to estimate sampling distribution.

5. Central Limit Theorem (CLT)

For i.i.d. X1,,XnX_1, \ldots, X_n with mean μ\mu and variance σ2\sigma^2:

Xˉnμσ/ndN(0,1)\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)

Delta Method: For smooth function gg: n(g(Xˉn)g(μ))dN(0,(g(μ))2σ2)\sqrt{n}(g(\bar{X}_n) - g(\mu)) \xrightarrow{d} \mathcal{N}(0, (g'(\mu))^2 \sigma^2)