Core statistical concepts: Z-score, Correlation, MLE, MAP, and Information Theory
Measure of how many standard deviations a data point is from the mean.
z = x − μ σ z = \frac{x - \mu}{\sigma} z = σ x − μ
Mean of z-scores is 0.
Std Dev of z-scores is 1.
Use : Compare across different scales, detect outliers (typically ∣ z ∣ > 3 |z| > 3 ∣ z ∣ > 3 ).
Measure of linear relationship between two variables X X X and Y Y Y .
ρ X , Y = Cov ( X , Y ) σ X σ Y = E [ ( X − μ X ) ( Y − μ Y ) ] σ X σ Y \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X \sigma_Y} ρ X , Y = σ X σ Y Cov ( X , Y ) = σ X σ Y E [( X − μ X ) ( Y − μ Y )]
Range: [ − 1 , 1 ] [-1, 1] [ − 1 , 1 ] . ρ = 0 \rho = 0 ρ = 0 means no linear relationship (but may have nonlinear).
Sample Correlation :
r = ∑ ( x i − x ˉ ) ( y i − y ˉ ) ∑ ( x i − x ˉ ) 2 ∑ ( y i − y ˉ ) 2 r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} r = ∑ ( x i − x ˉ ) 2 ∑ ( y i − y ˉ ) 2 ∑ ( x i − x ˉ ) ( y i − y ˉ )
Non-parametric measure based on ranked values. Captures monotonic (not just linear) relationships.
ρ s = 1 − 6 ∑ d i 2 n ( n 2 − 1 ) \rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} ρ s = 1 − n ( n 2 − 1 ) 6 ∑ d i 2
where d i d_i d i is the difference between ranks of x i x_i x i and y i y_i y i .
Measure of similarity between two non-zero vectors of an inner product space.
similarity = cos ( θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ = ∑ i = 1 n A i B i ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 \text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}} similarity = cos ( θ ) = ∥ A ∥∥ B ∥ A ⋅ B = ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 ∑ i = 1 n A i B i
Range: [ − 1 , 1 ] [-1, 1] [ − 1 , 1 ] . Used often in NLP and high-dimensional spaces.
Measure of joint variability of two variables.
Cov ( X , Y ) = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] = E [ X Y ] − E [ X ] E [ Y ] \text{Cov}(X, Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] Cov ( X , Y ) = E [( X − E [ X ]) ( Y − E [ Y ])] = E [ X Y ] − E [ X ] E [ Y ]
Properties :
Cov ( X , X ) = Var ( X ) \text{Cov}(X, X) = \text{Var}(X) Cov ( X , X ) = Var ( X )
Cov ( a X , b Y ) = a b ⋅ Cov ( X , Y ) \text{Cov}(aX, bY) = ab \cdot \text{Cov}(X, Y) Cov ( a X , bY ) = ab ⋅ Cov ( X , Y )
Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y ) \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2 Cov ( X , Y )
Var ( X ) = E [ ( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2 Var ( X ) = E [( X − E [ X ] ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2
σ = Var ( X ) \sigma = \sqrt{\text{Var}(X)} σ = Var ( X )
Sample Variance (unbiased):
s 2 = 1 n − 1 ∑ i = 1 n ( x i − x ˉ ) 2 s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 s 2 = n − 1 1 ∑ i = 1 n ( x i − x ˉ ) 2
Measure of asymmetry of the distribution.
γ 1 = E [ ( X − μ σ ) 3 ] = E [ ( X − μ ) 3 ] σ 3 \gamma_1 = E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right] = \frac{E[(X - \mu)^3]}{\sigma^3} γ 1 = E [ ( σ X − μ ) 3 ] = σ 3 E [( X − μ ) 3 ]
γ 1 = 0 \gamma_1 = 0 γ 1 = 0 : Symmetric (e.g., Normal).
γ 1 > 0 \gamma_1 > 0 γ 1 > 0 : Right-skewed (long tail to right).
γ 1 < 0 \gamma_1 < 0 γ 1 < 0 : Left-skewed (long tail to left).
Measure of "tailedness" of the distribution.
γ 2 = E [ ( X − μ σ ) 4 ] − 3 \gamma_2 = E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] - 3 γ 2 = E [ ( σ X − μ ) 4 ] − 3
γ 2 = 0 \gamma_2 = 0 γ 2 = 0 : Mesokurtic (Normal).
γ 2 > 0 \gamma_2 > 0 γ 2 > 0 : Leptokurtic (heavy tails, sharp peak).
γ 2 < 0 \gamma_2 < 0 γ 2 < 0 : Platykurtic (light tails, flat peak).
Method to estimate parameters θ \theta θ of a probability distribution by maximizing the likelihood function L ( θ ) L(\theta) L ( θ ) .
L ( θ ) = P ( X ∣ θ ) = ∏ i = 1 n P ( x i ∣ θ ) L(\theta) = P(X | \theta) = \prod_{i=1}^n P(x_i | \theta) L ( θ ) = P ( X ∣ θ ) = ∏ i = 1 n P ( x i ∣ θ )
θ ^ MLE = arg max θ L ( θ ) = arg max θ ∑ i = 1 n log P ( x i ∣ θ ) \hat{\theta}_{\text{MLE}} = \arg\max_{\theta} L(\theta) = \arg\max_{\theta} \sum_{i=1}^n \log P(x_i | \theta) θ ^ MLE = arg max θ L ( θ ) = arg max θ ∑ i = 1 n log P ( x i ∣ θ )
Example (Gaussian) :
For x i ∼ N ( μ , σ 2 ) x_i \sim \mathcal{N}(\mu, \sigma^2) x i ∼ N ( μ , σ 2 ) :
μ ^ = 1 n ∑ x i , σ ^ 2 = 1 n ∑ ( x i − μ ^ ) 2 \hat{\mu} = \frac{1}{n} \sum x_i, \quad \hat{\sigma}^2 = \frac{1}{n} \sum (x_i - \hat{\mu})^2 μ ^ = n 1 ∑ x i , σ ^ 2 = n 1 ∑ ( x i − μ ^ ) 2
Properties :
Consistency : θ ^ n → p θ \hat{\theta}_n \xrightarrow{p} \theta θ ^ n p θ as n → ∞ n \to \infty n → ∞ .
Asymptotic Normality : n ( θ ^ n − θ ) → d N ( 0 , I ( θ ) − 1 ) \sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1}) n ( θ ^ n − θ ) d N ( 0 , I ( θ ) − 1 ) where I ( θ ) I(\theta) I ( θ ) is Fisher Information.
Efficiency : Achieves Cramér-Rao Lower Bound asymptotically.
Estimate θ \theta θ by maximizing the posterior distribution (incorporating a prior P ( θ ) P(\theta) P ( θ ) ).
θ ^ MAP = arg max θ P ( θ ∣ X ) = arg max θ P ( X ∣ θ ) P ( θ ) P ( X ) \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(\theta | X) = \arg\max_{\theta} \frac{P(X | \theta) P(\theta)}{P(X)} θ ^ MAP = arg max θ P ( θ ∣ X ) = arg max θ P ( X ) P ( X ∣ θ ) P ( θ )
θ ^ MAP = arg max θ ( ∑ i = 1 n log P ( x i ∣ θ ) + log P ( θ ) ) \hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \left( \sum_{i=1}^n \log P(x_i | \theta) + \log P(\theta) \right) θ ^ MAP = arg max θ ( ∑ i = 1 n log P ( x i ∣ θ ) + log P ( θ ) )
If prior P ( θ ) P(\theta) P ( θ ) is uniform, MAP ≡ \equiv ≡ MLE.
L2 Regularization (Ridge) corresponds to Gaussian Prior.
L1 Regularization (Lasso) corresponds to Laplace Prior.
A prior distribution P ( θ ) P(\theta) P ( θ ) is conjugate to a likelihood function P ( X ∣ θ ) P(X|\theta) P ( X ∣ θ ) if the posterior distribution P ( θ ∣ X ) P(\theta|X) P ( θ ∣ X ) belongs to the same family as the prior.
Benefits :
Closed-form solution : Avoids expensive numerical integration (MCMC).
Interpretability : Posterior parameters update intuitively.
Sequential updating : Posterior from one step becomes prior for next.
Common Conjugate Pairs :
Likelihood Prior Posterior Bernoulli (p p p )Beta (α , β \alpha, \beta α , β )Beta (α + x , β + n − x \alpha + x, \beta + n - x α + x , β + n − x )Binomial (p p p )Beta (α , β \alpha, \beta α , β )Beta (α + x , β + n − x \alpha + x, \beta + n - x α + x , β + n − x )Poisson (λ \lambda λ )Gamma (k , θ k, \theta k , θ )Gamma (k + ∑ x i , θ 1 + n θ k + \sum x_i, \frac{\theta}{1 + n\theta} k + ∑ x i , 1 + n θ θ )Normal (μ \mu μ , known σ 2 \sigma^2 σ 2 )Normal (μ 0 , σ 0 2 \mu_0, \sigma_0^2 μ 0 , σ 0 2 )Normal (μ n e w , σ n e w 2 \mu_{new}, \sigma_{new}^2 μ n e w , σ n e w 2 )Multinomial (p p p )Dirichlet (α \alpha α )Dirichlet (α + x \alpha + x α + x )
Example: Beta-Bernoulli :
Prior p ∼ Beta ( α , β ) p \sim \text{Beta}(\alpha, \beta) p ∼ Beta ( α , β ) . Observe x x x successes in n n n trials.
Posterior:
P ( p ∣ X ) ∝ P ( X ∣ p ) P ( p ) ∝ p x ( 1 − p ) n − x ⋅ p α − 1 ( 1 − p ) β − 1 P(p|X) \propto P(X|p) P(p) \propto p^x (1-p)^{n-x} \cdot p^{\alpha-1} (1-p)^{\beta-1} P ( p ∣ X ) ∝ P ( X ∣ p ) P ( p ) ∝ p x ( 1 − p ) n − x ⋅ p α − 1 ( 1 − p ) β − 1
P ( p ∣ X ) ∝ p x + α − 1 ( 1 − p ) n − x + β − 1 ⟹ Beta ( α + x , β + n − x ) P(p|X) \propto p^{x+\alpha-1} (1-p)^{n-x+\beta-1} \implies \text{Beta}(\alpha+x, \beta+n-x) P ( p ∣ X ) ∝ p x + α − 1 ( 1 − p ) n − x + β − 1 ⟹ Beta ( α + x , β + n − x )
Interpretation : α \alpha α and β \beta β are "pseudo-counts" representing prior successes and failures.
Estimate parameters by equating sample moments to population moments.
θ ^ : 1 n ∑ x i k = E [ X k ] for k = 1 , 2 , … \hat{\theta}: \quad \frac{1}{n}\sum x_i^k = E[X^k] \text{ for } k=1,2,\ldots θ ^ : n 1 ∑ x i k = E [ X k ] for k = 1 , 2 , …
Example (Normal) : μ ^ = x ˉ \hat{\mu} = \bar{x} μ ^ = x ˉ , σ ^ 2 = 1 n ∑ ( x i − x ˉ ) 2 \hat{\sigma}^2 = \frac{1}{n}\sum (x_i - \bar{x})^2 σ ^ 2 = n 1 ∑ ( x i − x ˉ ) 2 .
Iterative method for MLE when data has latent variables.
E-Step : Compute Q ( θ ∣ θ ( t ) ) = E Z ∣ X , θ ( t ) [ log P ( X , Z ∣ θ ) ] Q(\theta | \theta^{(t)}) = E_{Z|X,\theta^{(t)}}[\log P(X, Z | \theta)] Q ( θ ∣ θ ( t ) ) = E Z ∣ X , θ ( t ) [ log P ( X , Z ∣ θ )] .
M-Step : θ ( t + 1 ) = arg max θ Q ( θ ∣ θ ( t ) ) \theta^{(t+1)} = \arg\max_{\theta} Q(\theta | \theta^{(t)}) θ ( t + 1 ) = arg max θ Q ( θ ∣ θ ( t ) ) .
Guarantee : log P ( X ∣ θ ( t + 1 ) ) ≥ log P ( X ∣ θ ( t ) ) \log P(X | \theta^{(t+1)}) \geq \log P(X | \theta^{(t)}) log P ( X ∣ θ ( t + 1 ) ) ≥ log P ( X ∣ θ ( t ) ) .
Applications : Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), Factor Analysis.
Framework for statistical hypothesis testing.
Null Hypothesis H 0 H_0 H 0 : Default assumption (e.g., no effect).
Alternative Hypothesis H 1 H_1 H 1 : What we want to detect.
p-value : Probability of observing data as extreme as observed, assuming H 0 H_0 H 0 is true.
Significance Level α \alpha α : Threshold (typically 0.05). Reject H 0 H_0 H 0 if p < α p < \alpha p < α .
Type I Error : False Positive (Reject H 0 H_0 H 0 when true). P ( Type I ) = α P(\text{Type I}) = \alpha P ( Type I ) = α .
Type II Error : False Negative (Fail to reject H 0 H_0 H 0 when false). P ( Type II ) = β P(\text{Type II}) = \beta P ( Type II ) = β .
Power : 1 − β 1 - \beta 1 − β (Probability of correctly rejecting H 0 H_0 H 0 ).
Test if mean of population differs from a value (one-sample), or if two populations have different means (two-sample).
One-Sample t-statistic :
t = x ˉ − μ 0 s / n t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} t = s / n x ˉ − μ 0
where s s s is sample standard deviation. Follows t n − 1 t_{n-1} t n − 1 distribution under H 0 : μ = μ 0 H_0: \mu = \mu_0 H 0 : μ = μ 0 .
Two-Sample t-test (equal variance) :
t = x ˉ 1 − x ˉ 2 s p 1 n 1 + 1 n 2 t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} t = s p n 1 1 + n 2 1 x ˉ 1 − x ˉ 2
where s p 2 = ( n 1 − 1 ) s 1 2 + ( n 2 − 1 ) s 2 2 n 1 + n 2 − 2 s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2} s p 2 = n 1 + n 2 − 2 ( n 1 − 1 ) s 1 2 + ( n 2 − 1 ) s 2 2 is pooled variance.
Test for independence in contingency tables or goodness-of-fit.
Test Statistic :
χ 2 = ∑ i = 1 k ( O i − E i ) 2 E i \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} χ 2 = ∑ i = 1 k E i ( O i − E i ) 2
where O i O_i O i is observed frequency, E i E_i E i is expected frequency under H 0 H_0 H 0 .
Follows χ k − 1 2 \chi^2_{k-1} χ k − 1 2 distribution.
Test equality of variances between two populations.
F = s 1 2 s 2 2 F = \frac{s_1^2}{s_2^2} F = s 2 2 s 1 2
Follows F n 1 − 1 , n 2 − 1 F_{n_1-1, n_2-1} F n 1 − 1 , n 2 − 1 distribution under H 0 : σ 1 2 = σ 2 2 H_0: \sigma_1^2 = \sigma_2^2 H 0 : σ 1 2 = σ 2 2 .
Use in ANOVA : Test if group means are equal.
Non-parametric test to determine if a sample comes from a reference probability distribution (One-Sample) or if two samples come from the same distribution (Two-Sample).
Statistic : The maximum absolute difference between the empirical CDFs.
D n = sup x ∣ F n ( x ) − F ( x ) ∣ D_n = \sup_x |F_n(x) - F(x)| D n = sup x ∣ F n ( x ) − F ( x ) ∣
where F n ( x ) F_n(x) F n ( x ) is the empirical CDF and F ( x ) F(x) F ( x ) is the reference CDF.
Non-parametric alternative to two-sample t-test. Tests if two samples have the same distribution.
Statistic : Count pairs ( x i , y j ) (x_i, y_j) ( x i , y j ) where x i > y j x_i > y_j x i > y j .
Bonferroni : For m m m tests, use significance level α / m \alpha/m α / m .
Benjamini-Hochberg (FDR) : Controls False Discovery Rate.
Order p-values: p ( 1 ) ≤ … ≤ p ( m ) p_{(1)} \leq \ldots \leq p_{(m)} p ( 1 ) ≤ … ≤ p ( m ) .
Reject H ( i ) H_{(i)} H ( i ) for all i ≤ k i \leq k i ≤ k where k = max { i : p ( i ) ≤ i m α } k = \max\{i: p_{(i)} \leq \frac{i}{m}\alpha\} k = max { i : p ( i ) ≤ m i α } .
Measure of uncertainty or average information content.
H ( X ) = − ∑ x P ( x ) log P ( x ) = E [ − log P ( X ) ] H(X) = - \sum_{x} P(x) \log P(x) = E[-\log P(X)] H ( X ) = − ∑ x P ( x ) log P ( x ) = E [ − log P ( X )]
For binary classification (p p p vs 1 − p 1-p 1 − p ):
H ( p ) = − p log p − ( 1 − p ) log ( 1 − p ) H(p) = -p \log p - (1-p) \log (1-p) H ( p ) = − p log p − ( 1 − p ) log ( 1 − p )
Properties :
H ( X ) ≥ 0 H(X) \geq 0 H ( X ) ≥ 0 with equality iff X X X is deterministic.
H ( X ) H(X) H ( X ) is maximized when X X X is uniform.
Joint Entropy : H ( X , Y ) = − ∑ x , y P ( x , y ) log P ( x , y ) H(X, Y) = -\sum_{x,y} P(x,y) \log P(x,y) H ( X , Y ) = − ∑ x , y P ( x , y ) log P ( x , y ) .
Conditional Entropy : H ( Y ∣ X ) = ∑ x P ( x ) H ( Y ∣ X = x ) H(Y|X) = \sum_x P(x) H(Y|X=x) H ( Y ∣ X ) = ∑ x P ( x ) H ( Y ∣ X = x ) .
Chain Rule : H ( X , Y ) = H ( X ) + H ( Y ∣ X ) H(X, Y) = H(X) + H(Y|X) H ( X , Y ) = H ( X ) + H ( Y ∣ X ) .
Measure of how one probability distribution Q Q Q diverges from a second, expected probability distribution P P P . (Relative Entropy).
D K L ( P ∥ Q ) = ∑ x P ( x ) log P ( x ) Q ( x ) = E P [ log P ( X ) − log Q ( X ) ] D_{KL}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} = E_P \left[ \log P(X) - \log Q(X) \right] D K L ( P ∥ Q ) = ∑ x P ( x ) log Q ( x ) P ( x ) = E P [ log P ( X ) − log Q ( X ) ]
D K L ( P ∥ Q ) = H ( P , Q ) − H ( P ) D_{KL}(P \| Q) = H(P, Q) - H(P) D K L ( P ∥ Q ) = H ( P , Q ) − H ( P )
Properties:
Non-negative : D K L ( P ∥ Q ) ≥ 0 D_{KL}(P \| Q) \geq 0 D K L ( P ∥ Q ) ≥ 0 with equality iff P = Q P = Q P = Q (Gibbs' Inequality).
Asymmetric : D K L ( P ∥ Q ) ≠ D K L ( Q ∥ P ) D_{KL}(P \| Q) \neq D_{KL}(Q \| P) D K L ( P ∥ Q ) = D K L ( Q ∥ P ) .
Not a metric (doesn't satisfy triangle inequality).
Loss function often used in classification.
H ( P , Q ) = − ∑ x P ( x ) log Q ( x ) = H ( P ) + D K L ( P ∥ Q ) H(P, Q) = - \sum_{x} P(x) \log Q(x) = H(P) + D_{KL}(P \| Q) H ( P , Q ) = − ∑ x P ( x ) log Q ( x ) = H ( P ) + D K L ( P ∥ Q )
For binary classification (True label y ∈ { 0 , 1 } y \in \{0,1\} y ∈ { 0 , 1 } , predicted prob y ^ \hat{y} y ^ ):
L = − [ y log y ^ + ( 1 − y ) log ( 1 − y ^ ) ] L = -[y \log \hat{y} + (1-y) \log (1-\hat{y})] L = − [ y log y ^ + ( 1 − y ) log ( 1 − y ^ )]
Minimizing Cross-Entropy w.r.t. Q Q Q is equivalent to minimizing KL Divergence from P P P to Q Q Q .
Measures reduction in uncertainty of X X X given knowledge of Y Y Y .
I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) = H ( Y ) − H ( Y ∣ X ) I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) I ( X ; Y ) = H ( X ) − H ( X ∣ Y ) = H ( Y ) − H ( Y ∣ X )
I ( X ; Y ) = D K L ( P ( X , Y ) ∥ P ( X ) P ( Y ) ) I(X; Y) = D_{KL}(P(X,Y) \| P(X)P(Y)) I ( X ; Y ) = D K L ( P ( X , Y ) ∥ P ( X ) P ( Y ))
Interpretation : Amount of information shared between X X X and Y Y Y . I ( X ; Y ) = 0 I(X;Y) = 0 I ( X ; Y ) = 0 iff X X X and Y Y Y are independent.
Symmetric version of KL divergence.
J S D ( P ∥ Q ) = 1 2 D K L ( P ∥ M ) + 1 2 D K L ( Q ∥ M ) JSD(P \| Q) = \frac{1}{2}D_{KL}(P \| M) + \frac{1}{2}D_{KL}(Q \| M) J S D ( P ∥ Q ) = 2 1 D K L ( P ∥ M ) + 2 1 D K L ( Q ∥ M )
where M = 1 2 ( P + Q ) M = \frac{1}{2}(P + Q) M = 2 1 ( P + Q ) .
Properties: Symmetric, bounded [ 0 , 1 ] [0, 1] [ 0 , 1 ] (with log 2 \log_2 log 2 ).
Probabilistic classifier based on Bayes' Theorem with "naive" independence assumptions between features.
P ( y ∣ x 1 , … , x n ) = P ( y ) P ( x 1 , … , x n ∣ y ) P ( x 1 , … , x n ) P(y | x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n | y)}{P(x_1, \dots, x_n)} P ( y ∣ x 1 , … , x n ) = P ( x 1 , … , x n ) P ( y ) P ( x 1 , … , x n ∣ y )
Assumption : Features x i x_i x i are conditionally independent given y y y .
P ( x i ∣ y , x 1 , … , x i − 1 , … ) = P ( x i ∣ y ) P(x_i | y, x_1, \dots, x_{i-1}, \dots) = P(x_i | y) P ( x i ∣ y , x 1 , … , x i − 1 , … ) = P ( x i ∣ y )
Decision Rule :
y ^ = arg max y P ( y ) ∏ i = 1 n P ( x i ∣ y ) \hat{y} = \arg\max_y P(y) \prod_{i=1}^n P(x_i | y) y ^ = arg max y P ( y ) ∏ i = 1 n P ( x i ∣ y )
Log-Space (Numerical Stability) :
y ^ = arg max y ( log P ( y ) + ∑ i = 1 n log P ( x i ∣ y ) ) \hat{y} = \arg\max_y \left( \log P(y) + \sum_{i=1}^n \log P(x_i | y) \right) y ^ = arg max y ( log P ( y ) + ∑ i = 1 n log P ( x i ∣ y ) )
Variants :
Gaussian Naive Bayes : P ( x i ∣ y ) ∼ N ( μ y , i , σ y , i 2 ) P(x_i | y) \sim \mathcal{N}(\mu_{y,i}, \sigma_{y,i}^2) P ( x i ∣ y ) ∼ N ( μ y , i , σ y , i 2 ) .
Multinomial Naive Bayes : For count data (text classification).
Bernoulli Naive Bayes : For binary features.
Family of distributions that can be written in the form:
P ( x ∣ θ ) = h ( x ) exp ( η ( θ ) T T ( x ) − A ( θ ) ) P(x|\theta) = h(x) \exp\left(\eta(\theta)^T T(x) - A(\theta)\right) P ( x ∣ θ ) = h ( x ) exp ( η ( θ ) T T ( x ) − A ( θ ) )
η ( θ ) \eta(\theta) η ( θ ) : Natural parameter.
T ( x ) T(x) T ( x ) : Sufficient statistic.
A ( θ ) A(\theta) A ( θ ) : Log-partition function (normalizer).
h ( x ) h(x) h ( x ) : Base measure.
Members : Gaussian, Exponential, Gamma, Beta, Bernoulli, Poisson, Multinomial, Dirichlet.
Properties :
Sufficient statistics exist.
Conjugate priors exist.
E [ T ( X ) ] = ∇ A ( θ ) E[T(X)] = \nabla A(\theta) E [ T ( X )] = ∇ A ( θ ) .
Var ( T ( X ) ) = ∇ 2 A ( θ ) \text{Var}(T(X)) = \nabla^2 A(\theta) Var ( T ( X )) = ∇ 2 A ( θ ) .
Mixture of K K K Gaussian distributions.
P ( x ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) P(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x | \mu_k, \Sigma_k) P ( x ) = ∑ k = 1 K π k N ( x ∣ μ k , Σ k )
where ∑ π k = 1 \sum \pi_k = 1 ∑ π k = 1 (mixing coefficients).
EM for GMM :
E-Step : Compute responsibilities γ i k = P ( z i = k ∣ x i ) \gamma_{ik} = P(z_i = k | x_i) γ ik = P ( z i = k ∣ x i ) .
M-Step : Update π k , μ k , Σ k \pi_k, \mu_k, \Sigma_k π k , μ k , Σ k using weighted MLE.
Model with latent states z t z_t z t and observations x t x_t x t .
Assumptions :
Markov : P ( z t ∣ z 1 : t − 1 ) = P ( z t ∣ z t − 1 ) P(z_t | z_{1:t-1}) = P(z_t | z_{t-1}) P ( z t ∣ z 1 : t − 1 ) = P ( z t ∣ z t − 1 ) (Transition).
Conditional Independence : P ( x t ∣ x 1 : t − 1 , z 1 : t ) = P ( x t ∣ z t ) P(x_t | x_{1:t-1}, z_{1:t}) = P(x_t | z_t) P ( x t ∣ x 1 : t − 1 , z 1 : t ) = P ( x t ∣ z t ) (Emission).
Algorithms :
Forward Algorithm : Compute P ( x 1 : T ) P(x_{1:T}) P ( x 1 : T ) in O ( T K 2 ) O(T K^2) O ( T K 2 ) .
Viterbi Algorithm : Find most likely state sequence arg max z 1 : T P ( z 1 : T ∣ x 1 : T ) \arg\max_{z_{1:T}} P(z_{1:T} | x_{1:T}) arg max z 1 : T P ( z 1 : T ∣ x 1 : T ) .
Baum-Welch (EM for HMM) : Learn parameters.
Sample from target distribution p ( x ) p(x) p ( x ) using proposal distribution q ( x ) q(x) q ( x ) where p ( x ) ≤ M q ( x ) p(x) \leq M q(x) p ( x ) ≤ Mq ( x ) .
Algorithm :
Sample x ∼ q ( x ) x \sim q(x) x ∼ q ( x ) .
Sample u ∼ Uniform ( 0 , M q ( x ) ) u \sim \text{Uniform}(0, Mq(x)) u ∼ Uniform ( 0 , Mq ( x )) .
If u ≤ p ( x ) u \leq p(x) u ≤ p ( x ) , accept x x x . Else, reject and repeat.
Acceptance Rate : 1 / M 1/M 1/ M . Lower M M M (tighter bound) is better.
Estimate E p [ f ( X ) ] E_p[f(X)] E p [ f ( X )] by sampling from proposal q q q .
E p [ f ( X ) ] = ∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x ≈ 1 n ∑ i = 1 n f ( x i ) w i E_p[f(X)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx \approx \frac{1}{n} \sum_{i=1}^n f(x_i) w_i E p [ f ( X )] = ∫ f ( x ) p ( x ) d x = ∫ f ( x ) q ( x ) p ( x ) q ( x ) d x ≈ n 1 ∑ i = 1 n f ( x i ) w i
where x i ∼ q x_i \sim q x i ∼ q , w i = p ( x i ) q ( x i ) w_i = \frac{p(x_i)}{q(x_i)} w i = q ( x i ) p ( x i ) (importance weights).
Construct a Markov chain with stationary distribution π ( x ) \pi(x) π ( x ) equal to target distribution.
Metropolis-Hastings :
Propose x ′ ∼ q ( x ′ ∣ x t ) x' \sim q(x' | x_t) x ′ ∼ q ( x ′ ∣ x t ) .
Accept with probability α = min ( 1 , π ( x ′ ) q ( x t ∣ x ′ ) π ( x t ) q ( x ′ ∣ x t ) ) \alpha = \min\left(1, \frac{\pi(x') q(x_t | x')}{\pi(x_t) q(x' | x_t)}\right) α = min ( 1 , π ( x t ) q ( x ′ ∣ x t ) π ( x ′ ) q ( x t ∣ x ′ ) ) .
If accept, x t + 1 = x ′ x_{t+1} = x' x t + 1 = x ′ . Else, x t + 1 = x t x_{t+1} = x_t x t + 1 = x t .
Gibbs Sampling :
Sample each variable conditional on others. Special case of Metropolis-Hastings with acceptance probability 1.
x i ( t + 1 ) ∼ P ( x i ∣ x 1 ( t + 1 ) , … , x i − 1 ( t + 1 ) , x i + 1 ( t ) , … , x n ( t ) ) x_i^{(t+1)} \sim P(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_n^{(t)}) x i ( t + 1 ) ∼ P ( x i ∣ x 1 ( t + 1 ) , … , x i − 1 ( t + 1 ) , x i + 1 ( t ) , … , x n ( t ) )
Uses gradient information and Hamiltonian dynamics to propose distant moves with high acceptance rate.
Hamiltonian : H ( x , p ) = U ( x ) + K ( p ) H(x, p) = U(x) + K(p) H ( x , p ) = U ( x ) + K ( p ) where U ( x ) = − log π ( x ) U(x) = -\log \pi(x) U ( x ) = − log π ( x ) (potential energy), K ( p ) = 1 2 p T M − 1 p K(p) = \frac{1}{2}p^T M^{-1} p K ( p ) = 2 1 p T M − 1 p (kinetic energy).
Leapfrog Integration : Simulate Hamiltonian dynamics to propose new state.
A statistic T ( X ) T(X) T ( X ) is sufficient for θ \theta θ if P ( X ∣ T ( X ) , θ ) = P ( X ∣ T ( X ) ) P(X | T(X), \theta) = P(X | T(X)) P ( X ∣ T ( X ) , θ ) = P ( X ∣ T ( X )) (data provides no additional information about θ \theta θ beyond T ( X ) T(X) T ( X ) ).
Factorization Theorem : T ( X ) T(X) T ( X ) is sufficient iff p ( x ∣ θ ) = g ( T ( x ) , θ ) h ( x ) p(x|\theta) = g(T(x), \theta) h(x) p ( x ∣ θ ) = g ( T ( x ) , θ ) h ( x ) .
Measures amount of information that observable X X X carries about parameter θ \theta θ .
I ( θ ) = E [ ( ∂ log p ( X ∣ θ ) ∂ θ ) 2 ] = − E [ ∂ 2 log p ( X ∣ θ ) ∂ θ 2 ] I(\theta) = E\left[\left(\frac{\partial \log p(X|\theta)}{\partial \theta}\right)^2\right] = -E\left[\frac{\partial^2 \log p(X|\theta)}{\partial \theta^2}\right] I ( θ ) = E [ ( ∂ θ ∂ l o g p ( X ∣ θ ) ) 2 ] = − E [ ∂ θ 2 ∂ 2 l o g p ( X ∣ θ ) ]
Cramér-Rao Lower Bound : For unbiased estimator θ ^ \hat{\theta} θ ^ :
Var ( θ ^ ) ≥ 1 I ( θ ) \text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)} Var ( θ ^ ) ≥ I ( θ ) 1
For estimator θ ^ \hat{\theta} θ ^ of parameter θ \theta θ :
MSE ( θ ^ ) = E [ ( θ ^ − θ ) 2 ] = Bias ( θ ^ ) 2 + Var ( θ ^ ) \text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}) MSE ( θ ^ ) = E [( θ ^ − θ ) 2 ] = Bias ( θ ^ ) 2 + Var ( θ ^ )
where Bias ( θ ^ ) = E [ θ ^ ] − θ \text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta Bias ( θ ^ ) = E [ θ ^ ] − θ .
Resampling method to estimate sampling distribution of a statistic.
Algorithm :
Sample n n n observations with replacement from data { x 1 , … , x n } \{x_1, \ldots, x_n\} { x 1 , … , x n } .
Compute statistic θ ^ ∗ \hat{\theta}^* θ ^ ∗ on bootstrap sample.
Repeat B B B times.
Use distribution of { θ ^ ∗ 1 , … , θ ^ ∗ B } \{\hat{\theta}^{*1}, \ldots, \hat{\theta}^{*B}\} { θ ^ ∗ 1 , … , θ ^ ∗ B } to estimate sampling distribution.
For i.i.d. X 1 , … , X n X_1, \ldots, X_n X 1 , … , X n with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 :
X ˉ n − μ σ / n → d N ( 0 , 1 ) \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1) σ / n X ˉ n − μ d N ( 0 , 1 )
Delta Method : For smooth function g g g :
n ( g ( X ˉ n ) − g ( μ ) ) → d N ( 0 , ( g ′ ( μ ) ) 2 σ 2 ) \sqrt{n}(g(\bar{X}_n) - g(\mu)) \xrightarrow{d} \mathcal{N}(0, (g'(\mu))^2 \sigma^2) n ( g ( X ˉ n ) − g ( μ )) d N ( 0 , ( g ′ ( μ ) ) 2 σ 2 )