Probability Theory

Basic Concepts

Sample Space ((S)): The set of all possible outcomes of a random experiment. \[ S = {s_1, s_2, \ldots, s_n} \]

Event ((E)): A subset of the sample space (S). \[ E \subseteq S \]

Probability of an Event ((P(E))): \[ P(E) = \frac{|E|}{|S|} ] where (|E|) is the number of outcomes in (E) and (|S|) is the number of outcomes in (S).

Axioms of Probability

Non-negativity: \[ P(E) \ge 0 \]
Normalization: \[ P(S) = 1 \]
Additivity: For any two mutually exclusive events (A) and (B), \[ P(A \cup B) = P(A) + P(B) \]

Conditional Probability

Conditional Probability ((P(A|B))): The probability of event (A) given that event (B) has occurred. \[ P(A|B) = \frac{P(A \cap B)}{P(B)} ] provided (P(B) > 0).

Bayes' Theorem

Bayes' Theorem: \[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} \]

Independent Events

Independence: Two events (A) and (B) are independent if \[ P(A \cap B) = P(A) P(B) \]

Random Variables

Random Variable ((X)): A function that assigns a numerical value to each outcome in the sample space. \[ X: S \to \mathbb{R} \]

Discrete Random Variable: A random variable that can take on a countable number of values.

Continuous Random Variable: A random variable that can take on an uncountable number of values.

Probability Distributions

Probability Mass Function (PMF) for Discrete Random Variables: \[ p_X(x) = P(X = x) \]

Probability Density Function (PDF) for Continuous Random Variables: \[ f_X(x) ] such that \[ P(a \le X \le b) = \int_a^b f_X(x) , dx \]

Cumulative Distribution Function (CDF): \[ F_X(x) = P(X \le x) \]

Expectation and Variance

Expectation (Mean) of a Discrete Random Variable: \[ \mathbb{E}[X] = \sum_x x p_X(x) \]

Expectation (Mean) of a Continuous Random Variable: \[ \mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x) , dx \]

Variance: \[ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \] \[ \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

Standard Deviation: \[ \sigma_X = \sqrt{\text{Var}(X)} \]

Common Discrete Distributions

Bernoulli Distribution: \[ P(X = 1) = p, \quad P(X = 0) = 1 - p \] \[ \mathbb{E}[X] = p \] \[ \text{Var}(X) = p(1 - p) \]

Binomial Distribution: \[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \] \[ \mathbb{E}[X] = np \] \[ \text{Var}(X) = np(1 - p) \]

Geometric Distribution: \[ P(X = k) = (1 - p)^{k - 1} p \] \[ \mathbb{E}[X] = \frac{1}{p} \] \[ \text{Var}(X) = \frac{1 - p}{p^2} \]

Poisson Distribution: \[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \] \[ \mathbb{E}[X] = \lambda \] \[ \text{Var}(X) = \lambda \]

Common Continuous Distributions

Uniform Distribution: \[ f_X(x) = \begin{cases} \frac{1}{b - a}, & a \le x \le b \ 0, & \text{otherwise} \end{cases} \] \[ \mathbb{E}[X] = \frac{a + b}{2} \] \[ \text{Var}(X) = \frac{(b - a)^2}{12} \]

Exponential Distribution: \[ f_X(x) = \begin{cases} \lambda e^{-\lambda x}, & x \ge 0 \ 0, & \text{otherwise} \end{cases} \] \[ \mathbb{E}[X] = \frac{1}{\lambda} \] \[ \text{Var}(X) = \frac{1}{\lambda^2} \]

Normal (Gaussian) Distribution: \[ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \] \[ \mathbb{E}[X] = \mu \] \[ \text{Var}(X) = \sigma^2 \]

Joint Distributions

Joint Probability Mass Function (Discrete): \[ p_{X,Y}(x, y) = P(X = x, Y = y) \]

Joint Probability Density Function (Continuous): \[ f_{X,Y}(x, y) \]

Marginal Distribution: \[ p_X(x) = \sum_y p_{X,Y}(x, y) \] \[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dy \]

Conditional Distributions

Conditional PMF: \[ p_{X|Y}(x|y) = \frac{p_{X,Y}(x, y)}{p_Y(y)} \]

Conditional PDF: \[ f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)} \]

Covariance and Correlation

Covariance: \[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \] \[ \text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] \]

Correlation Coefficient: \[ \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]

Properties:

( \text{Cov}(X, Y) = 0 ) implies ( X ) and ( Y ) are uncorrelated.
( \rho_{X,Y} ) ranges from (-1) to (1).

Moment Generating Functions

Moment Generating Function (MGF): \[ M_X(t) = \mathbb{E}[e^{tX}] \]

Properties:

The ( n )-th moment of ( X ) is ( M_X^{(n)}(0) ).
If ( M_X(t) = M_Y(t) ), then ( X ) and ( Y ) have the same distribution.

Law of Large Numbers

Weak Law of Large Numbers: \[ \overline{X}n = \frac{1}{n} \sum{i=1}^n X_i \to \mathbb{E}[X] \quad \text{in probability as} \quad n \to \infty \]

Strong Law of Large Numbers: \[ \overline{X}n = \frac{1}{n} \sum{i=1}^n X_i \to \mathbb{E}[X] \quad \text{almost surely as} \quad n \to \infty \]

Central Limit Theorem

Central Limit Theorem: If ( X_1, X_2, \ldots, X_n ) are i.i.d. random variables with mean ( \mu ) and variance ( \sigma^2 ), then \[ \frac{\overline{X}_n - \mu}{\sigma / \sqrt{n}} \to N(0, 1) \quad \text{as} \quad n \to \infty \]

Markov and Chebyshev Inequalities

Markov's Inequality: For a non-negative random variable ( X ) and ( a > 0 ), \[ P(X \ge a) \le \frac{\mathbb{E}[X]}{a} \]

Chebyshev's Inequality: For any random variable ( X ) with mean ( \mu ) and variance ( \sigma^2 ), \[ P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2} \]

Information Theory

Entropy (Discrete): \[ H(X) = -\sum_{x} p(x) \log p(x) \]

Joint Entropy: \[ H(X, Y) = -\sum_{x,y} p(x, y) \log p(x, y) \]

Conditional Entropy: \[ H(X|Y) = -\sum_{x,y} p(x, y) \log p(x|y) \]

Mutual Information: \[ I(X;Y) = H(X) + H(Y) - H(X, Y) \] \[ I(X;Y) = \sum_{x,y} p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)} \right) \]

Kullback-Leibler Divergence: \[ D_{KL}(P || Q) = \sum_{x} p(x) \log \left( \frac{p(x)}{q(x)} \right) \]

Common Transformations

Linear Transformation: If ( Y = aX + b ), then \[ \mathbb{E}[Y] = a\mathbb{E}[X] + b \] \[ \text{Var}(Y) = a^2 \text{Var}(X) \]

Sum of Independent Random Variables: If ( X ) and ( Y ) are independent, then \[ \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y] \] \[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \]

Convolution of Two Independent Random Variables: The PDF of the sum ( Z = X + Y ) is given by \[ f_Z(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z - x) , dx \]

Important Results

Law of Total Probability: \[ P(A) = \sum_{i} P(A|B_i) P(B_i) \] where ( {B_i} ) is a partition of the sample space.

Bayes' Theorem (General Form): \[ P(B_i|A) = \frac{P(A|B_i) P(B_i)}{\sum_{j} P(A|B_j) P(B_j)} \]

Poisson Process: A counting process with rate ( \lambda ) has:

Interarrival times ( T_i ) that are i.i.d. exponential with parameter ( \lambda ).
Number of events in time ( t ) is Poisson distributed with parameter ( \lambda t ).

Moment Generating Function (MGF): If ( X ) has an MGF ( M_X(t) ), then the ( n )-th moment is \[ \mathbb{E}[X^n] = M_X^{(n)}(0) \]

Characteristic Function: The characteristic function of ( X ) is \[ \phi_X(t) = \mathbb{E}[e^{itX}] \] and it uniquely determines the distribution of ( X ).

Stochastic Processes

Definition: A stochastic process is a collection of random variables ( {X(t) : t \in T} ) indexed by ( T ).

Markov Process: A stochastic process where the future is independent of the past given the present.

Martingale: A stochastic process ( {X_t} ) is a martingale if \[ \mathbb{E}[X_{t+1} | \mathcal{F}_t] = X_t \]

Brownian Motion: A continuous-time stochastic process ( {B(t) : t \ge 0} ) with:

( B(0) = 0 ).
Independent increments.
( B(t) - B(s) \sim N(0, t-s) ) for ( 0 \le s < t ).
Continuous paths.

Central Limit Theorem (CLT) for IID Random Variables

CLT Statement: If ( X_1, X_2, \ldots, X_n ) are i.i.d. random variables with mean ( \mu ) and variance ( \sigma^2 ), then \[ \frac{\overline{X}_n - \mu}{\sigma / \sqrt{n}} \to N(0, 1) \quad \text{as} \quad n \to \infty \]

Multivariate Central Limit Theorem: If ( \mathbf{X}_1, \mathbf{X}_2, \ldots, \mathbf{X}_n ) are i.i.d. random vectors with mean vector ( \mathbf{\mu} ) and covariance matrix ( \Sigma ), then \[ \sqrt{n}(\overline{\mathbf{X}}_n - \mathbf{\mu}) \to N(\mathbf{0}, \Sigma) \quad \text{as} \quad n \to \infty \]

Law of Large Numbers (LLN)

Weak Law of Large Numbers (WLLN): For i.i.d. random variables ( X_1, X_2, \ldots, X_n ) with mean ( \mu ), \[ \overline{X}n = \frac{1}{n} \sum{i=1}^n X_i \to \mu \quad \text{in probability as} \quad n \to \infty \]

Strong Law of Large Numbers (SLLN): For i.i.d. random variables ( X_1, X_2, \ldots, X_n ) with mean ( \mu ), \[ \overline{X}n = \frac{1}{n} \sum{i=1}^n X_i \to \mu \quad \text{almost surely as} \quad n \to \infty \]

Convergence in Probability and Distribution

Convergence in Probability: A sequence of random variables ( X_n ) converges in probability to ( X ) if for all ( \epsilon > 0 ), \[ \lim_{n \to \infty} P(|X_n - X| \ge \epsilon) = 0 \]

Convergence in Distribution: A sequence of random variables ( X_n ) converges in distribution to ( X ) if for all continuity points ( x ) of ( F_X(x) ), \[ \lim_{n \to \infty} F_{X_n}(x) = F_X(x) \]

Moment Generating Functions (MGF) and Characteristic Functions

MGF of Sum of Independent Random Variables: If ( X ) and ( Y ) are independent, \[ M_{X+Y}(t) = M_X(t) M_Y(t) \]

Characteristic Function of Sum of Independent Random Variables: If ( X ) and ( Y ) are independent, \[ \phi_{X+Y}(t) = \phi_X(t) \phi_Y(t) \]

Inequalities and Bounds

Jensen's Inequality: For a convex function ( \phi ) and random variable ( X ), \[ \phi(\mathbb{E}[X]) \le \mathbb{E}[\phi(X)] \]

Cauchy-Schwarz Inequality: For any random variables ( X ) and ( Y ), \[ (\mathbb{E}[XY])^2 \le \mathbb{E}[X^2] \mathbb{E}[Y^2] \]

Markov's Inequality: For a non-negative random variable ( X ) and ( a > 0 ), \[ P(X \ge a) \le \frac{\mathbb{E}[X]}{a} \]

Chebyshev's Inequality: For any random variable ( X ) with mean ( \mu ) and variance ( \sigma^2 ), \[ P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2} \]

Bayesian Inference

Posterior Distribution: \[ P(\theta|X) = \frac{P(X|\theta) P(\theta)}{P(X)} \]

Maximum A Posteriori (MAP) Estimate: \[ \hat{\theta}{MAP} = \arg\max\theta P(\theta|X) \]

Bayesian Updating: Given a prior ( P(\theta) ) and likelihood ( P(X|\theta) ), the posterior is updated as \[ P(\theta|X) \propto P(X|\theta) P(\theta) \]

Hypothesis Testing

Null Hypothesis ((H_0)): The hypothesis that there is no effect or no difference.

Alternative Hypothesis ((H_1)): The hypothesis that there is an effect or a difference.

Test Statistic: A function of the sample data used to decide whether to reject (H_0).

p-Value: The probability of obtaining a test statistic at least as extreme as the one observed, assuming (H_0) is true.

Type I Error: Rejecting (H_0) when it is true (false positive).

Type II Error: Failing to reject (H_0) when it is false (false negative).

Significance Level ((\alpha)): The probability of making a Type I error.

Power: The probability of correctly rejecting (H_0) (1 - probability of Type II error).

Confidence Intervals

Confidence Interval for Mean ((\mu)): For a normal distribution with known variance ( \sigma^2 ), \[ \left( \overline{X} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \overline{X} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right) \] where ( z_{\alpha/2} ) is the critical value from the standard normal distribution.

Confidence Interval for Proportion ((p)): \[ \left( \hat{p} - z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \hat{p} + z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \right) \] where ( \hat{p} ) is the sample proportion.

Regression Analysis

Simple Linear Regression: \[ Y = \beta_0 + \beta_1 X + \epsilon \]

Least Squares Estimates: \[ \hat{\beta}_1 = \frac{\sum (X_i - \overline{X})(Y_i - \overline{Y})}{\sum (X_i - \overline{X})^2} \] \[ \hat{\beta}_0 = \overline{Y} - \hat{\beta}_1 \overline{X} \]

Coefficient of Determination ((R^2)): \[ R^2 = \frac{\sum (\hat{Y}_i - \overline{Y})^2}{\sum (Y_i - \overline{Y})^2} \]

Multiple Linear Regression: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon \]

Generalized Least Squares Estimates: \[ \hat{\beta} = (X^T X)^{-1} X^T Y \]

Common Probability Distributions

Bernoulli Distribution: \[ P(X = 1) = p, \quad P(X = 0) = 1 - p \] \[ \mathbb{E}[X] = p \] \[ \text{Var}(X) = p(1 - p) \]

Binomial Distribution: \[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \] \[ \mathbb{E}[X] = np \] \[ \text{Var}(X) = np(1 - p) \]

Geometric Distribution: \[ P(X = k) = (1 - p)^{k - 1} p \] \[ \mathbb{E}[X] = \frac{1}{p} \] \[ \text{Var}(X) = \frac{1 - p}{p^2} \]

Poisson Distribution: \[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} \] \[ \mathbb{E}[X] = \lambda \] \[ \text{Var}(X) = \lambda \]

Normal (Gaussian) Distribution: \[ f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \] \[ \mathbb{E}[X] = \mu \] \[ \text{Var}(X) = \sigma^2 \]

Gamma Distribution: \[ f_X(x) = \frac{\lambda^k x^{k-1} e^{-\lambda x}}{\Gamma(k)}, \quad x \ge 0 \] \[ \mathbb{E}[X] = \frac{k}{\lambda} \] \[ \text{Var}(X) = \frac{k}{\lambda^2} \]

Beta Distribution: \[ f_X(x) = \frac{x^{\alpha-1} (1 - x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \le x \le 1 \] \[ \mathbb{E}[X] = \frac{\alpha}{\alpha + \beta} \] \[ \text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} \]

Chi-Square Distribution: \[ f_X(x) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{k/2 - 1} e^{-x/2}, \quad x \ge 0 \] \[ \mathbb{E}[X] = k \] \[ \text{Var}(X) = 2k \]

Student's t-Distribution: \[ f_X(x) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu \pi} \Gamma\left(\frac{\nu}{2}\right)} \left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2} \] \[ \mathbb{E}[X] = 0 \quad \text{for} \quad \nu > 1 \] \[ \text{Var}(X) = \frac{\nu}{\nu-2} \quad \text{for} \quad \nu > 2 \]

Multivariate Normal Distribution: \[ f_X(\mathbf{x}) = \frac{1}{(2\pi)^{k/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \Sigma^{-1} (\mathbf{x} - \mathbf{\mu}) \right) \] where ( \mathbf{\mu} ) is the mean vector and ( \Sigma ) is the covariance matrix.

Multinomial Distribution: \[ P(X_1 = x_1, X_2 = x_2, \ldots, X_k = x_k) = \frac{n!}{x_1! x_2! \cdots x_k!} p_1^{x_1} p_2^{x_2} \cdots p_k^{x_k} \] where ( X_i ) is the count of the (i)-th outcome in (n) trials, and ( p_i ) is the probability of the (i)-th outcome.

Negative Binomial Distribution: \[ P(X = k) = \binom{k+r-1}{r-1} p^r (1 - p)^k \] \[ \mathbb{E}[X] = \frac{r(1 - p)}{p} \] \[ \text{Var}(X) = \frac{r(1 - p)}{p^2} \]

Hypergeometric Distribution: \[ P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \] where ( X ) is the number of successes in (n) draws from a population of (N) with (K) successes.

Stephen Mark's Personal Website