// link established ▸ offline · zero-cloud · one file _

QUANT PREP TERMINAL

Your offline cockpit for quant interviews: cheatsheets + simulation lab + flashcards for learning + experimenting + drilling. Customize your subject tiles, pin your weak points, and add your own flashcards and cheatsheets.

Pinned

clear all

Your collection. Hit PIN on any cheatsheet card or flashcard below to pin a copy here; unpin to remove. Pins persist in this browser.

Browse

Jump straight to any cheatsheet topic (read) or flashcard domain (drill). Counts update as the library grows.

Must-Know Cheatsheets

Didactic, example-filled references spanning every quant subject. Pick a topic to switch.

⌕

All Algorithms Brainteasers C++ Calculus Combinatorics Distributions Geometry Inference Instruments Linear Algebra Mental Math Microstructure Options & Vol Probability Python Regression Stochastic Processes Time Series

Conditional probability & the multiplication rule

P(A|B)=P(A∩B)/P(B), defined for P(B)>0. Read it as: renormalize the sample space to B.
Chain rule: P(A∩B∩C)=P(A)·P(B|A)·P(C|A∩B) — the workhorse for sequential draws.
Conditioning is just slicing then renormalizing; the denominator is the new universe.
Independence ⇔ P(A|B)=P(A) ⇔ P(A∩B)=P(A)P(B). Pairwise independence does NOT imply mutual independence.
Watch the trap: P(A|B)≠P(B|A) in general (prosecutor's fallacy).

Example. Two cards drawn without replacement. P(both aces)=(4/52)·(3/51)=1/221 via the chain rule.

Law of total probability & Bayes

Partition {Bᵢ} of the sample space: P(A)=Σᵢ P(A|Bᵢ)P(Bᵢ). Average the conditionals, weighted by prior.
Bayes: P(Bⱼ|A)=P(A|Bⱼ)P(Bⱼ)/Σᵢ P(A|Bᵢ)P(Bᵢ).
Posterior ∝ likelihood × prior; the denominator is just the normalizer P(A).
Odds form is fastest for two hypotheses: posterior odds = prior odds × likelihood ratio.
Rare-event trap: even a good test produces mostly false positives when the base rate is tiny — low prior dominates.

Example. Two coins: fair and double-headed, pick one at random, flip H. P(double-headed|H)=(1·½)/(½·½+1·½)=2/3 by odds: prior 1:1 × LR (1/½)=2:1.

Expectation: linearity & the tail / LOTUS tools

E[aX+bY]=aE[X]+bE[Y] ALWAYS — no independence needed. This breaks most hard problems.
LOTUS: E[g(X)]=Σ g(x)p(x) or ∫ g(x)f(x)dx — don't derive the dist of g(X).
Tail sum for nonneg integers: E[X]=Σₖ₌₁ P(X≥k); continuous: E[X]=∫₀^∞ P(X>x)dx.
Decompose into indicators and sum expectations — independence is irrelevant for the mean.
Law of total expectation: E[X]=E[E[X|Y]].

Example. Expected couples sitting together: n couples at a round table of 2n seats. Indicator per couple, P(together)=2/(2n−1), so E=2n/(2n−1).

Indicator-variable method

Write the count as X=Σ Iₐ where Iₐ∈{0,1}; then E[X]=Σ P(Aₐ) by linearity.
Key identity: E[Iₐ]=P(Aₐ) and Iₐ²=Iₐ.
For variance of a sum of indicators: Var(X)=Σ Var(Iₐ)+ΣΣ_{a≠b} Cov(Iₐ,I_b); Cov(Iₐ,I_b)=P(Aₐ∩A_b)−P(Aₐ)P(A_b).
Never enumerate — assign one indicator per object and sum. Handles dependence automatically for the mean.

Example. Records in a random permutation: position i is a record w.p. 1/i. Expected records =Σᵢ₌₁ⁿ 1/i=Hₙ≈ln n+γ.

Variance, covariance & correlation

Var(X)=E[X²]−(E[X])²; Cov(X,Y)=E[XY]−E[X]E[Y].
Var(aX+bY)=a²Var(X)+b²Var(Y)+2ab·Cov(X,Y). Cross term vanishes only if uncorrelated.
ρ=Cov(X,Y)/(σ_Xσ_Y)∈[−1,1]; measures LINEAR dependence only.
Independent ⇒ uncorrelated, but uncorrelated does NOT ⇒ independent.
Variance of a sum of n iid: nσ²; of the mean: σ²/n.

Example. X uniform on {−1,0,1}, Y=X². Cov=E[XY]−E[X]E[Y]=E[X³]−0=0, so uncorrelated, yet Y is a function of X — fully dependent.

Conditional expectation & the tower / variance decomposition

Tower rule: E[X]=E[E[X|Y]] — condition on whatever simplifies the problem.
Eve's law: Var(X)=E[Var(X|Y)]+Var(E[X|Y]) = within-group + between-group variance.
E[X|Y] is a random variable (a function of Y); plug in a value to get a number.
First-step analysis: condition on the first move, write a recursion, solve.

Example. E[# flips to get first H], p=½. Condition on flip 1: E=1+½·0+½·E ⇒ E=2. (Mean of Geometric(p) is 1/p.)

Markov, Chebyshev & concentration

Markov (X≥0): P(X≥a)≤E[X]/a. Only the mean; loose but assumption-free.
Chebyshev: P(|X−μ|≥kσ)≤1/k² — needs finite variance, distribution-free.
Markov bounds the upper tail from the mean; Chebyshev bounds two-sided spread from the variance.
Chebyshev proves the weak law of large numbers: sample mean variance σ²/n→0.
Tighter tails (Chernoff/Hoeffding) need the MGF or boundedness — mention if asked, but those live beyond core.

Example. Mean income $50k. P(income≥$1M)≤50k/1M=5% by Markov — no other info needed.

Jensen's inequality

Convex φ: E[φ(X)]≥φ(E[X]). Concave: inequality flips.
Equality iff X is constant a.s. or φ is linear on X's support.
Consequences: E[X²]≥(E[X])² (so Var≥0), E[1/X]≥1/E[X] for X>0, E[ln X]≤ln E[X].
A function of the average ≤ the average of a convex function — convexity adds value to dispersion.
Finance hook: option convexity (gamma) means E[payoff(S)]>payoff(E[S]); volatility is valuable.

Example. X=1 or 4 each w.p. ½. E[√X]=(1+2)/2=1.5 < √E[X]=√2.5≈1.58 — concave √ gives the reverse Jensen gap.

Symmetry, conditioning & classic interview tactics

Symmetry: if outcomes are exchangeable, P(A before B)=½; the k-th of n shuffled items is uniform over positions.
States & recursion: define E from each state, condition on the first step, solve the linear system.
Poisson-process / memoryless framing: 'who arrives first' among independent exponentials is proportional to rates.
Before computing, ask: can symmetry, linearity, or a one-step recursion collapse this?
Reflection/complement: count the easy complement when the event itself is messy.

Example. Shuffle a deck; P(ace of spades above king of hearts)? By symmetry between two distinguished cards, =½ — no counting.

Covariance, Correlation, Conditioning

Cov(X,Y)=E[XY]−E[X]E[Y]; ρ = Cov/(σ_X σ_Y) ∈ [−1,1].

Var(aX+bY) = a²Var(X)+b²Var(Y)+2ab·Cov(X,Y).
Law of total variance: Var(X)=E[Var(X|Y)]+Var(E[X|Y]).
Tower: E[X]=E[E[X|Y]].
Bivariate normal: E[Y|X=x]=μ_Y+ρ(σ_Y/σ_X)(x−μ_X).

Example. Two assets σ=20% each, ρ=0.5, equal weights. Portfolio var = 0.25·0.04+0.25·0.04+2·0.25·(0.5·0.04) = 0.02+0.01 = 0.03, so σ_p ≈ 17.3% — diversification benefit vs 20%.

Estimators: bias, variance, MSE, consistency

An estimator θ̂ is a function of the sample; it has its own sampling distribution.
MSE = bias² + variance — MSE(θ̂)=E[(θ̂−θ)²]=(E[θ̂]−θ)²+Var(θ̂). The whole tradeoff lives here.
Unbiased: E[θ̂]=θ for all n. Consistent: θ̂→θ in probability as n→∞. Different properties — neither implies the other.
Efficient: smallest variance among a class; Cramér–Rao gives the floor Var(θ̂)≥1/I(θ) for unbiased estimators (I=Fisher info).
In quant work a slightly biased, lower-variance estimator (shrinkage) often beats unbiased — minimize MSE, not bias.

Example. θ̂≡5 (ignore data) is super low-variance but biased and inconsistent. Sample mean X̄ is unbiased and consistent: Var(X̄)=σ²/n→0.

Why n−1: Bessel's correction & sample variance

Sample variance s²=Σ(xᵢ−x̄)²/(n−1) is the unbiased estimator of σ²; dividing by n underestimates.
x̄ is fit from the data, so deviations from x̄ are systematically smaller than deviations from the true μ.
Mechanically: only n−1 deviations are free — they must sum to 0 (one degree of freedom is spent estimating the mean).
Caveat: s (the std, not s²) is still slightly biased by Jensen — E[s]<σ — because √ is concave. Unbiasedness doesn't survive nonlinear transforms.
Bias ~1/n, so it's negligible for large samples and matters mainly for small n.

Example. Data {2,4}: x̄=3, Σ(xᵢ−x̄)²=2. Biased: 2/2=1. Unbiased: 2/1=2 — the true spread is better captured by 2.

Maximum Likelihood Estimation (MLE)

Pick the parameter that makes observed data most probable: θ̂=argmax L(θ)=argmax Πf(xᵢ;θ). Maximize the log-likelihood — sums beat products.
Solve ∂ logL/∂θ = 0 (the score equation); check second order.
Invariance: MLE of g(θ) is g(θ̂) — e.g. MLE of σ is √(MLE of σ²).
Asymptotics: under regularity, θ̂ is consistent and √n(θ̂−θ)→N(0, 1/I(θ)) — attains the Cramér–Rao bound asymptotically (efficient).
MLE can be biased in finite samples: Normal MLE of variance divides by n, not n−1.

Example. n Bernoulli with k successes: logL=k·ln p+(n−k)ln(1−p), set derivative 0 → p̂=k/n. The sample proportion is the MLE.

Method of moments & MAP

Method of moments (MoM): set sample moments = theoretical moments, solve for parameters. Simple, no optimization, but can be inefficient or give out-of-range estimates.
MAP: argmax p(θ|data) ∝ argmax L(θ)·p(θ) — MLE weighted by a prior; maximizes the log-posterior logL+log prior.
MAP = MLE + a regularizing prior; a flat (uniform) prior makes MAP = MLE.
Gaussian prior on θ ⇒ MAP is ridge-style shrinkage toward the prior mean; Laplace prior ⇒ sparsity. Priors fight overfitting on thin data.
MAP is a point estimate, not the full posterior — it ignores posterior spread/skew (the mode ≠ mean in general).

Example. Normal, known σ: x̄=2 from n=4, prior N(0,σ²). MAP pulls the estimate from 2 toward 0; as n grows the data dominates and MAP→MLE.

CLT, sampling distribution & standard error

For iid Xᵢ with finite mean μ and variance σ²: √n(X̄−μ)/σ → N(0,1), regardless of the underlying shape.
SE shrinks like 1/√n: SE(X̄)=σ/√n. Quadruple the data to halve the error.
SE = std of an estimator's sampling distribution; population std describes spread of one draw. Don't confuse them.
What breaks CLT: infinite/undefined variance (heavy tails, e.g. Cauchy), strong dependence/autocorrelation, non-identical distributions. Markets have all three — convergence is slow and SE understated.
Convergence rate depends on skew/kurtosis (Berry–Esseen ~1/√n); fat tails need much larger n before Normal is safe.

Example. Daily returns σ=1%, n=252: SE of mean ≈ 1%/√252 ≈ 0.063% — tiny, but only if returns are roughly iid with finite variance.

Hypothesis testing: p-values, errors, power

Set up H₀ (null) vs H₁; pick a statistic; reject if it's extreme under H₀.
p-value = P(data this extreme or more | H₀ true) — NOT P(H₀ true).
Type I (α, false positive): reject a true H₀. Type II (β): fail to reject a false H₀. Power = 1−β: detect a real effect.
Power rises with effect size, n, and α; it falls with noise. Lowering α to cut false positives raises β — a direct tradeoff.
p < α is significance, not importance: huge n makes trivial effects 'significant'. Report effect size + CI, not just the p.

Example. Backtest Sharpe 'significant' at p=0.04: this means only ~4% of pure-noise strategies would look this good — over hundreds of trials you'd expect several. Significance ≠ profit.

Confidence vs credible intervals

Confidence (frequentist): the procedure covers the true fixed θ 95% of the time over repeated samples. θ is fixed; the interval is random.
A 95% CI does NOT mean 95% probability θ is in this interval — θ is not random in the frequentist view.
Credible (Bayesian): given a prior + this data, P(θ∈[a,b]|data)=0.95. Here θ is treated as random — the intuitive statement, but it needs a prior.
They coincide with flat priors in simple models, but diverge when the prior is informative or the likelihood is skewed.
CI from CLT: x̄ ± z·SE (z≈1.96 for 95%). Width ~1/√n.

Example. Estimate mean return, CI = [−1%, +3%]. Frequentist: 'this method brackets the truth 95% of the time.' Bayesian: 'given my prior, 95% chance the true mean is in here.' Different claims.

Multiple testing: FWER vs FDR

Test m hypotheses at α each and the chance of ≥1 false positive explodes: P(≥1)=1−(1−α)^m.
FWER (family-wise error): P(any false positive). Bonferroni controls it — use threshold α/m. Simple, but conservative → low power.
FDR controls the expected fraction of discoveries that are false — far more power when you test thousands of signals.
Benjamini–Hochberg: sort p-values p₍₁₎≤…≤p₍ₘ₎, find largest k with p₍ₖ₎ ≤ (k/m)·q, reject all up to k. Controls FDR at q.
Quant relevance: data-mining 1000s of factors guarantees spurious 'alpha'. Correction (or hold-out / lower threshold) is mandatory.

Example. m=100, α=0.05: P(≥1 false positive)=1−0.95¹⁰⁰≈99.4%. Bonferroni cutoff drops to 0.0005 per test.

Bootstrap & resampling

Estimate an estimator's sampling distribution by resampling the data with replacement B times, recomputing the statistic each time.
The empirical distribution stands in for the unknown population — get SE/CI/bias with no closed-form formula.
Uses: SE and percentile CIs for messy statistics (median, Sharpe, max drawdown), bias estimation, when CLT/normality is doubtful.
Fails / needs care: dependent data (use block bootstrap to preserve autocorrelation), extreme order statistics (max/min), tiny samples, and heavy tails where moments don't exist.
Percentile CI: take the 2.5% and 97.5% quantiles of the B bootstrap statistics.

Example. 250 daily returns, want SE of the Sharpe: resample 250 days with replacement 10,000×, compute Sharpe each time, take the std of those 10,000 values — that's the bootstrap SE.

Stationarity (weak vs strict)

Weak (covariance) stationarity: constant mean E[Xₜ]=μ, constant variance, and autocovariance γ(s,t) depends only on lag |t−s|, not on t.
Strict stationarity: the full joint distribution is shift-invariant. Strict + finite 2nd moment ⇒ weak; weak does NOT imply strict (except Gaussian, where they coincide).
Most time-series models (AR/MA/ARMA) assume weak stationarity — it's what makes the ACF and forecasts well-defined.
Prices are typically non-stationary (random walk); log-returns are usually treated as (approximately) stationary.
Stationarity = statistical properties don't drift with time; it's the precondition for estimating a stable autocovariance structure.

Example. Xₜ=t+εₜ has trending mean ⇒ not stationary. Detrend: Xₜ−t=εₜ is stationary. White noise εₜ is stationary; a random walk Sₜ=Σεᵢ is not (variance grows like tσ²).

ACF & PACF — reading the plots

ACF ρ(k)=γ(k)/γ(0): correlation between Xₜ and Xₜ₋ₖ including intermediate lags.
PACF: correlation at lag k after regressing out lags 1..k−1 — the 'direct' effect.
MA(q): ACF cuts off after lag q; PACF tails off (decays).
AR(p): PACF cuts off after lag p; ACF tails off.
ARMA(p,q): both tail off — order not readable directly; use AIC/BIC.
Significance band ≈ ±1.96/√n under the white-noise null.
Cut-off in ACF ⇒ MA order; cut-off in PACF ⇒ AR order. This duality is the classic identification trick.

AR, MA, ARMA models

AR(p): Xₜ=c+Σφᵢ Xₜ₋ᵢ+εₜ — regress on own past. Stationary iff roots of 1−Σφᵢzⁱ=0 lie outside the unit circle.
MA(q): Xₜ=μ+εₜ+Σθⱼ εₜ₋ⱼ — finite memory, always stationary; invertible iff MA roots outside unit circle.
ARMA(p,q): combines both; parsimonious vs a long pure AR or MA.
An invertible MA(∞) ⇔ AR(∞): an AR(1) Xₜ=φXₜ₋₁+εₜ with |φ|<1 equals Σφʲεₜ₋ⱼ.
AR = momentum on own past; MA = shocks with finite echo. Stationarity ↔ AR roots; invertibility ↔ MA roots.

Example. AR(1) Xₜ=φXₜ₋₁+εₜ: Var=σ²/(1−φ²), ρ(k)=φ^k. So φ=0.9 ⇒ slow geometric ACF decay; φ=0.1 ⇒ fast decay, near-white.

Random walk vs mean reversion

Random walk: Xₜ=Xₜ₋₁+εₜ (φ=1) — shocks permanent, variance ∝t, best forecast is last value, not stationary, not predictable.
Mean reverting (AR(1), |φ|<1): shocks decay, pulled back to mean c/(1−φ); half-life of a shock = ln(0.5)/ln(φ).
The whole question 'is this tradable?' often reduces to: is φ meaningfully below 1? That's a unit-root test.
Ornstein–Uhlenbeck is the continuous-time analog: dXₜ=θ(μ−Xₜ)dt+σdWₜ; θ>0 ⇒ reversion.
RW = unpredictable, trade nothing; mean reversion = a fade signal with a measurable half-life. The line between them is φ=1.

Example. φ=0.97 daily ⇒ half-life =ln0.5/ln0.97≈22.8 days — a slow reverter; size and horizon must match that half-life or you bleed carry.

Unit roots & Dickey–Fuller

Test H₀: φ=1 (unit root, non-stationary) vs H₁: φ<1 (stationary). Run ΔXₜ=γXₜ₋₁+εₜ, test γ=0 where γ=φ−1.
ADF adds lagged ΔX terms to absorb serial correlation; variants include drift/trend.
Distribution under the null is NOT standard t — use Dickey–Fuller critical values (more negative than normal). Reject (very negative stat) ⇒ stationary.
KPSS flips the null (stationary is H₀); pairing ADF + KPSS guards against low-power conclusions.
Unit-root tests have low power: failing to reject H₀ is weak evidence, not proof of a unit root. Don't over-trust 'looks non-stationary'.

ARIMA & differencing

ARIMA(p,d,q): difference d times to reach stationarity, then fit ARMA(p,q) to ΔᵈX.
d=1 handles a stochastic trend (unit root); a deterministic linear trend is better removed by regression, not differencing.
Choose d from unit-root tests + ACF (slowly decaying ACF ⇒ difference); over-differencing injects spurious negative MA structure.
Order selection: minimize AIC/BIC, check residuals are white (Ljung–Box).
d is for unit roots, not for everything — over-differencing manufactures a unit MA root and inflates variance.

Example. Log price is I(1) (unit root); Δlog price = log return is I(0). So modeling returns ≈ ARIMA with d=1 on log prices.

GARCH & volatility clustering

Stylized fact: returns are ~uncorrelated but |returns| / returns² are strongly autocorrelated — big moves cluster. Mean is unpredictable; variance is.
ARCH(1): σ²ₜ=ω+α ε²ₜ₋₁. GARCH(1,1): σ²ₜ=ω+α ε²ₜ₋₁+β σ²ₜ₋₁ — today's variance from yesterday's shock and yesterday's variance.
Persistence =α+β; <1 for stationarity. Unconditional variance =ω/(1−α−β). Equity α+β is often near 1 (long memory in vol).
Captures fat tails and clustering; εₜ=σₜ zₜ with zₜ iid. EGARCH/GJR add the leverage (asymmetric) effect.
GARCH models the conditional variance, not the mean — it's a vol forecaster, not an alpha signal. α+β is the vol-persistence dial.

Cointegration & pairs trading

Two I(1) series are cointegrated if a linear combo Yₜ−βXₜ is I(0) (stationary) — they share a stochastic trend and a stable long-run spread.
Engle–Granger: regress Y on X, ADF-test the residual (spread) for stationarity. Johansen: multivariate, tests rank / multiple cointegrating vectors.
Trade: the stationary spread mean-reverts — short the spread when high, long when low; size by z-score, exit near mean.
Correlation ≠ cointegration: two series can be highly correlated short-term yet drift apart (no shared trend).
Cointegration is a long-run anchor between non-stationary series; the tradable object is the stationary spread, not either leg.

Example. spread=log P_A−β log P_B; β the hedge ratio from regression. Trade z=(spread−μ)/σ: enter at |z|≈2, exit at z≈0. Risk: β drifts / the relationship breaks (cointegration is not permanent).

Lookahead bias & leakage in backtests

Lookahead: using data not available at decision time — using a bar's close to trade within that bar, or a same-day fundamental released after the close.
Survivorship bias: testing only names that survived; delisted/bankrupt tickers dropped ⇒ inflated returns.
Leakage via fitting: normalizing/standardizing or selecting features on the full sample (incl. future) before splitting — z-scores must use only trailing data.
Respect point-in-time data and realistic lag: signal at t trades at t+1 open; restatements and index recompositions are forward-only.
Every transform must use only information available at the bar it acts on. The classic failure: a trailing window that silently peeks at the future.

Example. Centering returns with the full-sample mean leaks the future mean into every point. Correct: an expanding/rolling estimate using data up to t−1 only — and shift the signal one bar before execution.

Quoted, effective & realized spread — the measurement ladder

quoted spread = ask − bid; half-spread = (ask−bid)/2. Often quoted relative: (ask−bid)/mid.
effective spread = 2·D·(P_trade − mid), D=+1 buy, −1 sell. Measures what you ACTUALLY paid vs mid at trade time — captures price improvement and crossing.
realized spread = 2·D·(P_trade − mid_{t+Δ}) uses the mid a few minutes later — what the maker KEEPS after the price moves against them.
effective = realized + price impact: the part of the effective spread that isn't realized is adverse-selection cost to the maker.

Example. Bid 99.98 / ask 100.02, mid 100.00. You buy at 100.02. Effective half-spread = 0.02. If mid drifts to 100.015 in 5 min, realized half-spread = 100.02−100.015 = 0.005; impact = 0.015. Maker only nets 0.005.

Order types, queue priority & the maker–taker model

Limit order = provides liquidity, rests in book, earns spread but bears adverse selection + non-execution risk. Market order = takes liquidity, pays spread, certain fill.
Most venues use price–time priority: best price first, then FIFO by arrival. Queue position is an asset — being early in the queue at the best bid is valuable optionality.
Maker–taker fees: passive fills earn a rebate, aggressive fills pay a fee. This distorts the true spread: effective cost to a taker = spread + taker fee.
A resting limit order is a free option written to the market — informed flow picks it off (adverse selection).
Hidden/iceberg orders sacrifice time priority for non-display; pro-rata matching (common in some futures) changes optimal order sizing vs FIFO.

Glosten–Milgrom: spread as pure adverse selection

Market maker faces informed traders (know true value V) and noise traders (random). MM is competitive & risk-neutral → zero expected profit → quotes are conditional expectations.
ask = E[V | buy order], bid = E[V | sell order]. The spread exists EVEN with zero inventory cost and zero fees — purely from information asymmetry.
Each trade is a signal: a buy revises beliefs up, a sell down. Quotes are a martingale updated via Bayes; prices converge to V as trades arrive.
Spread = compensation for trading against possibly-informed counterparties; more informed flow (higher α) ⇒ wider spread.
Contrast with inventory models: G–M spread is informational, not risk-bearing.

Example. V∈{0,100} equally likely, prior mid 50. Fraction α informed. A buy could be informed (V=100) or noise. ask=E[V|buy]>50, bid=E[V|sell]<50; the gap is the adverse-selection spread.

Kyle (1985): λ, market depth & how info gets impounded

Single informed trader, noise traders submit order flow u, one market maker sets price from total flow x+u. Linear equilibrium: price = p₀ + λ·(order flow).
λ is Kyle's lambda — price impact per unit of net order flow; 1/λ is market DEPTH (Kyle calls it the inverse of liquidity).
Single-auction closed form: λ = (1/2)·√(Σ₀/σ_u²) where Σ₀ = variance of value, σ_u² = noise-trade variance. More noise traders ⇒ smaller λ ⇒ deeper market.
In that one-shot model the informed trader sizes to avoid full revelation: in equilibrium exactly half the private-info variance is impounded into price (Σ₁ = Σ₀/2).
Empirically λ is estimated by regressing price changes on signed order flow: ΔP = λ·(signed volume) + ε.

Order-book imbalance as a short-horizon price predictor

Define volume imbalance at top of book: I = (Q_bid − Q_ask)/(Q_bid + Q_ask) ∈ [−1,1].
Strong positive contemporaneous & short-lead relation: high bid size ⇒ next mid tick more likely UP. The size-weighted mid formalizes this.
Size-weighted mid (Stoikov's microprice proxy): P_micro = (Q_ask·bid + Q_bid·ask)/(Q_bid+Q_ask) — weights the side with LESS size, i.e. tilts toward where the book is thin (the likely move).
Imbalance predicts the next move because thin side fills first; signal decays in seconds and is mostly arbitraged at liquid names.
Caveat: spoofing/flickering quotes contaminate raw imbalance; predictive power is regime- and tick-size-dependent.

Example. Bid 99.99 (5000 shares), ask 100.01 (1000). I=(5000−1000)/6000≈0.67. Microprice = (1000·99.99+5000·100.01)/6000 ≈ 100.007 — pulled toward the ask, predicting an up-move.

Avellaneda–Stoikov: reservation price & optimal quoting

MM maximizes exponential (CARA, risk aversion γ) utility of terminal wealth while holding inventory q over horizon T−t.
Reservation price r = s − q·γ·σ²·(T−t) — the mid SHIFTED away from current inventory. Long inventory (q>0) pulls quotes DOWN to encourage selling.
Optimal total spread: δ_total = γσ²(T−t) + (2/γ)·ln(1 + γ/k), where k governs order-arrival intensity λ=A·e^{−kδ}.
Quotes set symmetrically around r, NOT the mid: bid = r − δ/2, ask = r + δ/2. This skews fills to mean-revert inventory toward zero.
As t→T the inventory term vanishes; quotes recenter on the mid.

Inventory-risk view of the spread (Stoll / Ho–Stoll / Garman)

Distinct from adverse selection: a MM with finite risk capacity charges a spread to be compensated for holding unwanted, undiversified inventory exposure.
Quotes are shaded around the true value by inventory: after a buy (MM now SHORT) it raises quotes to attract a seller; after a sell it lowers them. Mid is pushed AWAY from the true value temporarily.
Inventory effects induce transient (mean-reverting) price pressure; adverse-selection effects are permanent. This is the key to decomposing spread components.
Larger position, higher volatility, shorter unwind horizon, higher funding cost ⇒ wider inventory spread.
Three classic components: order-processing (fixed cost), inventory-holding, adverse-selection.

Market impact: temporary vs permanent & the square-root law

Permanent impact = information content of the trade, stays in the price (≈ linear in size, Kyle's λ). Temporary impact = liquidity-consumption cost that decays after you stop trading.
Empirical square-root law: cost of executing a metaorder ≈ Y·σ·√(Q/V) — impact grows with the SQUARE ROOT of participation (Q = order size, V = daily volume, σ = vol).
Almgren–Chriss execution: temporary impact often modeled ∝ (trade rate), permanent ∝ (size); optimal schedule trades off impact vs timing (volatility) risk.
Concavity matters: doubling order size does NOT double cost — splitting and stretching execution reduces impact but adds price risk.

Example. Buying 1% of ADV at σ=2%/day with Y≈1: impact ≈ 1·0.02·√0.01 = 0.02·0.1 = 0.002 = 20 bps. Buying 4% ⇒ √4=2× the cost, not 4×.

Execution benchmarks: TWAP, VWAP & Implementation Shortfall

TWAP: slice evenly over time — simple, predictable, ignores volume profile (vulnerable to gaming, bad when volume is U-shaped).
VWAP: match the day's volume distribution (front-load opening/closing humps). Benchmark = Σ pᵢvᵢ / Σ vᵢ; beating VWAP means trading better than the average participant.
Implementation Shortfall (Perold): IS = (paid − decision-price) on executed + opportunity cost on unexecuted. The honest all-in benchmark: it charges you for delay, impact, fees, AND missed fills.
VWAP/TWAP measure execution quality vs the market; IS measures the total cost of the DECISION — the only one that penalizes not-trading.
Aggressive (front-loaded) schedules cut timing risk but raise impact; IS-optimal schedules solve this trade-off explicitly (Almgren–Chriss).

Bitwise Operators & Bit Tricks

Six operators acting on the binary representation. & AND · | OR · ^ XOR · ~ NOT · << left shift · >> right shift. Shifts are fast multiply/divide by powers of two.

Op	Example	Result	Meaning
& AND	`5 & 3 = 0101 & 0011`	`0001 = 1`	1 only where both bits are 1
\| OR	`5 \| 3 = 0101 \| 0011`	`0111 = 7`	1 where either bit is 1
^ XOR	`5 ^ 3 = 0101 ^ 0011`	`0110 = 6`	1 where the bits differ
~ NOT	`~5`	`−6`	flip every bit (~x = −x−1)
<< shift	`1 << 4`	`16`	x << n = x·2ⁿ
>> shift	`20 >> 2`	`5`	x >> n = ⌊x / 2ⁿ⌋

Tricks worth memorizing:

x & 1 tests odd/even; x & (x−1) clears the lowest set bit, so x & (x−1) == 0 tests a power of two.
x & (−x) isolates the lowest set bit; popcount via Brian Kernighan loops x &= x−1 and counts iterations.
Set / clear / toggle / read bit k: x | (1<<k) · x & ~(1<<k) · x ^ (1<<k) · (x >> k) & 1.
XOR identities x^x=0, x^0=x: find the lone unpaired element by XOR-ing the whole array; swap two ints with no temp via three XORs.
Enumerate every subset of an n-element set as bitmasks 0 … (1<<n)−1 — the backbone of bitmask DP (e.g. held-Karp TSP over subsets).

Example. Array [4,1,2,1,2], every value paired but one. 4^1^2^1^2 = 4 — the pairs cancel, leaving the unique element, in O(n) time and O(1) space.

Big-O: the rules that actually get tested

Big-O is an upper bound on growth as n→∞; drop constants and lower-order terms.
Sum rule: sequential blocks → take the max, O(n)+O(n²)=O(n²). Product rule: nested loops multiply, O(n)·O(m)=O(nm).
Recurrences via Master theorem: T(n)=2T(n/2)+O(n)→O(n log n) (mergesort); T(n)=2T(n/2)+O(1)→O(n) (heapify-style).
Logs from halving (O(log n) binary search) or from balanced-tree depth; base of the log is a constant, so omit it.
Watch hidden costs: string concat, slicing, and in on a list are each O(n) inside a loop → quietly O(n²).

Example. Building a result by s += c over n chars is O(n²) (each concat copies); join a list once for O(n).

Hashing: how it works and when it breaks

Average O(1) insert/lookup; degrades to O(n) when keys collide into one bucket or the table never resizes.
Collision handling: chaining (linked list/bucket per slot) vs open addressing (probe to next slot). Load factor α = items/buckets drives performance.
Resize (rehash) when α exceeds a threshold; this is what makes growth amortized O(1), not per-op O(1).
Degradation triggers: adversarial keys all hashing to one bucket, a weak hash function, or hashing mutable objects whose value changes after insertion.
Sets/dicts give O(1) membership; ordered structures (balanced BST) give O(log n) but keep sorted order and range queries.

Example. Two-sum: one pass storing target−x in a hash set is O(n); the brute nested loop is O(n²).

Quicksort vs mergesort vs heapsort

All average O(n log n); they differ on worst case, stability, memory, and constant factors.
avg worst space stable
quicksort n log n n² O(log n) no
mergesort n log n n log n O(n) yes
heapsort n log n n log n O(1) no
Quicksort worst case n² hits on bad pivots (already-sorted with naive first-element pivot); randomized or median-of-three pivots fix it in practice.
Mergesort guarantees n log n and is stable → preferred for linked lists and external/disk sorts; cost is the O(n) auxiliary buffer.
Quicksort wins on arrays despite equal asymptotics due to cache locality and in-place partitioning (small constants).

	avg	worst	space	stable
quicksort	n log n	n²	O(log n)	no
mergesort	n log n	n log n	O(n)	yes
heapsort	n log n	n log n	O(1)	no

Heaps and the top-k pattern

A binary heap gives O(1) peek-min, O(log n) push/pop; build-heap from n items is O(n), not O(n log n).
Array-backed: parent of i is (i−1)/2, children 2i+1 and 2i+2; no pointers needed.
Top-k largest of a stream: keep a min-heap of size k; push, then pop the smallest if size>k. Cost O(n log k), space O(k).
Use a max-heap to repeatedly extract the largest; use two heaps (max-heap of low half + min-heap of high half) to track a running median.
Heapsort = build-heap then repeatedly extract; in-place but not stable and worse cache behavior than quicksort.

Example. k-th largest in n items: size-k min-heap → O(n log k), beats sorting's O(n log n) when k≪n.

Two-pointer and sliding window

Replace an O(n²) nested scan with one linear pass by moving two indices that never backtrack.
Opposite-ends two-pointer needs a sorted array or a monotonic property (e.g. pair-sum: move left up if sum too small, right down if too big).
Sliding window fits contiguous subarray/substring problems: expand the right edge, shrink the left while a constraint is violated. Amortized O(n) since each index enters and leaves once.
Fast/slow pointers detect cycles and find midpoints in a single pass with O(1) extra space (Floyd).
Pitfall: the window technique assumes monotonic feasibility (e.g. all-positive values); negatives can break the shrink invariant — switch to prefix-sum + hash.

Example. Longest substring without repeats: window + last-seen hash map, advance left past the duplicate → O(n).

Dedup: choosing the right technique

Hash set = O(n) time / O(n) space; sort-then-scan = O(n log n) time / O(1) extra; pick by whether order, memory, or stability matters.
A single pass with a seen-set preserves first-seen order and is fastest, but needs space proportional to distinct items — bad for huge/streaming data.
Sort then drop adjacent equals uses no extra structure but destroys original order and is slower.
Find duplicates without extra space when values are bounded in [1,n]: index-as-hash (negate arr[abs(x)−1]) or cycle detection on the value-as-pointer graph.
At scale (data won't fit in memory): external sort, or approximate membership with a Bloom filter (false positives, no false negatives).

Example. First duplicate in a stream: one pass, add to set, return on first re-seen element → O(n)/O(n).

Reservoir sampling

Sample k items uniformly from a stream of unknown length in one pass, O(k) memory.
Algorithm R: keep the first k; for the i-th item (i>k) keep it with probability k/i, evicting a uniformly random current holder.
Correctness: induction shows every seen item retains probability k/n after n items — the early-vs-late asymmetry exactly cancels.
k=1 case: replace the held item with probability 1/i; trivial to state and a common warm-up.
Use it when you can't store or count the data first (logs, ticks, network streams); weighted variants use the exponential-jump (A-ExpJ) trick.

Example. P(item 1 survives n draws, k=1) = 1·½·⅔···(n−1)/n = 1/n — uniform, as required.

Dynamic programming: state, transition, order

DP = overlapping subproblems + optimal substructure; define the state, the transition, the base case, and a valid evaluation order.
Top-down memoization: recursion + cache; only computes reachable states, easy to write. Bottom-up tabulation: iterative fill, no recursion overhead, enables space rolling.
Rolling array: if a transition only reads the previous row, keep O(width) instead of O(n·width) memory.
Classic templates: knapsack (weight-capacity state), longest common subsequence (2-D grid), edit distance, coin change (min-count), interval DP.
Pitfall: greedy fails when a locally optimal choice blocks a global optimum — DP enumerates all consistent choices; prove substructure before trusting greedy.

Example. Coin change min coins: dp[a]=min(dp[a−c]+1) over coins c; O(amount·#coins), 1-D array.

Pattern map: problem signature → technique

Most interview problems are a known pattern in disguise; match the signature first, then code.
"Sorted array / find a pair or bound" → two pointers or binary search.
"Contiguous subarray/substring with a constraint" → sliding window or prefix-sum + hash.
"k largest/smallest, or running order statistic" → heap (size-k or two-heap median).
"Count/optimize over choices with reuse" → DP; "min answer satisfying a monotonic predicate" → binary search on the answer.
"Membership / first duplicate / group-by-key" → hash set/map; "cycle or midpoint in a list" → fast-slow pointers.

Example. "Min capacity to ship in D days" reads as optimization but the feasibility is monotonic → binary-search the capacity, O(n log range).

The Brainteaser Attack Plan

Before computing, restate the rules out loud and ask what's being optimized — half of teasers are won by clarifying constraints.
Try the smallest / most extreme case first: n=1, n=2 often reveals the pattern or invariant.
Hunt for an invariant (something that never changes) or a monovariant (something that only moves one way) — these prove impossibility or bound the answer.
Work backward from the end state when the last move is forced (induction, endgames, pirate splits).
Think out loud and narrate your search — the interviewer scores your process, not just the final number.

Example. "Can a knight visit every square of a 5×5 board and return to start?" Color the board: a closed knight's tour alternates colors, needing equal black/white, but 25 squares can't split evenly — invariant kills it instantly.

Parity & Invariants

If every legal move flips a quantity by a fixed parity, the reachable states are exactly those matching that parity — use it to prove a goal is impossible.
Checkerboard coloring: a domino always covers one black + one white square, so a board with unequal colors can't be tiled.
Mutilated chessboard: remove two opposite (same-color) corners → 32 vs 30 squares → no domino tiling.
Light-switch / coin-flip puzzles: track the parity of "heads" or "on" switches; an odd invariant blocks the all-off state.
Splitting a coin into two known-count groups: separate N coins from the dark, flip those N — both groups end with equal heads regardless of which were heads.

Example. 100 coins, 10 heads, you're blindfolded. Split into a pile of 10 and a pile of 90, then flip every coin in the pile of 10. If the small pile took k heads, it now has 10−k heads and the big pile has 10−k heads — equal, guaranteed.

Hat-Color Logic via Parity

One announced parity bit lets everyone behind deduce their own color — the first speaker sacrifices himself to broadcast the XOR of all hats he sees.
N prisoners in a line, each sees only those ahead; back-most counts (say) red hats mod 2 and shouts that parity.
Each subsequent prisoner counts reds still ahead, compares to the running parity announced, and infers his own hat.
Guarantees N−1 saved; the first has a 50% guess. No probability needed — it's pure parity bookkeeping.
Generalizes to colors mod k: announce the sum of all visible hats mod k.

Example. Back prisoner sees 7 red hats → odd → shouts "red." Next sees 6 red → even → the parity dropped by his own, so his hat is red. He says red; everyone updates the parity and continues.

Weighing Puzzles: the 3ᵏ Bound

A balance has 3 outcomes per weighing (left<right, =, >), so k weighings distinguish at most 3ᵏ states — set up that information budget first.
12 coins, one fake of unknown direction: 24 possibilities < 3³=27, so 3 weighings suffice; design each weighing to split outcomes as evenly as possible.
Classic 3-weighing 12-coin split: weigh 4 vs 4, then branch on the result, keeping every outcome equally informative.
A one-pan scale (reads exact weight) gives far more info — that's the regime for the bottle-counterfeit-coin puzzle (take 1,2,3,… coins).
Always count outcomes vs states before constructing the scheme; if states > 3ᵏ it's provably impossible.

Example. 10 stacks, each coin 10g except one stack of 9g coins. One weighing on a scale: take 1 coin from stack 1, 2 from stack 2, …; the deficit below the expected total (in grams) names the bad stack.

Binary Encoding: Bottles & Testers

To identify 1 poisoned item among N with parallel yes/no testers, each item gets a binary ID and tester i drinks every item whose bit i is 1 — needs ⌈log₂N⌉ testers.
1000 bottles, 1 poison, one test round: 2¹⁰=1024 ≥ 1000, so 10 testers; the pattern of deaths reads off the poisoned bottle's 10-bit index.
Key requirements: tests run in parallel (one round) and the result is per-tester binary (sick / not).
If you have multiple sequential rounds instead, the budget grows multiplicatively — re-derive the base accordingly.
Same trick labels prisoners or switches: assign IDs, let each "channel" carry one bit.

Example. Bottle 731 = 1011011011₂. Testers at bit-positions {0,1,3,4,6,7,9} drink it; if exactly those testers die, the binary pattern of deaths reconstructs 731.

Backward Induction: Pirates & Endgames

When the last decision is forced, solve the final subgame first and induct backward — each player anticipates that everyone after them plays optimally.
Pirate split (5 pirates, 100 coins, ≥50% incl. proposer's own vote needed, else proposer dies): solve the 2-pirate case, then 3, 4, 5.
With 2 left, the senior keeps all (his own vote = 50%); knowing that, with 3 the proposer buys the cheapest vote with 1 coin, etc.
Result for 5: proposer keeps 98, gives 1 each to the two pirates who'd get nothing in the next subgame.
Pirates are rational, greedy, and prefer killing only when indifferent — clarify the tie-break rule before answering.

Example. 5 pirates (A senior…E): A proposes 98-0-1-0-1. C and E accept because if A dies, B's plan gives them 0; A's own vote + C + E = 3/5 ≥ majority. A survives with 98.

Nim & Combinatorial Games

A Nim position is losing for the player to move iff the XOR (nim-sum) of all pile sizes is 0; otherwise move to make it 0.
Identify P-positions (previous player wins / mover loses) — these are the targets you hand your opponent.
Winning move: pick a pile, reduce it so the total XOR becomes 0; such a move always exists when XOR ≠ 0.
For non-Nim games, compute Grundy numbers (mex of reachable positions' Grundy values); a sum of games XORs their Grundy values (Sprague–Grundy).
Misère Nim differs only near the end: play normal Nim until all heaps are size 1, then flip parity.

Example. Piles 3,4,5: 3⊕4⊕5 = 011⊕100⊕101 = 010 = 2 ≠ 0, so the mover wins. Reduce the pile of 3 to 1 (since 1⊕4⊕5=0), handing a losing position to the opponent.

Optimal Stopping: the Secretary Rule

To pick the single best of N candidates seen one-at-a-time with no recall, reject the first ~N/e (37%) automatically, then take the first one better than everyone so far.
The first phase only calibrates a threshold (the best-so-far); the second phase commits on the first record-breaker.
Probability of landing the very best converges to 1/e ≈ 37% as N grows — cite the result; the integral derivation lives in Probability.
Variants change the cutoff: maximizing expected rank, or allowing a few picks, shifts the threshold — state which objective you're optimizing.
Watch the framing trap: this maximizes P(get the absolute best), not expected value of the chosen candidate.

Example. 100 applicants: interview and reject the first 37 (remember the best of them), then hire the first applicant who beats that bar. ~37% chance you hired the true #1.

100 Prisoners & the Boxes

Each prisoner follows the cycle: open the box labeled with his number, then the box numbered by the slip he finds, repeating — survival hinges on no permutation cycle exceeding 50.
100 prisoners, boxes hold a random permutation of 1–100; each may open 50 boxes and must find his own number; all must succeed.
Naive random search → survival ≈ (1/2)¹⁰⁰, essentially zero. Cycle-following ties everyone's fate to one event: "longest cycle ≤ 50."
That event's probability is about 31% (≈ 1 − ln 2 in the limit) — cite it; the harmonic-sum derivation belongs to Probability.
The trick is that following slips means a prisoner traverses exactly the cycle containing his own number, and he finds it iff that cycle has length ≤ 50.

Example. Prisoner 7 opens box 7 → finds 23 → opens box 23 → finds 7: a 2-cycle, found in 2 opens. He's safe as long as his cycle's length is ≤ 50.

Squares & Cubes, n = 1 … 20

Memorize cold — they recur in estimation, variance, and "what's the root" problems, and let you recognize roots instantly. Squares (n^2):

1^2=1 · 2^2=4 · 3^2=9 · 4^2=16 · 5^2=25 · 6^2=36 · 7^2=49 · 8^2=64 · 9^2=81 · 10^2=100 · 11^2=121 · 12^2=144 · 13^2=169 · 14^2=196 · 15^2=225 · 16^2=256 · 17^2=289 · 18^2=324 · 19^2=361 · 20^2=400

Cubes (n^3):

1^3=1 · 2^3=8 · 3^3=27 · 4^3=64 · 5^3=125 · 6^3=216 · 7^3=343 · 8^3=512 · 9^3=729 · 10^3=1000 · 11^3=1331 · 12^3=1728 · 13^3=2197 · 14^3=2744 · 15^3=3375 · 16^3=4096 · 17^3=4913 · 18^3=5832 · 19^3=6859 · 20^3=8000

Use. See 2197 → recognize 13^3, so ∛2197 = 13; spotting 169 = 13^2 or 361 = 19^2 speeds square-root and factoring guesses.

Percent of: x% of y = y% of x

x% of y equals y% of x — swap to whichever is easier.
Hard: 16% of 25. Swap: 25% of 16 = 4.
Build any percent from 10% (move decimal), 5% (half of 10%), 1% (move twice).
Basis points: 1bp = 0.01% = 0.0001; 100bp = 1%.
Tip-style: 15% = 10% + 5%; 17.5% = 10%+5%+2.5%.

Example. 8% of 50 = 50% of 8 = 4. And 35bp of $20M = 0.0035×20M = $70k.

Fraction → decimal cheat table

Memorize eighths, sixteenths, and 1/7 — they recur in odds, ticks, and yields.
Halving chain: 1/2=.5, 1/4=.25, 1/8=.125, 1/16=.0625.
Sixteenths: 3/16=.1875, 5/16=.3125, 7/16=.4375 (odd k: k/16).
Sevenths cycle 142857: 1/7=.142857, 2/7=.285714, … (rotate the block).
Useful: 1/3=.3̄, 1/6=.16̄, 1/9=.1̄, 1/11=.0909, 1/12=.0833.

Example. 3/7 = .428571 — same six digits as 1/7, started at the right spot (3×.142857).

Squaring tricks: ends-in-5, near a base

Ends-in-5 shortcut: n5² = n(n+1) followed by 25. So 35² = 3·4|25 = 1225.
Near a base b: n² = (n+d)(n−d) + d² where d = n−b.
Near 50: 53² = (50+3)² → 2500 + 100·3 + 9 = 2809.
Near 100: 97² = (100−3)² = 10000 − 600 + 9 = 9409.
General: (a±b)² = a² ± 2ab + b² — pick a as a round anchor.

Example. 48² = (50−2)² = 2500 − 200 + 4 = 2304.

Fast multiplication: difference of squares & ×11

For symmetric pairs, ab = (mid)² − (gap)².
17×19 = 18² − 1² = 324 − 1 = 323; 38×42 = 40² − 2² = 1600 − 4 = 1596.
×11: write the digits with their sum between, carrying: 11×34 = 3|3+4|4 = 374.
×5 = ×10 ÷2; ×25 = ÷4 ×100; ×9 = ×10 − itself.
Break-apart: 23×14 = 23×10 + 23×4 = 230 + 92 = 322.

Example. 54×46 = 50² − 4² = 2500 − 16 = 2484 (pair averages 50, gap 4).

Square roots by digit-pairing

Pair digits from the decimal point; each pair → one result digit, so count pairs to fix the magnitude.
Anchor squares: √2≈1.414, √3≈1.732, √5≈2.236, √7≈2.646.
√10≈3.162 (= √2·√5); √12 = 2√3 ≈ 3.464.
Refine with one Newton step: √S ≈ (g + S/g)/2 for guess g.
Scaling: √(100x) = 10√x; move the decimal in pairs, not singly.

Example. √50: guess 7, (7 + 50/7)/2 = (7 + 7.14)/2 ≈ 7.07 (true 7.071).

Rule of 72 (and when it's exact)

Rule of 72: doubling time ≈ 72 / rate(%); the pure log answer is ln2×100 ≈ 69.3, but 72 corrects upward for discrete compounding and has more clean divisors.
Near-exact at 8%: 72/8 = 9.0 vs true 9.01 yr.
Slightly long at 7%: 72/7 = 10.29 vs true 10.24 yr.
Use 69–70 for continuous compounding (ln2 = .693); 72 has more clean divisors.
Tripling: from ln3 ≈ 1.10, time ≈ 110/rate (e.g. 6% → ~18 yr).

Example. At 6%, money doubles in 72/6 = 12 yr; true value 11.9 yr.

Compounding & chained returns

Binomial approximation: (1+x)ⁿ ≈ 1 + nx for small x — but the dropped term is +n(n−1)x²/2, always positive, so you under-count.
+x then −x = (1+x)(1−x) = 1 − x² — always a net loss of x².
Compounding beats simple: 0.5%/mo → 1.005¹² = 1.0617 (6.17%, not 6%).
Returns chain multiplicatively: total = ∏(1+rᵢ) − 1, never add unless tiny.
Small-return shortcut: sum the logs, ln(1+r) ≈ r.

Example. +10% then −10% = 1 − .01 = 0.99 → down 1%, not flat.

Log & exp approximations

For small x: eˣ ≈ 1 + x + x²/2 and ln(1+x) ≈ x − x²/2.
Key logs: ln2 ≈ .693, ln10 ≈ 2.303; log₁₀2 = .301, log₁₀3 = .477, log₁₀7 = .845.
Get the rest by addition: log₁₀5 = 1 − .301 = .699; log₁₀6 = .301+.477 = .778.
e ≈ 2.718; e² ≈ 7.39; doubling-decibel sense: +3dB ≈ ×2 since 10^.3 ≈ 2.
Convert bases: ln x = 2.303 · log₁₀ x.

Example. ln1.1 ≈ .1 − .005 = .095 (true .0953); so 10%/yr doubles in .693/.095 ≈ 7.3 yr.

Annualization & quick P&L arithmetic

Daily vol → annual: ×√252 ≈ 16, so 1% daily vol ≈ 16% annual (because 16²=256≈252).
Monthly → annual vol: ×√12 ≈ 3.46; weekly: ×√52 ≈ 7.2.
Sharpe annualizes the same way: daily Sharpe ×√252, monthly ×√12.
Volatility scales with √time, return scales with time — that gap is why Sharpe grows.
P&L: notional × move; in bps, $X × bp/10000.

Example. +35bp on $20M = 20M × .0035 = $70k. A 0.8% daily-vol book ≈ 0.8×16 = 12.8% annual vol.

Limits & Continuity

A limit is the value a function approaches; the function need not be defined there. f is continuous at a if lim(x→a) f(x) = f(a) — no jumps, holes, or blow-ups.

One-sided limits must agree: lim(x→a⁻) = lim(x→a⁺).
Rational-function trick: factor and cancel the removable term.
Squeeze theorem: if g ≤ f ≤ h and both ends → L, then f → L. Gives lim(x→0) sin(x)/x = 1.

Example. lim(x→2) (x²−4)/(x−2) is 0/0 raw. Factor: (x−2)(x+2)/(x−2) = x+2 → 4.

Derivative Rules

The derivative is the instantaneous rate of change / slope of the tangent.

Rule	Formula
Power	`d/dx xⁿ = n·xⁿ⁻¹`
Product	`(fg)′ = f′g + fg′`
Quotient	`(f/g)′ = (f′g − fg′)/g²`
Chain	`(f(g(x)))′ = f′(g(x))·g′(x)`

Quotient mnemonic: "low·d-high minus high·d-low, over low²."

Example. d/dx [x²·sin x] = 2x·sin x + x²·cos x (product). And d/dx (3x+1)⁵ = 5(3x+1)⁴·3 = 15(3x+1)⁴ (chain).

Key Derivatives

Memorize these cold — they appear in almost every problem.

f(x)	f′(x)
`eˣ`	`eˣ`
`aˣ`	`aˣ·ln a`
`ln x`	`1/x`
`sin x`	`cos x`
`cos x`	`−sin x`
`tan x`	`sec²x = 1/cos²x`

eˣ is its own derivative — the only function (up to scale) with that property.

Example. d/dx ln(x²+1) = 1/(x²+1)·2x = 2x/(x²+1) (chain with ln).

Integration Techniques

Integration is anti-differentiation (+C) and signed area. Two workhorses:

Substitution (reverses chain rule): set u = g(x), du = g′(x)dx.
By parts (reverses product rule): ∫u dv = uv − ∫v du. Pick u by LIATE (Log, Inverse-trig, Algebraic, Trig, Exp).

Common: ∫xⁿdx = xⁿ⁺¹/(n+1), ∫1/x dx = ln|x|, ∫eˣdx = eˣ, ∫sin x dx = −cos x, ∫cos x dx = sin x.

Example (parts). ∫x·eˣ dx: let u=x, dv=eˣdx → uv−∫v du = x·eˣ − ∫eˣdx = eˣ(x−1)+C.

The Gaussian Integral

The bell-curve integral — central to probability and the normal distribution. ∫₋∞^∞ e^(−x²) dx = √π

Scaled: ∫₋∞^∞ e^(−a·x²) dx = √(π/a).
Normal density: ∫₋∞^∞ (1/√(2πσ²))·e^(−x²/2σ²) dx = 1.

Proof trick: square it to a 2D integral, switch to polar — (∫e^(−x²)dx)² = ∫∫e^(−r²)·r dr dθ = π, so the integral is √π ≈ 1.7725.

Example. ∫₋∞^∞ e^(−2x²) dx = √(π/2) ≈ 1.2533.

Taylor Series & Small-x Approximations

Near x=0 any smooth function ≈ a polynomial: f(x) ≈ f(0) + f′(0)x + f″(0)x²/2 + …. Memorize these:

Function	Series
`eˣ`	`1 + x + x²/2 + x³/6 + …`
`ln(1+x)`	`x − x²/2 + x³/3 − …`
`(1+x)ⁿ`	`1 + nx + n(n−1)x²/2 + …`
`sin x`	`x − x³/6 + x⁵/120 − …`
`cos x`	`1 − x²/2 + x⁴/24 − …`

For small x: eˣ ≈ 1+x, ln(1+x) ≈ x, (1+x)ⁿ ≈ 1+nx, sin x ≈ x, cos x ≈ 1−x²/2.

Example. Estimate e^0.1: ≈ 1+0.1+0.005 = 1.105 (actual 1.10517). And ln(1.1) ≈ 0.1−0.005 = 0.095 (actual 0.09531).

Partial Derivatives & Gradient

For f(x,y,…), the partial ∂f/∂x differentiates in x treating other vars as constants. The gradient stacks them: ∇f = (∂f/∂x, ∂f/∂y, …).

∇f points in the direction of steepest ascent; its magnitude is the max slope.
∇f is perpendicular to level curves/surfaces of f.
Directional derivative along unit vector u: ∇f · u.

Example. f = x²y + y³. ∂f/∂x = 2xy, ∂f/∂y = x² + 3y². At (1,2): ∇f = (4, 13).

Unconstrained Optimization (FOC / SOC)

Find extrema by setting the derivative to zero, then classify.

FOC (first-order): f′(x)=0 (or ∇f=0) → critical points.
SOC (second-order), 1D: f″>0 → min, f″<0 → max, f″=0 → inconclusive.
Multivariate: use the Hessian H. H positive-definite → local min; negative-definite → local max; indefinite → saddle.

Example. f = x³ − 3x. f′ = 3x²−3 = 0 → x = ±1. f″ = 6x: at x=1, f″=6>0 (min); at x=−1, f″=−6<0 (max).

Lagrange Multipliers

Optimize f subject to constraint g = c. At the optimum the gradients align: ∇f = λ·∇g, solved together with g = c. Intuition: you can't improve f without violating g once the level set of f is tangent to the constraint.

Example. Maximize f = xy on x + y = 10. ∇f=(y,x), ∇g=(1,1) → y=λ, x=λ so x=y. With x+y=10: x=y=5, max product 25. (Confirms AM–GM.)

L'Hôpital's Rule

For indeterminate 0/0 or ∞/∞, differentiate top and bottom separately: lim f/g = lim f′/g′ (repeat if still indeterminate).

Other forms (0·∞, ∞−∞, 1^∞) must first be rewritten as a quotient.
For 1^∞ etc., take logs first.

Example. lim(x→0) (1−cos x)/x² = 0/0 → sin x / 2x = 0/0 → cos x / 2 → 1/2. (Matches cos x ≈ 1−x²/2.)

Convexity & Jensen's Inequality

A twice-differentiable f is convex on an interval iff f″(x) ≥ 0 there; concave iff f″ ≤ 0. Convex = holds water, lies below its chords, above its tangents.
Jensen: for convex f, E[f(X)] ≥ f(E[X]); flip the sign for concave. Equality iff X is constant or f linear.
The convexity gap E[f(X)] − f(E[X]) grows with Var(X) and curvature — this is exactly option vega/gamma in disguise.
Tangent-line bound (convex): f(x) ≥ f(a) + f′(a)(x−a) for all a — the basis of the supporting-hyperplane argument.
Common: x², eˣ, −ln x, 1/x (x>0) are convex; ln x, √x are concave.

Example. f(x)=x² convex, so E[X²] ≥ (E[X])² — which restates Var(X) = E[X²]−(E[X])² ≥ 0. Jensen is the structural reason variance is nonnegative.

Feynman's Trick (Differentiation Under the Integral)

Introduce a parameter t into an integral I(t)=∫ f(x,t) dx, then swap derivative and integral: I′(t)=∫ ∂f/∂t dx (Leibniz rule), integrate the easier result, and fix the constant from a known value.
Workhorse for Gaussian moments: differentiate ∫₋∞^∞ e^(−tx²) dx = √(π/t) in t to pull down powers of x².
E[X²ⁿ] for standard normal: (2n−1)!! = 1·3·5···(2n−1) (e.g. E[X²]=1, E[X⁴]=3, E[X⁶]=15).
Justified when ∂f/∂t is dominated by an integrable bound (Leibniz integral rule / dominated convergence) — fine for the smooth, fast-decaying integrands quants see.

Example. From ∫₋∞^∞ e^(−tx²) dx = √π·t^(−1/2), differentiate both sides in t: ∫ −x²e^(−tx²) dx = −½√π·t^(−3/2). At t=1: ∫₋∞^∞ x²e^(−x²) dx = √π/2 ≈ 0.8862.

Second-Order Multivariable Taylor & Newton Step

Quadratic expansion about x: f(x+h) ≈ f(x) + ∇f(x)ᵀh + ½ hᵀH(x)h, where H is the Hessian of second partials Hᵢⱼ=∂²f/∂xᵢ∂xⱼ.
H is symmetric when second partials are continuous (Clairaut/Schwarz): ∂²f/∂x∂y = ∂²f/∂y∂x.
Setting the gradient of the quadratic model to zero gives the Newton step: h = −H⁻¹∇f — the multivariate analogue of x − f′/f″.
The curvature seen along a direction v is vᵀHv/(vᵀv); bounded by the smallest/largest eigenvalues of H.
This expansion underlies risk: a portfolio P&L in factors is ≈ ΔᵀΔx + ½Δxᵀ Γ Δx (delta–gamma), with Γ the Hessian of value.

Example. f=x²+xy+y² at origin: ∇f=(0,0), H=[[2,1],[1,2]]. So f(h)≈½hᵀHh = h₁²+h₁h₂+h₂² — exact here since f is already quadratic.

Vectors, Dot Product & Norm

A vector is a list of numbers (a point/arrow). Length: ‖v‖ = √(v·v).

Dot product: u·v = Σ uᵢvᵢ = ‖u‖‖v‖cos θ.
u·v = 0 ⟺ vectors are orthogonal (⊥).
Cauchy–Schwarz: |u·v| ≤ ‖u‖‖v‖.

Example. u=(1,2), v=(3,4): u·v = 3+8 = 11, ‖u‖=√5, ‖v‖=5, so cos θ = 11/(5√5) ≈ 0.984 → θ ≈ 10.3°.

Matrix Multiplication & Properties

(AB)ᵢⱼ = Σ Aᵢₖ Bₖⱼ — row of A dotted with column of B. Inner dimensions must match: (m×n)(n×p) = (m×p).

Not commutative: AB ≠ BA in general.
Associative & distributive: (AB)C = A(BC), A(B+C)=AB+AC.
(AB)ᵀ = BᵀAᵀ and (AB)⁻¹ = B⁻¹A⁻¹ (order flips).

Example. [[1,2],[3,4]]·[[0,1],[1,0]] = [[2,1],[4,3]] (swaps columns), but the reverse product is [[3,4],[1,2]] (swaps rows) — different.

Transpose & Inverse

Transpose flips across the diagonal: (Aᵀ)ᵢⱼ = Aⱼᵢ. Inverse A⁻¹ satisfies A·A⁻¹ = I and exists only when det A ≠ 0. For 2×2 [[a,b],[c,d]]: A⁻¹ = 1/(ad−bc)·[[d,−b],[−c,a]].

Orthogonal matrix: QᵀQ = I, so Q⁻¹ = Qᵀ (cheap, stable).

Example. A=[[1,2],[3,4]], det = 1·4−2·3 = −2. A⁻¹ = (−1/2)·[[4,−2],[−3,1]] = [[−2,1],[1.5,−0.5]].

Determinant — Meaning

det A is the signed volume-scaling factor of the linear map; sign flags orientation flip.

det A = 0 ⟺ singular (columns dependent, map collapses a dimension, no inverse).
det(AB) = det A · det B; det(Aᵀ)=det A; det A = ∏ eigenvalues.
2×2: ad−bc. 3×3: cofactor expansion.

Example. [[2,0],[0,3]] scales area by det = 6: the unit square maps to a 2×3 rectangle.

Rank & Linear Independence

Vectors are independent if none is a combination of the others. Rank = number of independent columns (= independent rows) = dimension of the column space.

Full rank n (square) ⟺ invertible ⟺ det ≠ 0.
Rank-nullity: rank(A) + nullity(A) = #columns.

Example. [[1,2],[2,4]]: row 2 = 2×row 1, so rank 1 (not 2). Determinant is 0 — singular.

Eigenvalues & Eigenvectors

An eigenvector keeps its direction under A, only scaling: Av = λv. Find λ from the characteristic polynomial det(A − λI) = 0.

Σλ = trace(A); ∏λ = det(A) — fast sanity checks.
Then solve (A − λI)v = 0 for each eigenvector.

Example. A=[[2,1],[1,2]]: det([[2−λ,1],[1,2−λ]]) = (2−λ)²−1 = 0 → λ = 3, 1. Check: trace 4 = 3+1, det 3 = 3·1. Eigenvectors: (1,1) for 3, (1,−1) for 1.

Symmetric & PSD Matrices, Quadratic Forms

Symmetric A = Aᵀ → real eigenvalues and orthogonal eigenvectors (spectral theorem). A quadratic form is xᵀAx.

Class	Condition
Positive definite (PD)	`xᵀAx > 0` ∀x≠0; all `λ > 0`
Positive semidefinite (PSD)	`xᵀAx ≥ 0`; all `λ ≥ 0`

Covariance matrices and XᵀX are always PSD.

Example. A=[[2,0],[0,3]]: xᵀAx = 2x₁²+3x₂² > 0 → PD (eigenvalues 2, 3).

SVD & Low-Rank

Every matrix factors as A = UΣVᵀ: U, V orthogonal, Σ diagonal of singular values σ₁ ≥ σ₂ ≥ … ≥ 0.

Singular values = √ eigenvalues of AᵀA; rank = #nonzero σ. Basis for PCA and compression.
Eckart–Young: the best rank-k approximation is the top-k truncation A_k = Σᵢ₌₁ᵏ σᵢuᵢvᵢᵀ — optimal in both spectral and Frobenius norm.
Error ‖A−A_k‖₂ = σ_{k+1}, ‖A−A_k‖_F = √(Σ_{i>k} σᵢ²). Energy captured = Σᵢ₌₁ᵏσᵢ² / Σσᵢ² (scree / variance-explained).
Storage drops from mn to k(m+n+1) — the win when σ decays fast.

Example. A 1000×1000 image with fast-decaying σ: top-50 triples store ≈ 50·2000 = 100K vs 1M numbers (~10× compression). A rank-1 signal in tiny noise has σ₁≫σ₂; truncating to k=1 denoises, residual = σ₂.

Solving Ax = b (and why not invert)

Solve directly via factorization — never form A⁻¹ (slower, less stable, destroys sparsity).

Method	Use when
LU	general square A
Cholesky (`A=LLᵀ`)	A symmetric PD (≈2× faster)
QR	least squares / tall A, stable

Computing A⁻¹ then A⁻¹b wastes work and amplifies rounding error.

Example. For 1M sparse equations, LU/Cholesky exploit sparsity; an explicit inverse would be dense and infeasible to even store.

Projection & Least Squares

To fit Xβ ≈ y when no exact solution exists, minimize ‖Xβ − y‖². Setting the gradient to zero gives the normal equations: XᵀX β = Xᵀy → β = (XᵀX)⁻¹Xᵀy (in practice solved by QR). Geometrically, Xβ̂ is the orthogonal projection of y onto the column space of X; the residual is ⊥ that space.

Example. Fit line through (1,1),(2,2),(3,2). X=[[1,1],[1,2],[1,3]]. Solving normal equations gives intercept 2/3, slope 1/2 → ŷ = 0.667 + 0.5x.

Trace & Orthogonality / Gram–Schmidt

Trace = sum of diagonal = sum of eigenvalues. Cyclic: tr(ABC) = tr(BCA) = tr(CAB). Gram–Schmidt turns independent vectors into an orthonormal basis by subtracting projections: uₖ = vₖ − Σ proj(vₖ onto earlier uᵢ), then normalize. This is the Q in QR.

Example. v₁=(1,0), v₂=(1,1): keep u₁=(1,0). Subtract projection: u₂ = (1,1) − (v₂·u₁)u₁ = (1,1)−(1,0) = (0,1). Now orthonormal.

Eigendecomposition vs SVD — When Each Applies

Eigendecomposition A=VΛV⁻¹ needs A square; V orthogonal only if A is symmetric (spectral theorem).
SVD A=UΣVᵀ exists for ANY m×n matrix; U,V always orthonormal, σᵢ≥0.
For symmetric PSD A: the two coincide — σᵢ=λᵢ, left/right singular vectors equal eigenvectors.
General A: singular values are σᵢ=√(λᵢ(AᵀA)); eigenvalues of A can be complex/negative, σ never; σᵢ=|λᵢ| for any normal A (e.g. rotations), not only symmetric.
Use eigen for square symmetric (covariance, quadratic forms); use SVD when rectangular, rank-deficient, or numerically nasty.

Example. A non-symmetric 2×2 can have complex eigenvalues (a rotation) yet two real positive σ. So |det A|=σ₁σ₂=|λ₁λ₂| still holds, but eigenvectors need not be orthogonal.

Condition Number — Definition & Consequences

κ(A)=σ_max/σ_min (2-norm); for symmetric, =|λ|_max/|λ|_min.
Measures sensitivity: relative error in x solving Ax=b can be up to κ(A) × relative error in b or A.
Rule of thumb: lose about log₁₀κ digits of accuracy; κ≈10¹⁶ in double precision means the answer is garbage.
Normal equations square it: κ(AᵀA)=κ(A)² — why QR/SVD beats forming AᵀA for least squares.
High κ = nearly singular = collinear features; fix with regularization or drop/orthogonalize columns.

Example. Ridge adds λI: solving (AᵀA+λI)x shifts every eigenvalue up by λ, capping κ at (σ_max²+λ)/(σ_min²+λ) — strictly better-conditioned.

Distance, Dot/Cross Product & Angle

Distance in n-D: d = √(Σ(aᵢ−bᵢ)²).

Angle between vectors: cos θ = (u·v)/(‖u‖‖v‖).
Cross product (3D): u×v is ⊥ to both, with ‖u×v‖ = ‖u‖‖v‖sin θ = area of the parallelogram they span.
u·v measures alignment (scalar); u×v measures perpendicular spread (vector).

Example. u=(1,0,0), v=(0,1,0): u·v=0 (⊥), u×v=(0,0,1), area = 1.

Lines & Planes

Line (2D): y = mx + b, or ax + by = c.
Plane (3D): ax + by + cz = d, with normal vector n = (a,b,c).
Point-to-line distance (2D): |ax₀+by₀−c| / √(a²+b²). Plane version uses √(a²+b²+c²).
Perpendicular lines: slopes multiply to −1.

Example. Distance from origin to 3x + 4y = 10: |−10|/√(9+16) = 10/5 = 2.

Circles & Spheres

Circle: (x−h)² + (y−k)² = r²; circumference 2πr, area πr².
Sphere: (x−h)²+(y−k)²+(z−l)² = r²; surface 4πr², volume (4/3)πr³.
Inscribed-angle theorem: an angle subtended by a diameter is 90°.

Example. x²+y²−6x+8 = 0 → complete the square: (x−3)²+y² = 1, a circle centered (3,0) radius 1.

Areas & Volumes (Lookup)

Shape	Formula
Triangle	`½·base·height`
Triangle (Heron)	`√(s(s−a)(s−b)(s−c))`, `s=(a+b+c)/2`
Circle	area `πr²`, circ. `2πr`
Sphere	vol `(4/3)πr³`, surf. `4πr²`
Cone	`(1/3)πr²h`
Regular tetrahedron (edge a)	`a³/(6√2)`

Triangle area from vertices: ½|x₁(y₂−y₃)+x₂(y₃−y₁)+x₃(y₁−y₂)|.

Example. Triangle (0,0),(4,0),(0,3): area ½·4·3 = 6. Regular tetrahedron edge 1: vol 1/(6√2) ≈ 0.1179.

Similar Triangles

Triangles with equal angles have proportional sides — a workhorse for "find the missing length."

Side ratios equal: a/a′ = b/b′ = c/c′.
Areas scale as the square of the linear ratio; volumes as the cube.

Example. A 6-ft person casts a 4-ft shadow; a tree casts 20 ft. By similarity h/20 = 6/4 → tree height = 30 ft.

The n-Sphere & "Points in a Hemisphere" Puzzle

Classic: n points are placed uniformly at random on a sphere. Probability all lie in some common hemisphere? Key insight: each point defines a hemisphere via its diametrically opposite half; "all in one hemisphere" ⟺ the n points do not "surround" the center. In d dimensions: P = 2⁻ⁿ⁺¹ · Σ_{k=0}^{d−1} C(n−1, k).

Circle (d=2): P = n/2ⁿ⁻¹.
Sphere (d=3): P = (n² − n + 2)/2ⁿ.

Example. 4 points on a circle: P = 4/2³ = 1/2. 4 points on a sphere: P = (16−4+2)/16 = 14/16 = 7/8.

Coordinate Geometry Toolkit

Midpoint: ((x₁+x₂)/2, (y₁+y₂)/2).
Slope: (y₂−y₁)/(x₂−x₁).
Section formula (ratio m:n): ((mx₂+nx₁)/(m+n), (my₂+ny₁)/(m+n)).
Centroid of a triangle: average the three vertices.

Example. Centroid of (0,0),(6,0),(0,9): ((0+6+0)/3, (0+0+9)/3) = (2, 3).

Pythagoras, Triples & 3D Diagonals

Right triangle: a² + b² = c². Memorize triples (3,4,5), (5,12,13), (8,15,17) and scalings.

Distance = Pythagoras generalized: 3D box diagonal = √(l²+w²+h²).
Equilateral triangle side a: height = (√3/2)a, area = (√3/4)a².

Example. Diagonal of a 1×2×2 box: √(1+4+4) = √9 = 3. Area of equilateral triangle side 2: (√3/4)·4 = √3 ≈ 1.732.

Triangle Area: Shoelace, Heron & Cross-Product

Shoelace is the fastest area formula given coordinates — memorize it.
Coordinates: Area = ½|x_A(y_B−y_C)+x_B(y_C−y_A)+x_C(y_A−y_B)|. Equivalent to ½|(B−A)×(C−A)| in 2D.
Heron (sides only): s=(a+b+c)/2, Area=√(s(s−a)(s−b)(s−c)).
Two sides + included angle: Area=½·a·b·sinθ.
3D triangle: Area=½‖(B−A)×(C−A)‖ — the cross-product magnitude is twice the area.
Sign of the shoelace (before abs) gives orientation: + counterclockwise, − clockwise. Collinear ⟺ area 0.

Example. A=(0,0), B=(4,0), C=(1,3). Shoelace: ½|0(0−3)+4(3−0)+1(0−0)| = ½·12 = 6.

Geometric Probability: the Sample-Space Method

Turn "random" into a uniform point in a region; the probability is a ratio of measures (length/area/volume).
Pick coordinates: independent uniform picks → a square/cube; an angle → arc length on a circle; a stick break → a simplex.
Shade the event region, integrate or use geometry, divide by total measure: P = measure(event)/measure(space).
Symmetry first — many answers fall out without integrating. Conditioning restricts the sample space, so recompute the denominator.
Classics: broken-stick triangle 1/4 (each of x, y−x, 1−y must be < ½); expected distance between two uniform points on [0,1] = 1/3; Bertrand chord — pin down the sampling rule first, "random" is ambiguous.

Example. x,y ~ U(0,1): P(|x−y| < ½)? The two excluded corner triangles (x−y>½, y−x>½) each have area ½·(½)² = 1/8, so P = 1 − 2·(1/8) = 3/4.

Vector Projections & Point-to-Line/Plane Distance

Projection of a onto b: (a·b/‖b‖²)·b; its length is |a·b|/‖b‖.
Point P to line through A with direction d (any dim): dist = ‖(P−A) − proj_d(P−A)‖. In 2D/3D: ‖(P−A)×d‖/‖d‖.
Point P to plane n·x=c: dist = |n·P − c|/‖n‖ (signed if you drop the abs — sign tells which side).
Closest point on the line/plane = P minus the perpendicular component; reflect P by subtracting twice that component.
Distance between two skew lines: |(A₂−A₁)·(d₁×d₂)|/‖d₁×d₂‖.

Example. Distance from P=(1,2,3) to plane 2x−y+2z=3: |2−2+6−3|/√(4+1+4) = 3/3 = 1.

Two Dice — Sum & Product Tables

Roll two fair dice: 36 equally likely ordered outcomes (each 1/36). Rows = first die, columns = second; the corner shows the operator.

Sum (d₁ + d₂)

+	1	2	3	4	5	6
1	2	3	4	5	6	7
2	3	4	5	6	7	8
3	4	5	6	7	8	9
4	5	6	7	8	9	10
5	6	7	8	9	10	11
6	7	8	9	10	11	12

Product (d₁ × d₂)

×	1	2	3	4	5	6
1	1	2	3	4	5	6
2	2	4	6	8	10	12
3	3	6	9	12	15	18
4	4	8	12	16	20	24
5	5	10	15	20	25	30
6	6	12	18	24	30	36

Sum: most likely is 7 (6/36 = 1/6); the pmf is triangular over 2…12, symmetric about 7.
Product: ranges 1…36 and is far from uniform — odd only when both dice are odd, so P(product odd) = 9/36 = 1/4.

Example. P(sum = 5) = 4/36 (1-4, 2-3, 3-2, 4-1); P(product = 12) = 4/36 (2-6, 6-2, 3-4, 4-3).

Master Table — Discrete Distributions

Conventions stated explicitly (the usual trap is geometric/neg-binomial counting).

Dist	pmf	Mean	Variance	Use
DiscreteUniform{1..N}	`1/N, k=1..N`	`(N+1)/2`	`(N²−1)/12`	fair die (N=6), random index
Bernoulli(p)	`P(1)=p, P(0)=1−p`	`p`	`p(1−p)`	single yes/no trial
Binomial(n,p)	`C(n,k) pᵏ(1−p)ⁿ⁻ᵏ`	`np`	`np(1−p)`	# successes in n trials
Geometric(p) — # trials	`(1−p)ᵏ⁻¹ p, k≥1`	`1/p`	`(1−p)/p²`	trial of 1st success
Geometric(p) — # failures	`(1−p)ᵏ p, k≥0`	`(1−p)/p`	`(1−p)/p²`	failures before 1st success
NegBinom(r,p) — # failures	`C(k+r−1,k) pʳ(1−p)ᵏ`	`r(1−p)/p`	`r(1−p)/p²`	failures before r-th success
Poisson(λ)	`e⁻λ λᵏ/k!`	`λ`	`λ`	rare events, fixed interval

Geometric variance is (1−p)/p² in BOTH conventions; only the mean shifts by 1.

Master Table — Continuous Distributions

For exponential, λ is the RATE (scale = 1/λ); for Gamma, θ is the SCALE. (β below is the Beta shape parameter, not a rate.)

Dist	pdf	Mean	Variance
Uniform(a,b)	`1/(b−a), a≤x≤b`	`(a+b)/2`	`(b−a)²/12`
Exponential(λ)	`λe⁻λˣ, x≥0`	`1/λ`	`1/λ²`
Normal(μ,σ²)	`(1/√(2πσ²)) e^(−(x−μ)²/2σ²)`	`μ`	`σ²`
Lognormal(μ,σ²)	`(1/(xσ√(2π))) e^(−(ln x−μ)²/2σ²)`	`e^(μ+σ²/2)`	`(e^(σ²)−1)e^(2μ+σ²)`
Gamma(k,θ) shape/scale	`xᵏ⁻¹e⁻ˣ/θ/(Γ(k)θᵏ)`	`kθ`	`kθ²`
Beta(α,β)	`xᵅ⁻¹(1−x)ᵝ⁻¹/B(α,β)`	`α/(α+β)`	`αβ/[(α+β)²(α+β+1)]`
Chi-square(k)	`Gamma(k/2, 2)`	`k`	`2k`
Student-t(ν)	`∝(1+t²/ν)^(−(ν+1)/2)`	`0 (ν>1)`	`ν/(ν−2) (ν>2)`

Example. Beta(2,3): mean = 2/5 = 0.4; var = (2·3)/[(5)²·6] = 6/150 = 0.04, so σ ≈ 0.2.

Standard Normal — Φ, the 68-95-99.7 Rule, Tails

Z = (X−μ)/σ. Φ(z) = P(Z ≤ z), symmetric: Φ(−z) = 1−Φ(z).

Within ±1σ: 68.27%; ±2σ: 95.45%; ±3σ: 99.73%.
Two-sided 95% ⇒ z = 1.96; 90% ⇒ z = 1.645; one-sided 95% ⇒ z = 1.645.
Gaussian tail bound: P(Z > z) ≤ (1/(z√(2π))) e^(−z²/2) for z > 0.
φ(0) = 1/√(2π) ≈ 0.3989; E|Z| = √(2/π) ≈ 0.798.

Example. Returns ~ N(8%, (15%)²) annual. P(loss) = P(X < 0) = Φ(−8/15) = Φ(−0.533) ≈ 0.297. A 2σ down-move = 8% − 30% = −22%, beyond ~2.3% of the left tail.

MGFs and Their Uses

M_X(t) = E[e^(tX)]. Moments: E[Xⁿ] = M^(n)(0). Independent sums multiply MGFs.

Dist	MGF
Bernoulli(p)	`1−p+peᵗ`
Binomial(n,p)	`(1−p+peᵗ)ⁿ`
Poisson(λ)	`e^(λ(eᵗ−1))`
Exponential(λ)	`λ/(λ−t), t < λ`
Normal(μ,σ²)	`e^(μt+½σ²t²)`
Gamma(k,θ)	`(1−θt)⁻ᵏ, t < 1/θ`

Example. Sum of n iid Exp(λ): each MGF λ/(λ−t), product = (λ/(λ−t))ⁿ = Gamma(n, 1/λ). So waiting time for the n-th Poisson arrival is Gamma (Erlang).

Key Relationships and Limits

Binomial → Poisson: n→∞, p→0, np→λ fixed ⇒ Binomial(n,p) → Poisson(λ). Rule of thumb: n ≥ 100, np ≤ 10.
Sum of independent normals: N(μ₁,σ₁²)+N(μ₂,σ₂²) = N(μ₁+μ₂, σ₁²+σ₂²). Variances add; for differences too.
Poisson ↔ Exponential: Poisson(λt) counts; inter-arrival gaps are Exponential(λ), iid.
Lognormal = exp(Normal): if ln X ~ N(μ,σ²) then X is lognormal. Used for prices (always positive).
CLT: (X̄ − μ)/(σ/√n) → N(0,1). Sums of iid finite-variance terms become Gaussian.
Chi-square: Σ of k iid N(0,1)² = χ²(k). t(ν) = Z/√(χ²(ν)/ν).

Example. 1000 trades, each loses $1 w.p. 0.005 (rare). Defaults ≈ Poisson(λ=5). P(0 defaults) = e⁻⁵ ≈ 0.0067; P(≥1) ≈ 0.993.

Exponential Memorylessness

The exponential is the ONLY continuous memoryless distribution (geometric is its discrete analog). P(X > s+t | X > s) = P(X > t).

Survival: P(X > t) = e^(−λt); hazard rate is constant = λ.
Min of independent exponentials: min(Exp(λ₁),…,Exp(λₙ)) ~ Exp(λ₁+…+λₙ).
Competing risks: P(source i fires first) = λᵢ / Σλⱼ.

Example. Two market-makers quote, fill times Exp(2/min) and Exp(3/min). First fill ~ Exp(5), expected 1/5 min = 12s. P(maker A is first) = 2/5 = 40%.

Order Statistics — General Formula, Uniform & Discrete Uniform

General pdf. For n iid draws with cdf F and pdf f, the k-th smallest X₍ₖ₎ has density f₍ₖ₎(x) = [n! / ((k−1)!(n−k)!)] · F(x)ᵏ⁻¹ · (1−F(x))ⁿ⁻ᵏ · f(x) — pick which sample is the k-th (the count factor), k−1 fall below it, n−k above, one sits at x.

Continuous Uniform(0,1): U₍ₖ₎ ~ Beta(k, n−k+1), mean k/(n+1); max mean n/(n+1), min mean 1/(n+1).
Probability integral transform: F(X) ~ Uniform(0,1); inverse-cdf F⁻¹(U) generates any distribution.

Discrete Uniform{1,…,N} — m DISTINCT draws (without replacement). The k-th smallest satisfies E[X₍ₖ₎] = k(N+1)/(m+1) and Var(X₍ₖ₎) = k(m+1−k)(N+1)(N−m) / [(m+1)²(m+2)] (the discrete analog of the Beta result; note m is the sample size, N the population top).

Max (k=m): E = m(N+1)/(m+1) — the German tank problem; invert for N̂ = max·(m+1)/m − 1.
Sanity: m=1 → mean (N+1)/2, var (N²−1)/12 (one uniform draw); m=N → X₍ₖ₎=k exactly (the N−m factor kills the variance).
With replacement (iid) there is no clean closed form — E[max] = N − N⁻ᵐ·Σⱼ₌₀^(N−1) jᵐ.

Example. 9 iid Uniform(0,1): expected median (k=5) = 5/10 = 0.5, expected max = 9/10 = 0.9. Discrete, 5 distinct draws from {1,…,100}: expected max = 5·101/6 ≈ 84.2, so German-tank N̂ ≈ max·6/5 − 1.

Sums of Independent Random Variables — Additivity & Convolution

Independent sums: MGFs multiply, so means and variances always add: E[ΣXᵢ]=ΣE[Xᵢ], Var(ΣXᵢ)=ΣVar(Xᵢ) (independence needed only for variance).
Many families are closed under addition with FIXED parameters; memorize which.
Beware: ratios/products are NOT additive, and the sum of two iid uniforms is triangular, not uniform.

Sum of independent	Result
Normal(μᵢ,σᵢ²)	`Normal(Σμᵢ, Σσᵢ²)`
Poisson(λᵢ)	`Poisson(Σλᵢ)`
n × Bernoulli(p)	`Binomial(n,p)`
Binomial(nᵢ,p) (same p)	`Binomial(Σnᵢ, p)`
n × Exp(λ)	`Gamma(n, 1/λ)` = Erlang
Gamma(kᵢ,θ) (same θ)	`Gamma(Σkᵢ, θ)`
χ²(kᵢ)	`χ²(Σkᵢ)`

Example. X−Y for independent N(μ₁,σ₁²), N(μ₂,σ₂²): write as X+(−Y); means subtract but VARIANCES STILL ADD → N(μ₁−μ₂, σ₁²+σ₂²). Classic sign-trap.

Simulating Any Distribution — Inverse-CDF & Transforms

Probability integral transform: if U~Uniform(0,1) then X=F⁻¹(U) has CDF F. Conversely F(X)~Uniform(0,1).
Inverse-CDF sampling needs a closed-form, invertible F.
Exponential: solve U=1−e^(−λx) ⇒ X=−ln(U)/λ (use ln U; U and 1−U are both uniform).
Normal (no closed-form F⁻¹): Box–Muller — Z₁=√(−2 ln U₁)·cos(2πU₂), Z₂=√(−2 ln U₁)·sin(2πU₂) are iid N(0,1).
When F⁻¹ is intractable: acceptance–rejection or other transforms.

Example. Sample a fair coin from U~Uniform(0,1): return Heads if U<0.5. Sample Geometric(p) # trials: X=⌈ln(U)/ln(1−p)⌉ — the discrete inverse-CDF.

Poker Hand Probabilities — 5 cards from 52

Total hands = C(52,5) = 2,598,960. Each probability is count / 2,598,960; the counts sum to exactly that total.

Hand	Count	Combos — C(n,k) form	How to count it	Prob
Royal flush	4	`C(4,1)`	the one A-K-Q-J-10 run, one per suit	0.000154%
Straight flush (excl. royal)	36	`10·C(4,1) − 4`	10 runs × 4 suits = 40, minus 4 royals	0.00139%
Four of a kind	624	`C(13,1) C(4,4) C(48,1)`	rank × all 4 suits × any 5th card	0.0240%
Full house	3,744	`C(13,1) C(4,3) C(12,1) C(4,2)`	trip rank+suits × different pair rank+suits	0.1441%
Flush (excl. str. flush)	5,108	`C(4,1) C(13,5) − 40`	suit × any 5 ranks, minus straight flushes	0.1965%
Straight (excl. str. flush)	10,200	`10·C(4,1)⁵ − 40`	10 runs × any suit per card, minus straight flushes	0.3925%
Three of a kind	54,912	`C(13,1) C(4,3) C(12,2) C(4,1)²`	trip × two distinct off-ranks, one suit each	2.1128%
Two pair	123,552	`C(13,2) C(4,2)² C(11,1) C(4,1)`	two pair ranks+suits × kicker rank+suit	4.7539%
One pair	1,098,240	`C(13,1) C(4,2) C(12,3) C(4,1)³`	pair rank+suits × three distinct kickers	42.257%
High card	1,302,540	`[C(13,5)−10][C(4,1)⁵−4]`	any 5 ranks (not a run) × any suits (not all one)	50.118%

"x of a kind" = choose the rank (13), choose which x of its 4 suits (C(4,x)), then fill the other 5−x cards with distinct off-ranks so the hand isn't accidentally upgraded.
"excl." rows subtract the rarer hand they contain: a straight flush is also a straight and a flush, so it's removed from both to avoid double counting.
Straights let the ace be high or low — A-2-3-4-5 up to 10-J-Q-K-A is 10 runs; a wrap-around like Q-K-A-2-3 does not count.

Example — full house. Pick the trip rank (13) and 3 of its 4 suits (C(4,3)=4); pick a different pair rank (12) and 2 of its 4 suits (C(4,2)=6): 13·4·12·6 = 3,744, so P ≈ 0.144%.

Permutations vs Combinations

Order matters ⇒ permutation; order ignored ⇒ combination.

Permutations: P(n,k) = n!/(n−k)!.
Combinations: C(n,k) = n!/(k!(n−k)!), and C(n,k)=C(n,n−k).
Multiset permutations (anagrams): n!/(n₁! n₂! … nₘ!).
Circular arrangements of n distinct: (n−1)!.

Example. "MISSISSIPPI" (11 letters: M1 I4 S4 P2) arrangements = 11!/(1!·4!·4!·2!) = 39916800/1152 = 34650.

Binomial Theorem and Pascal's Triangle

(x+y)ⁿ = Σ_{k=0..n} C(n,k) xᵏ yⁿ⁻ᵏ.

Pascal's rule: C(n,k)=C(n−1,k−1)+C(n−1,k).
Row sum: Σ_k C(n,k) = 2ⁿ (subsets of an n-set).
Alternating sum: Σ_k (−1)ᵏ C(n,k) = 0 for n ≥ 1.
Vandermonde: Σ_j C(m,j)C(n,k−j) = C(m+n,k).
Hockey stick: Σ_{i=r..n} C(i,r) = C(n+1,r+1).

Example. Coefficient of x²y³ in (x+y)⁵ = C(5,2) = 10. Check row 5 sum: 1+5+10+10+5+1 = 32 = 2⁵.

Stars and Bars

Distribute n identical items into k labeled bins.

Nonnegative solutions to x₁+…+x_k = n: C(n+k−1, k−1).
Strictly positive (each bin ≥ 1): C(n−1, k−1).
With lower bounds, substitute yᵢ = xᵢ − Lᵢ and reduce n.

Example. Ways to split $10 (whole dollars) among 4 funds, any fund may get $0: C(10+4−1, 4−1) = C(13,3) = 286. If each fund must get ≥ $1: C(9,3) = 84.

Inclusion–Exclusion

|A∪B∪C| = Σ|Aᵢ| − Σ|Aᵢ∩Aⱼ| + |A₁∩A₂∩A₃|; alternate signs in general.

Example. Integers 1–100 divisible by 2, 3, or 5. ⌊100/2⌋+⌊100/3⌋+⌊100/5⌋ = 50+33+20 = 103. Subtract pairs: ⌊100/6⌋+⌊100/10⌋+⌊100/15⌋ = 16+10+6 = 32. Add triple ⌊100/30⌋ = 3. Total = 103 − 32 + 3 = 74.

Derangements

Permutations with no fixed point. Dₙ = n! Σ_{k=0..n} (−1)ᵏ/k!, recursion Dₙ = (n−1)(Dₙ₋₁ + Dₙ₋₂).

Values: D₁=0, D₂=1, D₃=2, D₄=9, D₅=44.
Dₙ/n! → 1/e ≈ 0.3679 fast — the "hat-check" limit.

Example. 4 people swap hats randomly. P(nobody gets own hat) = D₄/4! = 9/24 = 0.375 ≈ 1/e.

Catalan Numbers

Cₙ = C(2n,n)/(n+1) = C(2n,n) − C(2n,n+1); sequence 1,1,2,5,14,42,132,…; recurrence Cₙ₊₁ = Σ Cᵢ Cₙ₋ᵢ.

Counts: balanced parenthesizations of n pairs; lattice/Dyck paths (0,0)→(n,n) staying weakly below the diagonal; binary trees with n nodes; triangulations of an (n+2)-gon; non-crossing matchings of 2n points.
Trigger: counting objects with a never-go-negative / never-cross constraint built from n up-steps and n down-steps.
Derive by reflection (André): total paths C(2n,n) minus bad paths (touch above the diagonal), which biject to C(2n,n+1).

Example. Parenthesizing a product of 4 factors = C₃ = 5; lattice paths (0,0)→(3,3) never crossing above the diagonal = C₃ = 5.

Indicators and Linearity of Expectation

E[ΣXᵢ] = ΣE[Xᵢ] always — even when Xᵢ are dependent. Write counts as sums of 0/1 indicators.

Example. Expected fixed points of a random permutation of n: indicator Iᵢ = 1 if position i fixed, E[Iᵢ] = 1/n. Sum ⇒ E = n·(1/n) = 1, for every n. (Matches derangement intuition: avg one match.)

Example 2. Expected number of distinct birthdays among 30 people (365 days): E = 365·(1 − (364/365)³⁰) ≈ 365·0.0792 ≈ 28.9.

Pigeonhole Principle

Put n items in m boxes ⇒ some box has ≥ ⌈n/m⌉ items. Powerful for existence arguments.

Example. Among any 13 people, two share a birth month (13 > 12). Among 5 points in a unit square, two are within √2/2 distance (split into four ¼-squares; ⌈5/4⌉ = 2 in one cell, diagonal √2/2).

Generating Functions

Encode a sequence aₙ as coefficients of G(x) = Σ aₙ xⁿ; products = convolutions (great for sums of independent counts).

One die: x+x²+…+x⁶. Two dice: square it; coefficient of xˢ = ways to roll sum s.
1/(1−x) = Σ xⁿ; 1/(1−x)ᵏ = Σ C(n+k−1, k−1) xⁿ (recovers stars-and-bars).
Probability GF: E[zˣ]; for Poisson it is e^(λ(z−1)).

Example. Ways two dice sum to 7: coefficient of x⁷ in (x+…+x⁶)² = 6. Sum to 4: coefficient of x⁴ = 3 (1+3, 2+2, 3+1).

Counting Decision Tree: Which Formula When

First ask two questions: does order matter, and is repetition allowed?
Order matters, no repeat: P(n,k)=n!/(n−k)!. Order matters, repeat: nᵏ.
Order doesn't matter, no repeat: C(n,k)=n!/(k!(n−k)!). Order doesn't matter, repeat (multiset): C(n+k−1,k) — this is stars & bars.
Distinct objects into groups of sizes k₁…kₘ (multinomial): n!/(k₁!k₂!…kₘ!).
Pick the cell by (order?, repeat?) before reaching for any formula — most errors are choosing the wrong cell, not the wrong arithmetic.

Example. Arrangements of MISSISSIPPI (11 letters: I×4, S×4, P×2, M×1): order matters with identical items, so 11!/(4!·4!·2!·1!)=34650.

Circular and Symmetric Arrangements

Seating n distinct people around a round table: fix one person to kill rotational symmetry, giving (n−1)!.
If reflections also count as the same (a bracelet): divide by 2 more, (n−1)!/2 for n≥3.
The principle: when k arrangements are all equivalent under a symmetry, divide the linear count by k — valid only when every orbit has the SAME size.
Divide-by-symmetry only works cleanly when no arrangement is fixed by the symmetry; when some are (e.g. symmetric necklaces), use Burnside instead.
Burnside (orbit-counting): distinct configs = average number of arrangements fixed by each symmetry, (1/|G|)·Σ|Fix(g)|.

Example. 5 people at a round table: (5−1)!=24. As a flippable bracelet of 5 distinct beads: 4!/2=12.

Markov Chains — Transition Matrix and Stationarity

Memoryless in state: P(Xₙ₊₁=j | Xₙ=i, past) = Pᵢⱼ. Rows of P sum to 1. n-step probabilities = Pⁿ.

Stationary distribution π solves πP = π, Σπᵢ = 1.
Ergodic (irreducible + aperiodic, finite) ⇒ unique π and Pⁿ → 1π regardless of start.
Reversible if πᵢPᵢⱼ = πⱼPⱼᵢ (detailed balance) ⇒ such π is stationary.

Example. Two states, P = [[0.9, 0.1],[0.2, 0.8]]. Solve 0.9π₁+0.2π₂=π₁ ⇒ 0.2π₂=0.1π₁ ⇒ π₁=2π₂. With π₁+π₂=1: π = (2/3, 1/3).

Simple Random Walk — Recurrence and Hitting Times

Symmetric ±1 steps. E[Sₙ]=0, Var(Sₙ)=n.

Pólya recurrence: recurrent in 1D and 2D, transient in 3D and higher.
Gambler's ruin on {0,…,N}, fair: P(hit N before 0 | start i) = i/N; expected duration = i(N−i).
Biased (up p, down q=1−p): ruin prob uses ratio r=q/p; P(reach N)=(1−rⁱ)/(1−rᴺ).

Example. Start with $30, fair bets, target $100 or bust. P(reach 100) = 30/100 = 0.30; expected number of $1 bets = 30·70 = 2100.

Martingales and Optional Stopping

E[Xₙ₊₁ | past] = Xₙ (fair game). Optional Stopping Theorem: E[X_τ] = E[X₀] for stopping time τ when (e.g.) τ is bounded, or bounded increments + finite E[τ].

For SRW, both Sₙ and Sₙ² − n are martingales.
Using Sₙ²−n at the ruin stopping time recovers E[duration].

Example. Fair gambler's ruin {0,N}, start i. Sₙ martingale ⇒ E[S_τ] = i = N·P(hit N) ⇒ P = i/N. Then E[S_τ²−τ]=i² ⇒ E[τ] = E[S_τ²]−i² = N²(i/N) − i² = i(N−i).

Brownian Motion — Properties

Standard BM Wₜ:

W₀ = 0; independent increments; Wₜ − Wₛ ~ N(0, t−s).
Continuous paths but nowhere differentiable.
Cov(Wₛ, Wₜ) = min(s,t).
Self-similar: W_{ct} =ᵈ √c · Wₜ. Scales as √t (the "square-root-of-time" rule).
Quadratic variation over [0,t] = t (deterministic).

Example. W₁ = 0.5 observed. E[W₃ | W₁=0.5] = 0.5 (martingale); Var(W₃ | W₁) = 3−1 = 2, so W₃ | W₁ ~ N(0.5, 2).

Poisson Process

Counting process N(t), rate λ:

N(t) ~ Poisson(λt); independent increments over disjoint intervals.
Inter-arrival times iid Exponential(λ); n-th arrival time ~ Gamma(n, 1/λ).
Superposition: independent rates λ₁+λ₂ ⇒ Poisson(λ₁+λ₂). Thinning by p ⇒ Poisson(λp).
Given N(t)=n, the n arrival times are distributed as n iid Uniform(0,t) order statistics.

Example. Trades arrive at λ=4/min. P(no trade in 30s) = e^(−4·0.5) = e⁻² ≈ 0.135. Expected gap = 1/4 min = 15s.

Itô's Lemma and Quadratic Variation

Key rule: (dW)² = dt, dt·dW = 0, (dt)² = 0. For dX = a dt + b dW and f(t,X): df = (∂f/∂t + a ∂f/∂x + ½b² ∂²f/∂x²) dt + b ∂f/∂x dW. The ½b²f_xx term is the Itô correction — absent in ordinary calculus.

Example. f(W)=W². df = 0 + 0 + ½·1·2 dt + 1·2W dW = dt + 2W dW. Integrating: Wₜ² = t + 2∫W dW, confirming E[Wₜ²]=t since the Itô integral has mean 0.

Geometric Brownian Motion (GBM)

Model for prices: dS = μS dt + σS dW. Solution (via Itô on ln S): Sₜ = S₀ exp((μ − ½σ²)t + σWₜ).

Drift correction: log-return has mean (μ−½σ²)t, NOT μt.
E[Sₜ] = S₀ e^(μt); Sₜ is lognormal.
Var(Sₜ) = S₀² e^(2μt)(e^(σ²t) − 1).

Example. S₀=100, μ=10%, σ=20%, t=1. E[S₁]=100e^0.10≈110.5. Median = 100·exp(0.10−0.02)=100e^0.08≈108.3 — below the mean (right-skew of lognormal).

Ornstein–Uhlenbeck (Mean Reversion)

dX = θ(μ − X) dt + σ dW, θ > 0 = reversion speed.

Mean: E[Xₜ] = μ + (X₀ − μ)e^(−θt) → μ.
Stationary variance: σ²/(2θ); stationary law N(μ, σ²/(2θ)).
Half-life of a shock: ln 2 / θ.

Example. θ=2/yr, σ=0.3, μ=0. Long-run variance = 0.09/(2·2) = 0.0225, σ_∞ = 0.15. Shock half-life = ln2/2 ≈ 0.347 yr ≈ 4.2 months.

Itô Integral and SDE Toolkit

∫₀ᵗ f dW is defined for adapted, square-integrable f; it is a martingale with E[∫f dW]=0.
Itô isometry: E[(∫₀ᵗ f dW)²]=E[∫₀ᵗ f² ds] — converts a stochastic integral's variance into an ordinary time integral.
Multiplication rules: dW·dW=dt, dt·dW=0, dt·dt=0. This is why second-order terms survive in Itô's lemma.
An Itô process dX=μ(X,t)dt+σ(X,t)dW with zero drift μ≡0 is a local martingale — a true martingale when σ is square-integrable (E[∫σ²ds]<∞).
Drift = predictable trend; diffusion = the only term carrying randomness. Kill the drift to get a martingale.
Integration by parts (no cross term unless both are Itô): d(XY)=X dY+Y dX+dX·dY.

Example. For Xₜ=∫₀ᵗ W_s dW_s, apply Itô to Wₜ²: d(Wₜ²)=2Wₜ dWₜ+dt, so ∫₀ᵗ W dW=½(Wₜ²−t). The −t/2 is the Itô correction a Riemann integral would miss.

Risk-Neutral Measure, Girsanov and Feynman–Kac

Under risk-neutral ℚ, discounted assets are martingales: S₀=E^ℚ[e^{−rT}S_T]; every tradable's drift becomes r.
Girsanov: changing measure shifts the drift but leaves volatility and quadratic variation unchanged. W̃ₜ=Wₜ+∫₀ᵗ θ_s ds is a ℚ-Brownian motion.
Radon–Nikodym density is an exponential martingale: dℚ/dℙ=exp(−∫θ dW−½∫θ² ds).
Market price of risk θ=(μ−r)/σ is the drift adjustment that turns the real-world drift μ into r.
Feynman–Kac: the PDE ∂_t V+μ ∂_x V+½σ² ∂_xx V−rV=0 with terminal data V(T,x)=payoff has solution V=E^ℚ[e^{−r(T−t)}payoff].
Girsanov re-drifts; Feynman–Kac is the bridge between the pricing PDE and the discounted expectation.

Example. Black–Scholes: under ℚ, dS=rS dt+σS dW̃. Feynman–Kac says the BS PDE's solution equals e^{−rT}E^ℚ[(S_T−K)⁺] — same answer two ways.

First-Passage Times for Brownian Motion

Reflection principle (running max): P(max_{s≤t} Wₛ ≥ a) = 2·P(Wₜ ≥ a) = 2Φ(−a/√t) for a>0.
Hitting time τ_a = inf{t : Wₜ = a} has the Lévy (stable-½) density (a/√(2πt³)) e^(−a²/2t); heavy-tailed, so E[τ_a] = ∞ for driftless BM.
τ_a is a.s. finite (1D BM is recurrent) but has infinite mean — it hits, but you cannot bound when.
With drift Xₜ = μt + σWₜ: hitting a>0 is certain only if μ ≥ 0; if μ < 0, P(ever hit a) = exp(2μa/σ²) < 1.
Two-sided exit (levels −b, a), no drift: P(hit a first) = b/(a+b).

Example. P(BM exceeds 1 sometime in [0,1]) = 2·P(W₁ ≥ 1) = 2(1 − Φ(1)) = 2·0.1587 ≈ 0.317.

Simple Linear Regression

Fit a line ŷ = β₀ + β₁x through a cloud of points. The slope and intercept that minimize squared error have closed forms:

β₁ = Cov(x,y) / Var(x) (slope = how much y moves per unit x)
β₀ = ȳ − β₁x̄ (line passes through the mean point (x̄, ȳ))

Interpretation: β₁ is the expected change in y for a one-unit increase in x, holding the model fixed. Note β₁ = r · (sy/sx), so the slope is the correlation rescaled by the ratio of standard deviations.

Example. x = [1,2,3,4,5], y = [2,4,5,4,6]. Then x̄=3, ȳ=4.2, Cov(x,y)=2.0, Var(x)=2.5. So β₁ = 2.0/2.5 = 0.8 and β₀ = 4.2 − 0.8·3 = 1.8 → ŷ = 1.8 + 0.8x. At x=3, ŷ = 4.2.

OLS: Normal Equations & the Geometry

OLS chooses β to minimize the sum of squared residuals SSR = Σ(yᵢ − ŷᵢ)². Setting the gradient to zero gives the normal equations:

XᵀX β = Xᵀy → β = (XᵀX)⁻¹Xᵀy (when XᵀX is invertible)
Residuals are orthogonal to every column of X: Xᵀε = 0

Geometric view: ŷ is the orthogonal projection of y onto the column space of X. The hat matrix H = X(XᵀX)⁻¹Xᵀ does the projection (ŷ = Hy); the residual e = (I−H)y is the part of y orthogonal to that space. Minimizing distance = dropping a perpendicular.

Example. With an intercept-only model (X = column of 1s), (XᵀX)⁻¹Xᵀy = (1/n)Σyᵢ = ȳ. OLS reduces to the sample mean — the projection of y onto the constant vector is its average. Residuals then sum to zero, the defining orthogonality condition.

Gauss–Markov: When OLS is BLUE

Under these assumptions, OLS is the Best Linear Unbiased Estimator — minimum variance among all linear unbiased estimators:

Assumption	What it buys
Linearity in parameters	Model is correctly specified
`E[ε\|X] = 0` (exogeneity)	Unbiasedness: `E[β̂] = β`
Homoskedasticity `Var(εᵢ) = σ²`	Efficiency / correct standard errors
No autocorrelation `Cov(εᵢ,εⱼ)=0`	Efficiency / correct SEs
No perfect multicollinearity	XᵀX invertible → β identified

Normality of errors is NOT required for BLUE. It is only needed for exact finite-sample t- and F-inference. In large samples the CLT delivers approximately normal β̂ regardless, so normality is an inference convenience, not a requirement for unbiasedness or efficiency.

R² and Adjusted R²

R² = 1 − SSR/SST = ESS/SST, the fraction of variance in y explained by the model. Here SST = Σ(yᵢ−ȳ)², SSR = Σ(yᵢ−ŷᵢ)².

R² never decreases when you add a regressor, even a useless one → it cannot be used for model selection.
Adjusted R² = 1 − (1−R²)(n−1)/(n−k−1) penalizes extra predictors (k = number of predictors excluding the intercept); it can fall when a variable adds noise.
R² says nothing about whether the model is correct, whether coefficients are causal, or whether predictions will generalize. High R² with bad residual patterns is still a bad model.

Example. From the SLR fit ŷ=1.8+0.8x: SST=8.8, SSR=2.4, giving R² = 1 − 2.4/8.8 ≈ 0.727. With n=5, k=1: adjR² = 1 − (1−0.727)(4/3) ≈ 0.636. The penalty is large here because n is tiny relative to parameters.

Standard Errors, t-tests, F-test

Coefficient covariance: Var(β̂) = σ²(XᵀX)⁻¹, with σ̂² = SSR/(n−k−1). SE(β̂ⱼ) is the square root of the j-th diagonal element.

t-test for a single coefficient: t = β̂ⱼ / SE(β̂ⱼ), df = n−k−1. Tests H₀: βⱼ = 0.
F-test for joint significance: F = (R²/k) / ((1−R²)/(n−k−1)). Tests H₀: all slopes = 0. Use the nested form F = [(SSRᵣ−SSRᵤ)/q] / [SSRᵤ/(n−k−1)] to compare a restricted vs unrestricted model.
Rule of thumb: |t| > 2 ≈ significant at 5% for moderate df.

Example. Suppose β̂₁ = 0.8 with SE = 0.30. Then t = 0.8/0.30 ≈ 2.67. With df large, the two-sided p-value ≈ 0.008 < 0.05, so reject H₀ — the slope is statistically distinguishable from zero.

Confidence vs Prediction Intervals

Both are centered at ŷ₀ at a new point x₀, but they answer different questions:

Confidence interval for the mean response E[y|x₀]: width ∝ √(Var(ŷ₀)).
Prediction interval for a single new observation y₀: adds the irreducible error variance σ²: width ∝ √(σ² + Var(ŷ₀)).

The prediction interval is always wider than the confidence interval because it must account for both estimation uncertainty and the new observation's own noise. Both fan out as x₀ moves away from x̄.

Example. If σ̂ = 1.0 and Var(ŷ₀) = 0.25 at x₀, the CI half-width uses √0.25 = 0.5, while the PI half-width uses √(1.0 + 0.25) = √1.25 ≈ 1.118 — more than twice as wide (before the t-multiplier). Forecasting one bond's return is far less certain than estimating the average return.

Multicollinearity & Omitted-Variable Bias

Multicollinearity: regressors are highly correlated. Symptoms: huge standard errors, unstable/sign-flipping coefficients, high R² but few individually significant predictors.

Detect with VIF: VIFⱼ = 1/(1−Rⱼ²), where Rⱼ² is from regressing xⱼ on the other predictors. VIF > 10 (Rⱼ² > 0.9) is a common red flag.
Fixes: drop a redundant variable, combine into an index/PCA, or use ridge regression.
It inflates variance but does NOT bias coefficients — predictions can still be fine.

Omitted-variable bias (OVB): leaving out a relevant correlated regressor biases the included ones. If true model is y = β₁x₁ + β₂x₂ + ε and you omit x₂, the bias on β̂₁ is β₂ · δ, where δ is the slope of x₂ regressed on x₁.

Example (OVB). True effect β₁ = 2, with x₂ having effect β₂ = 3 and δ = 0.5 (x₂ rises 0.5 per unit x₁). Omitting x₂ gives estimated slope ≈ 2 + 3·0.5 = 3.5 — a +1.5 bias. Sign of bias = sign(β₂)·sign(δ). Example (VIF). If Rⱼ² = 0.9 then VIF = 1/0.1 = 10: the SE of that coefficient is √10 ≈ 3.2× larger than with no collinearity.

Heteroskedasticity & Autocorrelation

Violating constant/uncorrelated errors keeps OLS unbiased but makes the default standard errors wrong, so t/F tests mislead.

Heteroskedasticity (Var(εᵢ) changes with x): detect via residual-vs-fitted "funnel/fan" plot, Breusch–Pagan or White test. Fix: White/Huber heteroskedasticity-robust SEs (HC).
Autocorrelation (errors correlated over time, common in finance): detect via Durbin–Watson (≈2 means none; near 0 = positive autocorrelation) or residual ACF plot. Fix: Newey–West (HAC) SEs, which correct for both heteroskedasticity and autocorrelation up to a chosen lag.

Robust SEs change inference only — the point estimates β̂ are unchanged.

Example. Regressing daily returns on a factor, residuals show volatility clustering (heteroskedastic) and serial correlation. A naive t-stat of 3.0 might drop to 1.6 under Newey–West with 5 lags — the apparent significance was an artifact of understated SEs. Always report HAC SEs for time-series return regressions.

Regularization: Ridge (L2) vs Lasso (L1)

Add a penalty to shrink coefficients, trading a little bias for a large variance reduction — helpful when predictors are many or collinear.

	Ridge (L2)	Lasso (L1)
Penalty	`SSR + λΣβⱼ²`	`SSR + λΣ\|βⱼ\|`
Closed form	`β = (XᵀX + λI)⁻¹Xᵀy`	No closed form (LP/coordinate descent)
Effect	Shrinks toward 0, rarely exactly 0	Drives some β exactly to 0 → sparsity / feature selection
Best when	Many correlated predictors all matter	Few predictors truly matter

Lasso does variable selection; ridge does not. Elastic Net mixes both. Larger λ → more shrinkage → more bias, less variance. Standardize predictors first, and never penalize the intercept. As λ→0 both → OLS; as λ→∞ both → 0.

Example. Ridge with one standardized predictor: closed form gives β = Xᵀy/(XᵀX+λ). If OLS β=1.0 and XᵀX=10, then λ=10 halves it to β = 10/20 = 0.5. Lasso instead would zero it out entirely once λ exceeds the soft-threshold cutoff.

Logistic Regression

For a binary outcome y ∈ {0,1}, model the log-odds linearly:

log(p/(1−p)) = β₀ + β₁x, so p = 1/(1 + e^−(β₀+β₁x)) (sigmoid, always in (0,1)).
e^βⱼ is the odds ratio — the multiplicative change in odds per unit increase in xⱼ. Fit by maximum likelihood, not least squares.

Why not OLS on a 0/1 target? OLS can predict probabilities <0 or >1, the errors are inherently heteroskedastic (variance p(1−p) depends on x), and the linear-probability fit is biased near the bounds. The logit link fixes all three.

Example. β₁ = 0.7 → odds ratio e^0.7 ≈ 2.01: each unit of x roughly doubles the odds of y=1. If at some x the log-odds = 0, then p = 1/(1+e⁰) = 0.5; at log-odds = −1, p = 1/(1+e¹) ≈ 0.269.

Bias–Variance, Overfitting & Cross-Validation

Expected test error decomposes as Bias² + Variance + Irreducible σ². Simple models underfit (high bias); flexible models overfit (high variance — they memorize noise). The sweet spot minimizes their sum.

Overfitting sign: train error ≪ test error. Cure with more data, fewer features, or regularization.
k-fold CV: split into k folds, train on k−1, validate on the rest, average. Estimates out-of-sample error and tunes λ.
Time-series: NEVER use random k-fold — it leaks the future into the past. Use walk-forward / expanding-window validation: always train on data strictly before the validation period. Also embargo/purge around the split to avoid overlap leakage.

Example. A degree-10 polynomial fits 11 training points with R²=1.0 (zero train error) but swings wildly between points → test R² may be negative. A degree-1 or -2 fit has higher train error but far lower test error. In a return model, evaluating with shuffled CV instead of walk-forward can turn a truly worthless signal into a falsely "profitable" backtest.

Residual Diagnostics

Residuals should look like featureless noise. Key plots and what a pattern means:

Plot	Healthy	Pattern → problem
Residuals vs fitted	Random band around 0	Curve → nonlinearity; funnel → heteroskedasticity
Q–Q plot	Points on the line	Heavy tails / skew → non-normal errors (matters for inference)
Residual ACF / vs order	No autocorrelation	Waves → serial correlation (time-series)
Leverage / Cook's distance	All small	Outlier or high-leverage point distorting the fit

Example. Fitting a line to data that curves upward leaves a U-shaped residual-vs-fitted plot (negative, then positive, then negative residuals). The fix is to add a quadratic term x² or transform (log y), not to keep the linear model. A single high-leverage point with large Cook's distance (> ~4/n) can flip a slope's sign — investigate before trusting the fit.

Finance Example: CAPM Beta as a Regression

CAPM/factor models are regressions. Regress excess asset returns on excess market (or factor) returns:

Rᵢ − R_f = α + β(R_m − R_f) + ε
β = Cov(Rᵢ, R_m) / Var(R_m) — the slope; measures systematic (non-diversifiable) risk.
α (intercept) = risk-adjusted excess return; a significantly positive α is "alpha." Extend to Fama–French: add SMB, HML, etc., as extra regressors → multivariate OLS with one β per factor.

Example. Market excess returns R_m = [−2%, −1%, 0%, 2%, 3%] and asset excess returns Rᵢ = [−2.5%, −1%, 0.5%, 3.5%, 5%]. Then Cov/Var → β = 1.5 and α = R̄ᵢ − β·R̄_m = 0.5%. β=1.5 means the stock moves 1.5× the market (aggressive); the 0.5% α is its average excess return unexplained by market exposure. Always report Newey–West SEs on β since return residuals are autocorrelated.

WLS, GLS & Robust Standard Errors

Non-spherical errors Var(ε)=σ²Ω (Ω≠I) leave OLS unbiased but no longer BLUE, and the textbook SE formula is wrong.
GLS reweights: β̂_GLS=(XᵀΩ⁻¹X)⁻¹XᵀΩ⁻¹y — equivalent to OLS on data premultiplied by Ω^(−1/2) (whitening). It is BLUE when Ω is known.
WLS = GLS for the diagonal (pure heteroskedasticity) case: weight each obs by wᵢ=1/Var(εᵢ), downweighting noisy points. Feasible GLS estimates Ω first, then plugs in.
If you don't trust your model for Ω, keep OLS point estimates but fix the SEs: White/HC (heteroskedasticity) or Newey–West/HAC (also autocorrelation).
Robust SEs change inference only, not β̂; WLS/GLS change both. Robust SEs are the default in practice because misspecifying Ω can make FGLS worse than OLS.

Example. Errors with Var(εᵢ)=σ²xᵢ²: dividing the whole equation by xᵢ gives homoskedastic errors — that transformed OLS is WLS with wᵢ=1/xᵢ². On finance return panels, the safe move is OLS + HAC SEs rather than guessing Ω.

Frisch–Waugh–Lovell & Partialling-Out

FWL: in y=X₁β₁+X₂β₂+ε, the OLS estimate β̂₂ is unchanged if you instead (1) regress y on X₁ and take residuals ỹ, (2) regress X₂ on X₁ and take residuals X̃₂, then (3) regress ỹ on X̃₂.
A multiple-regression coefficient is the bivariate slope on the part of that regressor orthogonal to all the others — "holding the others fixed" made precise.
Explains why adding a regressor changes existing coefficients: it strips shared variation out (this is exactly OVB unwound).
Justifies demeaning for an intercept (X₁=1 ⇒ partial out the mean) and within-transform for fixed effects — you needn't carry dummies in the design matrix.
If X̃₂ has little variation left (X₂ well-explained by X₁), that coefficient's SE blows up — the same mechanism as multicollinearity/VIF.

Example. Want the effect of a factor on returns controlling for market β: regress the factor on market return, take residuals (the factor's market-neutral component), then regress returns on those residuals. The slope equals the multivariate coefficient — and is why a factor's marginal alpha can shrink once you control for market exposure.

OLS as MLE & Spurious Regression in Time Series

Under ε~N(0,σ²I), minimizing SSR is identical to maximizing the Gaussian likelihood — OLS = MLE when errors are normal, which is why OLS inherits MLE efficiency and the exact t/F distributions.
MLE for σ² divides SSR by n (biased); OLS uses n−k−1 (k = predictors excluding the intercept) for unbiasedness. Normality is needed for exact small-sample t/F; large samples lean on the CLT instead.
Spurious regression: regressing one independent random walk on another yields high R² and large t-stats by pure chance — the t-distribution is invalid because residuals are non-stationary (a unit root).
Tell-tale sign: very low Durbin–Watson (≈0) alongside high R² ⇒ suspect spuriousness, not a real relationship.
Fix: regress in differences (returns, not prices) or use cointegration/ECM when levels share a common stochastic trend.

Example. Regressing the price level of stock A on stock B (both I(1), unrelated) can give R²>0.9 and "t>10" — meaningless. Switching to daily returns (both stationary) collapses R² toward its honest near-zero value. This is why quant signal research works in returns, not raw prices.

Payoffs & Moneyness

A call gives the right to buy at strike K; a put the right to sell at K. At expiry: call = max(S−K, 0), put = max(K−S, 0). Long option = limited loss (premium), unlimited (call) / large (put) upside; short option = capped gain (premium), large loss.

ITM: call S>K, put S<K — has intrinsic value.
ATM: S≈K — maximum time value, maximum Γ and θ.
OTM: call S<K, put S>K — all time value, expires worthless if unmoved.

Option value = intrinsic + time (extrinsic) value. Premium is always ≥ intrinsic (no-arbitrage).

Example. Stock S=102, you hold a K=100 call bought for 3.50. Intrinsic = 102−100 = 2; time value = 3.50−2 = 1.50. At expiry with S=105 payoff = 5, profit = 5−3.50 = 1.50; with S=99 payoff = 0, loss = full 3.50 premium.

Put–Call Parity & Arbitrage

For European options on a non-dividend stock: C − P = S − Ke^(−rT). Intuition: long call + short put = synthetic forward (you end up owning the stock at K regardless of S). With dividend yield q: C − P = Se^(−qT) − Ke^(−rT).

If LHS > RHS: sell the call, buy the put, buy stock, borrow Ke^(−rT) → lock riskless profit (conversion).
Reverse mispricing → reversal.
Parity is model-free: it holds by replication, independent of vol or distribution.

Example. S=100, K=100, r=5%, T=1, q=0. Ke^(−rT) = 100·e^(−0.05) = 95.12. So C−P = 100−95.12 = 4.88. If a 1-yr call trades 7.00 and the put trades 1.50, then C−P = 5.50 > 4.88 → call rich by 0.62. Sell call, buy put, buy stock, borrow 95.12: net cash today = 7.00−1.50−100+95.12 = +0.62 locked, position self-liquidates to 0 at expiry.

The Greeks (table)

Sensitivities of option price V to inputs. Signs are for a long position.

Greek	Measures ∂V/∂	Call	Put	Scaling
Δ (delta)	spot S	0→+1	−1→0	—
Γ (gamma)	delta vs S (∂²V/∂S²)	+	+	peaks ATM, ∝ 1/(σ√T)
ν (vega)	vol σ	+	+	∝ √T
θ (theta)	time (decay)	−*	−*	∝ 1/√T (ATM)
ρ (rho)	rate r	+	−	∝ T

*Long options usually have θ<0 (decay); deep-ITM European puts can have θ>0. Γ and θ have opposite signs for a long position — you pay theta to own gamma. Vega and gamma both peak ATM and rise with maturity / falling vol respectively.

Example. ATM 1-month call, Δ≈0.50, Γ=0.05/$. Stock jumps +$2 → new Δ ≈ 0.50 + 0.05·2 = 0.60, and the call gains ≈ Δ·ΔS + ½Γ·ΔS² = 0.50·2 + 0.5·0.05·4 = 1.00 + 0.10 = 1.10.

Black–Scholes: Formula, Assumptions, Failures

C = S·N(d₁) − Ke^(−rT)·N(d₂), with d₁ = [ln(S/K) + (r + σ²/2)T] / (σ√T) and d₂ = d₁ − σ√T. Here N(d₁) = Δ of the call; N(d₂) = risk-neutral probability of finishing ITM.

Assumptions: GBM (lognormal S, constant σ), continuous frictionless trading, constant r, no jumps, no transaction costs.
Failures: real returns have fat tails & jumps; vol is stochastic and clusters; this produces the vol smile/skew that flat-σ BS cannot fit. Traders invert BS to quote implied vol rather than price.

Example. S=K=100, r=0, σ=20%, T=1. d₁ = (0 + 0.02)/0.20 = 0.10, d₂ = −0.10. N(0.10)≈0.5398, N(−0.10)≈0.4602. C = 100·0.5398 − 100·0.4602 = 7.97. Cross-check the ATM rule: 0.4·S·σ·√T = 0.4·100·0.20·1 = 8.0 ✓.

Implied vs Realized Vol & the Variance Risk Premium

Realized vol (RV): the actual annualized stdev of returns that occurred. Implied vol (IV): the σ that makes BS match the market option price — the market's forward-looking forecast plus a risk premium.

Variance Risk Premium (VRP) = implied variance − expected realized variance, on average > 0.
IV typically exceeds subsequent RV (option sellers are compensated for bearing crash/vol risk) → systematic short-vol carry, but with large negative skew (sell-vol blows up in crashes).
Trading the spread: long straddle if you think RV will exceed IV; short straddle / variance if IV is rich.

Example. 30-day SPX IV = 18% (implied variance 0.0324). The 30 days actually realize 14% vol (variance 0.0196). A delta-hedged short straddle captures ≈ ½·(vega/σ)·(IV²−RV²) edge; VRP = 0.0324 − 0.0196 = 0.0128 of annualized variance earned by the seller.

Vol Smile, Skew & the Surface

Plotting IV vs strike at fixed maturity gives a smile (FX, balanced) or a skew/smirk (equity indices: OTM puts > ATM > OTM calls). Adding the maturity axis gives the vol surface IV(K,T).

Equity skew drivers: crash-fear (demand for downside puts), leverage effect (price down → leverage & vol up), and fat left tail.
Term structure: normally upward-sloping (more uncertainty far out); inverts in a panic (front-month IV spikes above back-month).
Skew steepness ≈ market's pricing of jump/crash risk; flattens in calm regimes.

Example. SPX 1-month: 90% strike put IV = 24%, 100% ATM = 18%, 110% call IV = 15%. The 9-point left-vs-right gap is the equity skew — the market charges far more for downside protection than for upside calls of equal moneyness.

Delta-Hedging & the Gamma–Theta Tradeoff

A delta-hedged long option is locally market-neutral but long Γ: every move (up or down) lets you rebalance profitably — buy low, sell high — earning realized variance. You pay for this with time decay θ. The P&L of a delta-hedged book over dt is approximately P&L ≈ ½·Γ·S²·(ΔS/S)² + θ·dt, and BS sets θ ≈ −½·Γ·σ²·S² so that breakeven occurs exactly when realized vol = implied vol.

Long gamma profits when RV > IV; short gamma profits when RV < IV.
Hedging gaps/jumps hurt long-gamma less than the theta bleed in quiet markets hurts.

Example. Γ=0.10, S=100, σ=20%/yr. Daily θ from the rule: −½·0.10·(0.20²/252)·100² = −½·0.10·(0.0001587)·10000 ≈ −0.079/day. If the stock actually moves 1.5% (=1.5) that day, gamma P&L = ½·0.10·1.5² = 0.1125; net ≈ 0.1125 − 0.079 = +0.033 because realized (1.5%/day ≈ 24%/yr) > implied (20%).

Spreads & Combos — the View Each Expresses

Structure	Construction	View
Straddle	long call + long put, same K	big move, direction unknown (long vol)
Strangle	long OTM call + OTM put	cheaper long-vol, needs bigger move
Bull call (vertical)	long low-K call, short high-K call	moderately bullish, capped, cheaper
Butterfly	+1 low / −2 mid / +1 high call	pin near mid K; short vol, defined risk
Calendar	short near-dated, long far-dated, same K	long term-structure / forward vol; wants stillness now
Risk reversal	long OTM call, short OTM put	bullish + short skew (sells expensive put)

Example. Bull call: buy 100-call for 4, sell 110-call for 1.50 → net debit 2.50. Max loss 2.50 (S≤100); max gain = 10−2.50 = 7.50 (S≥110); breakeven S = 102.50. Defined-risk bullish bet far cheaper than the outright 100-call.

Early Exercise — American Call/Put Asymmetry

Never optimal to early-exercise an American call on a non-dividend stock — you'd throw away remaining time value and the interest on K; better to sell the option. So American call = European call here.

Calls: only exercise early to capture a dividend (just before ex-date if the dividend exceeds the remaining time value + carry).
Puts: can be optimal to exercise early even with no dividends — exercising a deep-ITM put frees up cash K to earn interest now; the deeper ITM and higher r, the stronger the case. So American put > European put strictly.

Example. Deep-ITM put: K=100, S=2, r=8%, 6 months left. Exercise now → receive 98 cash, earning ≈ 98·0.08·0.5 ≈ 3.9 in interest. The option's remaining time value is tiny (almost nothing left to gain on the downside since S can only fall to 0), so early exercise dominates.

Variance Swaps & VIX

A variance swap pays N_var·(RV² − K_var²) — pure exposure to realized variance, no path-dependent delta-hedging needed. It is replicated by a static, strike-weighted portfolio of OTM options (weight ∝ 1/K²) plus a dynamic futures hedge.

VIX = √(model-free implied variance of 30-day SPX options) × 100; computed from the same 1/K² strip, not from a single ATM option.
VIX is a volatility (annualized %), so a "20 VIX" ≈ 20% expected annualized SPX vol ≈ 20/√252 ≈ 1.26% expected daily move.
Variance (not vol) is the cleanly tradeable/replicable quantity → quotes are often in variance then square-rooted.

Example. Var swap struck at K_var = 20 (variance 400), notional $50k per vol-point². If 30-day RV comes in at 25, payoff = 50,000·(25² − 20²) = 50,000·(625−400) = 50,000·225 = $11.25M (dollars). The payoff is linear in variance but convex in volatility, so the long gains disproportionately as realized vol rises above the strike.

Rule of Thumb: ATM Option Price

For a near-ATM forward option with low rates, the price is approximately C_ATM ≈ 0.4·S·σ·√T (from C ≈ S·σ·√T/√(2π), and 1/√(2π) ≈ 0.399). Hugely useful for fast mental pricing and for sanity-checking quotes.

Vega of an ATM option ≈ 0.4·S·√T (per 100% vol; divide by 100 for per-1%-vol).
A straddle (call+put) ATM ≈ 0.8·S·σ·√T.
Scaling: halving T shrinks the price by √2 ≈ 1.41×, not 2×.

Example. S=50, σ=30%, T=3 months=0.25 → √T=0.5. ATM call ≈ 0.4·50·0.30·0.5 = 3.0. The ATM straddle ≈ 6.0, i.e. the market implies a ±6 (±12%) move over the quarter is roughly a coin-flip breakeven.

No-Arbitrage Price Bounds & Comparative Statics

European call lower bound: C ≥ max(0, S − Ke^(−rT)) — below this, buy call + short stock + lend Ke^(−rT) for a riskless profit.
Upper bounds: C ≤ S (call never worth more than the stock); P ≤ Ke^(−rT) (European put ≤ PV of strike). American put ≤ K.
Comparative statics (which way price moves as one input rises):
Input ↑ Call Put
S ↑ ↓
K ↓ ↑
σ ↑ ↑
T ↑* ↑/↓*
r ↑ ↓
*Longer T raises an American option monotonically; for European a long-dated put can be worth less than a short-dated one (discounting of K dominates).
Option value is convex in S (gamma ≥ 0) and monotone — these shape constraints are themselves arbitrages (a call spread can't cost more than its width, a butterfly can't be negative).

Input ↑	Call	Put
S	↑	↓
K	↓	↑
σ	↑	↑
T	↑*	↑/↓*
r	↑	↓

Example. S=100, K=95, r=5%, T=1. Lower bound on the call: 100 − 95·e^(−0.05) ≈ 100 − 90.37 = 9.63. Any market call priced below 9.63 is a locked-in arb regardless of σ.

Vol Cone & Forward Volatility (Term-Structure Trades)

Vol cone: plot the historical min/median/max of realized vol per horizon (e.g. 10d, 30d, 90d, 1y). Overlay today's implied curve — IV near the top of the cone for a tenor = rich; near the bottom = cheap. A richness gauge across tenors, not strikes.
Variance is additive in time, so forward (between-date) variance is extractable from two spot vols:
σ_fwd = √[(σ₂²·T₂ − σ₁²·T₁)/(T₂ − T₁)].
A calendar spread (short near / long far, same K) is a long forward-vol / long calendar-vega position — it wants stillness now and a higher implied later.
Front-month IV can spike above back-month (inverted / backwardated term structure) in a panic; normally it is upward-sloping (contango).

Example. 1-month IV = 20%, 2-month IV = 22%. Implied 1-month-forward-starting vol for month 2: √[(0.22²·2 − 0.20²·1)/(2−1)] = √[(0.0968 − 0.0400)/1] = √0.0568 ≈ 23.8% — higher than either spot vol, the cost embedded in the calendar.

Dispersion: Index vs Single-Name Vol & Implied Correlation

Index variance is a correlation-weighted sum of constituent variances: for an equal-weight basket, σ_idx² ≈ ρ̄·(Σwᵢσᵢ)² + (1−ρ̄)·Σwᵢ²σᵢ²; with many names the second term vanishes, so σ_idx ≈ √ρ̄ · Σwᵢσᵢ.
Implied correlation backs ρ̄ out of index IV and single-name IVs — the market's price of co-movement. It tends to be high (index puts are bid for crash protection), so index vol is often rich vs the basket.
Dispersion trade: sell index vol, buy single-name vol (via straddles/var swaps). Short implied correlation — profits if names move idiosyncratically (low realized ρ); the classic risk is a correlated crash where ρ→1.
Single-name skew is typically flatter than index skew precisely because idiosyncratic crashes aren't correlated — index skew embeds the correlation-spike fear.

Example. Basket avg single-name vol Σwᵢσᵢ = 30%. Index IV trades at 21%. Implied correlation: ρ̄ ≈ (σ_idx/Σwᵢσᵢ)² = (0.21/0.30)² = 0.49. A dispersion book is short this 0.49; if realized correlation comes in at 0.30 the index leg was overpriced and the trade wins.

Stocks & Dividends

Common stock = residual ownership claim (votes + dividends, last in liquidation). A dividend is value transferred out of the firm: on the ex-dividend date the share price drops by ≈ the dividend amount (no arbitrage — buying just before vs after ex-date is a wash).

Key dates: declaration → ex-date (price drops; buyer no longer entitled) → record → payment.
Total return = price return + dividend yield; option pricing uses dividend yield q to discount the forward.
Buybacks return cash without the forced taxable event — often preferred to dividends.

Example. Stock closes 50.00, pays a $1.00 dividend. On the ex-date it opens ≈ 49.00 absent other news. A call holder is not entitled to the dividend, which is exactly why a juicy dividend can make early exercise of an American call worthwhile.

Bonds: Price–Yield, Duration, Convexity, DV01

Price = PV of coupons + principal: P = Σ C/(1+y)ᵗ + F/(1+y)ⁿ. Price and yield move inversely and convexly.

Macaulay duration = PV-weighted average time to cash flows (years).
Modified duration D_mod = D_mac/(1+y) → %ΔP ≈ −D_mod·Δy.
DV01 = dollar value of 1bp = D_mod·P·0.0001.
Convexity corrects the linear estimate: %ΔP ≈ −D_mod·Δy + ½·Cvx·(Δy)²; convexity is always +ve for vanilla bonds (helps you both ways).

Example. A bond with modified duration 7.0, price 98.00, yields rise +25bp. %ΔP ≈ −7.0·0.0025 = −1.75% → price ≈ 98.00·(1−0.0175) = 96.28. DV01 = 7.0·98·0.0001 = $0.0686 per $100 face per bp. With convexity 60: correction = ½·60·0.0025² = +0.0188% → true ΔP ≈ −1.73%, slightly less than the linear estimate (convexity always cushions).

Forwards & Futures: Cost of Carry, Basis, Margin

No-arbitrage forward price: F = S·e^((r−q)T) (q = dividend/convenience yield; for commodities add storage cost u: F = S·e^((r+u)T)). Basis = S − F (or F − S by convention); converges to 0 at expiry.

Contango: F > S (carry markets). Backwardation: F < S (high convenience yield / tight supply).
Futures vs forwards: futures are exchange-traded, daily marked-to-market with margin (cash flows daily); forwards are OTC, settled once. MTM creates small correlation-with-rates convexity differences.
Margin: post initial margin, top up to maintenance via variation margin as MTM moves.

Example. S=100, r=5%, q=2%, T=0.5 → F = 100·e^((0.05−0.02)·0.5) = 100·e^0.015 = 101.51. If the futures trades 102.00 it's rich by 0.49 → sell future, buy stock financed at r (cash-and-carry arb). A long futures holder whose contract falls from 102 to 100 pays $2/contract-unit variation margin that same day.

Swaps: Interest-Rate Swap Mechanics

A plain-vanilla IRS exchanges fixed for floating (e.g. SOFR) on a notional that is never exchanged. The fixed rate is set so the swap has zero value at inception (PV fixed leg = PV floating leg). It later moves with rates.

Pay-fixed/receive-float = short duration (gains when rates rise) — like being short a bond.
Swap rate = par yield: S = (1 − D(T_n)) / Σ τᵢ·D(Tᵢ) where D = discount factors.
Uses: convert floating debt to fixed, express rate views, hedge duration. DV01 of a swap behaves like the underlying bond.

Example. $100M 5-yr swap, fixed leg 4%, you pay fixed / receive SOFR. If SOFR averages 5% over a period, on a year you receive 5%·100M = $5M, pay 4%·100M = $4M → net +$1M. Rates rising made your pay-fixed position gain value (negative duration paid off).

Repo & Reverse Repo

A repo = sell a security now + agree to buy it back later at a slightly higher price → economically a collateralized loan. The lender of cash does a reverse repo (buys now, sells back later). The price difference is the repo rate (interest).

Borrower of cash = repo (gives collateral, pays repo rate). Borrower of the security = reverse repo (lends cash).
Lender protects via a haircut (lends less than collateral value).
Special collateral repo rate < general collateral (GC) rate when a bond is in high demand to borrow (e.g. to short).

Example. Borrow $99.5M cash overnight against $100M Treasuries (0.5% haircut) at 5.0% repo. Next day repay 99.5M·(1 + 0.05/360) ≈ 99.5M + $13,819 interest, and get the bonds back. The cash lender earned a near-riskless secured 5% with a $0.5M cushion.

ETFs & Creation/Redemption Arbitrage

An ETF trades intraday like a stock but tracks a basket/NAV. Authorized Participants (APs) keep price ≈ NAV via in-kind creation/redemption:

ETF trading above NAV (premium) → AP buys the underlying basket, delivers it to the issuer, receives new ETF shares, sells them → pushes price down.
ETF below NAV (discount) → AP buys ETF shares, redeems for the basket, sells the basket → pushes price up.
This arbitrage is why liquid ETFs track NAV tightly; in-kind transfers are also tax-efficient.

Example. ETF NAV = $100.00 but it trades at $100.30 (30bp premium). An AP shorts the ETF at 100.30, buys the underlying basket for 100.00, creates a new ETF unit to cover the short → locks ≈ $0.30/share minus costs. Arbitrage flow lifts the basket / sells the ETF until the gap closes.

Securities Lending, Short Selling & Borrow Fees

To short a stock you must borrow it first (locate), sell it, and later buy it back to return. The lender earns a borrow fee (the "cost to borrow", quoted annualized); the short also posts cash collateral and receives a rebate = cash rate − borrow fee.

General collateral (GC): easy-to-borrow, low fee. Hard-to-borrow / special: scarce float → high fee, even negative rebate.
Short must pay any dividends to the lender ("manufactured dividend").
Risks: unlimited loss, short squeeze, recall (lender demands shares back → forced buy-in).

Example. Short 10,000 shares at $40 (=$400k). Borrow fee = 8%/yr (hard-to-borrow). Holding 1 month costs ≈ 400,000·0.08·(1/12) = $2,667 in borrow, plus any dividends paid. The position must beat that carry just to break even — high borrow fees are themselves a bearish-crowding signal.

Market Structure: Exchange vs OTC, CCP, Settlement

Exchange-traded: standardized, lit central limit order book, anonymous, centrally cleared (futures, listed options, most equities).
OTC: bilateral, customizable, bears counterparty credit risk (swaps, forwards, bonds) — though many swaps are now CCP-cleared post-2008.
Central Counterparty (CCP): novates each trade so it becomes buyer-to-every-seller and seller-to-every-buyer; mutualizes default risk via margin + a default fund → removes bilateral counterparty risk.
Settlement: US equities moved to T+1 (trade date + 1 business day) in May 2024; options settle T+1; spot FX T+2; cash/Treasuries often T+0/T+1.

Example. You buy 100 shares Tuesday on an exchange. The CCP becomes your counterparty; cash and shares exchange on Wednesday (T+1). If your broker defaults overnight, the CCP's margin and default fund — not the unknown seller — absorb the risk, which is the whole point of central clearing.

Order Types

Order	Behavior
Market	execute now at best available price; certain fill, uncertain price (slippage)
Limit	execute only at limit price or better; certain price, uncertain fill
Stop (stop-loss)	becomes a market order once a trigger price trades; protects/exits
Stop-limit	becomes a limit order at trigger — won't chase through gaps
IOC (immediate-or-cancel)	fill what you can right now, cancel the rest
FOK (fill-or-kill)	fill the entire size immediately or cancel all

Adding liquidity (resting limit) often earns a maker rebate; taking liquidity (marketable order) pays a fee.

Example. Market shows bid 9.98 / ask 10.02. A buy limit at 10.00 rests (no fill until someone sells to you). A buy IOC for 5,000 at 10.02 fills against the 3,000 offered and cancels the remaining 2,000 — you never sit on the book. A market buy lifts 10.02 then sweeps higher levels for the rest.

Margin & Leverage

Initial margin = equity you must post to open (e.g. Reg-T 50% for equities; futures use SPAN risk-based margin). Maintenance margin = the floor your equity may not fall below; breaching it triggers a margin call → top up or be liquidated.

Leverage = position size / equity; magnifies both return and loss.
Futures margin is a performance bond (a few % of notional), reset daily by MTM — far more leverage than Reg-T equity margin.
Liquidation price = where falling equity hits maintenance.

Example. Buy $100k stock with 50% initial margin → post $50k, borrow $50k (2× leverage). Maintenance = 25%. Equity = market value − loan. A margin call hits when (V − 50,000)/V = 0.25 → V = $66,667, i.e. a 33% drop wipes the position to the maintenance floor. A 10% stock drop (to $90k) cuts equity from $50k to $40k — a 20% equity loss, the leverage at work.

Callable & Convertible Bonds: Negative Convexity, OAS

Callable bond = straight bond − issuer call option. You are short the call, so P_callable = P_straight − C; price is capped near the call price as yields fall.
Callables (and MBS) show negative convexity: when rates drop, the call/prepay option kicks in, so price compression beats price appreciation.
YTM is misleading for callables — quote yield-to-worst (YTW) = min over yield-to-call at each call date and YTM.
OAS (option-adjusted spread) = the spread over the curve after stripping out embedded-option value via a rates model. It is the apples-to-apples spread vs an option-free bond.
Convertible = straight bond + investor equity call. conversion ratio = par/conversion price; parity (conversion value) = ratio × stock. Trades as max(bond floor, parity) plus option time value.
Convert greeks: delta (equity sensitivity) rises as stock climbs — "busted" (deep debt-like, low delta) vs "in-the-money" (equity-like). Convert arb = long bond, short delta×stock.

Example. 5% bond callable at 102. As rates fall, an option-free bond would rally to 110, but the issuer calls — price stalls near 102. Holder's upside is truncated; that truncation is the negative convexity.

FX Forwards: Covered Interest Parity, Forward Points, Cross-Currency Basis

Covered interest parity (CIP): F = S · (1+r_d·τ)/(1+r_f·τ) — the forward locks in the rate differential; no arbitrage if you can borrow/lend/spot freely. Higher-rate currency trades at a forward discount.
FX forward points = F − S quote the rate differential, not a forecast of spot. Points are added/subtracted to spot; sign follows which leg has the higher rate.
FX swap = spot leg + offsetting forward leg; the workhorse for rolling/funding FX positions and the dominant FX instrument by volume.
Cross-currency basis = the persistent CIP violation: the extra spread on the funding leg (esp. borrowing USD via FX swaps). A negative EUR/USD basis means paying a premium to get dollars.
Drivers of the basis: USD funding scarcity, balance-sheet/regulatory costs (limits to arbitrage), quarter/year-end demand. CIP is not a law — it holds only when balance sheet is free.

Example. S=1.10 EUR/USD, 1y r_USD=5%, r_EUR=3%. F≈1.10·1.05/1.03≈1.1214. EUR (lower rate) trades at a forward premium of ~214 points vs USD — exactly offsetting the 2% carry pickup of holding USD.

Total Return Swaps: Synthetic Exposure & Embedded Financing

TRS: total-return receiver gets all price appreciation + coupons/dividends on a reference asset; pays a floating funding leg (e.g. SOFR + spread) and reimburses any depreciation. The payer holds/hedges the asset.
A TRS is synthetic leveraged ownership: economic exposure to the asset with no upfront purchase — the funding leg IS the embedded financing.
Uses: synthetic prime brokerage / leverage, off-balance-sheet exposure, gaining access to assets you can't easily hold (foreign, restricted), and a sec-lending alternative for synthetic shorts.
Risks: counterparty risk (bilateral, collateralized by margin/CSA), concentration hidden from public filings (no 13F on swap exposure), and forced unwind on margin calls.
Archegos (2021) is the canonical cautionary case: massive concentrated TRS positions across prime brokers, no aggregate visibility, gap-down forced liquidations.

Example. Receiver wants $100mm of a stock posting 10% margin. Pays SOFR+spread on $100mm notional, receives the stock's total return. If the stock returns 8% and funding costs 5.5%, net ≈ 2.5% on $100mm but on $10mm posted — ~25% return, and symmetric on the downside.

Core data structures & complexity

Pick the container by access pattern, not by habit. dict/set are hash tables: O(1) average membership; list membership is O(n).

Op	list	dict	set	tuple
index / key get	O(1)	O(1) avg	—	O(1)
`x in c`	O(n)	O(1) avg	O(1) avg	O(n)
append / add	O(1) amort.	O(1) avg	O(1) avg	immutable
insert/pop at front	O(n)	—	—	—
pop end	O(1)	—	—	—
min / max	O(n)	O(n)	O(n)	O(n)

list: ordered, mutable, contiguous; great for stacks (append/pop).
tuple: immutable, hashable (usable as dict keys / set members).
dict: insertion-ordered since 3.7; keys must be hashable.
Need O(1) at both ends? Use collections.deque (O(1) appendleft/popleft).

Example. Membership test in a loop is the classic O(n²) trap.

seen = set(big_list)        # build once: O(n)
hits = [x for x in queries if x in seen]   # each lookup O(1)
# vs. `if x in big_list` -> O(n) per query -> O(n*m) total

Comprehensions (list / dict / set)

Comprehensions are faster than equivalent for-append loops (the append is done in C) and read declaratively. Use them for transform/filter; reach for a loop when there are side effects.

squares  = [x*x for x in range(10)]            # list
evens    = [x for x in range(10) if x % 2 == 0]  # filter
pairs    = [(i, j) for i in range(3) for j in range(3)]  # nested
price_by = {sym: px for sym, px in quotes}     # dict comp
uniq     = {round(x, 2) for x in samples}      # set comp
# ternary goes BEFORE the for:
clip     = [x if x > 0 else 0 for x in vals]

Gotcha. Don't build a list just to consume it once — use a generator expression (no brackets) to avoid materializing:

total = sum(x*x for x in data)   # O(1) extra memory, no list built

Generators & lazy evaluation

A function with yield returns a generator: it computes values on demand and holds O(1) state. Lazy = constant memory for arbitrarily long / infinite streams.

def moving_sum(stream, k):
    from collections import deque
    win, s = deque(), 0
    for x in stream:
        win.append(x); s += x
        if len(win) > k:
            s -= win.popleft()
        if len(win) == k:
            yield s

for ms in moving_sum(prices, 20):   # one value at a time
    ...

A generator is exhausted after one pass — iterate again and you get nothing.
next(gen) pulls one value; StopIteration at the end.

Example. Infinite stream, take first 5:

import itertools
def naturals():
    n = 0
    while True:
        yield n; n += 1
list(itertools.islice(naturals(), 5))   # [0,1,2,3,4]

Mutable default argument trap

Default values are evaluated once at function-definition time, not per call. A mutable default (list/dict) is shared across all calls. Use None as the sentinel and build the real default inside.

def bad(x, acc=[]):     # BUG: same list reused
    acc.append(x); return acc
bad(1)  # [1]
bad(2)  # [1, 2]  <-- leaked!

def good(x, acc=None):
    if acc is None:
        acc = []
    acc.append(x); return acc

Why interviewers ask. It tests whether you understand that def runs once and binds defaults eagerly. The same logic applies to default {} and to class-level mutable attributes.

copy vs deepcopy; is vs ==

Equality vs identity: == compares values, is compares identity (same object). Assignment never copies — it binds another name to the same object.

import copy
a = [[1, 2], [3, 4]]
b = a                  # alias: b is a -> True
c = a.copy()           # shallow: c is a -> False, but c[0] is a[0] -> True
d = copy.deepcopy(a)   # full clone: d[0] is a[0] -> False
a[0].append(99)        # also shows up in c (shared inner list), NOT in d

Shallow copy: new outer container, same inner objects.
Use is None / is not None — never == None.
Small ints (−5..256) and interned strings may share identity; don't rely on is for value equality.

Gotcha. x = y = [] makes both names point at one list. x.append(1) changes y too.

NumPy vectorization vs Python loops

Push the loop into C. Vectorized ops run in compiled code over contiguous memory — often 50–100x faster than a Python loop, and the intent reads cleaner.

# BEFORE: pairwise returns in a Python loop  (slow)
rets = []
for i in range(1, len(px)):
    rets.append(px[i] / px[i-1] - 1)

# AFTER: vectorized  (no Python-level loop)
import numpy as np
px = np.asarray(px)
rets = px[1:] / px[:-1] - 1            # whole-array slice arithmetic
logret = np.diff(np.log(px))          # log returns, one call

Avoid growing lists then converting; preallocate or operate on whole arrays.
Boolean masks replace if-loops: x[x < 0] = 0.
np.where(cond, a, b) is a vectorized ternary.

Example. Z-score a column without a loop:

z = (x - x.mean()) / x.std()

Pandas essentials (groupby, vectorized ops)

Think column-wise, not row-wise. Avoid iterrows / apply(axis=1) in hot paths — use vectorized column ops and groupby.

import pandas as pd
df = pd.DataFrame({"sym": ["A","A","B"], "px": [10.,11.,20.], "qty": [5,3,2]})

df["notional"] = df.px * df.qty            # vectorized, no loop
vwap = (df.groupby("sym")
          .apply(lambda g: (g.px*g.qty).sum()/g.qty.sum()))   # per-group
by_sym = df.groupby("sym").agg(tot=("notional","sum"),
                               n=("px","count"))
df["ret"] = df.groupby("sym").px.pct_change()   # group-aware diff

.loc[mask, col] for label/boolean selection and safe assignment (avoids chained-assignment warnings).
merge = SQL join; concat = stack frames.
shift(1) for lagging — careful not to leak future data.

Gotcha. df[df.a>0]["b"] = 1 may silently fail (chained indexing). Use df.loc[df.a>0, "b"] = 1.

f-strings

Fast, readable interpolation with inline format specs. Format spec after the colon controls width, precision, sign, thousands, alignment.

px, qty = 1234.5678, 50
f"{px:.2f}"        # '1234.57'  (2 decimals)
f"{px:,.2f}"       # '1,234.57' (thousands sep)
f"{px:10.2f}"      # '   1234.57' (width 10, right-aligned)
f"{qty:<6}|"       # '50    |'   (left-align in width 6)
f"{0.0734:.1%}"    # '7.3%'
f"{255:#x}"        # '0xff'
f"{px=}"           # 'px=1234.5678'  (debug: name + value, 3.8+)

Example. Build a quick table row:

f"{sym:<4} {bid:>8.2f} {ask:>8.2f}"

Must-know interview snippets

Two-sum with a dict, Counter, and reservoir sampling come up constantly.

# Two-sum: O(n) using a hash map of value -> index
def two_sum(nums, target):
    seen = {}
    for i, x in enumerate(nums):
        if target - x in seen:
            return (seen[target - x], i)
        seen[x] = i
    return None

# Counter: frequencies, most common
from collections import Counter
c = Counter("abracadabra")
c.most_common(2)        # [('a', 5), ('b', 2)]

# Reservoir sampling: uniform sample of k from a stream of unknown length
import random
def reservoir(stream, k):
    res = []
    for i, x in enumerate(stream):
        if i < k:
            res.append(x)
        else:
            j = random.randint(0, i)   # inclusive
            if j < k:
                res[j] = x
    return res

Why reservoir works. Item i (0-indexed, i≥k) is kept with prob k/(i+1), and earlier items survive each step with the complementary prob; induction shows every item ends with prob k/n.

Float pitfalls

IEEE-754 doubles can't represent most decimals exactly. Never compare floats with ==; use a tolerance or math.isclose.

0.1 + 0.2 == 0.3          # False
0.1 + 0.2                 # 0.30000000000000004

import math
math.isclose(0.1 + 0.2, 0.3, rel_tol=1e-9)   # True

sum([0.1]*3) == 0.3       # False (0.30000000000000004)
math.fsum([0.1]*3) == 0.3    # True (exact summation)

For money use decimal.Decimal or integer cents — not floats.
float('nan') == float('nan') is False; test with math.isnan.
Catastrophic cancellation: subtracting two near-equal large floats loses precision.

Example. Sorting by a float key is fine; equality-bucketing by a float key is not — round to a grid first: round(x, 6).

itertools & collections highlights

These turn 10-line loops into one expressive, lazy line.

import itertools as it
from collections import defaultdict, deque, Counter, OrderedDict

it.accumulate([1,2,3,4])         # 1,3,6,10  (running sum -> prefix sums)
it.chain([1,2], [3,4])           # 1,2,3,4   (flatten)
it.combinations([1,2,3], 2)      # (1,2),(1,3),(2,3)
it.product([0,1], repeat=2)      # (0,0),(0,1),(1,0),(1,1)
it.groupby(sorted(data))         # consecutive runs (sort first!)
it.pairwise([1,2,3])             # (1,2),(2,3)   (3.10+)

dd = defaultdict(list)           # auto-creates [] on missing key
dd["A"].append(1)                # no KeyError
dq = deque(maxlen=3)             # ring buffer; oldest drops on overflow

Gotcha. itertools.groupby groups only consecutive equal keys — sort by the key first or you'll get fragmented groups.

Big-O of common operations (quick recall)

Know the amortized vs worst-case distinction.

list.append O(1) amortized; occasional resize is O(n) but rare.
list.insert(0, x) / pop(0) O(n) — shifts everything; use deque.
x in list/tuple O(n); x in set/dict O(1) average, O(n) worst (hash collisions).
sorted() / list.sort() O(n log n), Timsort, stable.
"".join(parts) O(total) — never += strings in a loop (O(n²)).
heapq push/pop O(log n); heapify O(n).
dict/set build from n items O(n); slicing a[i:j] O(j−i) (copies).

Example. Top-k from a stream in O(n log k) with a min-heap:

import heapq
def topk(stream, k):
    h = []
    for x in stream:
        if len(h) < k:
            heapq.heappush(h, x)
        elif x > h[0]:
            heapq.heapreplace(h, x)   # pop smallest, push x
    return sorted(h, reverse=True)

NumPy broadcasting & views vs copies

Broadcasting aligns shapes right to left: dims must be equal or one of them 1; a size-1 dim is stretched without copying.
Shapes (3,1) and (1,4) broadcast to (3,4) — the classic outer-op pattern. Incompatible: (3,) vs (4,) → error.
Basic slicing returns a view (shares memory; writes propagate); fancy/boolean indexing returns a copy.
Check with arr.base is not None (view) and np.shares_memory(a,b).
reshape returns a view when possible; ravel may view, flatten always copies.

Example. Pairwise diffs without a loop: x[:,None] - x[None,:] broadcasts (n,1)-(1,n) to an (n,n) matrix. x[1:][0]=9 mutates x (view); x[x>0][0]=9 does not (copy).

Pandas index alignment & SettingWithCopy

Arithmetic between objects aligns on index/columns first; non-matching labels yield NaN, not positional pairing.
a + b reorders to the union of labels. a.add(b, fill_value=0) stays label-aligned but fills labels missing on one side with 0 instead of NaN; for true positional math use a.values + b.values.
Assign through one .loc, never chain. df[m]["c"]=1 writes to a temporary and may be lost (SettingWithCopyWarning); use df.loc[m,"c"]=1.
.loc is label-based (endpoint inclusive); .iloc is positional (endpoint exclusive).
To detach explicitly use .copy(); reindexing/filtering may return a view or copy ambiguously.

Gotcha. Summing two return series with different date indices: pandas aligns dates, so a day present in only one becomes NaN. Reindex both to a common calendar before combining.

Subtle gotchas: closures, interning, sort keys

Late binding: closures capture the variable, not its value. [lambda: i for i in range(3)] all return 2; fix with a default arg lambda i=i: i.
−5..256 and short string literals are interned, so is can accidentally match == — never rely on it (identity vs value lives in the copies card).
sorted(..., key=...) sorts by the key; it is stable, so equal keys keep input order — chain sorts least-significant-key first.
dict preserves insertion order (3.7+); equality ignores order, iteration respects it.
a += b on a list mutates in place (extend); a = a + b rebinds to a new list.

Example. Sort trades by price desc, ties by time asc: sorted(t, key=lambda x:(-x.px, x.t)) — pack the key as a tuple rather than sorting twice.

Value vs reference vs pointer

Value = a copy; reference = an alias to an existing object; pointer = an address that can be null and reseated.

void by_val(int x)  { x = 9; }   // local copy; caller unchanged
void by_ref(int& x) { x = 9; }   // edits caller's object
void by_ptr(int* p) { if (p) *p = 9; }   // may be null; deref carefully

int a = 1;
by_val(a);  // a == 1
by_ref(a);  // a == 9
by_ptr(&a); // a == 9

A reference must be bound at init and can't be rebound or null.
A pointer can be null, reassigned, and do arithmetic.
Prefer references for "must exist"; pointers for "optional / owns nothing / iterates".

Gotcha. Returning a reference/pointer to a local is undefined behavior — the object dies at function exit. Return by value instead.

const & const-correctness

Mark everything that doesn't mutate as const — it documents intent, enables optimization, and lets callers pass temporaries/const objects.

int sum(const std::vector<int>& v);   // won't modify v; avoids a copy

const int* p;        // pointer to const int  (can't change *p)
int* const q = &a;   // const pointer to int  (can't reseat q)
const int* const r = &a;  // both fixed

struct Px {
    double mid() const { return (bid + ask) / 2; }  // const member fn
    double bid, ask;
};

Read declarations right-to-left: const int* p = "p is a pointer to a const int".
A const member function may not modify members (unless mutable) and can be called on const objects.

Why it matters. Without const on mid(), you can't call it through a const Px& — const-correctness is viral; add it early.

RAII & smart pointers

RAII: a resource's lifetime is tied to an object's scope — the destructor frees it, so there are no leaks even on exceptions.

#include <memory>
auto u = std::make_unique<Order>(id);   // sole owner; freed at scope exit
auto s = std::make_shared<Book>();      // ref-counted shared ownership
std::weak_ptr<Book> w = s;              // non-owning observer (breaks cycles)

void f() {
    auto p = std::make_unique<int>(42);
    may_throw();          // if it throws, p is still freed -- no leak
}                          // destructor runs here

Type	Ownership	Copyable
`unique_ptr`	exclusive	no (move-only)
`shared_ptr`	shared (refcount)	yes
`weak_ptr`	none (observes)	yes

Gotcha. Two shared_ptrs pointing at each other never reach refcount 0 — a leak. Break the cycle with weak_ptr.

STL containers + complexity

Ordered vs hashed is the core choice: map/set are red-black trees (ordered, O(log n)); unordered_* are hash tables (O(1) average).

Container	Lookup	Insert	Ordered?
`vector`	O(1) index / O(n) find	O(1) amort. push_back	—
`map`	O(log n)	O(log n)	sorted by key
`unordered_map`	O(1) avg, O(n) worst	O(1) avg	no
`set`	O(log n)	O(log n)	sorted
`deque`	O(1) index	O(1) both ends	—

std::vector<int> v{3, 1, 2};
v.push_back(4);
std::unordered_map<std::string, double> px;
px["AAPL"] = 190.5;             // insert or update
auto it = px.find("AAPL");      // O(1) avg; it == px.end() if absent
if (it != px.end()) use(it->second);

Tip. Default to vector — contiguous memory is cache-friendly and beats list/map in practice even when big-O looks worse. reserve(n) to avoid reallocations.

Iterators & range-for

Iterators generalize pointers; range-for is the clean way to traverse.

std::vector<int> v{1, 2, 3};

for (int x : v)        { /* copy each element */ }
for (int& x : v)       { x *= 2; }          // modify in place
for (const auto& x : v){ use(x); }          // read-only, no copy

for (auto it = v.begin(); it != v.end(); ++it)
    use(*it);                               // classic iterator loop

Use const auto& to avoid copying non-trivial elements.
Modifying a container (e.g. erase/push_back) during iteration can invalidate iterators.
erase returns the next valid iterator: it = v.erase(it);

Gotcha. vector::push_back may reallocate and invalidate all existing iterators/pointers/references into it.

Templates (generic code)

Templates generate type-specialized code at compile time — zero runtime dispatch cost, full type safety.

template <typename T>
T max_of(const T& a, const T& b) {
    return (a < b) ? b : a;
}
max_of(3, 7);          // T = int
max_of(2.5, 1.5);      // T = double

template <typename It>
auto mean(It first, It last) {
    using V = typename std::iterator_traits<It>::value_type;
    V s{};
    long n = 0;
    for (; first != last; ++first) { s += *first; ++n; }
    return s / n;
}

Note. Errors surface at instantiation (when used), which is why template error messages are long. C++20 concepts constrain types and give clearer diagnostics. Watch types: for an integer range V=int, so s/n is integer division (truncates) — accumulate in double for a real mean.

Move semantics vs copy

A move steals the guts of a temporary (pointer/buffer) instead of deep-copying — O(1) vs O(n) for containers.

std::vector<int> make();        // returns a big vector
std::vector<int> a = make();    // move (or RVO) -- no deep copy

std::vector<int> b = a;             // COPY: duplicates all elements
std::vector<int> c = std::move(a);  // MOVE: c takes a's buffer
// after move, `a` is valid-but-unspecified (typically empty)

std::move is just a cast to an rvalue reference — it enables a move, it doesn't move anything itself.
Don't use an object after moving from it (except to reassign/destroy).
Return local objects by value and let RVO / move handle it — don't std::move a return value (it can disable RVO).

Example. Pass sink params by value then move: void set(std::string s){ name_ = std::move(s); } — one move whether caller passes lvalue or rvalue.

Pass-by-const-ref vs by-value

For non-trivial types, take const T& to avoid a copy; for small/cheap types (int, double, pointers) pass by value.

double price(const std::string& sym);   // no copy of the string
double price(int id);                   // cheap; by value is fine

void consume(std::vector<int> v);        // takes a copy you'll own
void inspect(const std::vector<int>& v); // read-only, no copy

By value when you need your own copy anyway (then you can move it).
const T& binds to temporaries too, so callers can pass literals.

Rule of thumb. "In params: const& for big, by-value for small. Out/in-out params: non-const reference or pointer."

Stack vs heap

Stack allocation is a pointer bump (nearly free) and auto-freed at scope exit; heap (new/malloc) is slower and must be managed.

void f() {
    int x = 5;                 // stack: freed automatically
    std::array<int, 64> buf;    // stack: fixed size, fast

    auto p = std::make_unique<int[]>(n);  // heap: size known at runtime
}                              // x, buf freed here; p frees its heap block

Stack is small (often ~1–8 MB) — large or unknown-size data goes on the heap.
Heap allocation can fragment and stalls latency — avoid in hot loops.
Prefer RAII wrappers (containers, smart pointers) over raw new/delete.

Gotcha. Returning a pointer to a stack local is UB: int* bad(){ int x=1; return &x; } — x is gone.

Why C++ for low latency

No garbage collector, deterministic destruction, and control over memory layout for cache efficiency.

No GC pauses: deallocation is deterministic (destructor at scope end), so there are no unpredictable stop-the-world stalls.
Cache locality: vector/array store data contiguously; sequential access prefetches well — node-based structures chase pointers and miss cache.
Zero-overhead abstractions: templates/inlining compile away; you pay only for what you use.
Control: stack allocation, custom allocators/memory pools, std::array to avoid heap in hot paths.
Mechanical sympathy: avoid allocations, branches, and false sharing in the hot path; measure with a profiler.

Layout matters. Struct-of-arrays (parallel vectors) often beats array-of-structs when you scan one field across many records — fewer cache lines touched.

Idiomatic example: sort with a lambda

A lambda is an inline callable; the compiler can inline the comparator, so it's as fast as a hand-written loop.

#include <algorithm>
#include <vector>
struct Quote { std::string sym; double px; };

std::vector<Quote> book = /* ... */;

// sort by price descending, ties by symbol ascending
std::sort(book.begin(), book.end(),
    [](const Quote& a, const Quote& b) {
        if (a.px != b.px) return a.px > b.px;
        return a.sym < b.sym;
    });

int thresh = 100;
auto n = std::count_if(book.begin(), book.end(),
    [thresh](const Quote& q){ return q.px > thresh; });  // capture by value

[x] capture by value, [&x] by reference, [=]/[&] capture all.
A comparator must be a strict weak ordering — return false for equal elements, never <=.

Gotcha. Capturing by reference ([&]) a local that outlives the lambda is a dangling-reference bug; capture by value when the callable escapes the scope.

Memory layout & cache locality

Latency hierarchy (approx, order-of-magnitude): L1 ~1ns, L2 ~4ns, L3 ~10ns, main RAM ~100ns. A cache miss to RAM costs ~100× an L1 hit.
Cache line is 64 bytes; a load pulls the whole line. Keep hot data contiguous and small so one line serves many accesses.
std::vector is contiguous → linear scans are prefetcher-friendly; std::list/node-based maps chase pointers → a miss per node.
Struct-of-Arrays (SoA) beats Array-of-Structs (AoS) when you touch one field across many records — no wasted bytes loaded.
False sharing: two threads writing different vars on the same line ping-pong the line between cores; pad/align hot per-thread data to a line.

Example. Summing 1M ints in a vector vs a list: same big-O O(n), but the vector can be ~10× faster purely from cache locality and prefetching — why HFT code favors flat arrays.

Copy elision, RVO & perfect forwarding

RVO/copy elision lets the compiler construct a returned object directly in the caller's slot — zero copies or moves. Since C++17 it is mandatory for returning a pure prvalue (e.g. return T{...};).
So return std::move(local); on a return statement is an anti-pattern — it blocks elision and forces a move. Just return local;.
A forwarding (universal) reference T&& in a deduced template binds to lvalues and rvalues; std::forward<T>(x) preserves the original value category.
std::move is just a cast to rvalue ref — it moves nothing by itself; the move happens in the chosen overload.

Example. template<class T> void push(T&& x){ v.emplace_back(std::forward<T>(x)); } — an lvalue arg copies, an rvalue arg moves, one function handles both with no extra temporary.

Iterator invalidation by container

A reallocation or rehash invalidates iterators/pointers/references into a container; know which ops trigger it per type.

Container	Insert invalidates	Erase invalidates
vector	all if capacity grows, else from insert point	at/after erased
deque	all iterators (refs to ends survive)	all unless an end
list	none	only the erased
map/set	none	only the erased
unordered_*	iterators on rehash; refs survive	only the erased

vector::reserve(n) up front avoids growth reallocations and keeps iterators stable.

Example. Erasing in a loop: for(auto it=v.begin(); it!=v.end();) it = v.erase(it); — capture the returned iterator; never ++it after erasing it.

Flashcards

Search, filter by domain, click to flip. Drill out loud — narrate reasoning, don't recite. ◆ reported = appears in real quant interviews.

Lab

Build intuition by visualizing and simulating — an option-payoff explorer, a distribution explorer, a market-making game, a linear-regression simulator, and an in-browser Python sandbox. Switch tools below.

▤ payoff explorer ∿ distributions ⚄ market maker ▦ regression ▶ python sandbox

Interactive option-payoff explorer — pick a structure or build your own legs, tune spot / vol / time, read breakeven, max-profit & max-loss live. Toggle net P&L (folds in the Black–Scholes premium) vs payoff only (gross intrinsic value — the textbook diagram).

▸ control center

premium: included

Spot S₀ 100 Vol σ 20% Time T 0.25y

+ add leg

▸ outcome PIN view

Workshop

Make it yours — add your own subjects, flashcards, and cheatsheet cards. Saved in this browser; they appear instantly across Browse / Flashcards / Cheatsheets.

+ Subject

Creates a new subject in both the cheatsheet and flashcard sections — they share one subject list.