AI Algorithm Interview Formula Derivation Collection: 10 Math Formulas You Must Derive by Hand

Interview TopicsJune 15, 2025Author: BeautyResume Team

10 must-derive formulas for AI interviews: linear regression least squares, logistic regression gradient, SVM dual, Bayesian MAP, EM algorithm, backpropagation, Attention, VAE loss, Diffusion process, KL divergence, with complete derivation steps and assessment points

Background

When I was interviewing for algorithm positions, my biggest fear was the hand-derivation section. The interviewer gives you a pen and paper and asks you to derive the SVM dual problem or backpropagation gradients on the spot—that kind of nervousness is hard to describe. Later, I spent a month specifically practicing hand derivations, going through all 10 of the most commonly tested derivations in AI interviews, from linear regression to Diffusion models. After interviewing at several companies, I found that hand derivations actually follow patterns—once you master the methods, they're not as hard as imagined. Today I'm compiling these 10 must-derive formulas, each with derivation steps and interview assessment points.

Interview Process Review

The interview process for algorithm positions typically follows: resume screening → first technical round (fundamentals + hand-derive 1-2 formulas) → second technical round (projects + hand-derive 1 harder formula) → third technical round (system design or open-ended questions). Hand derivations usually appear in technical rounds, where the interviewer gives you a whiteboard and pen to write and explain simultaneously. The time limit is generally 10-15 minutes, so you can't derive slowly—you must be very familiar with the steps. Among the companies I interviewed with, the most frequently tested were logistic regression gradient derivation, SVM dual derivation, and backpropagation derivation—these three are almost mandatory. Next are EM algorithm and Attention formulas. Diffusion and VAE are less common but tested for LLM positions. What interviewers value is not whether you can derive the final result, but whether your derivation process is clear and the motivation for each step is reasonable.

Question Collection

1. Linear Regression Least Squares Solution

Derivation Steps:

Given data (X, y), linear model y=Xw, least squares objective: L(w) = ||y - Xw||² = (y - Xw)ᵀ(y - Xw)

Expand: L(w) = yᵀy - yᵀXw - wᵀXᵀy + wᵀXᵀXw = yᵀy - 2wᵀXᵀy + wᵀXᵀXw

Take derivative w.r.t. w: ∂L/∂w = -2Xᵀy + 2XᵀXw = 0

Solve: w* = (XᵀX)⁻¹Xᵀy

Assessment points: Matrix differentiation rules (∂(aᵀx)/∂x = a, ∂(xᵀAx)/∂x = (A+Aᵀ)x); conditions for XᵀX invertibility (X full column rank); regularized solution (Ridge: w* = (XᵀX+λI)⁻¹Xᵀy).

2. Logistic Regression Gradient Derivation

Derivation Steps:

Logistic regression model: P(y=1|x) = σ(wᵀx) = 1/(1+exp(-wᵀx))

Log-likelihood for a single sample: ℓ(w) = y·log(σ(wᵀx)) + (1-y)·log(1-σ(wᵀx))

Derivative of σ: σ'(z) = σ(z)(1-σ(z))

Derivative w.r.t. w: ∂ℓ/∂w = y·(1-σ(wᵀx))·x - (1-y)·σ(wᵀx)·x = (y - σ(wᵀx))·x

Batch gradient: ∇L = Σᵢ(yᵢ - σ(wᵀxᵢ))·xᵢ = Xᵀ(y - σ(Xw))

Update rule: w ← w + η·Xᵀ(y - σ(Xw))

Assessment points: Sigmoid derivative derivation technique; intuitive understanding of gradient form (prediction error times features); why logistic regression has no closed-form solution (nonlinear equation); multi-class Softmax derivation.

3. SVM Dual Problem Derivation

Derivation Steps:

Primal problem: min ½||w||² s.t. yᵢ(wᵀxᵢ+b)≥1

Introduce Lagrange multipliers αᵢ≥0, Lagrangian: L(w,b,α) = ½||w||² - Σαᵢ[yᵢ(wᵀxᵢ+b)-1]

Partial derivative w.r.t. w set to 0: ∂L/∂w = w - Σαᵢyᵢxᵢ = 0 → w = Σαᵢyᵢxᵢ

Partial derivative w.r.t. b set to 0: ∂L/∂b = -Σαᵢyᵢ = 0 → Σαᵢyᵢ = 0

Substitute back into Lagrangian, simplify to dual: max Σαᵢ - ½ΣΣαᵢαⱼyᵢyⱼxᵢᵀxⱼ s.t. αᵢ≥0, Σαᵢyᵢ=0

Assessment points: KKT conditions (complementary slackness αᵢ[yᵢ(wᵀxᵢ+b)-1]=0); support vector definition (samples with αᵢ>0); why convert to dual (introduce kernel functions, only depends on inner products); soft-margin derivation (adding slack variables ξᵢ and penalty parameter C, 0≤αᵢ≤C).

4. Bayes' Formula and MAP Derivation

Derivation Steps:

Bayes' formula: P(θ|D) = P(D|θ)P(θ)/P(D)

Maximum A Posteriori (MAP): θ_MAP = argmax P(θ|D) = argmax P(D|θ)P(θ)

Take log: θ_MAP = argmax [log P(D|θ) + log P(θ)]

When P(θ) is Gaussian prior N(0,σ²): log P(θ) = -||θ||²/(2σ²) + C

So MAP is equivalent to: argmax log P(D|θ) - ||θ||²/(2σ²) = argmin -log P(D|θ) + λ||θ||²

i.e., MAP = MLE + L2 regularization

Assessment points: Intuitive understanding of Bayes' formula (prior + data → posterior); MAP vs MLE relationship (MAP=MLE without prior); different priors correspond to different regularization (Laplace prior → L1 regularization); Naive Bayes conditional independence assumption.

5. EM Algorithm Derivation

Derivation Steps:

Goal: Maximize incomplete data log-likelihood log P(X|θ) = log Σ_Z P(X,Z|θ)

Introduce distribution q(Z), use Jensen's inequality: log Σ_Z P(X,Z|θ) = log Σ_Z q(Z)·P(X,Z|θ)/q(Z) ≥ Σ_Z q(Z)·log[P(X,Z|θ)/q(Z)]

E-step: Equality condition, set q(Z) = P(Z|X,θᵗ), compute Q(θ|θᵗ) = E_Z|X,θᵗ[log P(X,Z|θ)]

M-step: θᵗ⁺¹ = argmax Q(θ|θᵗ)

Assessment points: Jensen's inequality equality condition (q(Z) = P(Z|X,θᵗ)); proof of EM monotonic increase; GMM EM derivation (E-step computes posterior γ, M-step updates mean, covariance, mixing coefficients); EM vs K-Means relationship (K-Means is hard-assignment version of GMM-EM).

6. Backpropagation Gradient Derivation

Derivation Steps:

Three-layer network: input x → hidden h=σ(W₁x+b₁) → output ŷ=W₂h+b₂ → loss L

Output layer gradient: ∂L/∂W₂ = (∂L/∂ŷ)·(∂ŷ/∂W₂) = (∂L/∂ŷ)·hᵀ

∂L/∂b₂ = ∂L/∂ŷ

Hidden layer gradient (chain rule): ∂L/∂h = W₂ᵀ·(∂L/∂ŷ)

∂L/∂W₁ = (∂L/∂h ⊙ σ'(W₁x+b₁))·xᵀ, where ⊙ is element-wise multiplication

∂L/∂b₁ = ∂L/∂h ⊙ σ'(W₁x+b₁)

Extension to multiple layers: δˡ = (Wˡ⁺¹)ᵀδˡ⁺¹ ⊙ σ'(zˡ), ∂L/∂Wˡ = δˡ·(aˡ⁻¹)ᵀ

Assessment points: Chain rule application; computation graph representation; gradient vanishing/exploding causes (σ' < 1 multiplied → vanishing, W too large → exploding); why use ReLU (σ'=1 or 0, mitigates vanishing); Batch Normalization's effect on gradient flow.

7. Attention Formula Derivation

Derivation Steps:

Self-Attention: Q = XWq, K = XWk, V = XWv

Attention weights: A = softmax(QKᵀ/√dₖ)

Output: Output = AV

Why divide by √dₖ: When dₖ is large, elements of QKᵀ have variance dₖ (assuming q and k elements are i.i.d. with variance 1), dividing by √dₖ normalizes variance to 1, preventing softmax from entering saturation region (near-zero gradients)

Multi-head attention: headᵢ = Attention(QWqᵢ, KWkᵢ, VWvᵢ), MultiHead = Concat(head₁,...,headₕ)Wᴏ

Assessment points: Mathematical derivation of scaling factor √dₖ; self-attention complexity O(n²d); causal mask implementation (setting future position attention weights to -∞); KV Cache principle (caching K and V during inference to avoid recomputation).

8. VAE Loss Function Derivation

Derivation Steps:

Goal: Maximize data marginal log-likelihood log P(X)

Introduce variational distribution q(Z|X), using ELBO: log P(X) = E_q[log P(X|Z)] - KL(q(Z|X)||P(Z)) + log P(X)/q(Z|X)

Since KL divergence ≥ 0: log P(X) ≥ ELBO = E_q[log P(X|Z)] - KL(q(Z|X)||P(Z))

Maximizing ELBO is equivalent to minimizing: L = -E_q[log P(X|Z)] + KL(q(Z|X)||P(Z))

First term is reconstruction loss, second term is KL divergence regularization

When both q and P are Gaussian: KL(N(μ,σ²)||N(0,1)) = -½Σ(1+log σ²-μ²-σ²)

Reparameterization trick: Z = μ + σ·ε, ε~N(0,1), enabling gradients to flow through sampling

Assessment points: ELBO derivation (Jensen's inequality or KL divergence decomposition); necessity of reparameterization trick (sampling is non-differentiable); VAE vs AE difference (VAE learns distributions, AE learns encodings); why VAE generates blurry images (strong KL constraint + pixel-level MSE loss).

9. Diffusion Forward/Reverse Process Derivation

Derivation Steps:

Forward process (noising): q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

Distribution at any timestep: q(xₜ|x₀) = N(xₜ; √ᾱₜx₀, (1-ᾱₜ)I), where ᾱₜ = Πᵢ₌₁ᵗ(1-βᵢ)

Reverse process: pθ(xₜ₋₁|xₜ) = N(xₜ₋₁; μθ(xₜ,t), Σθ(xₜ,t))

Training objective (simplified): Lsimple = E[||ε - εθ(xₜ,t)||²], where xₜ = √ᾱₜx₀ + √(1-ᾱₜ)ε

The model learns to predict the added noise ε

Sampling: xₜ₋₁ = (1/√αₜ)(xₜ - βₜ/√(1-ᾱₜ)·εθ(xₜ,t)) + σₜz, z~N(0,I)

Assessment points: Reparameterization derivation of q(xₜ|x₀) (using Gaussian additivity); why predict noise instead of x₀ (simpler parameterization, more stable training); DDPM vs DDIM difference (DDIM is deterministic, skip-step sampling for acceleration); Classifier-Free Guidance principle (linear combination of conditional and unconditional model outputs).

10. KL Divergence Derivation

Derivation Steps:

KL divergence definition: KL(P||Q) = ∫P(x)log(P(x)/Q(x))dx = E_P[log P(x) - log Q(x)]

Non-negativity proof (using Jensen's inequality): KL(P||Q) = -E_P[log(Q(x)/P(x))] ≥ -log E_P[Q(x)/P(x)] = -log 1 = 0

Discrete form: KL(P||Q) = Σ Pᵢ log(Pᵢ/Qᵢ)

KL between Gaussians: KL(N(μ₁,σ₁²)||N(μ₂,σ₂²)) = log(σ₂/σ₁) + (σ₁²+(μ₁-μ₂)²)/(2σ₂²) - 1/2

Relationship with cross-entropy: KL(P||Q) = H(P,Q) - H(P) = Cross-entropy - Entropy

Assessment points: KL divergence asymmetry (KL(P||Q)≠KL(Q||P)); forward KL vs reverse KL behavioral differences (forward KL → mean-seeking, reverse KL → mode-seeking); JS divergence advantages (symmetric, bounded); intuition for KL regularization in VAE (making q close to P, ensuring generation quality).

Key Takeaways

The most important thing about hand-deriving formulas is not memorizing the final result, but understanding the motivation for each step. Interviewers hate "memorized derivations"—you write a bunch of formulas but can't explain why you're deriving it that way. I recommend asking yourself "why am I doing this" at each step when practicing. For example, the motivation for converting SVM to dual is "introducing kernel functions"; the motivation for using Jensen's inequality in EM is "moving the summation outside the log."

My second piece of advice is to derive core steps first, then fill in details. Interview time is limited—you can't write every step in detail. I recommend writing key steps and core formulas first, then expanding on details if the interviewer asks follow-up questions. For backpropagation, write the δ recurrence formula first, then expand the specific derivation for each layer.

My third piece of advice is to build connections between formulas. Many formulas are interconnected—MAP + Gaussian prior = L2 regularization, the E-step of EM is Bayesian posterior, VAE's ELBO is a generalization of EM. Once you understand these connections, derivations become natural and don't require rote memorization.

FAQ

Q: How detailed should hand derivations be?

A: Core derivation steps must be complete; auxiliary expansions can be omitted. For example, in SVM dual derivation, the transformation from primal to dual must be written clearly, but matrix multiplication expansion can be explained verbally.

Q: What if I get stuck during derivation?

A: Don't panic—state your approach first, and the interviewer will usually give hints. For example, if you forget a derivative, you can say "I don't remember this derivative exactly, but the approach is..." Interviewers value thinking over memorization.

Q: What math foundations should I prepare?

A: Matrix calculus (mandatory), probability fundamentals (Bayes, expectation, variance), optimization fundamentals (gradient descent, Lagrange multipliers), information theory fundamentals (entropy, KL divergence). I recommend reviewing these math foundations first.

Q: Which formulas are most frequently tested?

A: By frequency: logistic regression gradient > SVM dual > backpropagation > Bayes/MAP > EM algorithm > Attention > KL divergence > VAE > Diffusion > linear regression. The first 5 are almost mandatory; the last 5 depend on the position.

Q: Should I practice on paper or whiteboard?

A: Both are possible. I recommend practicing both—paper derivation is more precise but space-limited, whiteboard has more space but handwriting is harder to read. The key is to write and explain simultaneously, keeping the interviewer following your thought process.

#AI Algorithms#Formula Derivation#SVM#Backpropagation#EM Algorithms#VAE#Diffusion#面试推导