10 Algorithms Frequently Tested in AI Coding Interviews: From Gradient Descent to Attention Implementation
Detailed guide for 10 frequently tested AI coding interview problems: linear regression, logistic regression, K-Means, KNN, decision tree split, backpropagation, Self-Attention, Layer Norm, Beam Search, RAG pipeline, each with assessment points, code framework, and important notes
Background
AI interview coding questions differ from standard algorithm problems — many require implementing core ML/DL algorithms from scratch. During my fall recruitment, I was asked to implement linear regression, Self-Attention, K-Means, and more. Initially, I struggled, but after spending a week practicing intensively, I trained all 10 commonly tested algorithms to muscle memory. Today I'm organizing these 10 algorithms, each with assessment points, code frameworks, and important notes.
All code in this article uses Python/NumPy without deep learning frameworks. Interviews generally require NumPy implementation, not PyTorch.
1. Linear Regression from Scratch (Gradient Descent)
Assessment Points: Gradient descent principles, loss function derivation, learning rate selection, feature normalization
Code Framework:
class LinearRegression:
def __init__(self, lr=0.01, n_iters=1000):
self.lr = lr
self.n_iters = n_iters
def fit(self, X, y):
n_samples, n_features = X.shape
self.w = np.zeros(n_features)
self.b = 0
for _ in range(self.n_iters):
y_pred = X @ self.w + self.b
dw = (2 / n_samples) * (X.T @ (y_pred - y))
db = (2 / n_samples) * np.sum(y_pred - y)
self.w -= self.lr * dw
self.b -= self.lr * db
def predict(self, X):
return X @ self.w + self.b
Important Notes: ① Features must be normalized, otherwise gradient descent converges slowly; ② Learning rate too large = no convergence, too small = slow convergence, plot loss curves to debug; ③ Interviewers may ask about closed-form solution (w = (X^T X)^{-1} X^T y), know when it's not applicable (X^T X non-invertible, data too large).
2. Logistic Regression from Scratch
Assessment Points: Sigmoid function, cross-entropy loss, gradient computation, multi-class extension
Code Framework:
class LogisticRegression:
def __init__(self, lr=0.01, n_iters=1000):
self.lr = lr
self.n_iters = n_iters
def sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def fit(self, X, y):
n_samples, n_features = X.shape
self.w = np.zeros(n_features)
self.b = 0
for _ in range(self.n_iters):
z = X @ self.w + self.b
y_pred = self.sigmoid(z)
dw = (1 / n_samples) * (X.T @ (y_pred - y))
db = (1 / n_samples) * np.sum(y_pred - y)
self.w -= self.lr * dw
self.b -= self.lr * db
def predict(self, X):
return (self.sigmoid(X @ self.w + self.b) >= 0.5).astype(int)
Important Notes: ① Sigmoid needs clipping to prevent numerical overflow; ② Must be able to hand-derive cross-entropy loss gradient; ③ Interviewers may ask about multi-class (Softmax regression).
3. K-Means from Scratch
Assessment Points: Clustering principles, initialization strategies, convergence criteria, K selection
Code Framework:
class KMeans:
def __init__(self, k=3, max_iters=100):
self.k = k
self.max_iters = max_iters
def fit(self, X):
idx = np.random.choice(len(X), self.k, replace=False)
self.centroids = X[idx]
for _ in range(self.max_iters):
distances = np.linalg.norm(X[:, None] - self.centroids[None, :], axis=2)
labels = np.argmin(distances, axis=1)
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.k)])
if np.allclose(self.centroids, new_centroids):
break
self.centroids = new_centroids
self.labels = labels
Important Notes: ① Random initialization may lead to local optima — interviewers will ask about K-Means++ initialization; ② How to handle empty clusters (re-initialize randomly); ③ K selection: elbow method, silhouette coefficient.
4. KNN from Scratch
Assessment Points: Distance metrics, K selection, voting strategies, time complexity
Code Framework:
class KNN:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X):
predictions = []
for x in X:
distances = np.linalg.norm(self.X_train - x, axis=1)
k_indices = np.argsort(distances)[:self.k]
k_labels = self.y_train[k_indices]
predictions.append(np.bincount(k_labels).argmax())
return np.array(predictions)
Important Notes: ① Prediction time complexity O(ndk), n = training samples, d = dimensions; ② Interviewers may ask about KD-tree acceleration; ③ K too small = overfitting, K too large = underfitting; ④ Distance-weighted KNN (closer = higher weight).
5. Decision Tree Split from Scratch
Assessment Points: Information gain/gain ratio/Gini index, split strategies, continuous value handling
Code Framework (using Gini index):
def gini(y):
classes, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
return 1 - np.sum(probs ** 2)
def best_split(X, y, feature_idx):
values = np.sort(np.unique(X[:, feature_idx]))
best_gini, best_threshold = 1, None
for i in range(len(values) - 1):
threshold = (values[i] + values[i + 1]) / 2
left_mask = X[:, feature_idx] <= threshold
gini_split = (left_mask.sum() * gini(y[left_mask]) + (~left_mask).sum() * gini(y[~left_mask])) / len(y)
if gini_split < best_gini:
best_gini, best_threshold = gini_split, threshold
return best_threshold, best_gini
Important Notes: ① ID3 uses information gain, C4.5 uses gain ratio, CART uses Gini index; ② Information gain favors multi-value features, gain ratio corrects this; ③ Continuous features need sorting then iterating candidate split points.
6. Backpropagation from Scratch
Assessment Points: Forward propagation, loss computation, backpropagation, gradient updates
Code Framework (single hidden layer network):
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size, lr=0.01):
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros(hidden_size)
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros(output_size)
self.lr = lr
def relu(self, z):
return np.maximum(0, z)
def relu_grad(self, z):
return (z > 0).astype(float)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self.relu(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
return self.z2
def backward(self, X, y, y_pred):
m = X.shape[0]
dz2 = (y_pred - y) / m
dW2 = self.a1.T @ dz2
db2 = np.sum(dz2, axis=0)
da1 = dz2 @ self.W2.T
dz1 = da1 * self.relu_grad(self.z1)
dW1 = X.T @ dz1
db1 = np.sum(dz1, axis=0)
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
Important Notes: ① Weight initialization uses small random values, not zero (symmetry problem); ② Gradients must be divided by batch size; ③ Interviewers may ask about vanishing gradients and solutions.
7. Self-Attention from Scratch
Assessment Points: QKV computation, scaled dot-product attention, multi-head attention, masking
Code Framework:
def self_attention(X, W_q, W_k, W_v, mask=None):
Q = X @ W_q
K = X @ W_k
V = X @ W_v
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores)
attn_weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
attn_weights = attn_weights / attn_weights.sum(axis=-1, keepdims=True)
output = attn_weights @ V
return output, attn_weights
Important Notes: ① Divide by sqrt(d_k) to prevent large dot products causing Softmax gradient vanishing; ② Softmax numerical stability (subtract max); ③ Mask used in Decoder to prevent seeing future information; ④ Multi-head attention is multiple Self-Attentions concatenated then linearly transformed.
8. Layer Norm from Scratch
Assessment Points: Normalization principles, difference from Batch Norm, why Transformer uses Layer Norm
Code Framework:
def layer_norm(x, gamma=1.0, beta=0.0, eps=1e-5):
mean = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True)
x_norm = (x - mean) / np.sqrt(var + eps)
return gamma * x_norm + beta
Important Notes: ① Layer Norm normalizes along feature dimension, Batch Norm along batch dimension; ② Transformer uses Layer Norm because variable sequence lengths make Batch Norm unstable; ③ gamma and beta are learnable parameters; ④ RMSNorm is a simplified Layer Norm (no mean subtraction).
9. Beam Search from Scratch
Assessment Points: Decoding strategies, Beam Size selection, comparison with greedy search/sampling
Code Framework:
def beam_search(model_init_state, decode_step, beam_width, max_len, start_token, end_token):
sequences = [[[], 0.0, model_init_state]]
for _ in range(max_len):
all_candidates = []
for seq, score, state in sequences:
if seq and seq[-1] == end_token:
all_candidates.append((seq, score, state))
continue
logits, new_state = decode_step(state, seq[-1] if seq else start_token)
log_probs = logits - np.max(logits)
log_probs = log_probs - np.log(np.sum(np.exp(log_probs)))
top_indices = np.argsort(log_probs)[-beam_width:]
for idx in top_indices:
all_candidates.append((seq + [idx], score + log_probs[idx], new_state))
sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_width]
return sequences[0][0]
Important Notes: ① Use log probability addition instead of probability multiplication to prevent numerical underflow; ② Larger Beam Size = more thorough search but slower; ③ Length penalty prevents bias toward short sequences; ④ Interviewers may ask about Sampling, Top-k, Top-p decoding strategies.
10. Simple RAG Pipeline from Scratch
Assessment Points: Document chunking, vectorization, similarity retrieval, prompt construction
Code Framework:
def simple_rag(query, documents, embed_fn, generate_fn, top_k=3):
chunks = []
for doc in documents:
words = doc.split()
for i in range(0, len(words), 200):
chunk = " ".join(words[i:i+200])
chunks.append(chunk)
chunk_embeddings = np.array([embed_fn(c) for c in chunks])
query_embedding = embed_fn(query)
scores = chunk_embeddings @ query_embedding
top_indices = np.argsort(scores)[-top_k:][::-1]
retrieved = [chunks[i] for i in top_indices]
context = "\n".join(retrieved)
prompt = f"Based on the following context:\n{context}\n\nQuestion: {query}\nAnswer:"
answer = generate_fn(prompt)
return answer, retrieved
Important Notes: ① Document chunking strategies (fixed length vs semantic vs recursive); ② Retrieval uses cosine similarity (normalize then dot product); ③ Interviewers may ask about hybrid retrieval (vector + keyword), reranking, and handling cases where no relevant documents are retrieved.
Key Questions Summary
1. Linear regression from scratch (gradient descent)
2. Logistic regression from scratch
3. K-Means from scratch
4. KNN from scratch
5. Decision tree split from scratch
6. Backpropagation from scratch
7. Self-Attention from scratch
8. Layer Norm from scratch
9. Beam Search from scratch
10. Simple RAG pipeline from scratch
Insights and Advice
1. Write each algorithm at least 3 times: First time following along, second time from memory, third time under time pressure (within 30 minutes). Only muscle memory prevents freezing in interviews.
2. Pay attention to numerical stability: Softmax subtract max, Sigmoid clip, log probabilities instead of probability products — interviewers value these tricks.
3. Understanding beats memorization: Don't memorize code — understand why each step exists. Interviewers may ask you to modify code (e.g., switch MSE to MAE), and understanding enables flexibility.
4. Be proficient with matrix operations: AI coding questions heavily use matrix operations. NumPy broadcasting and dimension manipulation must be second nature. Draw dimensions on paper first to verify before coding.
5. Explain while writing: Don't silently write code in interviews — explain your thinking as you go. Interviewers value your thought process more than perfect final code.
FAQ
Q: What language should I use in interviews?
A: Most companies accept Python; some require C++. Python with NumPy is recommended.
Q: Does the code need to run?
A: Running is ideal, but interviewers value logical correctness more. If time is short, write the core logic.
Q: Can I use PyTorch?
A: Generally not allowed. Interviewers want to see your understanding of underlying operations — PyTorch's nn.Linear() defeats the purpose.
Q: What if I can't finish in time?
A: Write core logic first (e.g., Self-Attention's QKV computation and attention weights), explain auxiliary parts (Mask, Dropout) verbally.
Q: What other algorithms are commonly tested besides these 10?
A: NMS, IoU, convolution operations, pooling operations, Focal Loss, cross-entropy loss. Practice these too.