10 Algorithms Frequently Tested in AI Coding Interviews: From Gradient Descent to Attention Implementation

AI Coding QuestionsAuthor: BeautyResume Team

Detailed guide for 10 frequently tested AI coding interview problems: linear regression, logistic regression, K-Means, KNN, decision tree split, backpropagation, Self-Attention, Layer Norm, Beam Search, RAG pipeline, each with assessment points, code framework, and important notes

Background

AI interview coding questions differ from standard algorithm problems — many require implementing core ML/DL algorithms from scratch. During my fall recruitment, I was asked to implement linear regression, Self-Attention, K-Means, and more. Initially, I struggled, but after spending a week practicing intensively, I trained all 10 commonly tested algorithms to muscle memory. Today I'm organizing these 10 algorithms, each with assessment points, code frameworks, and important notes.

All code in this article uses Python/NumPy without deep learning frameworks. Interviews generally require NumPy implementation, not PyTorch.

1. Linear Regression from Scratch (Gradient Descent)

Assessment Points: Gradient descent principles, loss function derivation, learning rate selection, feature normalization

Code Framework:

class LinearRegression:

  def __init__(self, lr=0.01, n_iters=1000):

    self.lr = lr

    self.n_iters = n_iters

  def fit(self, X, y):

    n_samples, n_features = X.shape

    self.w = np.zeros(n_features)

    self.b = 0

    for _ in range(self.n_iters):

      y_pred = X @ self.w + self.b

      dw = (2 / n_samples) * (X.T @ (y_pred - y))

      db = (2 / n_samples) * np.sum(y_pred - y)

      self.w -= self.lr * dw

      self.b -= self.lr * db

  def predict(self, X):

    return X @ self.w + self.b

Important Notes: ① Features must be normalized, otherwise gradient descent converges slowly; ② Learning rate too large = no convergence, too small = slow convergence, plot loss curves to debug; ③ Interviewers may ask about closed-form solution (w = (X^T X)^{-1} X^T y), know when it's not applicable (X^T X non-invertible, data too large).

2. Logistic Regression from Scratch

Assessment Points: Sigmoid function, cross-entropy loss, gradient computation, multi-class extension

Code Framework:

class LogisticRegression:

  def __init__(self, lr=0.01, n_iters=1000):

    self.lr = lr

    self.n_iters = n_iters

  def sigmoid(self, z):

    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

  def fit(self, X, y):

    n_samples, n_features = X.shape

    self.w = np.zeros(n_features)

    self.b = 0

    for _ in range(self.n_iters):

      z = X @ self.w + self.b

      y_pred = self.sigmoid(z)

      dw = (1 / n_samples) * (X.T @ (y_pred - y))

      db = (1 / n_samples) * np.sum(y_pred - y)

      self.w -= self.lr * dw

      self.b -= self.lr * db

  def predict(self, X):

    return (self.sigmoid(X @ self.w + self.b) >= 0.5).astype(int)

Important Notes: ① Sigmoid needs clipping to prevent numerical overflow; ② Must be able to hand-derive cross-entropy loss gradient; ③ Interviewers may ask about multi-class (Softmax regression).

3. K-Means from Scratch

Assessment Points: Clustering principles, initialization strategies, convergence criteria, K selection

Code Framework:

class KMeans:

  def __init__(self, k=3, max_iters=100):

    self.k = k

    self.max_iters = max_iters

  def fit(self, X):

    idx = np.random.choice(len(X), self.k, replace=False)

    self.centroids = X[idx]

    for _ in range(self.max_iters):

      distances = np.linalg.norm(X[:, None] - self.centroids[None, :], axis=2)

      labels = np.argmin(distances, axis=1)

      new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.k)])

      if np.allclose(self.centroids, new_centroids):

        break

      self.centroids = new_centroids

    self.labels = labels

Important Notes: ① Random initialization may lead to local optima — interviewers will ask about K-Means++ initialization; ② How to handle empty clusters (re-initialize randomly); ③ K selection: elbow method, silhouette coefficient.

4. KNN from Scratch

Assessment Points: Distance metrics, K selection, voting strategies, time complexity

Code Framework:

class KNN:

  def __init__(self, k=3):

    self.k = k

  def fit(self, X, y):

    self.X_train = X

    self.y_train = y

  def predict(self, X):

    predictions = []

    for x in X:

      distances = np.linalg.norm(self.X_train - x, axis=1)

      k_indices = np.argsort(distances)[:self.k]

      k_labels = self.y_train[k_indices]

      predictions.append(np.bincount(k_labels).argmax())

    return np.array(predictions)

Important Notes: ① Prediction time complexity O(ndk), n = training samples, d = dimensions; ② Interviewers may ask about KD-tree acceleration; ③ K too small = overfitting, K too large = underfitting; ④ Distance-weighted KNN (closer = higher weight).

5. Decision Tree Split from Scratch

Assessment Points: Information gain/gain ratio/Gini index, split strategies, continuous value handling

Code Framework (using Gini index):

def gini(y):

  classes, counts = np.unique(y, return_counts=True)

  probs = counts / len(y)

  return 1 - np.sum(probs ** 2)

def best_split(X, y, feature_idx):

  values = np.sort(np.unique(X[:, feature_idx]))

  best_gini, best_threshold = 1, None

  for i in range(len(values) - 1):

    threshold = (values[i] + values[i + 1]) / 2

    left_mask = X[:, feature_idx] <= threshold

    gini_split = (left_mask.sum() * gini(y[left_mask]) + (~left_mask).sum() * gini(y[~left_mask])) / len(y)

    if gini_split < best_gini:

      best_gini, best_threshold = gini_split, threshold

  return best_threshold, best_gini

Important Notes: ① ID3 uses information gain, C4.5 uses gain ratio, CART uses Gini index; ② Information gain favors multi-value features, gain ratio corrects this; ③ Continuous features need sorting then iterating candidate split points.

6. Backpropagation from Scratch

Assessment Points: Forward propagation, loss computation, backpropagation, gradient updates

Code Framework (single hidden layer network):

class NeuralNetwork:

  def __init__(self, input_size, hidden_size, output_size, lr=0.01):

    self.W1 = np.random.randn(input_size, hidden_size) * 0.01

    self.b1 = np.zeros(hidden_size)

    self.W2 = np.random.randn(hidden_size, output_size) * 0.01

    self.b2 = np.zeros(output_size)

    self.lr = lr

  def relu(self, z):

    return np.maximum(0, z)

  def relu_grad(self, z):

    return (z > 0).astype(float)

  def forward(self, X):

    self.z1 = X @ self.W1 + self.b1

    self.a1 = self.relu(self.z1)

    self.z2 = self.a1 @ self.W2 + self.b2

    return self.z2

  def backward(self, X, y, y_pred):

    m = X.shape[0]

    dz2 = (y_pred - y) / m

    dW2 = self.a1.T @ dz2

    db2 = np.sum(dz2, axis=0)

    da1 = dz2 @ self.W2.T

    dz1 = da1 * self.relu_grad(self.z1)

    dW1 = X.T @ dz1

    db1 = np.sum(dz1, axis=0)

    self.W2 -= self.lr * dW2

    self.b2 -= self.lr * db2

    self.W1 -= self.lr * dW1

    self.b1 -= self.lr * db1

Important Notes: ① Weight initialization uses small random values, not zero (symmetry problem); ② Gradients must be divided by batch size; ③ Interviewers may ask about vanishing gradients and solutions.

7. Self-Attention from Scratch

Assessment Points: QKV computation, scaled dot-product attention, multi-head attention, masking

Code Framework:

def self_attention(X, W_q, W_k, W_v, mask=None):

  Q = X @ W_q

  K = X @ W_k

  V = X @ W_v

  d_k = Q.shape[-1]

  scores = Q @ K.T / np.sqrt(d_k)

  if mask is not None:

    scores = np.where(mask == 0, -1e9, scores)

  attn_weights = np.exp(scores - scores.max(axis=-1, keepdims=True))

  attn_weights = attn_weights / attn_weights.sum(axis=-1, keepdims=True)

  output = attn_weights @ V

  return output, attn_weights

Important Notes: ① Divide by sqrt(d_k) to prevent large dot products causing Softmax gradient vanishing; ② Softmax numerical stability (subtract max); ③ Mask used in Decoder to prevent seeing future information; ④ Multi-head attention is multiple Self-Attentions concatenated then linearly transformed.

8. Layer Norm from Scratch

Assessment Points: Normalization principles, difference from Batch Norm, why Transformer uses Layer Norm

Code Framework:

def layer_norm(x, gamma=1.0, beta=0.0, eps=1e-5):

  mean = x.mean(axis=-1, keepdims=True)

  var = x.var(axis=-1, keepdims=True)

  x_norm = (x - mean) / np.sqrt(var + eps)

  return gamma * x_norm + beta

Important Notes: ① Layer Norm normalizes along feature dimension, Batch Norm along batch dimension; ② Transformer uses Layer Norm because variable sequence lengths make Batch Norm unstable; ③ gamma and beta are learnable parameters; ④ RMSNorm is a simplified Layer Norm (no mean subtraction).

9. Beam Search from Scratch

Assessment Points: Decoding strategies, Beam Size selection, comparison with greedy search/sampling

Code Framework:

def beam_search(model_init_state, decode_step, beam_width, max_len, start_token, end_token):

  sequences = [[[], 0.0, model_init_state]]

  for _ in range(max_len):

    all_candidates = []

    for seq, score, state in sequences:

      if seq and seq[-1] == end_token:

        all_candidates.append((seq, score, state))

        continue

      logits, new_state = decode_step(state, seq[-1] if seq else start_token)

      log_probs = logits - np.max(logits)

      log_probs = log_probs - np.log(np.sum(np.exp(log_probs)))

      top_indices = np.argsort(log_probs)[-beam_width:]

      for idx in top_indices:

        all_candidates.append((seq + [idx], score + log_probs[idx], new_state))

    sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_width]

  return sequences[0][0]

Important Notes: ① Use log probability addition instead of probability multiplication to prevent numerical underflow; ② Larger Beam Size = more thorough search but slower; ③ Length penalty prevents bias toward short sequences; ④ Interviewers may ask about Sampling, Top-k, Top-p decoding strategies.

10. Simple RAG Pipeline from Scratch

Assessment Points: Document chunking, vectorization, similarity retrieval, prompt construction

Code Framework:

def simple_rag(query, documents, embed_fn, generate_fn, top_k=3):

  chunks = []

  for doc in documents:

    words = doc.split()

    for i in range(0, len(words), 200):

      chunk = " ".join(words[i:i+200])

      chunks.append(chunk)

  chunk_embeddings = np.array([embed_fn(c) for c in chunks])

  query_embedding = embed_fn(query)

  scores = chunk_embeddings @ query_embedding

  top_indices = np.argsort(scores)[-top_k:][::-1]

  retrieved = [chunks[i] for i in top_indices]

  context = "\n".join(retrieved)

  prompt = f"Based on the following context:\n{context}\n\nQuestion: {query}\nAnswer:"

  answer = generate_fn(prompt)

  return answer, retrieved

Important Notes: ① Document chunking strategies (fixed length vs semantic vs recursive); ② Retrieval uses cosine similarity (normalize then dot product); ③ Interviewers may ask about hybrid retrieval (vector + keyword), reranking, and handling cases where no relevant documents are retrieved.

Key Questions Summary

1. Linear regression from scratch (gradient descent)

2. Logistic regression from scratch

3. K-Means from scratch

4. KNN from scratch

5. Decision tree split from scratch

6. Backpropagation from scratch

7. Self-Attention from scratch

8. Layer Norm from scratch

9. Beam Search from scratch

10. Simple RAG pipeline from scratch

Insights and Advice

1. Write each algorithm at least 3 times: First time following along, second time from memory, third time under time pressure (within 30 minutes). Only muscle memory prevents freezing in interviews.

2. Pay attention to numerical stability: Softmax subtract max, Sigmoid clip, log probabilities instead of probability products — interviewers value these tricks.

3. Understanding beats memorization: Don't memorize code — understand why each step exists. Interviewers may ask you to modify code (e.g., switch MSE to MAE), and understanding enables flexibility.

4. Be proficient with matrix operations: AI coding questions heavily use matrix operations. NumPy broadcasting and dimension manipulation must be second nature. Draw dimensions on paper first to verify before coding.

5. Explain while writing: Don't silently write code in interviews — explain your thinking as you go. Interviewers value your thought process more than perfect final code.

FAQ

Q: What language should I use in interviews?
A: Most companies accept Python; some require C++. Python with NumPy is recommended.

Q: Does the code need to run?
A: Running is ideal, but interviewers value logical correctness more. If time is short, write the core logic.

Q: Can I use PyTorch?
A: Generally not allowed. Interviewers want to see your understanding of underlying operations — PyTorch's nn.Linear() defeats the purpose.

Q: What if I can't finish in time?
A: Write core logic first (e.g., Self-Attention's QKV computation and attention weights), explain auxiliary parts (Mask, Dropout) verbally.

Q: What other algorithms are commonly tested besides these 10?
A: NMS, IoU, convolution operations, pooling operations, Focal Loss, cross-entropy loss. Practice these too.

#代码题#Handwritten Algorithms#Gradient Descent#Self-Attention#K-Means#RAG#Beam Search