20 Must-Know LLM Interview Questions: From Transformer to RLHF Complete Coverage

Interview TopicsJune 15, 2025Author: BeautyResume Team

Complete coverage of 20 high-frequency LLM interview questions: Transformer fundamentals, pre-training, fine-tuning SFT/LoRA/RLHF/DPO, inference KV Cache/quantization/speculative decoding, applications RAG/Agent/Prompt Engineering, with assessment points and answer directions

Background

To be honest, when I first started preparing for LLM interviews, I was completely lost. There was too much scattered material online—some went incredibly deep into Transformer theory, while others only covered the application layer. I had no idea what interviewers would actually ask. After interviewing at about seven or eight companies working on LLMs, ranging from small startups to big tech research institutes, I gradually figured out a pattern: the core assessment areas for LLM interviews are limited to just a few domains, and the high-frequency questions in each domain are equally limited. Today I'm compiling these 20 must-know questions to help anyone currently preparing for interviews.

Interview Process Review

The interview processes at the companies I interviewed with were fairly similar: resume screening first, then a first-round technical interview (mainly testing fundamentals), a second-round deep-dive interview (testing projects + derivations), and a third round that might be a cross-interview or system design interview. A distinctive feature of LLM technical interviews is that fundamental knowledge carries a huge weight, unlike traditional development roles that primarily test project experience. Interviewers start from Transformer principles and work their way through training, fine-tuning, inference, and applications—essentially testing every stage. My most memorable experience was a second-round interview at a major tech company where the interviewer fired fundamental questions at me for 40 minutes straight, from Self-Attention all the way to DPO, with zero project-related discussion in between. Pure fundamentals. So if your foundation isn't solid, it's really hard to pass.

Question Collection

1. Transformer Fundamentals (4 Questions)

1. What is the principle of Self-Attention? Why is it better than RNN?

Assessment point: Understanding the core computation flow of the attention mechanism and its advantages/disadvantages compared to sequential models.

Answer direction: The core of Self-Attention is QKV computation. The input sequence is multiplied by Wq, Wk, and Wv respectively to obtain query, key, and value matrices. Then the dot product of Q and K computes attention weights, which are multiplied by V to produce the output. Compared to RNN, the biggest advantages are parallel computation and long-range dependency modeling. RNN must process sequences step by step, while Self-Attention can compute relationships across all positions at once, making training much faster. Additionally, RNN's long-range dependencies decay, whereas Attention directly models relationships between any two positions without decay. The time complexity of Self-Attention is O(n²), which is its disadvantage, but this can be mitigated through sparse attention.

2. What are the types of positional encoding? Why is RoPE effective?

Assessment point: Understanding how positional information is injected and the advantages of Rotary Position Embedding.

Answer direction: Positional encoding mainly includes absolute positional encoding (sinusoidal/cosine, learnable) and relative positional encoding (RoPE, ALiBi). RoPE's core idea is to incorporate positional information into the dot product of Q and K through rotation matrices, so the inner product naturally contains relative positional information. RoPE's advantages: good extrapolation, can extend to longer contexts through NTK-aware interpolation; computationally efficient, no additional positional embedding parameters needed; relative position aware, naturally models relative distances. Most mainstream LLMs now use RoPE.

3. What is the role of multi-head attention? How do you choose the number of heads?

Assessment point: Understanding the motivation behind the multi-head mechanism and design choices.

Answer direction: Multi-head attention allows the model to attend to different information patterns in different subspaces, similar to multi-channel CNNs. Each head learns different attention distributions—some focus on local syntax, others on long-range semantics. The number of heads is typically dimension / head dimension, e.g., 768 dimensions with 12 heads (64 per head). More heads isn't always better—too many leads to insufficient dimension per head and reduced expressiveness. GQA (Grouped Query Attention) is a recent improvement where multiple Q heads share K/V heads, reducing KV Cache overhead during inference.

4. What is the role of the FFN layer? Why use GLU variants?

Assessment point: Understanding the role of the feed-forward network and the advantages of GLU activation.

Answer direction: FFN is the "memory" module in Transformers—Attention handles information routing, while FFN handles information processing and storage. Standard FFN uses two linear transformations plus ReLU activation. GLU variants (like SwiGLU) introduce a gating mechanism with the formula Swish(xW1)⊙(xW2), where ⊙ is element-wise multiplication. GLU's advantages: stronger expressiveness—the gating mechanism can dynamically filter information; more stable training—Swish activation performs better than ReLU in deep networks. Models like LLaMA and Mistral all use SwiGLU.

2. Pre-training (4 Questions)

5. What is the pre-training objective function? Why use CLM instead of MLM?

Assessment point: Understanding the choice of language modeling objectives.

Answer direction: Pre-training mainly has two objectives: CLM (Causal Language Modeling, autoregressive next-token prediction) and MLM (Masked Language Modeling, predicting masked tokens). Mainstream LLMs now use CLM because: naturally aligned with generation tasks—downstream applications are primarily generative; higher training data utilization—every token serves as a label; better scalability—CLM's scaling laws are clearer. Although MLM has higher information density per step, it has a gap with generative tasks and is now mainly used for encoder models like BERT.

6. What are the key steps in pre-training data cleaning?

Assessment point: Understanding the impact of data quality on model performance.

Answer direction: Data cleaning is the most easily underestimated part of LLM training. Key steps include: deduplication (MinHash/LSH dedup to avoid memorization effects); quality filtering (scoring with small models, filtering low-quality web pages); harmful content filtering (safety classifiers to filter violent/sexual content); PII removal (anonymizing personally identifiable information); language identification (filtering non-target language data); format cleaning (removing HTML tags, template text, etc.). Data quality directly determines the model's upper bound—GPT-4's training data cleaning pipeline is reportedly more complex than the model architecture itself.

7. How to troubleshoot training instability? How to handle loss spikes?

Assessment point: Understanding engineering challenges in LLM training.

Answer direction: Training instability is the most frustrating problem in LLM training. Troubleshooting approach: check data (any dirty data or abnormal distributions); check gradients (monitor gradient norms for explosion/vanishing); check learning rate (is warmup sufficient, is the peak too high); check precision (any overflow in mixed precision training). For loss spikes: the most direct approach is to skip checkpoints around the spike and roll back to a stable point for retraining; you can also reduce the learning rate or increase the gradient clipping threshold. Meta used a 7-stage learning rate schedule when training LLaMA to maintain stability.

8. How to implement long context? What are the technical approaches?

Assessment point: Understanding technical routes for long context extension.

Answer direction: Long context mainly has three approaches: training-time extension (directly training with longer sequences—high cost but best results); positional encoding extrapolation (NTK-aware interpolation, YaRN, etc., extending the context window by adjusting RoPE's frequency base); inference-time extension (StreamingLLM, attention sinks, etc., keeping only sink tokens and local window KV Cache). In practice, these are often combined: pre-train on 4K, extend to 32K with NTK-aware continued pre-training, then fine-tune with long data. GPT-4's 128K context was achieved this way.

3. Fine-tuning (4 Questions)

9. How to construct SFT data? What are the pitfalls?

Assessment point: Understanding practical details of instruction fine-tuning.

Answer direction: The core of SFT data construction is diversity and quality. Data sources include: human annotation, GPT-4 generation, and open-source dataset cleaning. Key pitfalls: format consistency (data from different sources must be unified in format, otherwise the model gets confused); length distribution (can't all be short answers—need long-form generation samples); rejection samples (must include samples the model should refuse to answer, otherwise it answers everything); deduplication (similar instructions must be deduplicated to prevent overfitting). The data volume doesn't need to be huge—the LIMA paper demonstrated that 1,000 high-quality SFT samples can produce decent results.

10. What's the difference between LoRA and QLoRA? How to choose?

Assessment point: Understanding the principles and selection of parameter-efficient fine-tuning.

Answer direction: LoRA adds a low-rank decomposition matrix ΔW=BA alongside the original weight matrix, only updating A and B during training. QLoRA adds three optimizations on top of LoRA: 4-bit NormalFloat quantization (using NF4 data type instead of FP16 for more precise quantization); double quantization (quantizing the quantization constants again to save memory); paged optimizer (using CPU memory paging for optimizer states to avoid OOM). For choosing: use LoRA if you have enough memory, use QLoRA if memory is tight. QLoRA's precision loss is minimal and can basically replace LoRA. The rank r is typically 8-64, with larger r for more complex tasks.

11. What is the RLHF process? What are the challenges of PPO training?

Assessment point: Understanding the complete Reinforcement Learning from Human Feedback pipeline.

Answer direction: RLHF has three steps: train a reward model (train a scoring model using human preference data); optimize the policy model with PPO (maximize reward while constraining KL divergence to prevent drifting too far); iterative optimization (collect new preference data and repeat). PPO's challenges: training instability (updates to the reward model and policy model easily interfere with each other); hard to tune KL constraint (too loose leads to reward hacking, too tight means the model learns nothing); high memory overhead (need to load 4 models simultaneously: policy, reference, reward, value). InstructGPT required very careful hyperparameter tuning for stable training.

12. What are DPO's advantages over RLHF?

Assessment point: Understanding the principles and trade-offs of Direct Preference Optimization.

Answer direction: DPO's core idea is to skip the reward model and directly optimize the policy using preference data. Through mathematical derivation, it proves that under certain conditions, the optimal policy can be directly parameterized using preference data, with a log-sigmoid loss function. DPO's advantages: no need to train a reward model, simpler pipeline; more stable training, no PPO's 4-model joint training problem; lower computational cost, only needs 2 models (policy + reference). Disadvantages: less generalizable than RLHF—DPO only optimizes preferences in training data and can't generalize to new preference dimensions; more sensitive to data quality—noisy preference data directly affects the policy. In practice, DPO and RLHF each have their use cases—DPO for simple alignment, RLHF for more complex alignment.

4. Inference (4 Questions)

13. What is the principle of KV Cache? How to optimize it?

Assessment point: Understanding the core technology for inference acceleration.

Answer direction: KV Cache caches the already-computed K and V matrices during autoregressive generation to avoid redundant computation. Each new token only needs to compute attention with the cached KV, without recomputing KV for all previous tokens. Optimization methods: GQA/MQA (multiple Q heads share KV heads, reducing KV Cache size); quantization (storing KV Cache in 8-bit or 4-bit); PagedAttention (vLLM's approach—managing KV Cache like OS memory management to avoid fragmentation); sliding window (only keeping KV for the most recent W tokens, like Mistral's sliding window attention). KV Cache is the biggest memory bottleneck in LLM inference, with significant optimization potential.

14. What are the types of quantization? Is the precision loss significant for INT8 and INT4?

Assessment point: Understanding the principles and effects of model quantization.

Answer direction: Quantization is divided into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ is further divided into weight-only quantization and weight+activation quantization (W8A8, W4A16, etc.). INT8 weight quantization has minimal precision loss with almost no impact on performance; INT4 weight quantization is acceptable for models 7B and above, but smaller models (1-3B) show noticeable degradation. Key techniques: GPTQ (layer-wise quantization using Hessian information to compensate for quantization error); AWQ (activation-aware weight quantization protecting important channels); SmoothQuant (migrating activation quantization difficulty to weights). W4A16 is currently the most mainstream inference quantization scheme.

15. What is speculative decoding? How much speedup can it achieve?

Assessment point: Understanding the principles and speedup effects of speculative inference.

Answer direction: The core idea of speculative decoding is to use a small model (draft model) to quickly generate multiple candidate tokens, then have the large model verify these tokens in a single forward pass. If n out of k tokens generated by the small model are accepted by the large model, it's equivalent to generating n+1 tokens in one forward pass (n accepted + 1 generated by the large model itself). The speedup ratio depends on how well the small and large model distributions match—better matching means higher speedup. In practice, it typically achieves 2-3x speedup without precision loss. Medusa is an improved version of speculative decoding that adds multiple prediction heads to the large model for parallel candidate token generation, eliminating the need for a separate small model.

16. What are the mainstream inference frameworks? How to choose?

Assessment point: Understanding the inference framework ecosystem and selection criteria.

Answer direction: Mainstream frameworks: vLLM (PagedAttention, high throughput, suitable for online serving); TGI (by HuggingFace, feature-rich, easy deployment); TensorRT-LLM (by NVIDIA, extreme GPU optimization, but steep learning curve); llama.cpp (CPU/Apple Silicon inference, suitable for local deployment); MLC-LLM (compilation optimization, cross-platform). Selection advice: choose vLLM or TGI for online serving, TensorRT-LLM for extreme performance, llama.cpp for local development. vLLM currently has the most active community and is the first choice for most teams.

5. Applications (4 Questions)

17. What is the RAG process? How to improve its effectiveness?

Assessment point: Understanding the practice of Retrieval-Augmented Generation.

Answer direction: RAG process: user question → Embedding → vector retrieval → context concatenation → LLM generation. Methods to improve effectiveness: retrieval optimization (hybrid retrieval: vector + keyword, reranking: using Cross-Encoder for precision ranking); chunking optimization (semantic chunking instead of fixed-length chunking, parent-child document strategy); query optimization (query rewriting, multi-query expansion, HyDE hypothetical document embeddings); generation optimization (context compression, citation tracking, hallucination detection). The biggest pitfall in RAG is retrieval quality—if relevant documents can't be retrieved, even the best generation is useless. I recommend spending 80% of your effort on optimizing retrieval first, then generation.

18. What is the core architecture of an Agent? How to design one?

Assessment point: Understanding AI Agent design patterns.

Answer direction: The core of an Agent is the perceive-decide-execute loop: LLM as the brain receives environmental information → thinks about the next action → calls tools to execute → observes results → continues thinking. Mainstream architectures: ReAct (alternating reasoning and action, simple but token-intensive); Plan-and-Execute (formulate a complete plan first, then execute step by step, suitable for complex tasks); LATS (Language Agent Tree Search, using Monte Carlo Tree Search for planning). Design considerations: tool definitions must be clear (name, description, parameter schema); error handling must be robust (retry/fallback for failed tool calls); context management must be reasonable (truncate/summarize long conversation histories).

19. What are the techniques for Prompt Engineering?

Assessment point: Mastering core prompt engineering methods.

Answer direction: Core techniques: role setting (give the model a professional role for more professional output); few-shot learning (provide examples so the model quickly understands the task format); Chain of Thought (add "let's think step by step" to have the model show its reasoning process); structured output (request JSON/Markdown format for easier post-processing); self-consistency (multiple sampling with majority vote for improved reliability); step-by-step instructions (break complex tasks into steps with clear instructions per step). Advanced techniques include meta-prompting (having the model optimize its own prompts) and automatic prompt optimization (like OPRO using LLM to search for optimal prompts).

20. How are multimodal LLMs built? What are the challenges?

Assessment point: Understanding the technical roadmap for multimodal models.

Answer direction: Mainstream approach: encoder-alignment-LLM, using a visual encoder (like ViT) to extract visual features, projecting them into the language space through a projection layer, then feeding them into the LLM. Representative models: LLaVA (simple linear projection), Qwen-VL (cross-attention), GPT-4V (details undisclosed but speculated to be similar). Core challenges: modality alignment (large semantic space gap between vision and language, alignment quality determines multimodal understanding capability); high-resolution processing (high image resolution leads to token explosion, requiring dynamic resolution or patching strategies); training data (scarce high-quality image-text pairs, requiring careful construction); hallucination (multimodal models are more prone to visual hallucinations, requiring specialized alignment training).

Key Takeaways

My biggest takeaway from preparing for LLM interviews is that fundamentals must be rock-solid. Every detail of the Transformer must be thoroughly understood—you can't just memorize conclusions. Interviewers love to ask "why"—if you say RoPE is good, they'll definitely ask "why can RoPE extrapolate?" If you say DPO is simple, they'll ask "what's the mathematical derivation of DPO?" So for every knowledge point, you need to understand not just the what, but the why.

My second piece of advice is to stay current with the latest developments. The LLM field evolves incredibly fast—knowledge from six months ago may already be outdated. Before interviews, make sure to read papers from the last 3 months, especially the latest work from major tech companies. During one of my interviews, I was called out for not knowing about GQA—very embarrassing.

My third piece of advice is to get hands-on practice. Reading alone isn't enough—at minimum, you should run SFT and LoRA fine-tuning with HuggingFace, deploy inference services with vLLM, and build a RAG system with LangChain. Interviewers value practical experience highly—being able to describe pitfalls you've encountered is worth a hundred times more than reciting theory.

FAQ

Q: Do I need to practice LeetCode for LLM interviews?

A: It depends on the role. For research-oriented positions, algorithm problems are generally not tested—focus is on theory and derivations. For engineering-oriented positions (inference optimization, training frameworks, etc.), medium-difficulty algorithm problems may be asked. I recommend at least 50 medium problems as a safety net.

Q: What if I don't have LLM training experience?

A: You can run fine-tuning experiments with open-source models, use HuggingFace's TRL library for SFT/DPO, and deploy inference with vLLM. These can all go on your resume. The key is being able to discuss specific technical details and pitfalls you've encountered.

Q: What should I do when asked a question I don't know in an interview?

A: Be honest about not being familiar with it, but share your reasoning direction. For example, if asked about a specific paper, you could say "I haven't read this paper, but based on the problem, I'd guess the approach might be..." Interviewers value thinking ability more than memorization.

Q: Which papers should I read?

A: Must-read: Attention Is All You Need, GPT series, LLaMA series, InstructGPT (RLHF), DPO. Optional: Flash Attention, vLLM, RAG-related papers. At minimum, carefully read the methods sections of the must-read papers.

Q: What's the difference between LLM interviews and traditional ML interviews?

A: Traditional ML interviews focus more on mathematical derivations and statistical foundations, while LLM interviews focus more on systems engineering and cutting-edge technology. But the fundamentals overlap—I recommend building a solid traditional ML foundation first, then studying LLM-specific knowledge.

#LLM#Transformer#RLHF#DPO#LoRA#KV Cache#RAG#Agent#Interview Trivia#Large Language Models