DeepMind NLP Research Interview: Pre-Training, Information Extraction, and Text Generation
2 years of NLP experience, detailed review of DeepMind NLP Research three-round interview covering Transformer, pre-training models, information extraction, knowledge graph construction, and text generation
Background
Let me introduce myself: math undergrad, switched to CS for my master's with a focus on NLP, then spent 2 years as an NLP algorithm engineer at an AI startup, mainly working on information extraction and text generation. DeepMind has always been my dream offer — their NLP research team is absolutely world-class, so when I saw the opening, I applied without hesitation.
I applied for the NLP Research position at DeepMind, based in London. The whole interview process took about two and a half weeks — three technical rounds with a tight schedule. Honestly, DeepMind's interviews are genuinely challenging — they ask deep and detailed questions, not the kind you can pass by memorizing standard answers. Let me walk through the details.
Interview Process Review
Round 1: NLP Fundamentals + Pre-Training Models
My first interviewer was a composed engineer, likely a core team member. Started with self-introduction, then moved to NLP fundamentals.
First question: Can you explain the self-attention mechanism in detail? A classic — I covered QKV computation, scaled dot-product attention, multi-head attention concatenation, and positional encoding. The interviewer followed up on why we divide by sqrt(d_k) as a scaling factor — I explained it prevents large dot-product values from causing softmax gradient vanishing.
Then we focused on pre-training models: What are the differences between BERT and GPT's pre-training objectives? What are their respective pros and cons? I explained that BERT uses MLM (Masked Language Model) + NSP (Next Sentence Prediction), is a bidirectional encoder suited for understanding tasks; GPT is an autoregressive language model, a unidirectional decoder suited for generation tasks. The interviewer asked about RoBERTa's improvements over BERT — I mentioned removing NSP, dynamic masking, larger batches, and more data.
A deeper question: How do you understand emergent abilities in large language models? I said emergent abilities refer to capabilities that suddenly appear when model parameters reach a certain scale, like chain-of-thought reasoning and few-shot learning. But I also mentioned recent research questioning whether emergence is just an artifact of evaluation metrics — nonlinear metrics show emergence while linear metrics show smooth improvement. The interviewer found this discussion interesting.
A practical question: If you needed to fine-tune a 7B model for domain-specific text classification, how would you approach it? I mentioned parameter-efficient fine-tuning methods like LoRA and QLoRA, along with data preparation (domain data collection, cleaning, annotation), training strategies (learning rate scheduling, gradient accumulation), and evaluation methods. The interviewer asked how to choose LoRA's rank — I said typically start from 8 and adjust based on validation performance.
Round 1 lasted about 55 minutes. The interviewer said my fundamentals were solid and told me to prepare for Round 2.
Round 2: Information Extraction + NER/RE
Round 2's interviewer was clearly more senior, with questions leaning toward practical application and system design.
Started with information extraction: What are the mainstream approaches for Named Entity Recognition? I covered sequence labeling methods (BiLSTM-CRF, BERT-CRF), Span-based methods (SpanBERT), and generation-based methods (using seq2seq models to generate entities). The interviewer asked about practical differences between BERT-CRF and BERT-Softmax — I said CRF can learn transition constraints between labels, working better for contiguous entities, but training is slower.
Then relation extraction: What methods exist for relation extraction? How do you handle noisy labels in distant supervision? I discussed pipeline methods (NER then RE) and joint methods (joint entity-relation extraction). For noisy label handling in distant supervision, I mentioned multi-instance learning, attention mechanisms for selecting effective sentences, and rule-based post-processing. The interviewer asked me to sketch a joint extraction model architecture — I drew the CasRel architecture and explained the cascaded decoding process.
A system design question: Design an enterprise knowledge graph construction system, from unstructured text to knowledge graph, end-to-end. This was a big question. I covered data ingestion, entity recognition, relation extraction, event extraction, knowledge fusion (entity alignment, disambiguation), knowledge storage (graph databases), and knowledge applications (QA, recommendations). The interviewer asked about difficulties in knowledge fusion — I said entity disambiguation is the trickiest, as same-name-different-entity cases are common and require contextual and external knowledge base information.
A newer direction: What are the pros and cons of using LLMs for information extraction versus traditional methods? I said LLMs' advantage is strong zero-shot/few-shot capability without needing labeled data; disadvantages are slow inference, high cost, and poor controllability. In practice, you can use LLMs for cold-starting and then distill to smaller models for online serving.
Round 2 lasted about 65 minutes — a deep conversation.
Round 3: Text Generation + Project Deep Dive
Round 3 was with a technical leader in the department — definitely more pressure. This round focused on text generation and project experience.
Started with text generation: What decoding strategies exist for text generation? I covered greedy search, beam search, top-k sampling, top-p (nucleus) sampling, and the role of temperature. The interviewer followed up on beam search's diversity issues — I mentioned diverse beam search and contrastive search.
Then text generation evaluation: What are the limitations of metrics like BLEU and ROUGE? I said they're based on n-gram overlap and can't measure semantic equivalence, nor do they adequately evaluate generation diversity. I mentioned BERTScore, BLEURT and other pre-training-based evaluation metrics, plus the latest trend of using LLMs for evaluation.
Deep project dive: How do you ensure generation quality in your text generation project? I covered several aspects: data quality (cleaning, deduplication, diversity), model training (RLHF/DPO alignment), post-processing (rule filtering, re-ranking), and human evaluation. The interviewer asked how to train RLHF's reward model — I said using human preference data to train a scoring model, then using PPO to optimize the generation policy.
An open-ended question: What do you think is the most important future direction for NLP? I mentioned multimodal understanding and generation, long-text understanding, and NLP for Science (using NLP to accelerate scientific discovery). The interviewer was very interested in NLP for Science, and we discussed the possibility of using NLP for literature mining and knowledge discovery.
Round 3 lasted about 50 minutes. The interviewer said "looking forward to having you" at the end and told me to wait for HR.
Key Questions Summary
1. Details of Transformer's self-attention mechanism? Why the scaling factor?
2. Differences between BERT and GPT's pre-training objectives? Pros and cons?
3. What improvements does RoBERTa make over BERT?
4. How do you understand emergent abilities in LLMs?
5. How to fine-tune a 7B model for domain-specific text classification?
6. Mainstream approaches for Named Entity Recognition?
7. Methods for relation extraction? How to handle distant supervision noise?
8. Design an enterprise knowledge graph construction system.
9. LLMs vs traditional methods for information extraction?
10. Decoding strategies for text generation?
11. Limitations of BLEU/ROUGE metrics?
12. How to ensure text generation quality?
13. How to train RLHF's reward model?
14. Most important future direction for NLP?
Insights and Advice
1. Transformer is mandatory: NLP interviews will 100% ask about Transformer — from attention mechanisms to positional encoding, every detail must be clear.
2. Go deep on pre-training models: Don't just know the differences between BERT and GPT — understand their variants (RoBERTa, ALBERT, DeBERTa, etc.) and improvements.
3. System design needs a holistic view: DeepMind especially values system design ability. For questions like knowledge graph construction, answer from an end-to-end perspective, not just one component.
4. Follow LLM frontiers: Emergent abilities, RLHF, LoRA — these hot topics are must-knows. Interviewers value your awareness of cutting-edge developments.
5. Projects need depth: In Round 3's project deep dive, don't just say what you did — explain why, what trade-offs you made, and how you quantified results.
FAQ
Q: What does DeepMind's NLP team value?
A: Solid NLP fundamentals + system design ability + frontier awareness. All three are indispensable.
Q: Can you get in without top conference papers?
A: Yes, but your project experience needs to be strong enough. I didn't have top conference papers either, but my projects were deep.
Q: Will there be coding in the interview?
A: Yes — Round 1 has algorithm questions, Round 2 has model architecture design, Round 3 is more discussion-based.
Q: How to choose an NLP direction?
A: Depends on personal interest and industry demand. Information extraction and text generation are two high-demand directions worth focusing on.
Q: What's the work intensity at DeepMind?
A: Better than most places, but still not easy. The advantage is being able to focus on research with relatively less engineering pressure.