Spotify ML Engineer Interview: Recall, Ranking, and Re-Ranking Full-Chain Assessment
2 years of recommendation algorithm experience, complete review of Spotify ML Engineer three technical interview rounds covering ML fundamentals, feature engineering, recommendation system architecture, multi-objective optimization, and A/B testing, with real questions and preparation tips.
Background
I've been doing recommendation algorithms for 2 years, previously working at a content platform on recommendation system development, primarily responsible for iterating on recall and ranking modules. I'm most familiar with classic models like Two-Tower, DeepFM, and DIN, and I've also done feature engineering and A/B testing work. Spotify's ML Engineer position for recommendations has always been my target — after all, Spotify's recommendations have an excellent reputation in the industry, and the user experience of "Discover Weekly" and "Daily Mix" is fantastic.
I was referred by a recruiter for the ML Engineer position. About a week later, I received an interview invitation. The entire process consisted of three technical rounds, spanning about two and a half weeks.
Interview Process Review
Round 1: Machine Learning Fundamentals + Feature Engineering (~65 minutes)
The first interviewer was a core algorithm engineer on the recommendation team. After a self-introduction, we moved into machine learning fundamentals.
Machine Learning Fundamentals: The interviewer asked about the differences between LR and SVM. I covered model form, loss function, and kernel tricks. Then came questions about overfitting solutions — I mentioned regularization, Dropout, data augmentation, and early stopping. The interviewer followed up on the differences between L1 and L2 regularization. I answered from two perspectives: Bayesian priors (L1 corresponds to Laplace prior, L2 to Gaussian prior) and optimization properties (L1 produces sparse solutions). Then came a deeper question: why is L2 regularization more commonly used than L1 in deep learning? I explained from the perspectives of gradient properties and weight decay.
Feature Engineering Section: The interviewer asked about my understanding of feature engineering and common feature types in recommendation systems. I categorized them into user features, item features, context features, and cross features. The interviewer was particularly interested in cross features, asking me to discuss feature crossing methods — from manual crossing and FM/FFM to automatic feature crossing with DeepFM and DCN. I provided a detailed comparison of each method's pros and cons. Then came a practical question: how to handle categorical features? I discussed One-Hot, Embedding, Target Encoding, and Hash Encoding, along with their applicable scenarios.
Evaluation Metrics Section: Asked about common evaluation metrics for recommendation systems. I covered offline metrics (AUC, NDCG, Hit Rate) and online metrics (CTR, CVR, dwell time). The interviewer followed up on the physical meaning of AUC and the relationship between AUC improvement and online CTR improvement. I explained that AUC measures the probability of positive samples ranking ahead, but AUC improvement doesn't necessarily translate proportionally to online CTR improvement due to factors like position bias.
At the end of Round 1, the interviewer said "good fundamentals, with practical feature engineering experience," which felt encouraging.
Round 2: Recommendation System Architecture + Multi-Objective Optimization (~80 minutes)
Round 2 was the most critical round of the entire interview, with a recommendation system architect as the interviewer.
Recommendation System Architecture Section: The interviewer asked me to draw a complete recommendation system architecture diagram, covering four stages: recall, pre-ranking, ranking, and re-ranking. The interviewer was particularly focused on the recall stage, asking me to detail multi-channel recall strategies, including collaborative filtering recall, vector recall (Two-Tower model), popular item recall, and tag-based recall. Then they probed into vector recall implementation details — from ANN retrieval (HNSW, IVF-PQ) to Two-Tower model training methods (In-batch Negative, Hard Negative). I explained in detail. The interviewer then asked a key question: how to handle the misalignment between recall and ranking objectives? I discussed how recall focuses on coverage while ranking focuses on accuracy, and how to align them through sample sampling strategies.
Multi-Objective Optimization Section: The interviewer asked a very practical question: recommendation systems typically need to optimize multiple objectives simultaneously — clicks, likes, saves, shares — how to do multi-objective optimization? I discussed three architectures: Shared-Bottom, MMoE, and PLE, with a detailed comparison of their pros and cons. The interviewer was interested in MMoE's Expert selection mechanism, asking about the role and training of gating networks. Then came an advanced question: how to handle the significant differences in Loss magnitude across different objectives in multi-objective optimization? I discussed GradNorm, Uncertainty Weighting, and Dynamic Weight Average as Loss weighting methods.
Cold Start Problem: The interviewer asked how to solve cold start for new users and new items. For new users, I discussed demographic-based recommendations, popular recommendations, and interest exploration strategies. For new items, I discussed content-based recommendations, EE strategies (like LinUCB), and meta-learning. The interviewer followed up on balancing Exploration and Exploitation in EE strategies — I explained the principles of UCB and Thompson Sampling.
Round 2 lasted 80 minutes with a lot of information, but the interviewer's guidance was good — not leaving me completely unable to answer.
Round 3: Project Deep Dive + A/B Testing (~70 minutes)
The Round 3 interviewer was likely the technical lead of the recommendation department, with questions leaning more toward practical experience and depth of thinking.
Project Deep Dive: The interviewer asked me to discuss my most impactful recommendation project. I chose a recall model upgrade project — from the original ItemCF to Two-Tower vector recall. I covered the complete process of model design, training strategy, online deployment, and effect evaluation. The interviewer was interested in online deployment details, asking about vector index construction and update strategies, A/B test traffic splitting schemes, and model update grayscale deployment. Then asked about my biggest challenge — I described a negative sampling strategy tuning process, from random negative sampling to Hard Negative Mining, ultimately improving offline Recall@50 by 15%.
A/B Testing Section: The interviewer asked about my understanding of A/B testing, including traffic splitting strategies, metric selection, and significance testing. I discussed the advantages of user-level splitting, the differences between mutually exclusive and orthogonal experiments, and how to use t-tests to determine experiment result significance. The interviewer followed up with a key question: what if A/B test metrics contradict each other? For example, CTR improved but dwell time decreased. I discussed defining core metrics and guardrail metrics — only launching when core metrics improve and guardrail metrics don't decline. The interviewer also asked about a practical scenario: what if the experiment group doesn't have enough sample size? I discussed traffic amplification, extending experiment duration, and CUPED variance reduction methods.
Comprehensive Assessment: The interviewer asked about my views on recommendation system development trends — I discussed LLM + recommendations, on-device recommendations, and privacy-preserving computation. Then came career planning questions and why I chose Spotify. Finally, an open-ended question: if you were to build a music recommendation system from scratch, how would you approach it? I answered from four dimensions: data collection, feature construction, model selection, and A/B testing infrastructure.
Real Questions Summary
1. Differences between LR and SVM?
2. Differences between L1 and L2 regularization? Why is L2 more commonly used in deep learning?
3. What are common feature types in recommendation systems?
4. What are feature crossing methods? From manual to automatic crossing?
5. How to handle categorical features?
6. Physical meaning of AUC? Relationship between AUC and online CTR improvement?
7. Draw a complete recommendation system architecture diagram?
8. Multi-channel recall strategies? Vector recall implementation details?
9. How to handle misalignment between recall and ranking objectives?
10. Multi-objective optimization architectures? Comparison of Shared-Bottom, MMoE, and PLE?
11. How to handle Loss magnitude differences in multi-objective optimization?
12. How to solve cold start for new users and new items?
13. Balancing Exploration and Exploitation in EE strategies?
14. A/B testing traffic splitting strategies and significance testing?
15. What to do when A/B test metrics contradict each other?
16. Build a music recommendation system from scratch?
Tips and Advice
1. Machine learning fundamentals must be solid: Recommendation algorithm interviews aren't about just knowing how to call APIs — interviewers will probe underlying principles. LR, SVM, tree models, and deep learning fundamentals must be clearly explainable. I recommend reading "Statistical Learning Methods" and "Deep Learning."
2. Understand the full recommendation system pipeline: Don't just understand ranking models — recall, pre-ranking, re-ranking, and A/B testing all need to be understood. I recommend reading "Recommender Systems: The Textbook" and tech team blog posts about recommendation systems.
3. Multi-objective optimization is a high-frequency topic: Large company recommendation systems are almost always multi-objective. MMoE, PLE, and similar architectures must have clearly explainable principles and applicable scenarios. I recommend reading the original papers to understand experimental design and ablation studies.
4. A/B testing requires hands-on experience: Recommendation system iteration离不开 A/B testing. Traffic splitting strategies, metric systems, and significance testing are fundamental skills. I recommend accumulating A/B testing experience in real projects.
5. Stay current with frontier technology: LLM + recommendations, contrastive learning in recommendations, and privacy-preserving computation are frequently asked new directions. I recommend following the latest papers from top conferences like KDD, RecSys, and WWW.
FAQ
Q: Are programming requirements high for Spotify ML Engineer interviews?
A: There's some requirement. Round 1 may include Python programming questions, like implementing a simple collaborative filtering algorithm. But they won't ask LeetCode-style algorithm problems — it's more practical.
Q: Can I pass without recommendation system experience?
A: Quite difficult. Spotify's recommendation position explicitly requires recommendation system development experience. If you only have general ML experience, I recommend building a recommendation system project first, like a complete recommendation pipeline using the MovieLens dataset.
Q: Will the interview ask for mathematical derivations?
A: Yes. I was asked to derive the FM formula and Softmax gradient. I recommend preparing mathematical derivations for common models.
Q: What's the technical atmosphere like on Spotify's recommendation team?
A: From what I understand, the technical atmosphere is excellent, with frequent paper sharing and technical discussions. The interviewer also mentioned the team encourages publishing papers and attending top conferences.