Google System Design Interview: Designing YouTube's Recommendation System

System DesignJune 18, 2024Author: BeautyResume Team

3-year backend engineer's Google system design interview experience, detailing the complete process of designing YouTube's recommendation system including recall, ranking, re-ranking, real-time features, A/B testing, and cold start strategies

Google System Design Interview: Designing YouTube's Recommendation System

Background

I interviewed at Google in March 2024 for a backend software engineer role with 3 years of experience. Honestly, Google's system design interview was nothing like what I expected — it wasn't the "draw an architecture diagram and call it a day" format. The interviewer drilled deep into every detail, pushing until I couldn't go further. The team I was interviewing for worked on YouTube recommendations, so the interviewer asked me to design YouTube's recommendation system from scratch. My palms were sweating, but I had prepared a general recommendation system framework, so I managed to push through. Here's my complete review of this interview.

Interview Process Review

The interviewer was a staff engineer who introduced himself as being on the YouTube Recommendations infrastructure team. The entire interview lasted about 50 minutes at a very tight pace.

First 5 minutes: The interviewer briefly outlined the format and dove straight in: "Design a short-video recommendation system similar to YouTube Shorts." Following my prepared framework, I started by asking requirement-clarification questions.

5-15 minutes: Requirements clarification. I asked several key questions:

1. User scale? — "Assume 300M DAU."

2. Recommendation objective? — "Maximize watch time and engagement rate."

3. Real-time requirements? — "After a user watches a video, the next recommendation should reflect that behavior."

4. Cold start handling? — "Consider both new users and new videos."

5. Content safety? — "Need a content moderation mechanism."

The interviewer seemed satisfied with these questions and nodded, "Continue."

15-35 minutes: This was the core segment. I presented the Recall → Ranking → Re-ranking three-layer architecture. The interviewer added deep follow-up questions at each stage: "How do you solve cold start for collaborative filtering?" "What model do you use for fine ranking?" "How do you ensure recommendation diversity?" I answered each one — some fluently, some with a bit of stumbling.

35-45 minutes: The interviewer shifted to engineering implementation details: "How do you update features in real time?" "How do you run A/B experiments?" "How do you handle video embeddings?" I felt my answers here were mediocre — there were a few points I genuinely hadn't thought through.

Last 5 minutes: The interviewer let me ask questions. I asked about the team's tech stack and iteration cadence for the recommendation system.

Real Question: Design YouTube's Recommendation System

1. Overall Architecture Design

I drew a three-layer architecture diagram: Recall Layer → Ranking Layer → Re-ranking Layer. The interviewer glanced at it and said, "The basic framework is fine — expand on each layer."

2. Recall Layer Design

The recall layer's goal is to filter hundreds of millions of videos down to a few thousand candidates. I designed four parallel recall paths:

1. Collaborative Filtering Recall: Based on the user behavior matrix, use ItemCF to find similar videos. The interviewer followed up: "What about cold start?" I explained that new videos go through content-based recall first and join the collaborative filtering pool once they accumulate enough behavioral data.

2. Content-Based Recall: Use video tags, categories, and embeddings for similarity-based recommendations. I mentioned using a two-tower model to extract user and video features separately, then compute the dot product. The interviewer asked "What do you use for vector search?" — I answered FAISS or HNSW, supporting millisecond-level retrieval over billion-scale vectors.

3. Popularity Recall: Surface trending videos by region and time slot, ensuring new users see quality content. The interviewer asked "How do you avoid filter bubbles?" — I explained that popularity recall only accounts for a small portion (~10%) of the candidate set, and we add a random exploration mechanism.

4. Social Recall: Videos watched by followed users or liked by friends. This path typically has the highest CTR but limited coverage.

The interviewer was generally satisfied with the recall layer design but pointed out something I missed: recall deduplication. Videos the user has already watched or marked as uninterested should be filtered out at the recall stage, otherwise they waste ranking-layer compute resources.

3. Ranking Layer Design

The ranking layer's goal is to refine thousands of candidates down to hundreds. I split it into coarse ranking and fine ranking:

1. Coarse Ranking: Use a lightweight model (e.g., two-tower) for fast scoring, reducing thousands of candidates to hundreds. The core requirement is low latency — scoring each video in under 1ms.

2. Fine Ranking: Use deep learning models (DIN, DIEN, MMOE) for multi-objective optimization. I detailed several key design decisions:

- Multi-Objective Optimization: Simultaneously predict CTR, completion rate, like rate, comment rate, and share rate, then fuse them with a weighted formula. The interviewer asked "How do you determine the weights?" — "Through online A/B experimentation."

- Feature Engineering: User features (historical behavior sequence, profile tags), video features (duration, resolution, tag embeddings), context features (time, network, device). The user behavior sequence is the most important feature — use Attention mechanisms to extract behaviors relevant to the current candidate video.

- Training Data: Use impression-but-no-click as negative samples, but watch out for sample selection bias. The interviewer asked "How do you address this?" — Use random negative sampling to supplement unexposed samples.

4. Re-ranking Layer Design

The re-ranking layer applies business rules and diversity guarantees on top of the ranking results:

1. Diversity Control: Use MMR (Maximal Marginal Relevance) to balance relevance and diversity, preventing consecutive recommendations of the same type.

2. Business Rules: Ad insertion slots, promotional video guarantees, sensitive content filtering. These rules override model rankings.

3. Context Adjustment: Dynamic adjustments based on time of day (e.g., sleep-aid content late at night) and network status (short videos on weak connections).

5. Real-Time Feature Updates

This was a key area the interviewer drilled into. YouTube's recommendation system's core advantage is real-time feedback — after a user watches a comedy video, the next recommendation should be similar.

My design: User Behavior → Kafka → Flink Stream Processing → Feature Update → Model Inference.

Specifically, every play, like, and skip action is written to Kafka in real time. Flink consumes these events and updates the user's real-time features (e.g., last 10 watched videos, current session interest tags). Updated features are written to Redis, and the fine-ranking model reads from Redis during inference.

The interviewer asked about latency requirements. I said "end-to-end latency under 500ms." He followed up: "How do you guarantee that?" I mentioned several optimizations: Flink uses tumbling windows instead of event-time windows to reduce waiting; Redis uses Pipeline for batch read/write; model inference uses a GPU inference service.

6. A/B Experimentation Platform

The interviewer asked "How do you validate recommendation algorithm effectiveness?" I explained the A/B testing design:

1. Layered Experiments: Each layer — recall, ranking, re-ranking — has its own experiment layer. Experiments in different layers are orthogonal and don't interfere with each other. This allows recall experiment A and ranking experiment B to run simultaneously.

2. Traffic Splitting: Hash by user ID for consistent bucketing. Experiment groups typically receive 5-10% of traffic.

3. Metric Monitoring: Core metrics (watch time per user, engagement rate) + guardrail metrics (negative feedback rate, report rate). Guardrail metrics must not degrade, or the experiment rolls back immediately.

7. Cold Start Strategy

New User Cold Start: Select interest tags at registration → mix popularity recall + interest-tag recall → quickly accumulate behavioral data → gradually transition to personalized recommendations.

New Video Cold Start: New videos enter an exploration traffic pool (~5% traffic). Based on initial feedback, the system decides whether to expand recommendations. High-performing videos progressively enter larger pools — similar to a "horse-racing mechanism."

Key Takeaways

1. Always present recommendation systems using the "Recall → Ranking → Re-ranking" three-layer framework. This is the industry standard — interviewers immediately recognize that you know the domain.

2. Don't just talk about algorithms — discuss engineering implementation. Interviewers care more about how you operationalize models than how advanced the models are. Real-time feature updates, A/B experiments, and cold start are the differentiators.

3. Proactively discuss trade-offs. For example: "Coarse ranking uses a simpler model, trading accuracy for latency; fine ranking uses a complex model, trading latency for accuracy." This kind of statement earns major points.

4. Prepare specific numbers. For example: "Recall filters from 1 billion to 5,000; coarse ranking from 5,000 to 500; fine ranking from 500 to 100." Numbers are far more persuasive than pure text.

5. Google's interview style is "drill to the bottom" — the interviewer will keep pushing until you can't answer. This is normal. Don't panic — answer to the best of your ability.

FAQ

Q: How many system design rounds does Google typically have?

A: Usually 2-3 system design rounds. The first is general architecture, the second is a specific business scenario, and the third may be a cross-domain comprehensive question.

Q: Which papers should I study for recommendation system interviews?

A: Deep Interest Network (DIN), Deep Interest Evolution Network (DIEN), Deep & Cross Network, Wide & Deep Learning. You don't need to read every paper in depth, but you should be able to articulate the core ideas.

Q: What if I don't have recommendation system experience?

A: Approach it from a general system design perspective, breaking the recommendation system into four modules: "Data Collection → Feature Engineering → Model Training → Online Serving." Interviewers care more about your system design skills than your recommendation algorithm expertise.

#System Design#ByteDance#Recommendation System#TikTok#YouTube#Google#Recall & Ranking#A/B Testing