Uber Infrastructure Engineer Interview: Middleware, Containerization, and Service Governance

Interview ExperienceAuthor: BeautyResume Team

4 years infrastructure experience interviewing for Uber infrastructure engineer. Round 1: Java concurrency + middleware principles. Round 2: K8s + service governance. Round 3: system design + deep project dive. Includes real questions and preparation tips.

Background

Let me start with my situation: 4 years of infrastructure experience, previously at a mid-size internet company doing infrastructure, mainly responsible for middleware and containerization platforms. Honestly, before interviewing at Uber's infrastructure team, I felt fairly confident—while infrastructure interviews go deep, the scope is relatively fixed: middleware principles, containerization, and service governance are the three mountains you need to climb. But after going through the actual interview, I found Uber's infrastructure interview was far deeper than I expected, especially regarding middleware principles—not the "just used it" level, but requiring you to explain the underlying implementation.

I applied for Uber's Infrastructure Engineer position, based in San Francisco. Through a referral, it took about 4 days from application to the first round. The entire process consisted of three technical rounds plus an HR round, spanning about 3 weeks. Let me break it down in detail.

Interview Process Review

Round 1: Java Concurrency + Middleware Principles

The Round 1 interviewer was a calm-looking guy who started with a self-introduction, then went straight into technical questions. He said, "This round mainly covers fundamentals, especially Java concurrency and middleware principles."

The first question went straight to Java concurrency: "What's the difference between synchronized and ReentrantLock? What are their respective use cases?" I had prepared for this. synchronized is a JVM-level lock, while ReentrantLock is an API-level lock. synchronized doesn't require manual release, while ReentrantLock requires manual unlock; ReentrantLock supports fair locks, interruptible locks, and multiple condition variables, while synchronized doesn't. The interviewer followed up, "Do you understand synchronized's lock escalation process?" I said yes—no lock → biased lock → lightweight lock → heavyweight lock. He asked me to explain biased locks in detail. I said biased locks, in the absence of competition, bias the lock toward the first thread that acquires it, implemented by CAS modifying the Mark Word in the object header. When other threads compete, the biased lock is revoked and upgraded to a lightweight lock.

Next came the main event on middleware principles. "Why is Kafka so fast? How is zero-copy implemented?" I said Kafka is fast for several reasons: sequential disk writes, zero-copy, batch sending, and page cache. Zero-copy is implemented through the sendfile system call—data read from disk to the kernel buffer is directly transferred to the network card via DMA, without passing through user space. The interviewer followed up, "What's the difference between sendfile and mmap?" I said sendfile transfers directly in kernel space, while mmap maps files to user space, reducing one copy but still having user space overhead. He then asked, "What's the relationship between Kafka partitions and consumer groups?" I said one partition can only be consumed by one consumer within a consumer group, while one consumer can consume multiple partitions.

Redis was also mandatory. "What are Redis's persistence solutions? What are their pros and cons?" I said RDB is snapshot persistence—small file size, fast recovery, but may lose data between snapshots; AOF is append-only log persistence—high data safety, but large file size and slow recovery. The interviewer followed up, "Do you understand AOF's rewrite mechanism?" I said AOF rewrite forks a child process that traverses Redis's in-memory data to generate a new AOF file. During rewriting, new commands are written to both the old AOF and the rewrite buffer. After rewriting completes, the new AOF replaces the old one. He then asked, "Do you understand Redis's clustering solution?" I said Redis Cluster uses hash slot sharding, with 16384 slots distributed across different nodes. Clients calculate the slot using CRC16(key) % 16384.

Round 1 also covered Java concurrency questions: "What are the core parameters of a thread pool? What are the rejection policies?" I said core parameters include corePoolSize, maximumPoolSize, keepAliveTime, workQueue, threadFactory, and rejectedExecutionHandler. Rejection policies include AbortPolicy (throw exception), CallerRunsPolicy (caller executes), DiscardPolicy (discard), and DiscardOldestPolicy (discard oldest task). The interviewer followed up, "What happens if core threads are full, the queue is full, and max threads are also full?" I said the rejection policy is executed. He asked, "What rejection policy do you use in production?" I said we use CallerRunsPolicy because it doesn't discard tasks—it just has the caller thread execute them, acting as a rate limiter.

Round 1 lasted about 1 hour. The interviewer said, "Your fundamentals are solid, wait for the Round 2 notification."

Round 2: K8s + Service Governance

The Round 2 interviewer was a senior tech expert who jumped straight into project questions. He said, "I see containerization and service governance on your resume—tell me more."

I started with our containerization practices. "How did you migrate from VMs to containers? What problems did you encounter?" I said we did it in three steps: first containerize stateless services, then stateful services, and finally databases. The biggest challenge was networking—container networks and VM networks couldn't communicate. We used Calico as the network plugin with BGP mode for cross-node communication. The interviewer followed up, "What's the difference between Calico's BGP mode and IPIP mode?" I said BGP mode uses direct routing with good performance but requires network BGP support; IPIP mode uses tunnel encapsulation with good compatibility but extra overhead. He then asked, "How do you handle network policies for container networks?" I said we use NetworkPolicy to restrict access between Pods.

K8s scheduling was a key focus. "How does the K8s scheduler work? What scheduling policies are there?" I said the K8s scheduler works through two phases—predicates filter out nodes that don't meet conditions, and priorities score the remaining nodes to select the best one. Scheduling policies include node affinity, pod affinity/anti-affinity, taints and tolerations, and resource limits. The interviewer followed up, "If a node has insufficient resources but a Pod must be scheduled there, what do you do?" I said you can use nodeSelector or nodeAffinity to force scheduling, or adjust resource requests. He then asked, "What if the cluster has overall insufficient resources?" I said you need to scale the cluster or optimize resource usage.

Service governance was the core of Round 2. "What's your service governance solution? How do you handle service discovery?" I said we use Nacos for service discovery and configuration management—services register with Nacos, and consumers get service lists from Nacos. The interviewer asked, "What's the difference between Nacos and Eureka?" I said Nacos supports AP and CP mode switching, while Eureka only supports AP mode; Nacos supports configuration management, Eureka doesn't; Nacos supports health checks, Eureka relies on client heartbeats. He then asked, "How do you implement circuit breaking?" I said we use Sentinel for circuit breaking and degradation, supporting three circuit breaking strategies: slow call ratio, exception ratio, and exception count.

Then came a question that left a deep impression: "If service A depends on service B, service B depends on service C, and service C goes down, how do you prevent cascading failures?" I said we implemented several layers of protection: 1) Sentinel circuit breaking—when service C is abnormal, calls from B to C are broken and fail fast; 2) Timeout control—each call has a timeout, with automatic degradation on timeout; 3) Rate limiting—inbound traffic is rate-limited to prevent avalanches; 4) Fallback—after circuit breaking, return default values or cached data. The interviewer said, "Comprehensive approach, but how do you ensure consistency of cached data in the fallback?" I said we use local cache + version numbers, periodically updating from the configuration center.

Round 2 lasted about 1.5 hours. The interviewer said, "Your containerization and service governance experience is solid, but some solutions need optimization at Uber's scale."

Round 3: System Design + Deep Project Dive

Round 3 was with the department director. The atmosphere was more formal. He first asked about my understanding of infrastructure. I said, "The core value of infrastructure is improving development efficiency and system stability, reducing business teams' onboarding costs through standardization and automation." He nodded, then started diving deep into projects.

"What's the most technically challenging infrastructure project you've worked on?" I described a configuration center project supporting real-time push of million-level configurations. The biggest challenge was push performance and consistency. We used long polling + version numbers for real-time configuration change push—the server maintains configuration version numbers, and clients periodically check if the version has changed, pulling new configurations when it has. The interviewer followed up, "How does the server handle a million clients doing long polling simultaneously?" I said we did several optimizations: 1) Grouped push—push only to clients subscribed to the configuration group; 2) Batch response—merge responses when multiple clients check the same configuration; 3) Connection reuse—use HTTP/2 multiplexing to reduce connections. He then asked, "How do you ensure configuration change consistency?" I said we use version numbers + MD5 checksums—clients verify MD5 after pulling configurations and retry if inconsistent.

Then came an open-ended system design question: "Design a service registry supporting tens of millions of instances." I said I would design it from several aspects: 1) Storage—use sharded storage where each shard handles a portion of services; 2) Push—use incremental push + compression to reduce data volume; 3) Consistency—use Raft protocol to ensure data consistency; 4) Availability—multi-datacenter deployment with cross-datacenter synchronization. The interviewer asked, "Do you understand Raft's election process?" I said yes—Follower transitions to Candidate after election timeout, initiates an election, and becomes Leader after receiving a majority of votes. He then asked, "What happens during a network partition?" I said after partitioning, the majority partition can elect a new Leader, while the minority partition cannot, ensuring consistency.

Finally, he asked about my thoughts on Uber's infrastructure and career plans. I said Uber's infrastructure team has great industry influence, with top-tier technical depth and business scale, and I hope to deepen my expertise in infrastructure here. The interviewer said, "Welcome aboard."

The HR round was standard salary and start date discussion—nothing special.

Real Interview Questions

Round 1 Questions

1. synchronized vs ReentrantLock differences?
2. synchronized lock escalation process?
3. Why is Kafka fast? How is zero-copy implemented?
4. sendfile vs mmap differences?
5. Kafka partition and consumer group relationship?
6. Redis persistence solutions? AOF rewrite?
7. Redis Cluster hash slots?
8. Thread pool core parameters? Rejection policies?
9. HashMap underlying implementation? Resizing mechanism?
10. volatile vs synchronized differences?

Round 2 Questions

1. VM to container migration process?
2. Calico BGP vs IPIP mode differences?
3. K8s scheduler working principles? Scheduling policies?
4. Service governance solution? Service discovery approach?
5. Nacos vs Eureka differences?
6. Sentinel circuit breaking strategies?
7. How to prevent cascading failures?
8. Network policy handling?
9. Do you understand K8s HPA and VPA?
10. Container resource limit settings? OOM handling?

Round 3 Questions

1. Most technically challenging infrastructure project
2. Configuration center real-time push solution
3. How to optimize million-client long polling?
4. Design a service registry for tens of millions of instances
5. Raft protocol election process?
6. Network partition handling?
7. What is the core value of infrastructure?
8. How to drive technical solution implementation?
9. Your thoughts on Uber's infrastructure
10. Career plans

Key Takeaways

First, middleware principles must be deeply understood. Uber's infrastructure interview doesn't just require "having used" middleware—you need to explain the underlying implementation. Kafka's zero-copy, Redis's persistence, and thread pool principles are mandatory. I recommend reading source code.

Second, Java concurrency is fundamental for infrastructure interviews. synchronized's lock escalation, AQS implementation, and thread pool principles must be second nature. I recommend reading "The Art of Java Concurrent Programming" and JUC source code.

Third, containerization requires hands-on experience. Just knowing how to use Docker and K8s isn't enough—you need to know how to migrate from VMs to containers, handle networking issues, and set resource limits. I recommend setting up your own K8s cluster for practice.

Fourth, understand the full chain of service governance. Don't just know how to use Nacos and Sentinel—understand the relationships between service discovery, load balancing, circuit breaking, degradation, and rate limiting. I recommend building a complete microservices project.

Fifth, system design needs a sense of scale. Uber's infrastructure operates at million-level or even tens-of-millions-level scale. Interviewers will specifically focus on your solution design for large-scale scenarios. I recommend reading Uber's engineering blog to learn about their architectural practices.

FAQ

Q1: Does Uber's infrastructure interview require deep Java knowledge?

Yes, very deep. It's not about knowing how to use Spring Boot—you need to understand JVM, concurrency, and middleware principles. I recommend reading JUC source code and middleware source code.

Q2: What if I don't have containerization experience?

You can set up your own K8s cluster for practice. Use minikube or kind to set up locally, deploy a few microservices, and practice service discovery, configuration management, and rolling updates. The key is understanding the differences between containers and VMs, and K8s core concepts.

Q3: How deeply should I study middleware?

At minimum, read through the source code of core middleware. For Kafka, read the message storage and sending flow; for Redis, read data structures and persistence; for RocketMQ, read message delivery and transactions. Interviewers will ask questions at the level of "how is Kafka's zero-copy implemented."

Q4: How to learn service governance?

I recommend starting with Spring Cloud to understand service discovery, configuration management, and circuit breaking concepts, then learning Nacos and Sentinel's usage and principles. Most importantly, understand why service governance is needed, not just how to use the tools.

Q5: What's the work intensity like for Uber infrastructure?

The infrastructure team's pace is relatively stable, without the promotional pressure of business teams. But on-call is the norm—production issues need timely response. The technical atmosphere is great, and you encounter various large-scale distributed system problems. It's very helpful for infrastructure engineers' growth.

#Infrastructure#Meituan#Middleware#Kafka#Redis#K8s#Service Governance