Google SRE Interview: Kubernetes, Monitoring, and Automation Full Assessment
2 years ops experience interviewing at Google SRE, three technical rounds covering Linux networking, K8s architecture, monitoring systems, troubleshooting, and automation, with real questions and prep advice
Background
Let me start with my background. I have a bachelor's in Computer Science and spent 2 years doing operations development at a mid-size internet company. I mainly wrote automation scripts in Python and Go, managed a few hundred servers, and built some monitoring systems. Honestly, Google's SRE position was always a dream of mine — Google literally invented the SRE discipline, and their infrastructure scale is world-class. Getting exposure to their operational practices would be invaluable.
I applied in July through a referral channel. About a week later, HR contacted me to schedule the first round. The entire process was three technical rounds plus an HR round, completed in about two weeks. Google's interview pace is fast — each round is spaced just 2-3 days apart, unlike some companies that drag things out. Let me walk through each round in detail.
Interview Process Review
Round 1: Linux Fundamentals + Networking (about 60 minutes)
My first-round interviewer was a sharp and efficient engineer who got straight to the point without much small talk.
Linux fundamentals:
The first question had that distinctive Google flavor — "Explain the difference between processes and threads in Linux." I covered address space, resource usage, scheduling, and creation overhead. The interviewer followed up: "What are the inter-process communication methods? What are the characteristics of each?" I listed pipes, message queues, shared memory, semaphores, signals, and sockets, along with their use cases and trade-offs.
Then came a very practical question: "Explain the Linux file system. What is an inode?" I covered inode structure (metadata, data block pointers), the difference between hard links and soft links, and directory entries. Follow-up: "If disk space is full but du shows remaining space, what could be the reason?" I answered that it might be deleted files still held open by processes, which can be checked with lsof.
Networking fundamentals:
The interviewer asked: "Walk through the TCP three-way handshake and four-way teardown." I answered this smoothly, drawing the state transition diagram and detailing the packets and state changes at each stage. Follow-up: "Why is the handshake three-way instead of two-way? Why does the teardown need TIME_WAIT?" I explained preventing stale connection requests from reaching the server and ensuring the other side receives the final ACK.
There was also an HTTP question: "What improvements do HTTP/1.1, HTTP/2, and HTTP/3 each bring?" I compared persistent connections, multiplexing, header compression, server push, and the QUIC protocol. Follow-up: "What's the HTTPS handshake process?" I detailed certificate verification, key exchange, and symmetric encryption communication.
Shell and tools:
Finally, some Shell questions: "Write a command to find the top 10 IPs by access count in nginx logs." I wrote awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10, and the interviewer said that works. Also: "How do you check which files a process has open?" I answered lsof -p PID.
Round 2: K8s + Monitoring Systems (about 75 minutes)
The second-round interviewer was clearly more senior, likely a technical backbone of the SRE team. This round had more depth and breadth.
Kubernetes:
The first question was a big one — "Explain the architecture components of Kubernetes." I detailed the control plane (API Server, etcd, Scheduler, Controller Manager) and worker nodes (kubelet, kube-proxy, container runtime), explaining each component's responsibilities and interactions. Follow-up: "What's the Pod creation process?" I walked through the entire flow from kubectl submitting a request to API Server, etcd storage, Scheduler assignment, and kubelet creating containers.
There was also a service discovery question: "What's the difference between Service and Ingress in K8s? What scenarios is each suitable for?" I compared ClusterIP/NodePort/LoadBalancer type Services with Ingress's Layer 7 routing capabilities and their respective use cases.
A scheduling follow-up: "If a node has insufficient resources, what happens to Pods?" I covered the Pending state, scheduling failures, and potential cluster auto-scaling triggers.
Monitoring systems:
The interviewer asked: "How was your previous monitoring system set up?" I described the Prometheus + Grafana + Alertmanager architecture, detailing the complete pipeline of metrics collection, storage, visualization, and alerting. Follow-up: "What are the pros and cons of Prometheus's pull model vs. push model (Pushgateway)?" I compared them from service discovery, time series consistency, and applicable scenarios.
There was also an alerting question: "How do you solve alert fatigue from too many alerts?" I described several methods: alert tiering (P0-P3), alert aggregation and inhibition, dynamic thresholds, and SLO-based alerting strategies. The interviewer was quite satisfied with this answer.
Logging systems:
"Describe the ELK/EFK architecture." I covered the complete pipeline from Filebeat collection, Logstash processing, Elasticsearch storage, to Kibana visualization. Follow-up: "If log volume is too large and Elasticsearch performance degrades, what do you do?" I described hot-cold data separation, index lifecycle management, adding nodes, and query optimization.
Round 3: Troubleshooting + Automation (about 70 minutes)
The third round was with the SRE team lead, mainly assessing troubleshooting ability and automation thinking.
Troubleshooting:
The interviewer gave a failure scenario: "An online service suddenly becomes slow. How would you troubleshoot?" I described the investigation approach from three levels: application layer (logs, traces, profiles), system layer (CPU, memory, I/O, network), and infrastructure layer (database, cache, message queue), emphasizing the principle of stopping the bleeding first, then investigating. The interviewer followed up with several scenarios: "What if it's caused by slow database queries?" "What if it's network jitter?" "What if it's a GC issue?" I gave targeted troubleshooting methods for each.
There was also a capacity planning question: "How do you evaluate a service's capacity limit?" I described load testing (wrk/locust), performance metrics (QPS, latency, error rate), resource bottleneck analysis, and capacity model building.
Automation:
The interviewer asked: "What automation have you done in your previous work?" I described auto-scaling, automated fault recovery, and automated deployment pipelines. Follow-up: "How do you choose auto-scaling metrics? What are the pitfalls?" I covered CPU/memory/QPS-based scaling strategies and common pitfalls like cold start delays, metric lag, and traffic spikes.
There was also a chaos engineering question: "Are you familiar with chaos engineering? How do you implement it in production?" I described tools like Chaos Monkey, blast radius control, progressive experimentation, and observability guarantees.
On-Call:
The interviewer's final question was very practical: "What's your take on On-Call? What are good On-Call practices?" I described rotation systems, alert tiering, runbooks, and post-incident reviews, emphasizing that On-Call shouldn't be about "firefighting" but "fire prevention" — reducing On-Call burden through automation and proactive measures.
Key Interview Questions
Linux Fundamentals:
1. Difference between processes and threads
2. IPC methods and their characteristics
3. Linux file system and inodes
4. Reasons for disk full when du shows remaining space
Networking:
5. TCP three-way handshake and four-way teardown
6. Why three-way instead of two-way? TIME_WAIT's purpose
7. Improvements in HTTP/1.1, HTTP/2, HTTP/3
8. HTTPS handshake process
Shell and Tools:
9. Find top 10 IPs by access count in nginx logs
10. Check which files a process has open
Kubernetes:
11. K8s architecture components and responsibilities
12. Pod creation process
13. Difference between Service and Ingress
14. Pod handling when node resources are insufficient
Monitoring:
15. Prometheus + Grafana + Alertmanager architecture
16. Pull model vs. push model pros and cons
17. Solutions for alert fatigue
18. ELK/EFK architecture
19. Elasticsearch performance optimization
Troubleshooting and Automation:
20. Troubleshooting approach for slow service response
21. Capacity evaluation methods
22. Auto-scaling metric selection and common pitfalls
23. Chaos engineering practices
24. On-Call best practices
Lessons and Advice
1. Linux and networking fundamentals are the foundation of SRE. Google's interviews will ask about advanced topics like K8s and monitoring, but if your fundamentals are weak, you'll be exposed the moment the interviewer digs deeper. I recommend thoroughly studying "TCP/IP Illustrated" and "Advanced Programming in the UNIX Environment" — not to memorize, but to truly understand.
2. Understand K8s architecture principles, not just kubectl commands. Many people just run kubectl apply, but interviewers ask about Pod creation flows, scheduling algorithms, and service discovery mechanisms — the underlying principles. I recommend reading K8s source code, at least understanding how API Server and Controllers work.
3. Have a holistic view of monitoring systems. Don't just know how to configure Prometheus rules. Understand the complete pipeline from metric definition, collection, storage, and visualization to alerting, and the trade-offs at each stage. Google's interviewers highly value your overall understanding of monitoring systems.
4. Have a methodology for troubleshooting. Don't just guess blindly. Have a systematic approach: define the impact scope first, then investigate layer by layer, stop the bleeding first before fixing the root cause. Demonstrating this systematic troubleshooting thinking during the interview will earn you significant bonus points.
5. Automation thinking should be pervasive. The core value of SRE isn't manual operations — it's improving efficiency and reliability through automation. Thinking about problems from an automation perspective during the interview will show the interviewer you have an SRE mindset.
FAQ
Q: Does Google's SRE interview require strong programming skills?
A: There are some requirements. While not as algorithm-heavy as developer roles, you need to be able to write automation scripts and tools. The interview includes Shell and Python coding questions, so prepare in advance.
Q: Can I interview for Google SRE without K8s experience?
A: It's difficult. Google's SRE positions almost always require K8s experience since their infrastructure is fully containerized. If you lack K8s experience, I suggest setting up your own cluster for practice first.
Q: How intense is On-Call at Google?
A: Honestly, Google's On-Call intensity is not low, especially for core service teams. However, Google's automation level is very high — many faults can self-heal, and the On-Call burden is gradually decreasing.
Q: Will there be algorithm questions in the interview?
A: Yes, but they're easier than developer role questions. Generally medium difficulty, focusing more on your coding ability and engineering thinking than algorithmic tricks.
Q: How is the compensation?
A: With 2 years of experience, Google SRE compensation is very competitive, roughly on par with developer roles at the same level. The equity component is also attractive.