Cruise Autonomous Perception Interview: 3D Object Detection and Multi-Sensor Fusion
2 years of autonomous driving perception experience, detailed review of Cruise's three technical interview rounds: Round 1 CV basics and 3D detection, Round 2 multi-sensor fusion LiDAR+Camera, Round 3 project deep dive and BEV perception, with question summary and tips
Background
I worked as a perception algorithm engineer at an autonomous driving startup for 2 years, mainly focusing on 3D object detection and multi-sensor fusion. Cruise has always been a company I really wanted to join — their technical accumulation in L4 autonomous driving perception is very deep, especially in BEV perception and multi-sensor fusion, which is recognized as industry-leading. When I saw they were hiring perception algorithm engineers in March, I applied immediately.
To be honest, I wasn't very confident before applying. My experience was mainly in LiDAR detection, and I had relatively little experience with Camera-based 3D detection. But Cruise's perception position requires knowledge of both LiDAR and Camera, and I wasn't sure if my partial expertise would be enough. So I spent nearly three weeks cramming, going through classic papers like PointPillars, CenterPoint, BEVFormer, and DETR3D, and hand-coding several key modules.
The interview process consisted of three technical rounds, each with a different focus. Let me review each round in detail.
Interview Process Review
Round 1: CV Basics + 3D Detection
My first interviewer was a very capable female engineer, who I later learned was a core member of the perception team. She started with a self-introduction and then asked a few CV basics to warm up.
1. What are the feature fusion methods in FPN? What are the pros and cons of each?
I said FPN uses top-down feature fusion. Later, PANet added a bottom-up path, and BiFPN further introduced weighted feature fusion. FPN's problem is insufficient feature enhancement for small objects, because high-level features have rich semantics but low spatial resolution. PANet enhances the semantic information of low-level features through the bottom-up path. BiFPN uses learnable weights to let the model adaptively fuse features from different levels.
2. How is the Attention mechanism used in CV?
I said there are mainly two types: Self-Attention and Cross-Attention. Self-Attention captures long-range dependencies within features, like SE-Net's channel attention and ViT's patch attention. Cross-Attention fuses features from different modalities or scales, like the interaction between queries and feature maps in DETR.
After warming up, we moved to core 3D detection questions:
3. What's the difference between PointPillars and VoxelNet? Why is PointPillars faster?
I said VoxelNet uses 3D voxelization, dividing voxels in the z direction as well, requiring 3D convolution. PointPillars only does pillar voxelization in the xy plane, without dividing in the z direction, so it can use 2D convolution instead of 3D convolution, greatly improving speed. Also, PointPillars' PointNet only uses MLP+MaxPool, without VoxelNet's complex VFE layers, which also reduces computation.
4. What are the advantages of CenterPoint's anchor-free design?
I said CenterPoint's biggest advantage is not needing NMS post-processing. Traditional anchor-based methods generate many overlapping detection boxes that need NMS for deduplication. CenterPoint predicts a center point heatmap, where each object corresponds to only one peak point, naturally avoiding duplicate detections. Also, anchor-free methods don't require manual anchor size design and adapt better to objects of different shapes and sizes.
5. What's the accuracy of 3D detection in your actual project? What are the main failure cases?
I said on the Waymo Open Dataset, our model achieves about 72-75 APH Level 2. There are three main failure cases: first, distant small objects can't be detected because the point cloud is too sparse; second, missed detections in heavily occluded scenarios; third, increased false positives in rainy weather due to noisy point clouds. For distant small objects, we tried multi-scale feature fusion and super-resolution methods, with some improvement but still not ideal.
6. How do you handle the class imbalance problem?
I said we use focal loss to handle the foreground-background imbalance, and for rare classes (like pedestrians, cyclists), we do oversampling and hard example mining. At the data level, we also do some augmentation, like copy-paste augmentation for rare classes.
Round 1 lasted about an hour. The interviewer said my "3D detection fundamentals are solid" and told me to wait for the Round 2 notification. I breathed a small sigh of relief.
Round 2: Multi-Sensor Fusion (LiDAR+Camera)
Round 2 was with an experienced engineer who went straight into fusion topics.
1. What are the levels of multi-sensor fusion? What are the characteristics of each?
I said there are mainly three levels: early fusion (data-level), mid-level fusion (feature-level), and late fusion (decision-level). Early fusion directly fuses at the raw data level, preserving the most complete information but requiring high spatiotemporal synchronization. Mid-level fusion at the feature level is currently the most mainstream approach. Late fusion fuses results after each sensor independently detects, which is simplest but has the most information loss.
2. Which fusion method do you use in your project? Why?
I said we use feature-level fusion, specifically fusion in BEV space. We first project Camera image features to BEV space through LSS (Lift-Splat-Shoot), then concatenate them with LiDAR BEV features, and use convolution for fusion. The reason for choosing this approach is that BEV space naturally suits autonomous driving perception, and aligning LiDAR and Camera features in BEV space is relatively convenient.
3. What if LSS's depth estimation is inaccurate?
I said this is indeed a core issue with LSS. LSS needs to predict the depth distribution of each pixel to lift 2D features to 3D. If depth estimation is inaccurate, the projected position in BEV will shift. We have two solutions: first, use LiDAR point clouds for depth supervision to make depth estimation more accurate; second, use deformable attention to replace fixed projection, letting the model adaptively learn feature projection positions, similar to BEVFormer's approach.
4. How do you handle the time synchronization issue between LiDAR and Camera?
I said we use a hardware synchronization solution, triggering LiDAR and Camera acquisition at the same time through PPS signals. But even with hardware synchronization, there's still a delay of tens of milliseconds due to different exposure times. For moving objects, we do motion compensation based on ego vehicle speed and target speed, aligning historical frame features to the current moment.
5. If LiDAR fails, can pure Camera do 3D detection?
I said yes, but the accuracy would drop significantly. Pure Camera 3D detection mainly relies on monocular depth estimation or multi-view stereo matching, with depth accuracy far inferior to LiDAR. Currently popular methods are BEV perception, using images from multiple Cameras to construct BEV features through Transformer, then doing 3D detection. On nuScenes, methods like FCOS3D, BEVFormer, and PETR have achieved good results, but there's still a clear gap compared to LiDAR methods.
6. What's the difference between DETR3D and BEVFormer?
I said DETR3D uses 3D reference points to sample features from multi-view images, then does 3D object detection. BEVFormer uses predefined BEV queries to sample features from multi-view images to construct a BEV feature map, then does detection on the BEV feature map. The core difference is that DETR3D's queries are 3D, directly predicting 3D detection boxes. BEVFormer first constructs a BEV feature map, then does detection on BEV, which is more flexible and can handle segmentation, detection, and other tasks.
Round 2 lasted about 1 hour and 10 minutes. The interviewer asked very in-depth questions, especially about LSS depth estimation and temporal alignment. My answers weren't perfect but I didn't get completely stuck either.
Round 3: Project Deep Dive + BEV Perception
Round 3 was with the head of the perception team — very imposing but not intimidating, more like a technical discussion.
He first asked me to describe my proudest project in detail. I talked about our LiDAR-Camera fusion detection project. Then he started digging deeper:
1. How much improvement did your fusion detection achieve over pure LiDAR detection?
I said about 5 points of AP improvement on pedestrian detection and about 2 points on vehicle detection. The main improvement came from distant targets and occluded scenarios, because Camera's semantic information helps distinguish targets with sparse point clouds in LiDAR.
2. How much computational overhead does fusion add? How do you optimize it?
I said the fusion module adds about 30% computation, mainly from the Camera backbone and LSS projection. We did several optimizations: first, used a lighter backbone (ResNet-50 replacing ResNet-101); second, reduced LSS's depth discretization bins from 64 to 48, with minimal accuracy loss but significant speed improvement; third, used TensorRT for inference optimization, keeping overall latency within 100ms.
3. How do you do temporal fusion for BEV perception?
I said we use temporal BEV fusion: warping historical frame BEV features to the current frame based on ego vehicle motion, then concatenating or doing attention fusion with the current frame's BEV features. The benefit of temporal fusion is leveraging historical information to enhance current frame detection, especially helpful for occluded and distant targets.
4. What are the challenges of BEV perception in production?
I said the biggest challenge is computing power. BEV perception requires simultaneous inference from multiple Cameras, which is computationally intensive, and the onboard platform has limited computing resources. Also, BEV perception robustness is an issue — Camera calibration errors and extreme weather (rain, snow, fog) all affect BEV feature quality. Furthermore, BEV perception's interpretability is not as good as LiDAR detection, making it hard to diagnose problems when they occur.
5. What's your take on the vision-only vs LiDAR debate?
I said I think in the short term, the LiDAR+Camera fusion approach is still more reliable, because the precise depth information LiDAR provides is irreplaceable by Cameras. But in the long run, if vision-only solutions solve the depth estimation and robustness problems, the cost advantage would be enormous. I personally lean toward fusion solutions, but we should also pay attention to vision-only progress in research.
Round 3 lasted 1 hour and 20 minutes. At the end, the interviewer asked if I had any questions. I asked about Cruise's latest progress in BEV perception, and he mentioned some work on temporal BEV and occupancy networks, which sounded very cutting-edge.
Key Questions Summary
CV Basics:
1. FPN feature fusion methods and their pros/cons?
2. Applications of Attention mechanism in CV?
3D Detection:
3. Differences between PointPillars and VoxelNet? Why is PointPillars faster?
4. Advantages of CenterPoint's anchor-free design?
5. 3D detection accuracy and main failure cases?
6. Methods for handling class imbalance?
Multi-Sensor Fusion:
7. Levels and characteristics of multi-sensor fusion?
8. Specific implementation of feature-level fusion?
9. Solutions for inaccurate LSS depth estimation?
10. Time synchronization between LiDAR and Camera?
11. Methods and limitations of pure Camera 3D detection?
12. Differences between DETR3D and BEVFormer?
BEV Perception:
13. Improvement of fusion detection over pure LiDAR detection?
14. Computational overhead from fusion and optimization methods?
15. Temporal fusion methods for BEV perception?
16. Challenges of BEV perception in production?
17. Vision-only vs LiDAR debate?
Tips and Advice
1. Build a solid 3D detection foundation: You must be able to explain the principles and code implementation of classic methods like PointPillars and CenterPoint. Interviewers will ask for details, including loss design, post-processing, and data augmentation.
2. Multi-sensor fusion is key: Cruise's perception position heavily values fusion capability. Methods like LSS, BEVFormer, and DETR3D must be thoroughly understood, especially the details of depth estimation and feature projection.
3. BEV perception is a bonus: If you can clearly explain BEV perception principles, temporal fusion, and production challenges, interviewers will be very interested.
4. Project experience needs data support: Interviewers will ask for specific accuracy numbers, improvement margins, and computational overhead. You can't just say "there was improvement" — you need quantified results.
5. Understand industry trends: The vision-only vs LiDAR debate, Occupancy Networks, end-to-end perception, and other frontier directions — have your own thoughts and judgments.
6. Preparation time recommendation: 3 weeks: If you have about 2 years of perception experience, 3 weeks of focused preparation should be sufficient. Focus on reviewing 3D detection, multi-sensor fusion, and BEV perception.
FAQ
Q: How difficult is Cruise's perception interview?
A: Quite difficult, especially Rounds 2 and 3. Round 1 focuses on basics, Round 2 on fusion depth, and Round 3 on project experience and frontier thinking. Overall difficulty is above average among autonomous driving companies.
Q: Do I need to write code by hand?
A: They didn't ask me to write code by hand, but they asked about code implementation details, like how to implement PointPillars' VFE layer and how to write LSS's depth projection. I'd recommend going through the code of key modules.
Q: Will interviewers ask about paper details?
A: Yes, especially for papers mentioned on your resume. If you say you used a certain method, interviewers will follow up on the paper's core innovations and implementation details. Make sure you truly understand everything on your resume.
Q: What's the salary range?
A: The base salary for perception algorithm positions is roughly in the $140K-$200K range, depending on level and negotiation.
Q: How long do interview results take?
A: I received the Round 2 notification 4 days after Round 1, Round 3 notification 3 days after Round 2, and the offer a little over 1 week after Round 3.