Stability AI AIGC Algorithm Interview: Diffusion Models, Image Generation, and Controllable Generation

InterviewAuthor: BeautyResume Team

2 years of AIGC experience, detailed review of Stability AI AIGC algorithm engineer three-round technical interview, covering VAE/Diffusion theory, ControlNet controllable generation, video generation and more

Background

I got into AIGC relatively early. My master's research focused on generative models, and after graduating, I spent two years at an AI startup, experiencing the evolution from GANs to VAEs to Diffusion Models. Honestly, AIGC has been insanely hot these past two years, but when I started, it wasn't this crowded — I got into it purely because I thought "teaching machines to create" was just too cool.

Stability AI has been very active in the AIGC space, with involvement from foundational infrastructure to applications, especially with strong technical accumulation in image generation and controllable generation. When I saw they were hiring AIGC algorithm engineers, I applied without hesitation. During preparation, I focused on VAE and Diffusion Model theory, Stable Diffusion and ControlNet architecture design, image generation evaluation metrics, and the latest advances in video generation. The preparation period was about 1.5 months.

About 5 days after applying, HR called and scheduled the first technical interview. The entire process was three technical rounds plus one HR round. Here's my detailed review.

Interview Process Review

Round 1: Generative Model Fundamentals + VAE/Diffusion (about 65 minutes)

The first-round interviewer was a researcher who'd been working on generative models for years. Questions covered everything from fundamental theory to latest advances, at a fast pace.

Question 1: What's the principle of VAE? How does it differ from a regular autoencoder?

I said VAE's core idea is variational inference — using a neural network-parameterized distribution q(z|x) to approximate the true posterior p(z|x). The difference from regular autoencoders: regular autoencoders learn deterministic mappings where the encoder outputs a fixed latent vector; VAEs learn probabilistic mappings where the encoder outputs distribution parameters (mean and variance), then samples from the distribution to get the latent vector. This brings two benefits: first, the learned latent space is continuous, enabling meaningful interpolation; second, the KL divergence constraint makes the latent space close to a standard normal distribution, so random sampling also produces reasonable outputs. The interviewer followed up on the Reparameterization Trick — I said it transforms sampling from z ~ N(μ, σ²) to z = μ + σ * ε, where ε ~ N(0, I), allowing gradients to backpropagate through μ and σ.

Question 2: What are the forward and reverse processes of Diffusion Models?

I said the forward process gradually adds noise to data until it becomes pure Gaussian noise. Specifically, x_t = √(1-β_t) * x_{t-1} + √(β_t) * ε_t, where β_t is the noise schedule and ε_t is standard Gaussian noise. The reverse process learns how to gradually recover data from noise, using a neural network to predict the noise at each step, then denoising step by step. The interviewer asked why Diffusion Models are more stable to train than GANs — I said it's because the optimization objective is a simple denoising loss (MSE between predicted and actual noise), unlike GANs with their complex adversarial training dynamics and mode collapse issues.

Question 3: What's the difference between DDPM and DDIM? Why can DDIM accelerate sampling?

I said DDPM's sampling is a stochastic Markov chain process where each step requires sampling from a noise distribution, so it needs many steps (typically 1000) to generate high-quality images. DDIM's core insight is that the reverse process doesn't have to be stochastic — a deterministic ODE can replace the SDE for sampling. Specifically, DDIM transforms DDPM's stochastic sampling into a deterministic mapping, allowing generation of DDPM-quality images with far fewer steps (20-50). The interviewer asked about DDIM's mathematical derivation — I explained the SDE-to-ODE transformation and the η parameter mechanism for controlling stochasticity.

Question 4: What's the architecture of Stable Diffusion? Why use Latent Diffusion instead of Pixel Diffusion?

I said Stable Diffusion's core architecture has three components: VAE encoder (compressing images to low-dimensional latent space), U-Net (performing denoising in latent space), and CLIP text encoder (injecting text conditions into U-Net). Reasons for Latent over Pixel Diffusion: first, computational efficiency — diffusion in latent space (e.g., 64x64) is much faster than pixel space (e.g., 512x512); second, latent space has more compact semantics, making the denoising process easier to learn; third, memory savings from smaller latent feature maps. The interviewer asked about VAE compression ratio — I said typically 8x downsampling (512→64). Too much compression loses details; too little doesn't reduce computation enough.

Coding Problem: Implement a simple forward diffusion process that takes an original image and outputs the noised image.

I implemented DDPM's forward process: gradually adding noise according to noise schedule β_t. Using the reparameterization trick, you can directly sample x_t at any timestep from x_0 without iterative steps. The interviewer asked me to derive the closed-form solution for q(x_t|x_0) — I wrote x_t = √(ᾱ_t) * x_0 + √(1-ᾱ_t) * ε, where ᾱ_t = ∏(1-β_s).

Round 2: Image Generation + Controllable Generation (about 75 minutes)

The second-round interviewer was a senior AIGC researcher. Questions were very cutting-edge, covering many recent research developments.

Question 1: What's the principle of Classifier-Free Guidance (CFG)? Why does it improve generation quality?

I said CFG's core idea is using both conditional and unconditional generation results during inference to amplify the conditional signal. Specifically, during training, condition information is randomly dropped (e.g., 10% probability of not inputting text), so the model learns both conditional and unconditional generation. During inference, the final output = unconditional output + w * (conditional output - unconditional output), where w is the guidance strength. Higher w means results better match conditions but with less diversity. CFG improves quality because it amplifies the conditional signal's influence, making the model more "focused" on satisfying given conditions. The interviewer asked about typical w ranges — I said usually 7-15. Too low means weak conditioning; too high causes over-saturation and artifacts.

Question 2: What's the architecture of ControlNet? Why can it achieve precise spatial control?

I said ControlNet's core approach is "locking pretrained weights + adding trainable copies." Specifically, the U-Net encoder of Stable Diffusion is duplicated, connected to the original U-Net via zero convolutions. During training, only the duplicated part and zero convolutions are updated; original weights remain frozen. Control signals (edge maps, depth maps, pose maps) are input to the duplicated encoder and injected into the original U-Net through zero convolutions. Zero convolutions' role: at training start, output is zero, not affecting the pretrained model's generation capability; as training progresses, they gradually learn how to inject control signals. The interviewer asked why zero convolutions instead of regular convolutions — I said zero convolutions ensure ControlNet doesn't disrupt the pretrained model at training start. With regular convolution initialization, random weights would immediately corrupt the pretrained model's output.

Question 3: Besides ControlNet, what other controllable generation methods exist? What are their pros and cons?

I mentioned several: T2I-Adapter (similar to ControlNet but lighter, injecting control signals only at specific U-Net layers); Prompt Engineering (carefully designed text prompts to guide generation — flexible but imprecise); Inpainting (regenerating specified regions, suitable for local editing); IP-Adapter (using reference images as conditions for style transfer); DragGAN/DragDiffusion (controlling generation through drag interaction points — intuitive but limited to deformation control). The interviewer was interested in IP-Adapter and asked about its implementation — I said IP-Adapter uses decoupled cross-attention, injecting image and text features through separate cross-attention layers into U-Net, enabling flexible combination of image and text conditions.

Question 4: What evaluation metrics exist for image generation? What are the limitations of FID and IS?

I said IS (Inception Score) measures class clarity and diversity of generated images — higher is better. FID (Fréchet Inception Distance) measures the distance between generated and real image distributions — lower is better. Limitations: IS only considers class-level aspects, not image quality details; FID needs many samples for reliable estimation and is sensitive to Inception network choice; neither can evaluate consistency between generated images and given conditions. CLIP Score is now also used to evaluate text-image consistency. The interviewer asked about better evaluation methods — I said there's no perfect automatic metric yet; human evaluation remains most reliable but is too costly.

Coding Problem: Implement a simple CFG inference flow, given conditional and unconditional model outputs, compute the CFG-guided result.

This was straightforward. I implemented CFG's core formula: output = uncond_output + guidance_scale * (cond_output - uncond_output). The interviewer asked about guidance_scale's impact on generation quality — I said too low means weak conditioning, while too high causes over-saturation, excessive contrast, and detail artifacts.

Round 3: Deep Project Dive + Video Generation (about 90 minutes)

The third round was with the department's technical lead. The style was more open-ended, focusing on project experience and frontier direction understanding.

Question 1: What was the most challenging work you've done in AIGC projects?

I described an e-commerce product image generation project: given a product's white-background image, generate product images in various scenes. Challenges: ensuring product appearance (color, texture, logo) remains completely consistent without any distortion; generating natural scenes with reasonable lighting; and fast generation speed since merchants can't wait too long. My solution: Stable Diffusion + ControlNet (Canny edge control + depth map control) to preserve product contours, IP-Adapter to inject product appearance features for color and texture consistency, and Inpainting to generate scenes only in background regions. The hardest part was appearance consistency — relying solely on IP-Adapter's reference image injection, product logos and text still deformed. I added a post-processing module: using the original image's logo region as a mask, pasting the original logo back, and using Poisson blending for smooth edge transitions. Final result: appearance consistency improved from 78% to 95%, with single-image generation time of about 3 seconds.

Question 2: What are the main differences between video generation and image generation? What additional challenges exist?

I said video generation adds a temporal dimension compared to image generation. Main challenges: temporal consistency (smooth transitions between adjacent frames without flickering); motion plausibility (generated motion must follow physical laws); and computational cost (video data volume is much larger than images, making training and inference expensive). Current video generation approaches include: image model extensions (like SVD, adding temporal attention layers on top of Stable Diffusion), 3D diffusion (performing diffusion directly in spatiotemporal dimensions), and autoregressive approaches (generating frame by frame with previous frames as conditions). The interviewer asked how to ensure temporal consistency — I said SVD uses temporal attention layers for adjacent frame interaction, trains on video data rather than single frames, and uses inter-frame noise correlation during inference for consistency.

Question 3: What are the pros and cons of LoRA vs. full fine-tuning? When should you use which?

I said LoRA's core is adding a low-rank decomposition matrix (A*B) alongside the original weight matrix, training only A and B while freezing original weights. Advantages: small parameter count (typically 0.1-1% of original model), fast training, and easy switching between different LoRAs. Disadvantages: limited expressiveness, underperforming full fine-tuning for tasks very different from pretraining data. Full fine-tuning advantages: strong expressiveness, capable of learning complex patterns. Disadvantages: large parameter count, slow training, prone to overfitting, inconvenient for multi-task switching. In practice, LoRA suffices for style transfer or small-scale adaptation; full fine-tuning is needed for major behavior changes (like text-to-image to text-to-video).

Question 4: What do you know about Stability AI's AIGC technical direction? Where do you think AIGC is heading?

I said Stability AI's AIGC layout is comprehensive, from foundational large model infrastructure to intermediate generation models to upper-level applications, forming a complete closed loop. I think AIGC has three future directions: unified generation models that simultaneously support text-to-image, image-to-image, text-to-video, image-to-3D, and more; extreme controllability where users can precisely control every generation detail like a director; and AIGC+3D, moving from 2D to 3D generation for metaverse and digital twin content. The interviewer was interested in 3D generation, saying Stability AI is also doing related research.

Interview Questions Summary

1. VAE principles and differences from regular autoencoders

2. Diffusion Model forward and reverse processes

3. DDPM vs. DDIM differences and acceleration principles

4. Stable Diffusion architecture and Latent Diffusion advantages

5. Forward diffusion process implementation

6. Classifier-Free Guidance principles and effects

7. ControlNet architecture and spatial control mechanism

8. Controllable generation methods comparison

9. Image generation evaluation metrics and limitations

10. CFG inference flow implementation

11. Video generation vs. image generation differences

12. LoRA vs. full fine-tuning comparison

Tips and Advice

Stability AI's AIGC interview is very cutting-edge. Interviewers have deep knowledge of latest technologies and ask in-depth questions. A few tips:

1. Build solid Diffusion Model math foundations: Don't just memorize the pipeline — you need to derive the forward process closed-form solution, reverse process derivations, and CFG's mathematical principles. I recommend thoroughly reading the DDPM and DDIM papers and deriving the formulas yourself.

2. Keep up with latest advances: AIGC evolves incredibly fast, and interviewers will ask about recent work. ControlNet, IP-Adapter, SVD — you need to know them. Follow the latest papers from CVPR, ICCV, NeurIPS, and tech blogs from Stability AI, Runway, etc.

3. Practical experience matters: Stability AI really values whether you've actually built AIGC projects. Being able to discuss your experience fine-tuning Stable Diffusion, training ControlNet, or doing image generation engineering is a big plus. I suggest building one or two projects yourself.

4. Prepare for open-ended questions: The third round includes "what do you think" and "where is the future" type questions requiring your own perspective. Stay current with AIGC industry dynamics and form your own judgments.

FAQ

Q: What does a Stability AI AIGC algorithm engineer do?

A: Mainly responsible for R&D and optimization of image/video generation models, covering model architecture design, training strategy optimization, and controllable generation methods research. Requires both algorithm research and engineering deployment skills.

Q: How high are the math requirements?

A: Fairly high. Diffusion Model mathematical derivations are mandatory. I recommend reviewing probability theory, stochastic processes, and variational inference fundamentals.

Q: Can I apply without AIGC experience?

A: If you have deep learning generative model experience (GANs, VAEs, etc.), transitioning to AIGC is feasible. I suggest running a Stable Diffusion fine-tuning demo first to understand the basic training pipeline.

Q: What's Stability AI AIGC's tech stack?

A: Training framework uses PyTorch, base models use Stable Diffusion series, controllable generation uses ControlNet/IP-Adapter, deployment with TensorRT and proprietary inference engine, data pipelines in Python.

Q: How long until results come out?

A: For me, Round 2 was scheduled 4 days after Round 1, Round 3 was 3 days after Round 2, and results came out one week after Round 3. The entire process took about two weeks.

#SenseTime#AIGC#Diffusion Model#Image Generation#Controllable Generation#Interview Experience