Video Generation: A Technical Deep-Dive into Diffusion-Based Approaches
March 25, 2026
Video generation has undergone a paradigm shift. What was once limited to short, low-resolution clips is now producing coherent, high-fidelity videos spanning tens of seconds. This post traces the technical trajectory from foundational diffusion models to state-of-the-art video generation systems.
1. Foundations: From DDPM to Latent Diffusion
Before discussing video, we need the image generation foundations.
Denoising Diffusion Probabilistic Models
DDPMs model data generation as a reverse denoising process. Given a forward process that gradually adds Gaussian noise to data over timesteps:
The model learns to reverse this process by predicting the noise at each step:
This is elegant but operates in pixel space — prohibitively expensive for high-resolution content.
Latent Diffusion Models (LDM)
Stable Diffusion’s key insight: perform diffusion in a compressed latent space. An autoencoder first maps images to a lower-dimensional latent representation , the diffusion process operates on , and the decoder reconstructs the image . This reduces computational cost by 4-8x while maintaining perceptual quality.
For text conditioning, LDM injects text embeddings (from CLIP or T5) into the denoising U-Net via cross-attention layers at multiple resolution levels.
2. From U-Net to DiT: The Backbone Shift
The U-Net Era
Early video diffusion models (Video LDM, Align Your Latents, AnimateDiff) extended the image U-Net with temporal layers:
- Insert temporal attention (self-attention across frames at each spatial position) after each spatial attention block
- Inflate 2D convolutions to pseudo-3D (1D temporal conv after 2D spatial conv)
- Train temporal layers while keeping spatial layers frozen (from a pre-trained image model)
This approach leverages strong image priors but treats temporal modeling as a bolt-on module. The result: impressive short clips, but temporal consistency degrades beyond 16-32 frames.
The DiT Revolution
Diffusion Transformers (DiT) replaced the U-Net with a plain Transformer operating on patchified latent tokens. The architecture is simpler, more scalable, and empirically performs better at larger scales.
Key design elements of DiT:
- Patchification: Flatten the spatial latent into non-overlapping patches, each becoming a token
- AdaLN-Zero: Adaptive layer normalization conditioned on timestep and class/text embeddings, initialized to zero for stable training
- Scaling properties: DiT exhibits clear scaling laws — performance improves predictably with model size and training compute, following power-law relationships
For video, DiT naturally extends to 3D patches: a patch covers 2 frames × 2 height × 2 width in the latent space, jointly encoding spatial and temporal information from the ground up.
3. Architecture Comparison: Sora, CogVideoX, Open-Sora
Sora (OpenAI)
While OpenAI has not published full architectural details, the technical report and subsequent analysis reveal key design choices:
- Spacetime patches: Video is encoded into a latent space, then patchified along all three dimensions. This creates a unified token sequence where spatial and temporal information are inherently interleaved.
- Variable resolution and duration: Sora processes videos at their native aspect ratio and duration by varying the number of tokens rather than resizing. This is critical for generating content that doesn’t look artificially constrained.
- Scaling: Trained on a massive (undisclosed) compute budget. The qualitative jump in coherence and physical understanding suggests that video DiT, like text LLMs, benefits enormously from scale.
- Recaptioning: Training data was re-captioned using a vision-language model to produce detailed, accurate descriptions — a technique that significantly improves text-video alignment.
CogVideoX (Zhipu AI)
CogVideoX provides more architectural transparency:
- 3D VAE: A fully 3D variational autoencoder that compresses video along spatial and temporal dimensions jointly. The temporal compression factor is 4x (e.g., 48 frames → 12 latent frames), preserving temporal structure better than frame-by-frame encoding.
- Expert Transformer: Uses a variant of the Transformer where text tokens and video tokens are processed by partially separate “expert” attention layers, sharing only cross-modal interactions. This prevents text-to-video interference in self-attention.
- Progressive training: Starts with lower resolution and shorter videos, progressively increasing both. This curriculum significantly improves training stability and final quality.
Open-Sora (HPC-AI Tech)
Open-Sora aims to replicate Sora-class capabilities in an open-source framework:
- STDiT (Spatial-Temporal DiT): Alternates between spatial and temporal Transformer blocks. Spatial blocks process all tokens within a frame; temporal blocks process all tokens at the same spatial position across frames.
- Multi-resolution / multi-duration training: Uses bucket-based batching to train on videos of varying sizes without padding waste.
- Efficiency focus: Implements sequence parallelism and selective activation checkpointing to train large models on accessible GPU clusters.
Architectural Comparison
| Aspect | Sora | CogVideoX | Open-Sora |
|---|---|---|---|
| Backbone | Full 3D DiT | Expert DiT | STDiT (factorized) |
| VAE | Spacetime (3D) | 3D Causal VAE | 2D + Temporal |
| Temporal Modeling | Joint attention | Expert attention + 3D VAE | Factorized attention |
| Text Encoder | T5-XXL (likely) | T5 + CLIP | T5 |
| Open Source | No | Partial | Yes |
The fundamental difference: joint vs. factorized spatiotemporal attention. Joint attention (Sora) can model arbitrary spatiotemporal dependencies but scales quadratically with video length. Factorized approaches (Open-Sora) are more efficient but may miss complex cross-frame spatial relationships.
4. Spatiotemporal Consistency: The Core Challenge
Generating a single beautiful frame is relatively solved. The hard problem is consistency across frames:
Temporal Coherence
Objects should move smoothly, lighting should change gradually, and the “identity” of entities should persist. Failures manifest as:
- Flickering textures
- Objects morphing or disappearing between frames
- Inconsistent shadows and reflections
The 3D VAE approach (CogVideoX) addresses this at the representation level — by encoding temporal neighbors jointly, the latent space inherently captures local temporal coherence. Frame-by-frame VAEs must rely entirely on the diffusion model to learn consistency, a harder task.
Physical Plausibility
Sora demonstrated surprising (though imperfect) physical understanding: objects fall, liquids flow, and collisions produce plausible reactions. This emergent behavior likely results from scale — a sufficiently large model trained on diverse video data implicitly learns physics priors.
However, failures are still common: objects passing through each other, physically impossible transformations, and gravity violations. True physical reasoning likely requires either explicit physics simulation integration or training on synthetic data with known physics.
Long-Range Coherence
Generating videos beyond 10 seconds while maintaining narrative and visual consistency remains a major challenge. Current approaches use:
- Autoregressive extension: Generate a short clip, then condition subsequent clips on the last few frames of the previous one. Accumulation of error leads to drift.
- Hierarchical planning: Generate a sparse set of “keyframes” first, then interpolate. This provides global structure but the interpolation quality is critical.
- Full-length generation: Generate the entire video at once with a sufficiently long context. Only feasible for the largest models (Sora-class).
5. Training Data: The Silent Differentiator
Architecture gets the attention, but training data arguably matters more:
- Scale: Sora-class models train on hundreds of millions of video clips. The minimum viable scale for coherent generation appears to be ~10M clips.
- Quality filtering: Aesthetic scoring, motion filtering (removing static or overly chaotic clips), and watermark detection are standard pre-processing steps.
- Captioning quality: Re-captioning training data with detailed VLM-generated descriptions (rather than relying on noisy web-scraped alt-text) provides a substantial improvement in text-video alignment. This technique, pioneered by DALL-E 3 for images, transfers directly to video.
- Aspect ratio and duration diversity: Training on a distribution of aspect ratios and durations (rather than a single fixed format) enables more flexible generation. Bucket-based sampling groups similar-sized videos into batches for efficient training.
6. Controllable Generation and Editing
Beyond text-to-video, practical applications require fine-grained control:
- Image-to-video: Animate a static image. SVD (Stable Video Diffusion) demonstrated that fine-tuning an image diffusion model on video data produces strong image-conditioned video generation.
- Video editing: Modify specific elements while preserving the rest. TokenFlow and related methods enforce temporal consistency during editing by propagating edits along optical flow trajectories.
- Camera control: Specify camera movements (pan, zoom, orbit) independently of content. MotionCtrl and CameraCtrl inject camera parameters as additional conditioning signals.
- Subject-driven generation: Maintain the identity of a specific subject across generated videos. This combines ideas from DreamBooth-style personalization with temporal consistency.
7. Open Problems
Evaluation. There is no consensus on how to evaluate video generation. FVD (Fréchet Video Distance) is the standard quantitative metric but poorly correlates with human judgment. CLIP-based metrics (CLIPScore) measure text alignment but ignore temporal quality. Human evaluation remains the gold standard but is expensive and non-reproducible.
Temporal precision. Current models generate plausible motion but struggle with precise timing — “a ball bouncing exactly three times” is surprisingly difficult. This suggests that temporal understanding in generative models lags behind spatial understanding.
Efficiency. Generating a 10-second video at 1080p with a Sora-class model requires minutes on high-end hardware. Real-time or near-real-time generation requires 100-1000x efficiency improvements. Distillation, caching, and architectural innovations (e.g., consistency models adapted for video) are active research directions.
Audio. Video without synchronized audio feels incomplete. Joint video-audio generation is an emerging frontier, with models like V2A (Video-to-Audio) learning to generate temporally aligned soundtracks.
Closing Thoughts
Video generation is following a trajectory remarkably similar to image generation, with a ~2 year lag. The key enablers are the same: scalable architectures (DiT replacing U-Net), latent space compression (3D VAE), and massive curated training data. The difference is that video’s temporal dimension introduces qualitatively new challenges — consistency, physics, and narrative structure — that push beyond what pure scaling can solve.
The next breakthrough likely lies at the intersection of generation and understanding: models that can both generate and reason about video content, using each capability to improve the other.
References
- Ho et al. “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
- Rombach et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR 2022.
- Peebles et al. “Scalable Diffusion Models with Transformers.” ICCV 2023.
- Brooks et al. “Video Generation Models as World Simulators.” OpenAI technical report / research post, 2024.
- Yang et al. “CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer.” arXiv 2024.
- Zheng et al. “Open-Sora: Democratizing Efficient Video Production for All.” arXiv 2024.
- Blattmann et al. “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.” arXiv 2023.