Video Generation: A Technical Deep-Dive into Diffusion-Based Approaches

March 25, 2026

Video generation has undergone a paradigm shift. What was once limited to short, low-resolution clips is now producing coherent, high-fidelity videos spanning tens of seconds. This post traces the technical trajectory from foundational diffusion models to state-of-the-art video generation systems.

1. Foundations: From DDPM to Latent Diffusion

Before discussing video, we need the image generation foundations.

Denoising Diffusion Probabilistic Models

DDPMs model data generation as a reverse denoising process. Given a forward process that gradually adds Gaussian noise to data $x_0$ over $T$ timesteps:

$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I})$

The model learns to reverse this process by predicting the noise $\epsilon$ at each step:

$L = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]$

This is elegant but operates in pixel space — prohibitively expensive for high-resolution content.

Latent Diffusion Models (LDM)

Stable Diffusion’s key insight: perform diffusion in a compressed latent space. An autoencoder first maps images to a lower-dimensional latent representation $z = \mathcal{E}(x)$ , the diffusion process operates on $z$ , and the decoder reconstructs the image $x = \mathcal{D}(z)$ . This reduces computational cost by 4-8x while maintaining perceptual quality.

For text conditioning, LDM injects text embeddings (from CLIP or T5) into the denoising U-Net via cross-attention layers at multiple resolution levels.

2. From U-Net to DiT: The Backbone Shift

The U-Net Era

Early video diffusion models (Video LDM, Align Your Latents, AnimateDiff) extended the image U-Net with temporal layers:

Insert temporal attention (self-attention across frames at each spatial position) after each spatial attention block
Inflate 2D convolutions to pseudo-3D (1D temporal conv after 2D spatial conv)
Train temporal layers while keeping spatial layers frozen (from a pre-trained image model)

This approach leverages strong image priors but treats temporal modeling as a bolt-on module. The result: impressive short clips, but temporal consistency degrades beyond 16-32 frames.

The DiT Revolution

Diffusion Transformers (DiT) replaced the U-Net with a plain Transformer operating on patchified latent tokens. The architecture is simpler, more scalable, and empirically performs better at larger scales.

Key design elements of DiT:

Patchification: Flatten the spatial latent into non-overlapping patches, each becoming a token
AdaLN-Zero: Adaptive layer normalization conditioned on timestep and class/text embeddings, initialized to zero for stable training
Scaling properties: DiT exhibits clear scaling laws — performance improves predictably with model size and training compute, following power-law relationships

For video, DiT naturally extends to 3D patches: a $2 \times 2 \times 2$ patch covers 2 frames × 2 height × 2 width in the latent space, jointly encoding spatial and temporal information from the ground up.

3. Architecture Comparison: Sora, CogVideoX, Open-Sora

Sora (OpenAI)

While OpenAI has not published full architectural details, the technical report and subsequent analysis reveal key design choices:

Spacetime patches: Video is encoded into a latent space, then patchified along all three dimensions. This creates a unified token sequence where spatial and temporal information are inherently interleaved.
Variable resolution and duration: Sora processes videos at their native aspect ratio and duration by varying the number of tokens rather than resizing. This is critical for generating content that doesn’t look artificially constrained.
Scaling: Trained on a massive (undisclosed) compute budget. The qualitative jump in coherence and physical understanding suggests that video DiT, like text LLMs, benefits enormously from scale.
Recaptioning: Training data was re-captioned using a vision-language model to produce detailed, accurate descriptions — a technique that significantly improves text-video alignment.

CogVideoX (Zhipu AI)

CogVideoX provides more architectural transparency:

3D VAE: A fully 3D variational autoencoder that compresses video along spatial and temporal dimensions jointly. The temporal compression factor is 4x (e.g., 48 frames → 12 latent frames), preserving temporal structure better than frame-by-frame encoding.
Expert Transformer: Uses a variant of the Transformer where text tokens and video tokens are processed by partially separate “expert” attention layers, sharing only cross-modal interactions. This prevents text-to-video interference in self-attention.
Progressive training: Starts with lower resolution and shorter videos, progressively increasing both. This curriculum significantly improves training stability and final quality.

Open-Sora (HPC-AI Tech)

Open-Sora aims to replicate Sora-class capabilities in an open-source framework:

STDiT (Spatial-Temporal DiT): Alternates between spatial and temporal Transformer blocks. Spatial blocks process all tokens within a frame; temporal blocks process all tokens at the same spatial position across frames.
Multi-resolution / multi-duration training: Uses bucket-based batching to train on videos of varying sizes without padding waste.
Efficiency focus: Implements sequence parallelism and selective activation checkpointing to train large models on accessible GPU clusters.

Architectural Comparison

Aspect	Sora	CogVideoX	Open-Sora
Backbone	Full 3D DiT	Expert DiT	STDiT (factorized)
VAE	Spacetime (3D)	3D Causal VAE	2D + Temporal
Temporal Modeling	Joint attention	Expert attention + 3D VAE	Factorized attention
Text Encoder	T5-XXL (likely)	T5 + CLIP	T5
Open Source	No	Partial	Yes

The fundamental difference: joint vs. factorized spatiotemporal attention. Joint attention (Sora) can model arbitrary spatiotemporal dependencies but scales quadratically with video length. Factorized approaches (Open-Sora) are more efficient but may miss complex cross-frame spatial relationships.

4. Spatiotemporal Consistency: The Core Challenge

Generating a single beautiful frame is relatively solved. The hard problem is consistency across frames:

Temporal Coherence

Objects should move smoothly, lighting should change gradually, and the “identity” of entities should persist. Failures manifest as:

Flickering textures
Objects morphing or disappearing between frames
Inconsistent shadows and reflections

The 3D VAE approach (CogVideoX) addresses this at the representation level — by encoding temporal neighbors jointly, the latent space inherently captures local temporal coherence. Frame-by-frame VAEs must rely entirely on the diffusion model to learn consistency, a harder task.

Physical Plausibility

Sora demonstrated surprising (though imperfect) physical understanding: objects fall, liquids flow, and collisions produce plausible reactions. This emergent behavior likely results from scale — a sufficiently large model trained on diverse video data implicitly learns physics priors.

However, failures are still common: objects passing through each other, physically impossible transformations, and gravity violations. True physical reasoning likely requires either explicit physics simulation integration or training on synthetic data with known physics.

Long-Range Coherence

Generating videos beyond 10 seconds while maintaining narrative and visual consistency remains a major challenge. Current approaches use:

Autoregressive extension: Generate a short clip, then condition subsequent clips on the last few frames of the previous one. Accumulation of error leads to drift.
Hierarchical planning: Generate a sparse set of “keyframes” first, then interpolate. This provides global structure but the interpolation quality is critical.
Full-length generation: Generate the entire video at once with a sufficiently long context. Only feasible for the largest models (Sora-class).

5. Training Data: The Silent Differentiator

Architecture gets the attention, but training data arguably matters more:

Scale: Sora-class models train on hundreds of millions of video clips. The minimum viable scale for coherent generation appears to be ~10M clips.
Quality filtering: Aesthetic scoring, motion filtering (removing static or overly chaotic clips), and watermark detection are standard pre-processing steps.
Captioning quality: Re-captioning training data with detailed VLM-generated descriptions (rather than relying on noisy web-scraped alt-text) provides a substantial improvement in text-video alignment. This technique, pioneered by DALL-E 3 for images, transfers directly to video.
Aspect ratio and duration diversity: Training on a distribution of aspect ratios and durations (rather than a single fixed format) enables more flexible generation. Bucket-based sampling groups similar-sized videos into batches for efficient training.

6. Controllable Generation and Editing

Beyond text-to-video, practical applications require fine-grained control:

Image-to-video: Animate a static image. SVD (Stable Video Diffusion) demonstrated that fine-tuning an image diffusion model on video data produces strong image-conditioned video generation.
Video editing: Modify specific elements while preserving the rest. TokenFlow and related methods enforce temporal consistency during editing by propagating edits along optical flow trajectories.
Camera control: Specify camera movements (pan, zoom, orbit) independently of content. MotionCtrl and CameraCtrl inject camera parameters as additional conditioning signals.
Subject-driven generation: Maintain the identity of a specific subject across generated videos. This combines ideas from DreamBooth-style personalization with temporal consistency.

7. Open Problems

Evaluation. There is no consensus on how to evaluate video generation. FVD (Fréchet Video Distance) is the standard quantitative metric but poorly correlates with human judgment. CLIP-based metrics (CLIPScore) measure text alignment but ignore temporal quality. Human evaluation remains the gold standard but is expensive and non-reproducible.

Temporal precision. Current models generate plausible motion but struggle with precise timing — “a ball bouncing exactly three times” is surprisingly difficult. This suggests that temporal understanding in generative models lags behind spatial understanding.

Efficiency. Generating a 10-second video at 1080p with a Sora-class model requires minutes on high-end hardware. Real-time or near-real-time generation requires 100-1000x efficiency improvements. Distillation, caching, and architectural innovations (e.g., consistency models adapted for video) are active research directions.

Audio. Video without synchronized audio feels incomplete. Joint video-audio generation is an emerging frontier, with models like V2A (Video-to-Audio) learning to generate temporally aligned soundtracks.

Closing Thoughts

Video generation is following a trajectory remarkably similar to image generation, with a ~2 year lag. The key enablers are the same: scalable architectures (DiT replacing U-Net), latent space compression (3D VAE), and massive curated training data. The difference is that video’s temporal dimension introduces qualitatively new challenges — consistency, physics, and narrative structure — that push beyond what pure scaling can solve.

The next breakthrough likely lies at the intersection of generation and understanding: models that can both generate and reason about video content, using each capability to improve the other.

References

Ho et al. “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
Rombach et al. “High-Resolution Image Synthesis with Latent Diffusion Models.” CVPR 2022.
Peebles et al. “Scalable Diffusion Models with Transformers.” ICCV 2023.
Brooks et al. “Video Generation Models as World Simulators.” OpenAI technical report / research post, 2024.
Yang et al. “CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer.” arXiv 2024.
Zheng et al. “Open-Sora: Democratizing Efficient Video Production for All.” arXiv 2024.
Blattmann et al. “Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.” arXiv 2023.