Multimodal Large Language Models: Architecture Evolution and Technical Analysis

March 29, 2026

The rapid progression from text-only LLMs to natively multimodal systems has fundamentally reshaped how we think about vision-language understanding. This post traces the architectural evolution of Multimodal Large Language Models (MLLMs), dissects the core design choices, and examines the trade-offs that matter in practice.

1. A Brief Taxonomy

MLLMs can be broadly categorized by how they integrate visual and textual modalities:

Generation	Representative Models	Visual Encoder	Alignment Module	LLM Backbone
Early Fusion	VisualGPT, Flamingo	Frozen CLIP ViT	Perceiver Resampler	Chinchilla
Projection-based	LLaVA, MiniGPT-4	CLIP ViT-L/14	Linear / MLP Projection	Vicuna / LLaMA
Q-Former family	BLIP-2, InstructBLIP	EVA-CLIP ViT-G	Querying Transformer	FlanT5 / Vicuna
Native Multimodal	GPT-4o, Gemini 2.0	Proprietary / SigLIP	Interleaved / Early Fusion	Proprietary

The trend is clear: from loosely-coupled frozen encoders toward deeply integrated, end-to-end trainable architectures.

2. Visual Encoders: ViT, SigLIP, and Beyond

The choice of visual encoder profoundly affects a model’s spatial and semantic grounding capabilities.

CLIP ViT remains the most widely adopted backbone. Its contrastive pre-training on 400M+ image-text pairs produces representations that are naturally aligned with language. However, CLIP features tend to be globally biased — they excel at image-level classification but struggle with fine-grained spatial reasoning.

SigLIP (Sigmoid Loss for Image-Language Pre-training) replaces the softmax-based contrastive loss with a per-pair sigmoid loss, eliminating the need for large batch sizes. This yields comparable or better performance with significantly more efficient training. Gemini and PaLI-X adopted SigLIP-family encoders.

DINOv2 offers strong spatial features via self-supervised learning. Some recent architectures (e.g., Cambrian-1) fuse DINOv2 features with CLIP features to combine spatial precision with semantic alignment — a dual-encoder strategy worth monitoring.

A critical design question: freeze or fine-tune? Freezing the visual encoder preserves its pre-trained representations and reduces training cost, but limits adaptation. LLaVA-1.5 showed that unfreezing the ViT during instruction tuning meaningfully improves fine-grained understanding. The current consensus leans toward a staged approach: freeze during pre-training alignment, unfreeze during instruction tuning.

3. The Alignment Bottleneck

The interface between the visual encoder and the LLM is arguably the most important architectural decision in current MLLM design.

Linear / MLP Projection

LLaVA demonstrated that a simple two-layer MLP can effectively bridge CLIP features to the LLM’s embedding space. The appeal is simplicity — fewer parameters, faster training, easier debugging. The downside: each image patch maps to one token, leading to long visual token sequences (576 tokens for a 384×384 image at patch size 16). This creates a quadratic attention bottleneck for the LLM.

Querying Transformer (Q-Former)

BLIP-2 introduced the Q-Former: a set of learnable query tokens attend to visual features through cross-attention layers, producing a fixed-length visual representation (typically 32-64 tokens). This drastically reduces the sequence length but introduces an information bottleneck. The Q-Former must learn what to extract — and it can miss fine details that weren’t relevant during pre-training.

Perceiver Resampler

Flamingo’s approach uses cross-attention layers interleaved with self-attention over a fixed set of latent tokens. Conceptually similar to Q-Former but with a different architectural flavor. The advantage: flexible token count and graceful handling of variable-length inputs (multiple images, video frames).

Dynamic Resolution

A recent trend is dynamic resolution handling. Instead of resizing all images to a fixed resolution, models like LLaVA-NeXT, InternVL, and Monkey split high-resolution images into tiles, encode each tile independently, and concatenate the resulting tokens. This preserves fine detail at the cost of longer sequences. The practical impact: text-in-image understanding (OCR, document parsing) improves dramatically.

4. Training Paradigms

Modern MLLMs typically follow a multi-stage training recipe:

Stage 1: Alignment Pre-training. Train only the alignment module on image-caption pairs. The visual encoder and LLM remain frozen. Objective: teach the projection layer to produce visual tokens the LLM can interpret. Data scale: 500K–5M pairs.

Stage 2: Instruction Tuning. Fine-tune the alignment module and (optionally) the LLM on instruction-following data. This is where conversational and reasoning capabilities emerge. Data: curated instruction datasets (LLaVA-Instruct, ShareGPT-4V, etc.). Some recipes also unfreeze the visual encoder here.

Stage 3: RLHF / DPO (optional). Align the model with human preferences. LLaVA-RLHF and RLHF-V showed measurable reductions in hallucination through reward modeling on visual faithfulness.

A notable shift: single-stage training is gaining traction. Models like MM1.5 (Apple) demonstrated that with sufficiently diverse and high-quality data, a single-stage fine-tune from a pre-trained LLM can match multi-stage pipelines — simplifying the recipe considerably.

5. Efficient Inference

Deploying MLLMs at scale requires addressing the visual token overhead:

Token Merging / Pruning: FastV and similar approaches prune less informative visual tokens after the first few LLM layers, achieving 2-3x speedup with minimal quality loss.
KV Cache Compression: Visual tokens dominate the KV cache. Methods like PyramidKV apply progressive compression to visual KV entries, significantly reducing memory footprint during long-context inference.
Quantization: Standard LLM quantization (AWQ, GPTQ) applies to the language backbone, but the visual encoder and projection layers require separate calibration. Mixed-precision strategies (FP16 visual encoder + INT4 LLM) offer a practical balance.

6. Open Challenges

Hallucination remains the central failure mode. MLLMs confidently describe objects and relationships that do not exist in the image. POPE and CHAIR benchmarks reveal that even state-of-the-art models hallucinate 15-30% of mentioned objects.

Spatial reasoning is another persistent weakness. Tasks requiring precise localization (counting, relative positioning, reading spatial layouts) expose the limitations of patch-based representations. Grounding-oriented models (Ferret, Shikra, CogVLM) partially address this through coordinate-aware training, but a general solution remains open.

Video understanding in MLLMs is nascent. Simply sampling frames and concatenating visual tokens does not capture temporal dynamics. This is an active frontier.

Closing Thoughts

The MLLM landscape is converging toward natively multimodal architectures where vision and language are processed in a unified representation space from the start, rather than bolted together through adapters. The key insight from the past two years: alignment quality matters more than scale. A well-designed projection with curated training data consistently outperforms brute-force scaling of either the visual encoder or the LLM.

References

Liu, H. et al. Visual Instruction Tuning. NeurIPS 2023.
Li, J. et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023.
Alayrac, J.-B. et al. Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022.
Zhai, X. et al. Sigmoid Loss for Language Image Pre-Training. ICCV 2023.
Liu, H. et al. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. Jan. 2024.
Chen, L. et al. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. ECCV 2024.