Redundancy is the bottleneck: Unimodal semantic redundancy (uninformative/common features) and cross-modal relationship redundancy (spurious co-occurrences) jointly inflate compute and degrade retrieval accuracy.
Limits of prior approaches: Pure embedding optimization or heavy re-ranking often amplifies redundancy or sacrifices efficiency under large-scale retrieval.
Opportunity: If we first purify representations (mitigate semantic redundancy) and then refine alignments (mitigate relationship redundancy) under a unified objective, we can improve both accuracy and efficiency.
Generic phrases or cluttered visual regions mask salient semantics, hindering compact, discriminative representations.
Spurious cross-modal co-occurrences (e.g., common but irrelevant pairs) inflate similarity for incorrect matches.
Fine-grained attention and re-ranking can help accuracy but are computationally heavy without early filtering.
"Redundancy Mitigation First, Alignment Refinement Second"
Explicitly view image–text retrieval (ITR) through a redundancy lens: purify semantics first, then refine cross-modal relationships under a unified objective.
Consistency Modeling: external-attention weighting between image and candidate texts to capture global semantic association.
Complementarity Modeling: re-ranking via how each text ranks the query image (top-z) to inject complementary signals; combine with consistency for final scores.
Feature-space discrepancy loss (Lf) + Encoding loss (Lq) with weights β1, β2 stabilize hash/quant spaces while preserving original similarities.
Dual filtering reduces compute while boosting top-rank accuracy; MEET achieves higher rSum with competitive time cost versus strong baselines.
On Flickr30K and MS-COCO, MEET consistently surpasses prior SOTAs in rSum (with strong gains on R@1 for both I2T and T2I).
Ablations show both Stage I/II and both losses (Lf, Lq) are essential; performance remains stable across β1/β2, α/β, and z settings.
@article{wang2025meet,
title={Redundancy Mitigation: Towards Accurate and Efficient Image-Text Retrieval},
author={Wang, Kun and Hu, Yupeng and Liu, Hao and Jie, Lirong and Nie, Liqiang},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2025}
}
