From Representational Drift to Cross-modal Representation Shift: Why Dynamic Alignment Matters
April 20, 2026
From Representational Drift to Cross-modal Representation Shift
Recent neuroscience has made one point increasingly difficult to ignore: neural activity patterns are not as stable as we once hoped. Even when an animal keeps performing the same learned behavior, the neurons participating in that computation may gradually change over days or weeks. This phenomenon is now widely discussed as representational drift.
At first glance, this sounds like a problem. If internal representations keep changing, how can behavior remain stable? But a more interesting possibility is that drift is not merely noise to be tolerated. It may be part of how biological systems achieve continual learning, preserving function while reorganizing the substrate that implements it.
This question becomes even more intriguing when we step outside neuroscience and look at machine learning systems. In my own work, Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval, the term is not drift but representation shift. Yet the underlying tension is surprisingly similar: when supervision is sparse and the learning signal is incomplete, the relationship between internal features and target semantics becomes unstable. The system must then learn not only to match information, but to continuously realign it.
This post is an attempt to place these two stories into the same conceptual frame.
Representational drift: stable behavior, unstable neurons
Large-scale neural recording has made it possible to track population activity across days and even weeks. These studies reveal a striking pattern:
- over short timescales, many neurons preserve similar activity patterns;
- over longer timescales, the mapping between neural activity and sensory, cognitive, or behavioral variables can change substantially;
- individual neurons may stop encoding a variable, while a different subset of neurons begins encoding it;
- yet the overall behavioral output and population-level statistics often remain surprisingly stable.
In other words, the function persists even as the implementation changes.
This phenomenon has been reported in multiple brain regions, including the hippocampus and several cortical areas. A familiar example comes from hippocampal place cells: some neurons that once encoded spatial position gradually drop out, while new neurons take over, even though the environment itself remains unchanged. What stays stable is not the identity of the individual neuron, but the computational role realized at the population level.
That observation alone challenges a very classical intuition: if a system behaves consistently, its representation should also remain fixed. The data suggests otherwise.
Why the term itself is worth questioning
The phrase representational drift is convenient, but also subtly misleading.
“Representation” implies that neural activity is best understood as a relatively direct encoding of some external variable. “Drift” suggests that this encoding then wanders away from its original form. But this framing already assumes a static picture of what the brain is doing.
A different interpretation is possible. Neural circuits may be better understood as ongoing dynamical systems that interact with the world, rather than as static maps of it. On this view, what changes over time is not simply a stored representation, but the system’s trajectory through a changing computational space.
This matters because it changes the question. Instead of asking, why does the brain fail to preserve a fixed code?, we might ask, what kinds of adaptive computation are made possible by a substrate that continuously rewrites itself?
Drift as a cost of biology — or a feature for continual learning
There are at least two broad ways to interpret representational drift.
The first is conservative: drift is the inevitable byproduct of biology. Synapses turn over, proteins are replaced, excitability fluctuates, and network parameters are never perfectly fixed. From this perspective, the nervous system must constantly compensate for its own material instability. Drift is something downstream circuits learn to tolerate.
The second interpretation is more ambitious: drift may actually support learning.
This idea connects naturally to classic discussions of the plasticity–stability dilemma. A system that learns too aggressively overwrites previous knowledge. A system that is too stable cannot incorporate new information. Biological intelligence seems to avoid both extremes. One possible reason is that it does not insist on using exactly the same neural substrate forever. By allowing the active computational pool to change over time, new associations can be encoded without always colliding with old ones.
That makes drift relevant to three important functions:
1. Preserving separability across time
If all new experiences were written indiscriminately into the same neurons and synapses, interference would quickly accumulate. A shifting active population can help distribute memories across partially distinct substrates.
2. Linking related experiences
At the same time, drift does not imply complete separation. Revisited memories may be rewritten into newly excitable populations together with newer related information, allowing the system to build bridges across time.
3. Forgetting what no longer matters
Not every learned association deserves permanent maintenance. A system with baseline representational turnover may gradually stop reinforcing obsolete structure, providing a natural path toward selective forgetting.
Seen this way, drift is not merely tolerated. It may be part of the machinery of continual refinement.
A parallel in machine learning: representation shift under weak supervision
This is exactly why I find the notion of cross-modal representation shift in video moment retrieval so compelling.
In point-supervised Video Moment Retrieval (VMR), the goal is to localize the temporal segment in a video that corresponds to a natural-language query. Unlike fully supervised settings, point-supervised VMR uses only a single annotated frame rather than precise start and end boundaries. This dramatically reduces annotation cost, but also creates a serious learning problem: the supervision is too sparse to fully constrain temporal semantics.
As a result, the alignment between visual features and language features becomes unstable. The model may know roughly where something relevant happens, yet still fail to form a robust cross-modal association over the full temporal extent of the event. In the language of the paper, this induces cross-modal representation shift.
The phrase is highly suggestive. Sparse supervision does not just make the problem noisy. It changes the geometry of the learned feature space. Temporal evidence and semantic evidence no longer line up cleanly. What the model needs is not simply stronger matching, but a mechanism to refine alignment as the representation itself shifts.
That is the core motivation behind DRONE.
What DRONE tries to do
In Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval, we proposed DRONE — pseuDo fRame-based tempOral and semaNtic rEfinement — to explicitly address this shift problem.
The idea is built around two components:
1. Pseudo-Frame Temporal Alignment (PTA)
PTA treats textual queries as pseudo-frames, allowing the model to inject language information into the temporal sequence more directly. Rather than forcing language to supervise video only at a coarse global level, PTA encourages finer temporal coherence between query semantics and evolving video features.
Conceptually, this is a temporal realignment mechanism. It says: if sparse point labels distort the temporal structure of learning, then we should create a bridge that helps language participate in the temporal organization itself.
2. Curriculum-Guided Semantic Refinement (CSR)
CSR progressively refines semantic consistency using a curriculum-style contrastive strategy, moving from easier cases to harder ones. Instead of assuming that the model can immediately resolve all ambiguous cross-modal correspondences, CSR gradually sharpens the semantic space.
This is important because representation shift is rarely corrected in one shot. It is typically reduced through staged adaptation, where the model first locks onto more reliable correspondences and then extends that structure to harder samples.
Together, PTA and CSR give DRONE a simple but useful bias: when supervision is incomplete, alignment should be progressively rebuilt rather than naively assumed.
Why this connection matters
Of course, representational drift in the brain and cross-modal representation shift in VMR are not the same phenomenon.
One emerges in biological neural populations recorded across days or weeks. The other appears in learned multimodal embeddings under sparse supervision. One is about living circuits, the other about optimization in machine learning.
But the analogy is still productive.
Both cases challenge a static view of representation. In both cases, the central issue is not merely whether a representation exists, but whether it remains functionally aligned with what downstream computation requires. And in both cases, successful systems seem to depend on mechanisms that preserve useful behavior even when the internal basis of that behavior is changing.
This suggests a broader principle:
Intelligent systems may not need perfectly stable representations.
They may instead need robust mechanisms for refinement, compensation, and realignment.
That principle is already familiar in neuroscience, where downstream circuits may use plasticity to compensate for upstream drift. In machine learning, we often rediscover the same need in a different form: temporal alignment modules, curriculum learning, contrastive refinement, memory updates, and adaptive decoders.
The vocabulary changes. The problem remains.
A neuroscience reading of DRONE
If I read DRONE through the lens of representational drift, three ideas stand out.
Sparse supervision behaves like partial observability
In neuroscience, one reason drift is difficult to interpret is that we rarely measure all relevant variables. Some apparent drift may reflect missing latent structure. Point supervision creates a related issue in VMR: the model sees only a tiny fraction of the true temporal annotation, so the target structure is underdetermined from the start.
Alignment must be maintained, not assumed
Biological systems appear able to maintain stable readout despite changing population codes, likely through ongoing plastic compensation. DRONE adopts a computational analogue: rather than relying on raw weak supervision to keep video and text aligned, it introduces explicit refinement modules to restore that alignment.
Continual adaptation may be more realistic than frozen matching
A static representation-learning mindset often assumes that once the right embedding is found, the problem is solved. But both drift and representation shift suggest otherwise. Good systems may need to keep reorganizing intermediate structure while preserving task-relevant output.
That does not make learning unstable; it may be what makes learning possible.
Toward a broader research agenda
The deeper lesson here is not limited to VMR.
Many modern multimodal systems operate under weak, noisy, delayed, or partial supervision. In these conditions, representation spaces rarely remain clean and fixed. They bend under the pressure of optimization, data imbalance, temporal ambiguity, and domain mismatch. If we want more robust models, we may need to think less in terms of static embedding alignment and more in terms of representation shift management.
Neuroscience offers a useful conceptual provocation: maybe instability is not always an error signal. Sometimes it is the price — or even the mechanism — of continual adaptation.
That perspective opens at least two future directions.
First, we can ask whether machine learning systems should include more explicit readout compensation mechanisms, analogous to how downstream biological circuits may adapt to upstream changes.
Second, we can explore whether controlled representational turnover can actually improve continual multimodal learning, just as drift may help biological systems balance separation, integration, and forgetting.
Closing thoughts
Representational drift tells us that stable intelligence does not require a frozen substrate. Cross-modal representation shift tells us that weakly supervised multimodal learning cannot rely on fixed alignment either.
What matters, in both settings, is the capacity to preserve useful structure while internal organization changes.
That is why I see Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval not just as a task-specific method, but as part of a larger conversation about dynamic representation. DRONE is one concrete answer to a general question:
When representations move, how should learning systems respond?
For biological systems, the answer may involve plasticity, memory allocation, and continual reorganization.
For multimodal models, it may involve temporal alignment, semantic curriculum, and explicit refinement.
For both, the future likely belongs to systems that do not demand perfect stability — but know how to stay coherent while changing.
References
-
Laura N. Driscoll, Lea Duncker, Christopher D. Harvey. Representational drift: Emerging theories for continual learning and experimental future directions. Current Opinion in Neurobiology, 76, 102609, 2022.
-
Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, Liqiang Nie. Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval. ACM Transactions on Information Systems, 44(3), 1–30, 2026.