Change localization: how Solver finds its way
Authored by Christian Cosgrove, Founding Research Engineer
Ask any developer: a large fraction of their time isn't spent writing new code, but figuring out where the existing code needs to change. Got a bug report? Need to add a feature? The first step is finding the right spot in potentially millions of lines of code.
Existing tools only get you so far. Simple keyword search, like grep, works if you know the exact name of the function or variable you need. But what if you don't? What if the bug report just says "the checkout button is slow"? Good luck guessing the right keywords.
Newer tools use "semantic search." They try to understand the meaning behind your search query, not just the exact words. That's definitely better. You might search for "speed up checkout," and it could find code related to payment processing or the checkout button frontend. But even semantic search often falls short. Why?
Because the way we describe a task ("speed up checkout") looks totally different from the code that actually needs changing (maybe a specific database query or a loop). Finding code about checkout isn't the same as finding the exact lines to modify. This mismatch is especially tough early on when you're not even sure which files are involved.
From semantic search to change localization
Solver is an agent that can tackle software engineering tasks, so, like a human developer, it needs access to tools and signals for searching codebases. When fixing that slow checkout button, Solver can't just search for code related to the task description; it needs to figure out the most likely place where a change is needed. It needs to predict the location of the future change.
Finding related code is like looking up a word in a dictionary. Predicting where a change needs to happen is more like asking, "Based on this idea I want to convey, which words should I change?" It requires understanding the goal and which words have typically evolved in response to similar goals, before.
Standard search tools like grep and semantic search aren't built for this. They look at the code as it is now. True change localization requires understanding the connection between a task description, code changes made so far, and the code changes still needed.
Predicting where code will change
So, how do we teach a model to predict where code should change? Where can we get a low-noise source of ground truth to train such a system? Early on, we found a simple answer: tap into the rich history of public Git repositories.
Every time a developer makes a commit or submits a pull request, they usually write a message describing why they made the change. This gives us millions of examples linking:
- A task description (the commit message, PR description, issue text).
- Previous changes (other edits in the same commit or PR)
- The actual code chunks that were modified.
After encoding this content into tokens, we feed these examples into a pretrained Transformer, which has the expressive power to model the patterns connecting the task description to code changes.
It's a bit like how an experienced developer builds intuition. They've seen thousands of bug fixes and features added. When they see a new task, they have a good hunch about which parts of the codebase are likely involved because they remember past changes. Our model learns this same kind of pattern recognition, but is trained on vastly more data than a human ever could see.
Change localization is critical for Solver. If the agent just used basic search, it would often get lost, edit the wrong files, or waste costly iterations finding the target. Our change localization system allows the agent to quickly narrow down its search space, allowing Solver to focus its efforts, work much faster, and produce better results.
Embeddings and contrastive learning
How does the model actually learn these patterns from this data? There are a few key ingredients:
- Transformers + embeddings: Semantic search works with vectors, not text or code directly. So, the first step is to turn both the task description (and its history) and the code chunks into embeddings. Code chunks that do similar things, or task descriptions about similar goals, will have similar embeddings (as measured by cosine similarity). Any transformer model can do this.
- Contrastive learning: Just creating embeddings isn't enough. We need to teach the model which code chunks are relevant to which task descriptions. We do this with contrastive learning:
- We take a task description and past work (the "query”).
- We find the code chunk that was changed for that task (the "positive").
- We grab some code chunks that were not changed for that task (the "negatives").
- Via a contrastive loss function, tell the model: "Make the embedding for the query closer to the positive, and push it further away from all the negatives."
After making many small updates to minimize this contrastive loss, the model learns to recognize the subtle signals that link a task description to the specific code that changed – distinguishing it from all the other code that isn't relevant.

At deployment time, we store these embeddings in an approximate nearest neighbors (ANN) index. When our agent queries this index, we transform the request and past work into an embedding vector and find its closest neighbors. These correspond to the most likely sites in the codebase to change next.
Picking the right negatives is crucial
Contrastive learning sounds simple, but there's a tricky part: choosing the "negatives" (the code chunks that weren't changed). The quality of these negatives massively affects how well the model learns.
Problem: If you only show the model easy negatives, it doesn't learn much. Imagine trying to teach someone to identify a specific type of bird. If you only show them pictures of that bird alongside pictures of cars and dogs, they'll learn quickly, but they won't be able to tell that bird apart from other similar birds.
It's the same with code:
- Random negatives: The simplest thing is to just pick random code chunks from anywhere, in any repo, as negatives. Maybe you pick a chunk of Java code from the Google Guava repo when the real change was in a Python project, django. The model easily learns to tell them apart. But this doesn't teach it the difference between the correct Python chunk and another, similar-looking but wrong Python chunk in the same project. The model learns superficial things, like programming language and project-level abstractions, but not the deeper meaning needed for precise intra-repo prediction. (This can be implemented by sampling negatives from the same training batch, e.g., per OpenAI’s embedding training approach.)
- Using the model's mistakes: A better approach is to find negatives that the model currently finds confusing. You can use the model itself to find code chunks that it thinks are relevant to the task, but actually aren't. These "hard negatives" force the model to learn the really subtle differences. The downside? This is slow and expensive, because you have to run the model and query the index just to find the hard examples.
- Smart guesses: This tries to find a balance. You use cheaper methods to find negatives that are plausible but wrong. First, you pick code chunks from the same repo as the positive. Or, maybe you pick other code chunks from the same file as the real change, or from files that are often changed together with the real one based on historical commits. These are harder than random negatives but much faster to find than true hard negatives.
There's a distinct trade-off: acquiring higher-quality hard negatives is often more expensive, which can mean a smaller dataset within a given training budget. Balancing this is key, because if the negatives are too weak, the model may only learn superficial patterns—like just recognizing the programming language—instead of developing useful representations for change localization.
Why it matters: For Solver, an AI agent working autonomously, just being "mostly right" isn't good enough. It needs to be precise. Using harder negatives during training is crucial to make the model reliable enough for an agent to use across all sorts of codebases and tasks. We’ve learned that finding the right mix of negative sampling strategies is key.
Avoiding overfitting
Another trap when training change-localization models is overfitting.
If the model sees the exact same piece of code being the "correct" answer (the positive example) for too many different training tasks (maybe it's a popular utility function that gets changed a lot), it might just memorize that specific chunk. Instead of learning the general pattern of why that kind of code gets changed for that kind of task, it just learns "when in doubt, pick this chunk."
This is bad because the model won't generalize to new situations. When it sees a new task that should point to a different piece of code, the overfit model might still incorrectly predict an embedding for the code chunk it memorized, or worse, something entirely random. For an AI agent, this means it could keep trying to edit the wrong piece of code over and over.
How do we prevent overfitting when training change-localization embedding models?
- Limit repetition: We carefully track how often specific code chunks, files, or repositories appear in our training data, especially as positive examples. We put limits in place to prevent overexposure in the training set.
- Augmentation: We generate variations of task descriptions or code histories by generating synthetic data to give the model more diverse and in-distribution examples.
- Standard tricks: We also use common ML techniques, like regularization and weight freezing that can prevent overfitting.
The future of efficient software engineering agents
Understanding how to translate intent to code is one thing; knowing where to change a vast codebase is another, often more difficult, challenge. We’ve trained a change localization system on commits and pull requests, a system that predicts where code needs to change based on user intent and past changes. When Solver uses change localization as a tool, Solver acts with increased precision and less wasted effort than alternatives, boosting your software development lifecycle.
Discover the difference yourself. Solver is live and ready for you to try.