We analyzed per-task results on SWE-Bench Verified and noticed a pattern that aggregate leaderboard scores hide: many tasks failed by the top-performing model are consistently solved by other models.
For example, Claude Opus 4.5 solves the most tasks overall, but a significant number of tasks it fails are solved by other models like Sonnet or Gemini. The reverse is also true. This suggests strong task-level specialization that a single-model baseline cannot exploit.
We built a simple routing system to test this idea. Instead of training a new foundation model, we embed each problem description, assign it to a semantic cluster learned from a separate general coding dataset, and route the task to the model with the highest historical success rate in that cluster.
Using this approach, the system exceeds single-model baselines on SWE-Bench Verified (75.6% versus ~74% for the best individual model).
A few clarifications up front: we did not train on SWE-Bench problems or patches. Clusters are derived from general coding data, not from SWE-Bench. SWE-Bench is used only to estimate per-cluster model success rates. At inference time, routing uses only the problem description and historical cluster statistics, with no repo execution or test-time search.
The main takeaway is not the absolute number, but the mechanism. Leaderboard aggregates hide complementary strengths between models, and even simple routing can capture a higher performance ceiling than any single model.
We propose a new architecture called Mixture of Models (MoM) to solve LLM routing for coding workflows. We use a embedding + clustering approach on SWE data and then evaluate LLMs on each cluster to find out who is best.
For example, Claude Opus 4.5 solves the most tasks overall, but a significant number of tasks it fails are solved by other models like Sonnet or Gemini. The reverse is also true. This suggests strong task-level specialization that a single-model baseline cannot exploit.
We built a simple routing system to test this idea. Instead of training a new foundation model, we embed each problem description, assign it to a semantic cluster learned from a separate general coding dataset, and route the task to the model with the highest historical success rate in that cluster.
Using this approach, the system exceeds single-model baselines on SWE-Bench Verified (75.6% versus ~74% for the best individual model).
A few clarifications up front: we did not train on SWE-Bench problems or patches. Clusters are derived from general coding data, not from SWE-Bench. SWE-Bench is used only to estimate per-cluster model success rates. At inference time, routing uses only the problem description and historical cluster statistics, with no repo execution or test-time search.
The main takeaway is not the absolute number, but the mechanism. Leaderboard aggregates hide complementary strengths between models, and even simple routing can capture a higher performance ceiling than any single model.