Hacker Newsnew | past | comments | ask | show | jobs | submit | botirk's commentslogin

We analyzed per-task results on SWE-Bench Verified and noticed a pattern that aggregate leaderboard scores hide: many tasks failed by the top-performing model are consistently solved by other models.

For example, Claude Opus 4.5 solves the most tasks overall, but a significant number of tasks it fails are solved by other models like Sonnet or Gemini. The reverse is also true. This suggests strong task-level specialization that a single-model baseline cannot exploit.

We built a simple routing system to test this idea. Instead of training a new foundation model, we embed each problem description, assign it to a semantic cluster learned from a separate general coding dataset, and route the task to the model with the highest historical success rate in that cluster.

Using this approach, the system exceeds single-model baselines on SWE-Bench Verified (75.6% versus ~74% for the best individual model).

A few clarifications up front: we did not train on SWE-Bench problems or patches. Clusters are derived from general coding data, not from SWE-Bench. SWE-Bench is used only to estimate per-cluster model success rates. At inference time, routing uses only the problem description and historical cluster statistics, with no repo execution or test-time search.

The main takeaway is not the absolute number, but the mechanism. Leaderboard aggregates hide complementary strengths between models, and even simple routing can capture a higher performance ceiling than any single model.


We propose a new architecture called Mixture of Models (MoM) to solve LLM routing for coding workflows. We use a embedding + clustering approach on SWE data and then evaluate LLMs on each cluster to find out who is best.

cool!


Thanks!


thanks!


yea exactly!


were you trying to build with 0.15, it was broken till recently, I like to preface this is not a feature ready OS, its just a learning experiment!


yes good point, README was AI generate not going to lie, missed those details its updated now to work with 0.15!


yes exactly, its based off that!


this is not usable at all, its just to show people that OSes are not mysterious things, at a bare bones level quite simple


We built our infra on Azure during a hackathon. It made sense at the time, so we stuck with it.

For a while, Container Apps worked fine. Then we launched our AI model router demo, and everything changed.

In just two days, we spent over $250 on GPU compute. Two uni students, a side project, and suddenly we were paying production-level bills.

Autoscaling was slow. Cold starts were bad. Costs were unpredictable.

Then I watched a talk from one of Modal’s founders about GPU infra. We gave Modal a try.

Now we’re running the same workloads for under $100, with fast autoscaling and no lag.

Azure was stable, but Modal gave us speed, control, and real cost efficiency.

Anyone else switch from Azure (or AWS/GCP) to Modal for AI workloads? What was your experience?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: