More

botirk · 2026-02-04T11:07:44 1770203264

We analyzed per-task results on SWE-Bench Verified and noticed a pattern that aggregate leaderboard scores hide: many tasks failed by the top-performing model are consistently solved by other models.

For example, Claude Opus 4.5 solves the most tasks overall, but a significant number of tasks it fails are solved by other models like Sonnet or Gemini. The reverse is also true. This suggests strong task-level specialization that a single-model baseline cannot exploit.

We built a simple routing system to test this idea. Instead of training a new foundation model, we embed each problem description, assign it to a semantic cluster learned from a separate general coding dataset, and route the task to the model with the highest historical success rate in that cluster.

Using this approach, the system exceeds single-model baselines on SWE-Bench Verified (75.6% versus ~74% for the best individual model).

A few clarifications up front: we did not train on SWE-Bench problems or patches. Clusters are derived from general coding data, not from SWE-Bench. SWE-Bench is used only to estimate per-cluster model success rates. At inference time, routing uses only the problem description and historical cluster statistics, with no repo execution or test-time search.

The main takeaway is not the absolute number, but the mechanism. Leaderboard aggregates hide complementary strengths between models, and even simple routing can capture a higher performance ceiling than any single model.

botirk · 2026-02-03T16:39:53 1770136793

We propose a new architecture called Mixture of Models (MoM) to solve LLM routing for coding workflows. We use a embedding + clustering approach on SWE data and then evaluate LLMs on each cluster to find out who is best.

botirk · 2026-01-17T20:58:31 1768683511

cool!

whispem · 2026-01-17T20:59:03 1768683543

Thanks!

botirk · 2025-10-05T08:33:06 1759653186

thanks!

botirk · 2025-10-05T08:32:58 1759653178

yea exactly!

botirk · 2025-10-05T08:32:02 1759653122

were you trying to build with 0.15, it was broken till recently, I like to preface this is not a feature ready OS, its just a learning experiment!

botirk · 2025-10-05T08:31:27 1759653087

yes good point, README was AI generate not going to lie, missed those details its updated now to work with 0.15!

botirk · 2025-10-05T08:31:02 1759653062

yes exactly, its based off that!

botirk · 2025-10-05T08:30:39 1759653039

this is not usable at all, its just to show people that OSes are not mysterious things, at a bare bones level quite simple

botirk · 2025-10-04T20:44:47 1759610687

We built our infra on Azure during a hackathon. It made sense at the time, so we stuck with it.

For a while, Container Apps worked fine. Then we launched our AI model router demo, and everything changed.

In just two days, we spent over $250 on GPU compute. Two uni students, a side project, and suddenly we were paying production-level bills.

Autoscaling was slow. Cold starts were bad. Costs were unpredictable.

Then I watched a talk from one of Modal’s founders about GPU infra. We gave Modal a try.

Now we’re running the same workloads for under $100, with fast autoscaling and no lag.

Azure was stable, but Modal gave us speed, control, and real cost efficiency.

Anyone else switch from Azure (or AWS/GCP) to Modal for AI workloads? What was your experience?