Nordlys Labs’ Mixture-of-Models Hits 75.6% Accuracy on SWE-Bench

BULLETIN

Nordlys Labs has unveiled a new AI architecture that boosts performance on the SWE-Bench coding test to 75.6% accuracy. Their Mixture-of-Models system routes tasks to different models based on which ones have historically solved similar problems best. This approach outperforms any single model without relying on new foundation models.

The Story

Nordlys Labs’ Mixture-of-Models routes coding tasks by embedding problem descriptions and assigning them to semantic clusters. Each cluster tracks per-model success rates, sending tasks to the model with the strongest track record for that problem type. This beats the usual method of defaulting to the overall best model.

The system is lightweight and open-source, built on existing models without heavy computation or complex training. It taps into the unique strengths of multiple models rather than chasing a single all-around winner.

The Context

The key insight driving this architecture is that different AI models excel at different subsets of tasks. As Nordlys Labs researcher botirkhaltaev explains, even the top-performing model doesn’t solve every problem best. This challenges the usual focus on leaderboard averages and suggests a smarter path: assign problems to the models best suited for them.

This task-level specialization could reshape how AI systems are built, especially in complex fields where diverse skills matter. Instead of building ever-larger foundation models, combining specialized models offers a practical way to boost accuracy and efficiency.

Nordlys Labs’ open-source release invites the AI community to test, adapt, and improve the framework. This collaborative spirit could accelerate innovation and lead to more flexible, powerful AI systems.

Key Takeaways

75.6% accuracy achieved on SWE-Bench using Mixture-of-Models.
Tasks are routed based on learned success rates per semantic cluster, not just overall model ranking.
The method avoids new foundation models and expensive retraining.
Highlights the limits of aggregate leaderboard scores and the value of task-level insights.
Open-source framework encourages community collaboration and further development.