Completed a master's thesis, so that's in public domain now. Here are the key observations from the said thesis.
Kalle Tolonen
June 12, 2026
The quality of the local model's work varies greatly. It's good enough on tightly scoped tasks, but fails on larger tasks. It still remains to be seen if local model can be quantisized to a point where they can compete with the larger models of today tomorrow. Probably a big part of that is extending the context via permanent memory with some sort a solution that will be available in the future. Currently my take is that local models can serve a purpose for small tasks, but do require an order of magnitude more babysitting than their commercial rivals. Having the models available in the Visual Studio Code Agents panel does reduce most of the friction of using some propriatry plugin, and it does actually seem to be working better than with the Continue.dev-tool that was used in this thesis as a comparison. The key benefit for developers is that the Agent-panel is readily available and something that developers are well accustomed to using. The switching and experimenting costs with different models and bolting on future models is thus quite negligible. I’ve been driven to conclusion that there is currently no magic bullet solution that will make local, quantized models perform as great as the billion dollar models available by commercial vendors for agentic coding tasks.
The empirical data presented in the diary section of this thesis across different kinds of development tasks and workflows does highlight a dire difference between operating a locally run, quantized model (such as the Gemma 4 31 billion parameter model and the Qwen-series models) and proprietary frontier models, that are run on a remote server by the service provider. Local models do meet the criteria for keeping data private, and lets users safely operate on repositories that are not to be shared with cloud service providers (end customer approval is always needed for specific tooling). On the other hand the models do have severe limitations on the proves of their work quality and speed, when compared to frontier labs models.
| Technical metric | Locally run agentic loop | Frontier model |
|---|---|---|
| Multi-file context | Poor performance – context drifts and file paths get lost | Great performance – models can keep the paths in context and reason across the codebase better |
| Code correctness | Variable – models sometimes output garbled outputs | Consistent – models output quality edits without omitting code |
| Debugging with error message | Weak – simple type mismatch bugs are too complex to handle | Great performance – successfully realizes that a type mismatch is at fault |
Product management use case for model evaluation (pm-gosvelte customization) show blindness when they encounter stale documentation, and do not have enough context or reasoning abilities to check the actual implementation for correctness. Local models lack the size to resolve contradictions when they’re dealing with a large enough codebase. They’re prone to hallucinations, when they’re relying on outdated claims on context files. Frontier models on the other hand can utilize the bigger parameter scale to reference documentation against actual implementation with their tools. The key takeaway is that project documentation must be pristine and non- contradictionary for local models to be able to operate with it. It will also help, if the documentation is kept to a small size, since each character add to the meager token capacity locally run models have on developer machines, and thus limits the usability of them.
The smoothness of the developer experience is key to having people actually use local models, on top of the non-negotiable requirement of the models being actually usefull. When we’re comparing a frontier model that costs businesses a few hundred Euro’s per month to run and a local model that takes a certain amount of time to setup, each hour of setup is waste and a direct expence that may or may not be billable work per project agreements. The empirical results reveal a caveat – the cognitive win of having an AI assistant on the ready should be balanced with the amount of handholding a local model needs – the need for a human development engineer or a product manager to monitor context, provide absolute paths and split tasks to atomic sub tasks, while accounting for the correctness of the output is something that requires a feel for the models capabilities (and understanding of the project in general). In other words – locally run, quantized models are not a great fit for vibe coding in their current state. On the other hand, they are good companions for simple, scoped and well defined tasks, such as linter error corrections, where you can leave them working for the breaks humans need to take or overnight, for a neglible price of the electricity. If they do not manage to solve the task – you can always move the task to a more capable actor – be it a human engineer or a frontier model.
There could be possibilities to improve the hit rate of the local models by applying the harnesses/guardrails mentioned by Zambelli (Zambelli, 2026), if you’re willing to install more 3rd party dependencies on your machine. This could help with the long context tasks, where inaccuracies compound. There’s also a reasonable chance to successfully implement a RAG-approach for the Visual Studio Code’s native Agents panel, that could help with the context size limitations of locally run models.
No published comments yet.
Your comment may be published.