/Tech7h ago

Guide Details Local Coding Agents With Open-Weight Models Running 100% Locally

601.1K1641.1K45.9K

Original post

I put together a new article on setting up local coding agents with open-weight models. Everything runs 100% locally.

I thought it might be useful putting this together because many people asked me about my setup in the past, and I thought it would also motivate people to get started tinkering with local models for serious work (yes, things got incredibly capable this year with better LLMs and better harnesses).

So, here's a walkthrough of how to connect a local LLM to a local coding harness (could be Claude Code or Codex, which you may already be familiar with).

I also included some assessment notes that are useful as a checklist to select between and consider certain LLMs over others:

- Checking RAM usage at long contexts to see if the model is suitable for real work - Measuring prefill and decoding tok/sec to see whether it's fast enough to not be annoying - Making sure the model has sufficient tool-calling capabilities in theory - Assessing whether the model can solve some more challenging tasks when used in a coding harness.

Of course, there are always more specialized tools that can squeeze a bit more performance out of things, but I hope this is a good starter kit that stays flexible; that is you can easily switch to newer models as they are released or even tap into cloud models in your familiar harness if the current ones are not sufficient enough for a given task.

7:07 AM · Jun 27, 2026 · 40.2K Views

Sentiment

Many users praised the guide on local coding agents with open-weight models for its practical checklist and hands-on value, while a few noted overlooked issues like slow prefill latency and loud hardware strain.

Pos

80.0%

Neg

20.0%

25 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Using Local Coding Agents

SEBASTIANRASCHKA.COMVia

#182

Posts from X

Most Activity

VIEWS5.7KBOOKMARKS89LIKES63RETWEETS12REPLIES3

Sebastian Raschka@rasbt

Link to the full article: https://magazine.sebastianraschka.com/p/using-local-coding-agents

Sebastian Raschka@rasbt

I put together a new article on setting up local coding agents with open-weight models. Everything runs 100% locally.

So, here's a walkthrough of how to connect a local LLM to a local coding harness (could be Claude Code or Codex, which you may already be familiar with).

I also included some assessment notes that are useful as a checklist to select between and consider certain LLMs over others:

7h5.7K6389

Rompel@ukrroot

@rasbt Local agents die on tool-call reliability and long-context recall, not raw quality. On a 4090 the wall isn't tok/s—it's KV cache at 32k+ when the agent dumps a whole repo into context. What model + harness are you on, and where does it top out?

6h9411

Sebastian Raschka@rasbt

@assafbar Yes, in a sense, harnesses let the models interact with your local env etc.

4h1561

Noctus@noctus91

@rasbt I'm not sure why someone would use Ollama over llama.cpp . OpenCode and Hermes Agent is also worth adding to the agentic harness.

6h2343

hacsceo@aihacs

@rasbt

4h311

Thack@DaveThackeray

@rasbt do you think we can essentially get rid of the IDE and go full agentic using local models on reasonable compute?

I'm not a SWE but I still want to play the game. I have a LOT of PM experience and I'm very much a tech-forward marketer.

6h1371

Qarts@QArtsQ

@rasbt Very amazing Article 🙌

6h1091

Marc Compere@Marc_Compere

@rasbt this is helpful

why is vLLM not included? perhaps it is new🤔

3h201

Sebastian Raschka@rasbt

@QArtsQ Thanks!!

6h1572

Sebastian Raschka@rasbt

@noctus91 It's been a few months but I didn't see a noticeable speed difference tbh. Otherwise, Ollama has usually a nice set of quantized models and easy to switch.

5h1911

Sebastian Raschka@rasbt

@DaveThackeray Maybe one can, but I like using an IDE

5h150

Nir Zabari‎@nirzabari

@rasbt Great writeup, thanks!

7h461

ptk@ptkbhv

@rasbt You are amazing!!

6h341

Slop to Signal@SlopToSignal

@rasbt the RAM check tip alone saves so much suffering

people be loading 70b models on 32gb and wondering why their laptop sounds like a jet engine 💀

4h321

Daniel Tobi@DanielTobi0

@rasbt Perfect timing. This post came just when I needed it. Thank you @rasbt

6h261

Ali Asad@sye_voz

@rasbt Nice. Thanks to you for sharing.

6h241

Sebastian Raschka@rasbt

Good question. This was already a long article, and I wanted to focus more on the coding harness choices, so I picked the most convenient and flexible LLM serving tool. (Also, since it's not for serving things across multiple machines where we have to worry about batching etc. vLLM is nice but maybe a bit overkill). But yeah, fair point.

3h55

On chain terrorist 💥@Turbanwarrior01

@rasbt local dev until your laptop fan sounds like a jet engine

5h55

Mikhail Rogov@i_mika_el

@rasbt nice. tok/sec and RAM are the part people skip, but thats what decides if local coding agent gets used daily.

7h37

Sean Song@seansong

@rasbt The underrated piece is the harness staying local and model-agnostic. Then the LLM is just a runtime you can swap: same repo tools, same eval prompts, same traces, different weights. That is what makes local coding agents usable rather than a one-off weekend setup.

5h35