Running Modern Edge LLMs on an RTX PRO 4000 Blackwell Source

June 10, 2026 Markdown source

1---
2title: "Running Modern Edge LLMs on an RTX PRO 4000 Blackwell"
3date: "2026-06-10"
4published: true
5tags: ["ai", "llm", "local-ai", "ollama", "gguf", "nvidia", "blackwell", "gemma", "qwen", "hermes", "vscode", "continue"]
6author: "Gavin Jackson"
7excerpt: "I wanted to know whether a sub-$3k Australian workstation GPU could run useful modern local LLMs. The short answer is yes, if you pick the right model, the right quant, and build around the 24GB VRAM limit instead of pretending it is a tiny cloud."
8---
9
10# Running Modern Edge LLMs on an RTX PRO 4000 Blackwell
11
12I have been spending more time than is probably healthy thinking about local AI.
13
14Part of that is curiosity. Part of it is cost. If LLM pricing keeps moving upward, or if the useful models increasingly sit behind subscriptions, enterprise plans, and usage tiers, then relying entirely on cloud inference starts to feel fragile.
15
16The other part is network reality. Not every environment can connect directly to the Internet. Some developer networks should not have broad outbound access. Some codebases should not leave the building. Some organisations need AI assistance inside a controlled workstation environment, not hanging off whatever API a developer happened to sign up for last Tuesday night.
17
18So I bought an NVIDIA RTX PRO 4000 Blackwell card, installed it in an Ubuntu 26.04 workstation, and started testing what modern edge LLMs look like on hardware that is expensive, but not absurd. In Australia, I have seen the 24GB RTX PRO 4000 Blackwell listed under AUD 3,000. That is not pocket change, but it is very different from buying a datacentre GPU or building a multi-card inference box.
19
20The key constraint is simple:
21
22> The model needs to fit inside 24GB of VRAM and still leave enough headroom to be usable.
23
24That one sentence drives almost every practical decision.
25
26## The Hardware Reality
27
28The card I am talking about is the [NVIDIA RTX PRO 4000 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4000/), with 24GB of GDDR7 ECC memory. My test machine is running Ubuntu 26.04, which is the same base I would use for a managed developer workstation rollout. It is a workstation GPU, not a gaming flagship, and that matters. The attraction is not only raw speed. It is the combination of 24GB VRAM, a professional driver stack, sane power and cooling, and the ability to put useful local inference into a developer workstation without turning the desk into a small server room.
29
30On 24GB, you are not running giant 70B dense models comfortably. You are not replacing frontier cloud models for every task. But you can run genuinely capable 12B, 26B, 27B, 31B, and sparse 35B-class models if you choose the right quantized files and keep context length under control.
31
32That last bit matters. VRAM is not just model weights. You also need memory for the KV cache, runtime overhead, and whatever your inference engine is doing around the edges. A model file that is 23.8GB is technically under 24GB, but that does not mean it is a good daily-driver choice on a 24GB card.
33
34I learned to aim for breathing room.
35
36## Picking Models That Actually Fit
37
38There are two model families I keep coming back to for this size of card.
39
40The first is **Gemma 4**. I have written previously about [why Gemma 4's Apache 2.0 license matters](/post/gemma-4-apache-2-license-matters), and the licensing still matters here. It is much easier to have a serious enterprise conversation about local models when the legal posture is boring and understandable.
41
42The second is **Qwen 3.6**, especially for coding and agentic workflows. The [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) describes a 35B total parameter model with only 3B activated per token. That is the important trick. It is a Mixture-of-Experts style model, so you get a larger pool of parameters without paying the full dense-model cost on every token.
43
44That does not make it free. It still needs memory, and the quantization still matters. But it is exactly the kind of architecture that makes edge inference interesting.
45
46### Parameter Sizes In Plain English
47
48Parameter count is not a perfect measure of intelligence, but it is a good first approximation of memory pressure.
49
50- **E2B / E4B / small models**: fast, easy to run, good for classification, small summaries, simple drafting, and autocomplete-style jobs.
51- **12B class**: the first size where local chat starts to feel genuinely useful. A well-quantized 12B model can be responsive and surprisingly capable.
52- **26B / 27B class**: the sweet spot for a 24GB workstation card if you want better reasoning and code review without falling off a performance cliff.
53- **31B dense models**: possible with aggressive quantization, but you need to be realistic about context size and speed.
54- **35B sparse MoE models**: interesting because only part of the model is active per token. Qwen3.6-35B-A3B is the obvious example for coding.
55
56The raw model is only the starting point. The version you actually run locally is usually a quantized release.
57
58## Quantization Is The Whole Game
59
60A model published in BF16 or FP16 precision is often too large for workstation VRAM. Very roughly, 16-bit weights need about two bytes per parameter before you even think about runtime overhead. A 26B model can therefore be around 50GB in full precision. That is not going to live inside a 24GB card.
61
62Quantization reduces the precision of the weights so the model uses less memory. The trade-off is quality. A good quant loses surprisingly little. A bad or too-aggressive quant can make the model feel vague, brittle, or weirdly forgetful.
63
64In the GGUF world, you will see names like:
65
66- `Q8_0`: large, high quality, often too big for this card once overhead is included.
67- `Q6_K`: very good quality, but can be tight on 24GB with larger models.
68- `Q5_K_M`: a nice middle ground when Q6 is too large.
69- `Q4_K_M`: often the practical default for 24GB cards.
70- `IQ4_NL`, `IQ4_XS`, `IQ3_*`: newer or more aggressive options that can make awkward models fit.
71
72For example, Bartowski's [Gemma 4 26B A4B GGUF](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) page lists a full BF16 file at about 50.5GB, a `Q6_K_L` quant at about 23GB, and a `Q4_K_M` quant at about 17GB. On paper the Q6 file fits. In practice, the Q4 or Q5 variants are more comfortable because they leave room for context.
73
74For Qwen3.6, the [Unsloth Qwen3.6-35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) releases include 4-bit files around the 20GB mark. That is exactly why this model is interesting on a 24GB card: it is large enough to be useful, but the right quant still fits.
75
76The rule I have settled on is simple:
77
78> Pick a quant at least 1-2GB smaller than your VRAM, and preferably more if you want larger context windows.
79
80## Step 1: Download The GGUF File
81
82GGUF is a local model file format used heavily by `llama.cpp` and tools built around that ecosystem. Hugging Face describes it as a binary format optimised for quick loading and inference, with both tensors and model metadata stored in the file.
83
84That metadata part matters. A GGUF file is not just a bag of numbers. It normally includes enough information for local inference tools to understand the model architecture, tokeniser, quantization type, and prompt format.
85
86In a connected environment you can download directly:
87
88```bash
89python -m pip install -U "huggingface_hub[cli]"
90
91mkdir -p ~/models/qwen3.6
92cd ~/models/qwen3.6
93
94huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
95  --include "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \
96  --local-dir .
97```
98
99For Gemma 4:
100
101```bash
102mkdir -p ~/models/gemma4
103cd ~/models/gemma4
104
105huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
106  --include "gemma-4-26B-A4B-it-Q4_K_M.gguf" \
107  --local-dir .
108```
109
110In a restricted environment, I would not have each workstation randomly pulling files from Hugging Face. I would download once from a controlled machine, record the hash, put the GGUF into an internal artefact store, and use an orchestration tool such as Ansible to push the approved model files and configuration out to the endpoints.
111
112That gives you:
113
114- repeatability
115- provenance
116- hash verification
117- no surprise model swaps
118- no direct developer workstation Internet dependency
119- a controlled way to replace or retire models across the fleet
120
121## Step 2: Install Ollama
122
123[Ollama](https://ollama.com/) is still the easiest way to get local LLMs running on a workstation. It wraps the model runtime, exposes a local API, and gives you a clean CLI for testing.
124
125On a normal Linux workstation:
126
127```bash
128curl -fsSL https://ollama.com/install.sh | sh
129```
130
131In the sort of environment I care about, that installer should be mirrored or packaged internally. The workstation should install Ollama from an approved repository, not from a random curl pipe to the Internet.
132
133Once installed, check that it is alive:
134
135```bash
136ollama --version
137curl http://127.0.0.1:11434
138```
139
140You should see:
141
142```text
143Ollama is running
144```
145
146I also keep Ollama bound to localhost by default. A local model endpoint should not casually become a LAN service.
147
148## Step 3: Create A Modelfile And Load The Model
149
150Ollama can import a local GGUF with a `Modelfile`. The minimal version is almost comically small:
151
152```text
153FROM /opt/models/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
154
155PARAMETER num_ctx 8192
156PARAMETER temperature 0.6
157PARAMETER top_p 0.95
158```
159
160Save that as:
161
162```bash
163/opt/models/qwen3.6/Modelfile
164```
165
166Then create the Ollama model:
167
168```bash
169ollama create qwen36-local -f /opt/models/qwen3.6/Modelfile
170```
171
172For Gemma:
173
174```text
175FROM /opt/models/gemma4/gemma-4-26B-A4B-it-Q4_K_M.gguf
176
177PARAMETER num_ctx 8192
178PARAMETER temperature 0.7
179PARAMETER top_p 0.9
180```
181
182Then:
183
184```bash
185ollama create gemma4-local -f /opt/models/gemma4/Modelfile
186```
187
188Ollama's [Modelfile reference](https://docs.ollama.com/modelfile) is worth reading because the file is where you can set defaults like context length, stop tokens, temperature, and the system prompt.
189
190## Step 4: Test With Ollama Directly
191
192Before wiring this into anything else, test the model directly.
193
194```bash
195ollama list
196ollama run qwen36-local
197```
198
199Then give it something real:
200
201```text
202You are reviewing a Python function for production readiness.
203Find bugs, security issues, missing tests, and readability problems.
204Be concise.
205```
206
207For a one-shot CLI test:
208
209```bash
210ollama run qwen36-local "Write a PHP function that safely renders a tag link for a blog post."
211```
212
213And check memory usage:
214
215```bash
216ollama ps
217nvidia-smi
218```
219
220This is where the theory becomes visible. If the model spills out of VRAM, gets partially offloaded, or has an over-large context window, you will feel it immediately. Token generation slows down, prompts take longer to process, and the whole thing stops feeling like a useful assistant.
221
222For coding, Qwen3.6 has felt more natural to me than Gemma. It is more comfortable with repository-shaped tasks, code review, and tool-like instructions. Gemma 4 is still a strong general local assistant, especially when licensing and broader local deployment matter.
223
224## Step 5: Get Hermes Agent Running
225
226Once Ollama is working, the next question is: what do you actually do with it?
227
228Chatting in a terminal is useful for testing, but it is not an agent. That is where [Hermes Agent](https://hermes-agent.nousresearch.com/) becomes interesting.
229
230The normal Hermes install path is a script, but in a locked-down workstation environment I prefer to be explicit:
231
232```bash
233sudo apt install python3 python3-venv nodejs npm
234
235python3 -m venv ~/.venvs/hermes-agent
236source ~/.venvs/hermes-agent/bin/activate
237
238python -m pip install --upgrade pip
239pip install hermes-agent
240```
241
242In our environment, package dependencies come from a local PyPI mirror, so the install looks more like this:
243
244```bash
245pip config set global.index-url https://pypi-mirror.example.internal/simple
246pip config set global.trusted-host pypi-mirror.example.internal
247
248pip install hermes-agent
249```
250
251Then point Hermes at Ollama's OpenAI-compatible endpoint:
252
253```bash
254hermes model
255```
256
257Choose a custom or self-hosted endpoint, then use:
258
259```text
260Base URL: http://127.0.0.1:11434/v1
261API key:  leave blank, or use a local placeholder if the UI insists
262Model:    qwen36-local
263```
264
265Hermes stores its configuration under `~/.hermes/`, which makes it fairly easy to inspect and manage. The Hermes provider docs note that custom/self-hosted endpoints work when they expose an OpenAI-compatible `/v1/chat/completions` API, which Ollama does.
266
267### Breakout: Hermes Gives Me An Agent
268
269This is the point where the setup stops feeling like "I installed a model" and starts feeling like "I have a local assistant."
270
271I have called mine **Frank**.
272
273Frank is an arrogant Frenchman who is constantly criticising my code and getting upset with me for wasting his time. This sounds ridiculous, but it is genuinely useful. A local model with a memorable persona is easier to work with than a bland text box. Frank has a job: read my code, complain about the parts that deserve complaint, and begrudgingly suggest better approaches.
274
275The important part is not the accent. The important part is continuity and role.
276
277Hermes gives the model:
278
279- a persistent assistant shell
280- memory across sessions
281- tool access
282- channel support
283- configuration that survives the terminal window
284- the ability to behave like an agent rather than a single prompt-response loop
285
286That distinction matters. Ollama serves the model. Hermes gives it a place to live.
287
288I do not want developers thinking about local AI as "open a terminal and ask a model a question." I want them thinking: "I have a workstation assistant that can inspect code, remember project conventions, and operate inside the security boundary of this machine."
289
290Frank is rude about it, obviously. But that is his burden.
291
292## Step 6: Get VS Code Working With Continue
293
294The next piece was VS Code.
295
296Ollama now has a native VS Code integration path through GitHub Copilot Chat. The awkward bit is that VS Code requires a login for the model selector even when you are using custom local models. Ollama's own [VS Code integration documentation](https://docs.ollama.com/integrations/vscode) says this does not require a paid Copilot account, but it still requires the login path.
297
298In a locked-down or offline-first environment, that is not what I want.
299
300So the workaround is [Continue](https://docs.continue.dev/guides/ollama-guide).
301
302Install VS Code from the internal mirror of Microsoft's repository, then install the Continue extension from your approved extension mirror or a vetted `.vsix`:
303
304```bash
305sudo apt install code
306code --install-extension Continue.continue
307```
308
309For an offline extension package:
310
311```bash
312code --install-extension continue-extension.vsix
313```
314
315Then configure Continue to use Ollama. A minimal `~/.continue/config.yaml` can look like this:
316
317```yaml
318name: Local AI Workstation
319version: 0.0.1
320schema: v1
321models:
322  - name: Qwen 3.6 Local
323    provider: ollama
324    model: qwen36-local
325    apiBase: http://127.0.0.1:11434
326    roles:
327      - chat
328      - edit
329      - apply
330    capabilities:
331      - tool_use
332
333  - name: Gemma 4 Local
334    provider: ollama
335    model: gemma4-local
336    apiBase: http://127.0.0.1:11434
337    roles:
338      - chat
339      - edit
340      - apply
341
342  - name: Small Local Autocomplete
343    provider: ollama
344    model: qwen2.5-coder:1.5b
345    apiBase: http://127.0.0.1:11434
346    roles:
347      - autocomplete
348```
349
350The model names must match `ollama list`. If Ollama knows the model as `qwen36-local:latest`, use that exact name.
351
352Continue's Ollama guide also supports autodetection:
353
354```yaml
355models:
356  - name: Autodetect
357    provider: ollama
358    model: AUTODETECT
359    roles:
360      - chat
361      - edit
362      - apply
363      - autocomplete
364```
365
366That is handy for experimentation. For a managed developer fleet, I would rather be explicit.
367
368### Breakout: What VS Code Is Doing With The Local Model
369
370The exciting thing about this setup is that VS Code does not use the model in only one way.
371
372There are at least three different coding patterns here.
373
374**Autocomplete** is the fast path. The model sees the current file and nearby context, then predicts the next few lines. This wants a smaller, faster model. It should feel instant. A huge reasoning model is usually the wrong choice.
375
376**Question and answer** is the chat path. This is where a developer asks, "Why is this failing?", "What does this function do?", or "Where should I add the validation?" This can use a larger model because latency matters less than understanding.
377
378**Code synthesis and edit/apply** is the workflow path. The assistant proposes changes, rewrites a function, generates tests, or applies an edit across a file. This is where Qwen3.6 starts to make sense because repository reasoning and coding fluency matter more than raw chat pleasantness.
379
380That role separation is important. Local AI in an IDE should not be one model doing everything badly. It should be a small fast model for completion, a stronger model for chat, and a code-capable model for synthesis.
381
382I am genuinely excited to push this out to the devs and see what they do with it. Not because I expect local models to replace every cloud tool on day one, but because the workflow changes when the assistant is local, fast enough, and available inside the editor without sending code to an external API.
383
384## Offline And Controlled Environments
385
386The restricted-network angle is not an afterthought. It is one of the main reasons this experiment matters.
387
388For a serious workstation rollout, I would mirror or internally host:
389
390- Ollama packages
391- approved GGUF model files
392- model hashes and metadata
393- Python packages through a local PyPI mirror
394- Node packages through an internal npm mirror if needed
395- VS Code packages from a Microsoft repository mirror
396- approved VS Code extensions as `.vsix` artefacts
397- Continue configuration templates
398- Hermes configuration templates
399- Ansible roles or playbooks that install the approved runtime, place the approved models, verify hashes, and keep endpoint configuration consistent
400
401That gives developers a usable AI workstation without granting every machine direct Internet access.
402
403It also gives the organisation a model governance story. You can say:
404
405- these are the approved model files
406- these are the hashes
407- this is the license
408- this is the intended use
409- this is the runtime
410- this is the endpoint binding
411- this is how updates are tested
412- this is the orchestration path that pushes the approved model set to endpoints
413
414That is much better than "everyone install whatever model Reddit likes this week."
415
416## What Worked
417
418The RTX PRO 4000 Blackwell is a very practical local inference card for the 24GB class.
419
420The models that felt worth using were not tiny toys. Gemma 4 26B-class and Qwen3.6 35B-A3B-class quantized models are capable enough for real drafting, explanation, code review, and assistant workflows.
421
422Ollama made the runtime easy. GGUF made the model artefacts portable. Hermes made the setup feel like an assistant. Continue made VS Code local-model integration workable without going through the Copilot login path.
423
424The biggest lesson was that local AI is not one component. It is a stack:
425
426```text
427approved GGUF file
428  -> Ollama local runtime
429  -> Hermes for agent workflows
430  -> Continue for IDE workflows
431  -> internal mirrors for repeatable offline deployment
432```
433
434When those pieces line up, the experience becomes surprisingly normal.
435
436## What Did Not Work Perfectly
437
438The 24GB limit is real. You can make very large models load by using more aggressive quants or offloading to system RAM, but there is a point where the experience stops being pleasant.
439
440Context length is the silent killer. A model that works beautifully at 8K context might become sluggish when a tool tries to push it to 32K or beyond. For local coding assistants, this matters because IDE tools love stuffing prompts with file context, diffs, terminal output, and instructions.
441
442Tool support is also uneven. A model may claim tool support, but the integration may still behave strangely. This is why I like testing directly in Ollama first, then Hermes, then Continue. Each layer adds value, but each layer also adds another place to misconfigure the model.
443
444## The Bottom Line
445
446The RTX PRO 4000 Blackwell has changed how I think about local AI workstations.
447
448This is not about beating frontier models. It is about having a capable local baseline that works when cloud access is expensive, inappropriate, unavailable, or simply unnecessary.
449
450For under AUD 3,000, a 24GB workstation GPU can run modern edge LLMs that are good enough to matter. Gemma 4 gives me a strong general local model with a clean licensing story. Qwen3.6 gives me a better coding-oriented model for repository work. Ollama makes them easy to serve. Hermes turns them into an agent. Continue brings them into VS Code.
451
452That feels like the shape of the next developer workstation:
453
454local by default, cloud when allowed, governed by design, and useful enough that developers will actually reach for it.
455
456Frank still thinks my code is beneath him.
457
458Unfortunately, he is often right.
459
460---
461
462## References
463
464- [NVIDIA RTX PRO 4000 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4000/)
465- [Scorptec RTX PRO 4000 Blackwell listing](https://www.scorptec.com.au/product/graphics-cards/workstation/118874-900-5g147-2570-000)
466- [Gemma 4 model overview](https://ai.google.dev/gemma/docs/core)
467- [My earlier post: Why Gemma 4's Apache 2.0 License Matters](/post/gemma-4-apache-2-license-matters)
468- [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
469- [Bartowski Gemma 4 26B A4B GGUF](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF)
470- [Unsloth Qwen3.6-35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF)
471- [Hugging Face GGUF documentation](https://huggingface.co/docs/hub/gguf)
472- [Hugging Face: use Ollama with GGUF models](https://huggingface.co/docs/hub/ollama)
473- [Ollama importing models](https://docs.ollama.com/import)
474- [Ollama Modelfile reference](https://docs.ollama.com/modelfile)
475- [Ollama VS Code integration](https://docs.ollama.com/integrations/vscode)
476- [Continue Ollama guide](https://docs.continue.dev/guides/ollama-guide)
477- [Hermes Agent documentation](https://hermes-agent.nousresearch.com/docs/)
478- [Hermes Agent AI providers](https://hermes-agent.nousresearch.com/docs/integrations/providers)
479

---
title: "Running Modern Edge LLMs on an RTX PRO 4000 Blackwell"
date: "2026-06-10"
published: true
tags: ["ai", "llm", "local-ai", "ollama", "gguf", "nvidia", "blackwell", "gemma", "qwen", "hermes", "vscode", "continue"]
author: "Gavin Jackson"
excerpt: "I wanted to know whether a sub-$3k Australian workstation GPU could run useful modern local LLMs. The short answer is yes, if you pick the right model, the right quant, and build around the 24GB VRAM limit instead of pretending it is a tiny cloud."
---

# Running Modern Edge LLMs on an RTX PRO 4000 Blackwell

I have been spending more time than is probably healthy thinking about local AI.

Part of that is curiosity. Part of it is cost. If LLM pricing keeps moving upward, or if the useful models increasingly sit behind subscriptions, enterprise plans, and usage tiers, then relying entirely on cloud inference starts to feel fragile.

The other part is network reality. Not every environment can connect directly to the Internet. Some developer networks should not have broad outbound access. Some codebases should not leave the building. Some organisations need AI assistance inside a controlled workstation environment, not hanging off whatever API a developer happened to sign up for last Tuesday night.

So I bought an NVIDIA RTX PRO 4000 Blackwell card, installed it in an Ubuntu 26.04 workstation, and started testing what modern edge LLMs look like on hardware that is expensive, but not absurd. In Australia, I have seen the 24GB RTX PRO 4000 Blackwell listed under AUD 3,000. That is not pocket change, but it is very different from buying a datacentre GPU or building a multi-card inference box.

The key constraint is simple:

> The model needs to fit inside 24GB of VRAM and still leave enough headroom to be usable.

That one sentence drives almost every practical decision.

## The Hardware Reality

The card I am talking about is the [NVIDIA RTX PRO 4000 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4000/), with 24GB of GDDR7 ECC memory. My test machine is running Ubuntu 26.04, which is the same base I would use for a managed developer workstation rollout. It is a workstation GPU, not a gaming flagship, and that matters. The attraction is not only raw speed. It is the combination of 24GB VRAM, a professional driver stack, sane power and cooling, and the ability to put useful local inference into a developer workstation without turning the desk into a small server room.

On 24GB, you are not running giant 70B dense models comfortably. You are not replacing frontier cloud models for every task. But you can run genuinely capable 12B, 26B, 27B, 31B, and sparse 35B-class models if you choose the right quantized files and keep context length under control.

That last bit matters. VRAM is not just model weights. You also need memory for the KV cache, runtime overhead, and whatever your inference engine is doing around the edges. A model file that is 23.8GB is technically under 24GB, but that does not mean it is a good daily-driver choice on a 24GB card.

I learned to aim for breathing room.

## Picking Models That Actually Fit

There are two model families I keep coming back to for this size of card.

The first is **Gemma 4**. I have written previously about [why Gemma 4's Apache 2.0 license matters](/post/gemma-4-apache-2-license-matters), and the licensing still matters here. It is much easier to have a serious enterprise conversation about local models when the legal posture is boring and understandable.

The second is **Qwen 3.6**, especially for coding and agentic workflows. The [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) describes a 35B total parameter model with only 3B activated per token. That is the important trick. It is a Mixture-of-Experts style model, so you get a larger pool of parameters without paying the full dense-model cost on every token.

That does not make it free. It still needs memory, and the quantization still matters. But it is exactly the kind of architecture that makes edge inference interesting.

### Parameter Sizes In Plain English

Parameter count is not a perfect measure of intelligence, but it is a good first approximation of memory pressure.

- **E2B / E4B / small models**: fast, easy to run, good for classification, small summaries, simple drafting, and autocomplete-style jobs.
- **12B class**: the first size where local chat starts to feel genuinely useful. A well-quantized 12B model can be responsive and surprisingly capable.
- **26B / 27B class**: the sweet spot for a 24GB workstation card if you want better reasoning and code review without falling off a performance cliff.
- **31B dense models**: possible with aggressive quantization, but you need to be realistic about context size and speed.
- **35B sparse MoE models**: interesting because only part of the model is active per token. Qwen3.6-35B-A3B is the obvious example for coding.

The raw model is only the starting point. The version you actually run locally is usually a quantized release.

## Quantization Is The Whole Game

A model published in BF16 or FP16 precision is often too large for workstation VRAM. Very roughly, 16-bit weights need about two bytes per parameter before you even think about runtime overhead. A 26B model can therefore be around 50GB in full precision. That is not going to live inside a 24GB card.

Quantization reduces the precision of the weights so the model uses less memory. The trade-off is quality. A good quant loses surprisingly little. A bad or too-aggressive quant can make the model feel vague, brittle, or weirdly forgetful.

In the GGUF world, you will see names like:

- `Q8_0`: large, high quality, often too big for this card once overhead is included.
- `Q6_K`: very good quality, but can be tight on 24GB with larger models.
- `Q5_K_M`: a nice middle ground when Q6 is too large.
- `Q4_K_M`: often the practical default for 24GB cards.
- `IQ4_NL`, `IQ4_XS`, `IQ3_*`: newer or more aggressive options that can make awkward models fit.

For example, Bartowski's [Gemma 4 26B A4B GGUF](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) page lists a full BF16 file at about 50.5GB, a `Q6_K_L` quant at about 23GB, and a `Q4_K_M` quant at about 17GB. On paper the Q6 file fits. In practice, the Q4 or Q5 variants are more comfortable because they leave room for context.

For Qwen3.6, the [Unsloth Qwen3.6-35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) releases include 4-bit files around the 20GB mark. That is exactly why this model is interesting on a 24GB card: it is large enough to be useful, but the right quant still fits.

The rule I have settled on is simple:

> Pick a quant at least 1-2GB smaller than your VRAM, and preferably more if you want larger context windows.

## Step 1: Download The GGUF File

GGUF is a local model file format used heavily by `llama.cpp` and tools built around that ecosystem. Hugging Face describes it as a binary format optimised for quick loading and inference, with both tensors and model metadata stored in the file.

That metadata part matters. A GGUF file is not just a bag of numbers. It normally includes enough information for local inference tools to understand the model architecture, tokeniser, quantization type, and prompt format.

In a connected environment you can download directly:

```bash
python -m pip install -U "huggingface_hub[cli]"

mkdir -p ~/models/qwen3.6
cd ~/models/qwen3.6

huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  --include "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \
  --local-dir .
```

For Gemma 4:

```bash
mkdir -p ~/models/gemma4
cd ~/models/gemma4

huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir .
```

In a restricted environment, I would not have each workstation randomly pulling files from Hugging Face. I would download once from a controlled machine, record the hash, put the GGUF into an internal artefact store, and use an orchestration tool such as Ansible to push the approved model files and configuration out to the endpoints.

That gives you:

- repeatability
- provenance
- hash verification
- no surprise model swaps
- no direct developer workstation Internet dependency
- a controlled way to replace or retire models across the fleet

## Step 2: Install Ollama

[Ollama](https://ollama.com/) is still the easiest way to get local LLMs running on a workstation. It wraps the model runtime, exposes a local API, and gives you a clean CLI for testing.

On a normal Linux workstation:

```bash
curl -fsSL https://ollama.com/install.sh | sh
```

In the sort of environment I care about, that installer should be mirrored or packaged internally. The workstation should install Ollama from an approved repository, not from a random curl pipe to the Internet.

Once installed, check that it is alive:

```bash
ollama --version
curl http://127.0.0.1:11434
```

You should see:

```text
Ollama is running
```

I also keep Ollama bound to localhost by default. A local model endpoint should not casually become a LAN service.

## Step 3: Create A Modelfile And Load The Model

Ollama can import a local GGUF with a `Modelfile`. The minimal version is almost comically small:

```text
FROM /opt/models/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

PARAMETER num_ctx 8192
PARAMETER temperature 0.6
PARAMETER top_p 0.95
```

Save that as:

```bash
/opt/models/qwen3.6/Modelfile
```

Then create the Ollama model:

```bash
ollama create qwen36-local -f /opt/models/qwen3.6/Modelfile
```

For Gemma:

```text
FROM /opt/models/gemma4/gemma-4-26B-A4B-it-Q4_K_M.gguf

PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER top_p 0.9
```

Then:

```bash
ollama create gemma4-local -f /opt/models/gemma4/Modelfile
```

Ollama's [Modelfile reference](https://docs.ollama.com/modelfile) is worth reading because the file is where you can set defaults like context length, stop tokens, temperature, and the system prompt.

## Step 4: Test With Ollama Directly

Before wiring this into anything else, test the model directly.

```bash
ollama list
ollama run qwen36-local
```

Then give it something real:

```text
You are reviewing a Python function for production readiness.
Find bugs, security issues, missing tests, and readability problems.
Be concise.
```

For a one-shot CLI test:

```bash
ollama run qwen36-local "Write a PHP function that safely renders a tag link for a blog post."
```

And check memory usage:

```bash
ollama ps
nvidia-smi
```

This is where the theory becomes visible. If the model spills out of VRAM, gets partially offloaded, or has an over-large context window, you will feel it immediately. Token generation slows down, prompts take longer to process, and the whole thing stops feeling like a useful assistant.

For coding, Qwen3.6 has felt more natural to me than Gemma. It is more comfortable with repository-shaped tasks, code review, and tool-like instructions. Gemma 4 is still a strong general local assistant, especially when licensing and broader local deployment matter.

## Step 5: Get Hermes Agent Running

Once Ollama is working, the next question is: what do you actually do with it?

Chatting in a terminal is useful for testing, but it is not an agent. That is where [Hermes Agent](https://hermes-agent.nousresearch.com/) becomes interesting.

The normal Hermes install path is a script, but in a locked-down workstation environment I prefer to be explicit:

```bash
sudo apt install python3 python3-venv nodejs npm

python3 -m venv ~/.venvs/hermes-agent
source ~/.venvs/hermes-agent/bin/activate

python -m pip install --upgrade pip
pip install hermes-agent
```

In our environment, package dependencies come from a local PyPI mirror, so the install looks more like this:

```bash
pip config set global.index-url https://pypi-mirror.example.internal/simple
pip config set global.trusted-host pypi-mirror.example.internal

pip install hermes-agent
```

Then point Hermes at Ollama's OpenAI-compatible endpoint:

```bash
hermes model
```

Choose a custom or self-hosted endpoint, then use:

```text
Base URL: http://127.0.0.1:11434/v1
API key:  leave blank, or use a local placeholder if the UI insists
Model:    qwen36-local
```

Hermes stores its configuration under `~/.hermes/`, which makes it fairly easy to inspect and manage. The Hermes provider docs note that custom/self-hosted endpoints work when they expose an OpenAI-compatible `/v1/chat/completions` API, which Ollama does.

### Breakout: Hermes Gives Me An Agent

This is the point where the setup stops feeling like "I installed a model" and starts feeling like "I have a local assistant."

I have called mine **Frank**.

Frank is an arrogant Frenchman who is constantly criticising my code and getting upset with me for wasting his time. This sounds ridiculous, but it is genuinely useful. A local model with a memorable persona is easier to work with than a bland text box. Frank has a job: read my code, complain about the parts that deserve complaint, and begrudgingly suggest better approaches.

The important part is not the accent. The important part is continuity and role.

Hermes gives the model:

- a persistent assistant shell
- memory across sessions
- tool access
- channel support
- configuration that survives the terminal window
- the ability to behave like an agent rather than a single prompt-response loop

That distinction matters. Ollama serves the model. Hermes gives it a place to live.

I do not want developers thinking about local AI as "open a terminal and ask a model a question." I want them thinking: "I have a workstation assistant that can inspect code, remember project conventions, and operate inside the security boundary of this machine."

Frank is rude about it, obviously. But that is his burden.

## Step 6: Get VS Code Working With Continue

The next piece was VS Code.

Ollama now has a native VS Code integration path through GitHub Copilot Chat. The awkward bit is that VS Code requires a login for the model selector even when you are using custom local models. Ollama's own [VS Code integration documentation](https://docs.ollama.com/integrations/vscode) says this does not require a paid Copilot account, but it still requires the login path.

In a locked-down or offline-first environment, that is not what I want.

So the workaround is [Continue](https://docs.continue.dev/guides/ollama-guide).

Install VS Code from the internal mirror of Microsoft's repository, then install the Continue extension from your approved extension mirror or a vetted `.vsix`:

```bash
sudo apt install code
code --install-extension Continue.continue
```

For an offline extension package:

```bash
code --install-extension continue-extension.vsix
```

Then configure Continue to use Ollama. A minimal `~/.continue/config.yaml` can look like this:

```yaml
name: Local AI Workstation
version: 0.0.1
schema: v1
models:
  - name: Qwen 3.6 Local
    provider: ollama
    model: qwen36-local
    apiBase: http://127.0.0.1:11434
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use

  - name: Gemma 4 Local
    provider: ollama
    model: gemma4-local
    apiBase: http://127.0.0.1:11434
    roles:
      - chat
      - edit
      - apply

  - name: Small Local Autocomplete
    provider: ollama
    model: qwen2.5-coder:1.5b
    apiBase: http://127.0.0.1:11434
    roles:
      - autocomplete
```

The model names must match `ollama list`. If Ollama knows the model as `qwen36-local:latest`, use that exact name.

Continue's Ollama guide also supports autodetection:

```yaml
models:
  - name: Autodetect
    provider: ollama
    model: AUTODETECT
    roles:
      - chat
      - edit
      - apply
      - autocomplete
```

That is handy for experimentation. For a managed developer fleet, I would rather be explicit.

### Breakout: What VS Code Is Doing With The Local Model

The exciting thing about this setup is that VS Code does not use the model in only one way.

There are at least three different coding patterns here.

**Autocomplete** is the fast path. The model sees the current file and nearby context, then predicts the next few lines. This wants a smaller, faster model. It should feel instant. A huge reasoning model is usually the wrong choice.

**Question and answer** is the chat path. This is where a developer asks, "Why is this failing?", "What does this function do?", or "Where should I add the validation?" This can use a larger model because latency matters less than understanding.

**Code synthesis and edit/apply** is the workflow path. The assistant proposes changes, rewrites a function, generates tests, or applies an edit across a file. This is where Qwen3.6 starts to make sense because repository reasoning and coding fluency matter more than raw chat pleasantness.

That role separation is important. Local AI in an IDE should not be one model doing everything badly. It should be a small fast model for completion, a stronger model for chat, and a code-capable model for synthesis.

I am genuinely excited to push this out to the devs and see what they do with it. Not because I expect local models to replace every cloud tool on day one, but because the workflow changes when the assistant is local, fast enough, and available inside the editor without sending code to an external API.

## Offline And Controlled Environments

The restricted-network angle is not an afterthought. It is one of the main reasons this experiment matters.

For a serious workstation rollout, I would mirror or internally host:

- Ollama packages
- approved GGUF model files
- model hashes and metadata
- Python packages through a local PyPI mirror
- Node packages through an internal npm mirror if needed
- VS Code packages from a Microsoft repository mirror
- approved VS Code extensions as `.vsix` artefacts
- Continue configuration templates
- Hermes configuration templates
- Ansible roles or playbooks that install the approved runtime, place the approved models, verify hashes, and keep endpoint configuration consistent

That gives developers a usable AI workstation without granting every machine direct Internet access.

It also gives the organisation a model governance story. You can say:

- these are the approved model files
- these are the hashes
- this is the license
- this is the intended use
- this is the runtime
- this is the endpoint binding
- this is how updates are tested
- this is the orchestration path that pushes the approved model set to endpoints

That is much better than "everyone install whatever model Reddit likes this week."

## What Worked

The RTX PRO 4000 Blackwell is a very practical local inference card for the 24GB class.

The models that felt worth using were not tiny toys. Gemma 4 26B-class and Qwen3.6 35B-A3B-class quantized models are capable enough for real drafting, explanation, code review, and assistant workflows.

Ollama made the runtime easy. GGUF made the model artefacts portable. Hermes made the setup feel like an assistant. Continue made VS Code local-model integration workable without going through the Copilot login path.

The biggest lesson was that local AI is not one component. It is a stack:

```text
approved GGUF file
  -> Ollama local runtime
  -> Hermes for agent workflows
  -> Continue for IDE workflows
  -> internal mirrors for repeatable offline deployment
```

When those pieces line up, the experience becomes surprisingly normal.

## What Did Not Work Perfectly

The 24GB limit is real. You can make very large models load by using more aggressive quants or offloading to system RAM, but there is a point where the experience stops being pleasant.

Context length is the silent killer. A model that works beautifully at 8K context might become sluggish when a tool tries to push it to 32K or beyond. For local coding assistants, this matters because IDE tools love stuffing prompts with file context, diffs, terminal output, and instructions.

Tool support is also uneven. A model may claim tool support, but the integration may still behave strangely. This is why I like testing directly in Ollama first, then Hermes, then Continue. Each layer adds value, but each layer also adds another place to misconfigure the model.

## The Bottom Line

The RTX PRO 4000 Blackwell has changed how I think about local AI workstations.

This is not about beating frontier models. It is about having a capable local baseline that works when cloud access is expensive, inappropriate, unavailable, or simply unnecessary.

For under AUD 3,000, a 24GB workstation GPU can run modern edge LLMs that are good enough to matter. Gemma 4 gives me a strong general local model with a clean licensing story. Qwen3.6 gives me a better coding-oriented model for repository work. Ollama makes them easy to serve. Hermes turns them into an agent. Continue brings them into VS Code.

That feels like the shape of the next developer workstation:

local by default, cloud when allowed, governed by design, and useful enough that developers will actually reach for it.

Frank still thinks my code is beneath him.

Unfortunately, he is often right.

---

## References

- [NVIDIA RTX PRO 4000 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4000/)
- [Scorptec RTX PRO 4000 Blackwell listing](https://www.scorptec.com.au/product/graphics-cards/workstation/118874-900-5g147-2570-000)
- [Gemma 4 model overview](https://ai.google.dev/gemma/docs/core)
- [My earlier post: Why Gemma 4's Apache 2.0 License Matters](/post/gemma-4-apache-2-license-matters)
- [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
- [Bartowski Gemma 4 26B A4B GGUF](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF)
- [Unsloth Qwen3.6-35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF)
- [Hugging Face GGUF documentation](https://huggingface.co/docs/hub/gguf)
- [Hugging Face: use Ollama with GGUF models](https://huggingface.co/docs/hub/ollama)
- [Ollama importing models](https://docs.ollama.com/import)
- [Ollama Modelfile reference](https://docs.ollama.com/modelfile)
- [Ollama VS Code integration](https://docs.ollama.com/integrations/vscode)
- [Continue Ollama guide](https://docs.continue.dev/guides/ollama-guide)
- [Hermes Agent documentation](https://hermes-agent.nousresearch.com/docs/)
- [Hermes Agent AI providers](https://hermes-agent.nousresearch.com/docs/integrations/providers)