Running Modern Edge LLMs on an RTX PRO 4000 Blackwell Source
Markdown source
1---2title: "Running Modern Edge LLMs on an RTX PRO 4000 Blackwell"3date: "2026-06-10"4published: true5tags: ["ai", "llm", "local-ai", "ollama", "gguf", "nvidia", "blackwell", "gemma", "qwen", "hermes", "vscode", "continue"]6author: "Gavin Jackson"7excerpt: "I wanted to know whether a sub-$3k Australian workstation GPU could run useful modern local LLMs. The short answer is yes, if you pick the right model, the right quant, and build around the 24GB VRAM limit instead of pretending it is a tiny cloud."8---910# Running Modern Edge LLMs on an RTX PRO 4000 Blackwell1112I have been spending more time than is probably healthy thinking about local AI.1314Part of that is curiosity. Part of it is cost. If LLM pricing keeps moving upward, or if the useful models increasingly sit behind subscriptions, enterprise plans, and usage tiers, then relying entirely on cloud inference starts to feel fragile.1516The other part is network reality. Not every environment can connect directly to the Internet. Some developer networks should not have broad outbound access. Some codebases should not leave the building. Some organisations need AI assistance inside a controlled workstation environment, not hanging off whatever API a developer happened to sign up for last Tuesday night.1718So I bought an NVIDIA RTX PRO 4000 Blackwell card, installed it in an Ubuntu 26.04 workstation, and started testing what modern edge LLMs look like on hardware that is expensive, but not absurd. In Australia, I have seen the 24GB RTX PRO 4000 Blackwell listed under AUD 3,000. That is not pocket change, but it is very different from buying a datacentre GPU or building a multi-card inference box.1920The key constraint is simple:2122> The model needs to fit inside 24GB of VRAM and still leave enough headroom to be usable.2324That one sentence drives almost every practical decision.2526## The Hardware Reality2728The card I am talking about is the [NVIDIA RTX PRO 4000 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4000/), with 24GB of GDDR7 ECC memory. My test machine is running Ubuntu 26.04, which is the same base I would use for a managed developer workstation rollout. It is a workstation GPU, not a gaming flagship, and that matters. The attraction is not only raw speed. It is the combination of 24GB VRAM, a professional driver stack, sane power and cooling, and the ability to put useful local inference into a developer workstation without turning the desk into a small server room.2930On 24GB, you are not running giant 70B dense models comfortably. You are not replacing frontier cloud models for every task. But you can run genuinely capable 12B, 26B, 27B, 31B, and sparse 35B-class models if you choose the right quantized files and keep context length under control.3132That last bit matters. VRAM is not just model weights. You also need memory for the KV cache, runtime overhead, and whatever your inference engine is doing around the edges. A model file that is 23.8GB is technically under 24GB, but that does not mean it is a good daily-driver choice on a 24GB card.3334I learned to aim for breathing room.3536## Picking Models That Actually Fit3738There are two model families I keep coming back to for this size of card.3940The first is **Gemma 4**. I have written previously about [why Gemma 4's Apache 2.0 license matters](/post/gemma-4-apache-2-license-matters), and the licensing still matters here. It is much easier to have a serious enterprise conversation about local models when the legal posture is boring and understandable.4142The second is **Qwen 3.6**, especially for coding and agentic workflows. The [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) describes a 35B total parameter model with only 3B activated per token. That is the important trick. It is a Mixture-of-Experts style model, so you get a larger pool of parameters without paying the full dense-model cost on every token.4344That does not make it free. It still needs memory, and the quantization still matters. But it is exactly the kind of architecture that makes edge inference interesting.4546### Parameter Sizes In Plain English4748Parameter count is not a perfect measure of intelligence, but it is a good first approximation of memory pressure.4950- **E2B / E4B / small models**: fast, easy to run, good for classification, small summaries, simple drafting, and autocomplete-style jobs.51- **12B class**: the first size where local chat starts to feel genuinely useful. A well-quantized 12B model can be responsive and surprisingly capable.52- **26B / 27B class**: the sweet spot for a 24GB workstation card if you want better reasoning and code review without falling off a performance cliff.53- **31B dense models**: possible with aggressive quantization, but you need to be realistic about context size and speed.54- **35B sparse MoE models**: interesting because only part of the model is active per token. Qwen3.6-35B-A3B is the obvious example for coding.5556The raw model is only the starting point. The version you actually run locally is usually a quantized release.5758## Quantization Is The Whole Game5960A model published in BF16 or FP16 precision is often too large for workstation VRAM. Very roughly, 16-bit weights need about two bytes per parameter before you even think about runtime overhead. A 26B model can therefore be around 50GB in full precision. That is not going to live inside a 24GB card.6162Quantization reduces the precision of the weights so the model uses less memory. The trade-off is quality. A good quant loses surprisingly little. A bad or too-aggressive quant can make the model feel vague, brittle, or weirdly forgetful.6364In the GGUF world, you will see names like:6566- `Q8_0`: large, high quality, often too big for this card once overhead is included.67- `Q6_K`: very good quality, but can be tight on 24GB with larger models.68- `Q5_K_M`: a nice middle ground when Q6 is too large.69- `Q4_K_M`: often the practical default for 24GB cards.70- `IQ4_NL`, `IQ4_XS`, `IQ3_*`: newer or more aggressive options that can make awkward models fit.7172For example, Bartowski's [Gemma 4 26B A4B GGUF](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF) page lists a full BF16 file at about 50.5GB, a `Q6_K_L` quant at about 23GB, and a `Q4_K_M` quant at about 17GB. On paper the Q6 file fits. In practice, the Q4 or Q5 variants are more comfortable because they leave room for context.7374For Qwen3.6, the [Unsloth Qwen3.6-35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) releases include 4-bit files around the 20GB mark. That is exactly why this model is interesting on a 24GB card: it is large enough to be useful, but the right quant still fits.7576The rule I have settled on is simple:7778> Pick a quant at least 1-2GB smaller than your VRAM, and preferably more if you want larger context windows.7980## Step 1: Download The GGUF File8182GGUF is a local model file format used heavily by `llama.cpp` and tools built around that ecosystem. Hugging Face describes it as a binary format optimised for quick loading and inference, with both tensors and model metadata stored in the file.8384That metadata part matters. A GGUF file is not just a bag of numbers. It normally includes enough information for local inference tools to understand the model architecture, tokeniser, quantization type, and prompt format.8586In a connected environment you can download directly:8788```bash89python -m pip install -U "huggingface_hub[cli]"9091mkdir -p ~/models/qwen3.692cd ~/models/qwen3.69394huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \95 --include "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" \96 --local-dir .97```9899For Gemma 4:100101```bash102mkdir -p ~/models/gemma4103cd ~/models/gemma4104105huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \106 --include "gemma-4-26B-A4B-it-Q4_K_M.gguf" \107 --local-dir .108```109110In a restricted environment, I would not have each workstation randomly pulling files from Hugging Face. I would download once from a controlled machine, record the hash, put the GGUF into an internal artefact store, and use an orchestration tool such as Ansible to push the approved model files and configuration out to the endpoints.111112That gives you:113114- repeatability115- provenance116- hash verification117- no surprise model swaps118- no direct developer workstation Internet dependency119- a controlled way to replace or retire models across the fleet120121## Step 2: Install Ollama122123[Ollama](https://ollama.com/) is still the easiest way to get local LLMs running on a workstation. It wraps the model runtime, exposes a local API, and gives you a clean CLI for testing.124125On a normal Linux workstation:126127```bash128curl -fsSL https://ollama.com/install.sh | sh129```130131In the sort of environment I care about, that installer should be mirrored or packaged internally. The workstation should install Ollama from an approved repository, not from a random curl pipe to the Internet.132133Once installed, check that it is alive:134135```bash136ollama --version137curl http://127.0.0.1:11434138```139140You should see:141142```text143Ollama is running144```145146I also keep Ollama bound to localhost by default. A local model endpoint should not casually become a LAN service.147148## Step 3: Create A Modelfile And Load The Model149150Ollama can import a local GGUF with a `Modelfile`. The minimal version is almost comically small:151152```text153FROM /opt/models/qwen3.6/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf154155PARAMETER num_ctx 8192156PARAMETER temperature 0.6157PARAMETER top_p 0.95158```159160Save that as:161162```bash163/opt/models/qwen3.6/Modelfile164```165166Then create the Ollama model:167168```bash169ollama create qwen36-local -f /opt/models/qwen3.6/Modelfile170```171172For Gemma:173174```text175FROM /opt/models/gemma4/gemma-4-26B-A4B-it-Q4_K_M.gguf176177PARAMETER num_ctx 8192178PARAMETER temperature 0.7179PARAMETER top_p 0.9180```181182Then:183184```bash185ollama create gemma4-local -f /opt/models/gemma4/Modelfile186```187188Ollama's [Modelfile reference](https://docs.ollama.com/modelfile) is worth reading because the file is where you can set defaults like context length, stop tokens, temperature, and the system prompt.189190## Step 4: Test With Ollama Directly191192Before wiring this into anything else, test the model directly.193194```bash195ollama list196ollama run qwen36-local197```198199Then give it something real:200201```text202You are reviewing a Python function for production readiness.203Find bugs, security issues, missing tests, and readability problems.204Be concise.205```206207For a one-shot CLI test:208209```bash210ollama run qwen36-local "Write a PHP function that safely renders a tag link for a blog post."211```212213And check memory usage:214215```bash216ollama ps217nvidia-smi218```219220This is where the theory becomes visible. If the model spills out of VRAM, gets partially offloaded, or has an over-large context window, you will feel it immediately. Token generation slows down, prompts take longer to process, and the whole thing stops feeling like a useful assistant.221222For coding, Qwen3.6 has felt more natural to me than Gemma. It is more comfortable with repository-shaped tasks, code review, and tool-like instructions. Gemma 4 is still a strong general local assistant, especially when licensing and broader local deployment matter.223224## Step 5: Get Hermes Agent Running225226Once Ollama is working, the next question is: what do you actually do with it?227228Chatting in a terminal is useful for testing, but it is not an agent. That is where [Hermes Agent](https://hermes-agent.nousresearch.com/) becomes interesting.229230The normal Hermes install path is a script, but in a locked-down workstation environment I prefer to be explicit:231232```bash233sudo apt install python3 python3-venv nodejs npm234235python3 -m venv ~/.venvs/hermes-agent236source ~/.venvs/hermes-agent/bin/activate237238python -m pip install --upgrade pip239pip install hermes-agent240```241242In our environment, package dependencies come from a local PyPI mirror, so the install looks more like this:243244```bash245pip config set global.index-url https://pypi-mirror.example.internal/simple246pip config set global.trusted-host pypi-mirror.example.internal247248pip install hermes-agent249```250251Then point Hermes at Ollama's OpenAI-compatible endpoint:252253```bash254hermes model255```256257Choose a custom or self-hosted endpoint, then use:258259```text260Base URL: http://127.0.0.1:11434/v1261API key: leave blank, or use a local placeholder if the UI insists262Model: qwen36-local263```264265Hermes stores its configuration under `~/.hermes/`, which makes it fairly easy to inspect and manage. The Hermes provider docs note that custom/self-hosted endpoints work when they expose an OpenAI-compatible `/v1/chat/completions` API, which Ollama does.266267### Breakout: Hermes Gives Me An Agent268269This is the point where the setup stops feeling like "I installed a model" and starts feeling like "I have a local assistant."270271I have called mine **Frank**.272273Frank is an arrogant Frenchman who is constantly criticising my code and getting upset with me for wasting his time. This sounds ridiculous, but it is genuinely useful. A local model with a memorable persona is easier to work with than a bland text box. Frank has a job: read my code, complain about the parts that deserve complaint, and begrudgingly suggest better approaches.274275The important part is not the accent. The important part is continuity and role.276277Hermes gives the model:278279- a persistent assistant shell280- memory across sessions281- tool access282- channel support283- configuration that survives the terminal window284- the ability to behave like an agent rather than a single prompt-response loop285286That distinction matters. Ollama serves the model. Hermes gives it a place to live.287288I do not want developers thinking about local AI as "open a terminal and ask a model a question." I want them thinking: "I have a workstation assistant that can inspect code, remember project conventions, and operate inside the security boundary of this machine."289290Frank is rude about it, obviously. But that is his burden.291292## Step 6: Get VS Code Working With Continue293294The next piece was VS Code.295296Ollama now has a native VS Code integration path through GitHub Copilot Chat. The awkward bit is that VS Code requires a login for the model selector even when you are using custom local models. Ollama's own [VS Code integration documentation](https://docs.ollama.com/integrations/vscode) says this does not require a paid Copilot account, but it still requires the login path.297298In a locked-down or offline-first environment, that is not what I want.299300So the workaround is [Continue](https://docs.continue.dev/guides/ollama-guide).301302Install VS Code from the internal mirror of Microsoft's repository, then install the Continue extension from your approved extension mirror or a vetted `.vsix`:303304```bash305sudo apt install code306code --install-extension Continue.continue307```308309For an offline extension package:310311```bash312code --install-extension continue-extension.vsix313```314315Then configure Continue to use Ollama. A minimal `~/.continue/config.yaml` can look like this:316317```yaml318name: Local AI Workstation319version: 0.0.1320schema: v1321models:322 - name: Qwen 3.6 Local323 provider: ollama324 model: qwen36-local325 apiBase: http://127.0.0.1:11434326 roles:327 - chat328 - edit329 - apply330 capabilities:331 - tool_use332333 - name: Gemma 4 Local334 provider: ollama335 model: gemma4-local336 apiBase: http://127.0.0.1:11434337 roles:338 - chat339 - edit340 - apply341342 - name: Small Local Autocomplete343 provider: ollama344 model: qwen2.5-coder:1.5b345 apiBase: http://127.0.0.1:11434346 roles:347 - autocomplete348```349350The model names must match `ollama list`. If Ollama knows the model as `qwen36-local:latest`, use that exact name.351352Continue's Ollama guide also supports autodetection:353354```yaml355models:356 - name: Autodetect357 provider: ollama358 model: AUTODETECT359 roles:360 - chat361 - edit362 - apply363 - autocomplete364```365366That is handy for experimentation. For a managed developer fleet, I would rather be explicit.367368### Breakout: What VS Code Is Doing With The Local Model369370The exciting thing about this setup is that VS Code does not use the model in only one way.371372There are at least three different coding patterns here.373374**Autocomplete** is the fast path. The model sees the current file and nearby context, then predicts the next few lines. This wants a smaller, faster model. It should feel instant. A huge reasoning model is usually the wrong choice.375376**Question and answer** is the chat path. This is where a developer asks, "Why is this failing?", "What does this function do?", or "Where should I add the validation?" This can use a larger model because latency matters less than understanding.377378**Code synthesis and edit/apply** is the workflow path. The assistant proposes changes, rewrites a function, generates tests, or applies an edit across a file. This is where Qwen3.6 starts to make sense because repository reasoning and coding fluency matter more than raw chat pleasantness.379380That role separation is important. Local AI in an IDE should not be one model doing everything badly. It should be a small fast model for completion, a stronger model for chat, and a code-capable model for synthesis.381382I am genuinely excited to push this out to the devs and see what they do with it. Not because I expect local models to replace every cloud tool on day one, but because the workflow changes when the assistant is local, fast enough, and available inside the editor without sending code to an external API.383384## Offline And Controlled Environments385386The restricted-network angle is not an afterthought. It is one of the main reasons this experiment matters.387388For a serious workstation rollout, I would mirror or internally host:389390- Ollama packages391- approved GGUF model files392- model hashes and metadata393- Python packages through a local PyPI mirror394- Node packages through an internal npm mirror if needed395- VS Code packages from a Microsoft repository mirror396- approved VS Code extensions as `.vsix` artefacts397- Continue configuration templates398- Hermes configuration templates399- Ansible roles or playbooks that install the approved runtime, place the approved models, verify hashes, and keep endpoint configuration consistent400401That gives developers a usable AI workstation without granting every machine direct Internet access.402403It also gives the organisation a model governance story. You can say:404405- these are the approved model files406- these are the hashes407- this is the license408- this is the intended use409- this is the runtime410- this is the endpoint binding411- this is how updates are tested412- this is the orchestration path that pushes the approved model set to endpoints413414That is much better than "everyone install whatever model Reddit likes this week."415416## What Worked417418The RTX PRO 4000 Blackwell is a very practical local inference card for the 24GB class.419420The models that felt worth using were not tiny toys. Gemma 4 26B-class and Qwen3.6 35B-A3B-class quantized models are capable enough for real drafting, explanation, code review, and assistant workflows.421422Ollama made the runtime easy. GGUF made the model artefacts portable. Hermes made the setup feel like an assistant. Continue made VS Code local-model integration workable without going through the Copilot login path.423424The biggest lesson was that local AI is not one component. It is a stack:425426```text427approved GGUF file428 -> Ollama local runtime429 -> Hermes for agent workflows430 -> Continue for IDE workflows431 -> internal mirrors for repeatable offline deployment432```433434When those pieces line up, the experience becomes surprisingly normal.435436## What Did Not Work Perfectly437438The 24GB limit is real. You can make very large models load by using more aggressive quants or offloading to system RAM, but there is a point where the experience stops being pleasant.439440Context length is the silent killer. A model that works beautifully at 8K context might become sluggish when a tool tries to push it to 32K or beyond. For local coding assistants, this matters because IDE tools love stuffing prompts with file context, diffs, terminal output, and instructions.441442Tool support is also uneven. A model may claim tool support, but the integration may still behave strangely. This is why I like testing directly in Ollama first, then Hermes, then Continue. Each layer adds value, but each layer also adds another place to misconfigure the model.443444## The Bottom Line445446The RTX PRO 4000 Blackwell has changed how I think about local AI workstations.447448This is not about beating frontier models. It is about having a capable local baseline that works when cloud access is expensive, inappropriate, unavailable, or simply unnecessary.449450For under AUD 3,000, a 24GB workstation GPU can run modern edge LLMs that are good enough to matter. Gemma 4 gives me a strong general local model with a clean licensing story. Qwen3.6 gives me a better coding-oriented model for repository work. Ollama makes them easy to serve. Hermes turns them into an agent. Continue brings them into VS Code.451452That feels like the shape of the next developer workstation:453454local by default, cloud when allowed, governed by design, and useful enough that developers will actually reach for it.455456Frank still thinks my code is beneath him.457458Unfortunately, he is often right.459460---461462## References463464- [NVIDIA RTX PRO 4000 Blackwell](https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-4000/)465- [Scorptec RTX PRO 4000 Blackwell listing](https://www.scorptec.com.au/product/graphics-cards/workstation/118874-900-5g147-2570-000)466- [Gemma 4 model overview](https://ai.google.dev/gemma/docs/core)467- [My earlier post: Why Gemma 4's Apache 2.0 License Matters](/post/gemma-4-apache-2-license-matters)468- [Qwen3.6-35B-A3B model card](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)469- [Bartowski Gemma 4 26B A4B GGUF](https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF)470- [Unsloth Qwen3.6-35B-A3B GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF)471- [Hugging Face GGUF documentation](https://huggingface.co/docs/hub/gguf)472- [Hugging Face: use Ollama with GGUF models](https://huggingface.co/docs/hub/ollama)473- [Ollama importing models](https://docs.ollama.com/import)474- [Ollama Modelfile reference](https://docs.ollama.com/modelfile)475- [Ollama VS Code integration](https://docs.ollama.com/integrations/vscode)476- [Continue Ollama guide](https://docs.continue.dev/guides/ollama-guide)477- [Hermes Agent documentation](https://hermes-agent.nousresearch.com/docs/)478- [Hermes Agent AI providers](https://hermes-agent.nousresearch.com/docs/integrations/providers)479