TL;DR
On April 4, 2026, Anthropic ended Claude Pro and Max coverage for OpenClaw and every third-party agent tool. Anthropic's own framing: $200/month plans had been running $1,000 to $5,000 of compute. Part 1's "rising infrastructure costs" signal just became a third signal: pricing-page sovereignty. The platform can rewrite your deal overnight.
The hedge isn't a different vendor. The hedge is owning the loop. Run. Measure. Improve.
The 2026 stack: Gemma 4 + TurboQuant for the model. MLX or CUDA for the chip. Ollama or LM Studio for the runtime. exo + Tailscale to cluster your devices. Hermes as the agent. RunPod when local isn't enough. A spreadsheet for evals. You can ship the first lap of the loop this weekend. The longer you wait, the bigger the second gap gets.
April 4, 12pm Pacific
On April 4, 2026 at noon Pacific, an email landed in OpenClaw users' inboxes. Their Claude Pro and Max subscriptions would no longer cover OpenClaw or any third-party agentic tool. The switch was immediate: usage moved to pay-as-you-go bundles or direct API keys. Affected subscribers got a one-time credit equal to one month of their plan, and pre-purchased bundles got up to 30% off — a softer landing than the headlines implied, but a landing nonetheless.
The numbers came from Anthropic itself. In a statement to VentureBeat, the company pointed out that some $200/month Max subscriptions were running $1,000 to $5,000 worth of agent compute — that is where the "5x to 25x" bill-jump story comes from. Indie developers who had wired OpenClaw into their daily workflow watched their projected April spend balloon inside an afternoon. Threads filled up. Refund requests piled in.
OpenClaw's original creator, Peter Steinberger, had announced his move to OpenAI on February 14, nearly two months earlier. The project was handed to an open-source foundation (with OpenAI's continued support) and remains MIT-licensed. The economics, for anyone still on a flat Anthropic subscription, do not.
Boris Cherny, who runs Claude Code at Anthropic, said it cleanly: "Our subscriptions weren't built for the usage patterns of these third-party tools." He added: "Capacity is a resource we manage thoughtfully, and we are prioritizing our customers using our products and API." He is not wrong. He is also not the point.
The point is that the rules can change overnight. They just did. Renting intelligence is a business model. Your business model.
Two signals just became three
Part 1 named two signals — capability moving fast, infrastructure costs climbing. The conclusion was: start building now.
This week added a third signal. Pricing-page sovereignty. The platform you build on can rewrite the deal between Tuesday and Wednesday and there is nothing in your contract that says no. Not a Claude problem. Not an Anthropic problem. A rented-intelligence problem.
The hedge isn't switching vendors. Switching vendors just buys you a different landlord. The hedge is owning the loop.
The loop is three verbs. Run. Measure. Improve. Run a model on a machine you control. Measure it on a task you care about. Improve it through prompt, tool, or training. Then run it again. Reps stack. Reps compound. Reps cannot be revoked by an email.
CUDA vs. MLX — the laptop you own picked the lane
Two compute languages run almost everything in local AI. Pick the wrong one and your hardware sits idle.
CUDA is NVIDIA's. It is the world's GPU language. Every cloud H100, every gaming PC with a 3090 or 4090, every Linux box with a green-stickered card — CUDA. If you have an NVIDIA GPU, you are in CUDA's lane and in the lane the entire research world targets first.
MLX is Apple's. It is the native framework for M-series Macs and the reason a $1,500 MacBook Air now beats a $4,000 PC tower for many local workloads. Unified memory is the trick. The model and the GPU share the same RAM, no copying. A 64GB MacBook can hold a 35-billion-parameter model in working memory and barely warm up the fans.
The laptop you own picked your lane
Mac on M1 or later? You are MLX. Windows or Linux PC with an NVIDIA card? You are CUDA. Don't fight it. Run what runs.
Serious shops run both
CUDA wins on raw FLOPs and training. MLX wins on watts-per-token and inference latency. MLX on the daily drivers, a CUDA box at home (or a rented H100) for the heavy lifts.
The three runtimes everyone argues about
llama.cpp. Ollama. LM Studio. People will tell you which one is best. They are arguing about the paint color on the same engine.
llama.cpp — written in C++ by Georgi Gerganov. The engine. Every other tool on this list is a wrapper around it.
Ollama — a package-manager-style wrapper around llama.cpp. ollama run gemma3:4b is a verb now. Auto-downloads. Auto-quantizes. Exposes a local API on port 11434 that any app can talk to.
LM Studio — a desktop GUI for people who don't live in a terminal. Click a model. Click download. Click chat. Same engine underneath.
Pick by comfort, not features. They all run the same models from the same files. The differences are surface.
Install LM Studio
Five minutes from download to your first chat with a frontier model. No terminal required.
Ollama on the daily driver
Use llama.cpp directly when you need to compile something custom or run on a headless server. Skip the GUI debate entirely.
What a GGUF actually is
You will see the letters GGUF on every model page on Hugging Face. Nobody explains what they mean.
A GGUF is one file. Inside that file: the model's quantized weights, the tokenizer, the metadata, the prompt template. All packed.
It is the MP3 of AI models. One format, every device, two minutes from download to running. Download a 4GB GGUF, drop it in Ollama or LM Studio, and you are chatting with a frontier-quality model. No PyTorch install. No CUDA toolkit. No Python environment hell. One file. Done.
The format that turned AI models into something you double-click.
Gemma 4 and TurboQuant shipped this week
This part happened this week. April 2, 2026. Google DeepMind shipped Gemma 4 and a new compression method called TurboQuant.
Effective 2B. Runs on a phone. 60.0% MMLU-Pro.
Effective 4B. Runs on any laptop from the last three years. 69.4% MMLU-Pro.
Mixture-of-Experts. ~4B active params per token. Fast despite the size.
A fourth size — 31B Dense — is the foundation model, the one you fine-tune. It's the headline variant and the one people are benchmarking against frontier closed models this week.
The bigger story is TurboQuant — unveiled March 25, accepted to ICLR 2026. Real headline numbers: 6x reduction in KV-cache memory at the same accuracy as the BF16 baseline, and up to 8x faster attention on H100 at 4-bit. KV cache compresses to roughly 2.5–3 bits with near-lossless quality. Training-free and data-oblivious — you download it and it works. A model that needed a data-center A100 last year now runs on a MacBook Air today.
That is the single biggest hardware-accessibility jump of the year. It happened the same week the rented-intelligence bill arrived. That is not a coincidence. That is the new equilibrium.
Download Gemma 4 E4B tonight
In LM Studio. 69.4% MMLU-Pro on a mid-spec laptop is the best local/laptop trade in 2026. Period.
Pull the 26B A4B GGUF
From Unsloth or Google's Hugging Face org. Near-31B-Dense quality at ~4B-active speed. The model to fine-tune. The model to put under Hermes.
Gemma 4 31B Dense on the real scoreboards (April 2026): 85.2 MMLU-Pro · 84.3 GPQA Diamond · 80.0 LiveCodeBench v6 · 89.2 AIME 2026 · LMArena #3 open / #6 overall. It lands within 7–10 points of GPT-5.4 (92.8 GPQA) and Claude Opus 4.6 on frontier evals — on a single consumer GPU. That is the gap you are buying back.
Your devices are already a cluster: exo + Tailscale
The single most underrated fact about local AI in 2026 is that most people own three or four computers and a tablet, and don't realize they could be one machine.
exo Labs shards a model across every Mac, PC, iPhone, and iPad on your network. The old MacBook in the drawer becomes part of a bigger brain. Two M1 Airs can run a model neither could fit alone. Add an iPad and you are past 100GB of pooled memory without buying anything.
Tailscale is a mesh VPN. After a five-minute install on every device, your home rig is reachable from a coffee shop like it is on the same Wi-Fi. End-to-end encrypted. Private. No port forwarding. No public IP. No cloud middleman.
No data left your perimeter. No cloud bill arrived. No pricing page changed underneath you. The model is yours, the cluster is yours, the network is yours, and the answer is private by construction. Private AI from anywhere is not a dream of 2027. It is a Sunday afternoon project.
Hermes: the agent that learns from your work
Now the OpenClaw news pays off.
If you spent the last six months building on OpenClaw inside a Claude subscription, you woke up April 4 with a problem. The replacement is Hermes Agent, from Nous Research, led by Jeffrey Quesnelle — the same lab behind the legendary Hermes model series.
Hermes Agent is MIT-licensed. It launched as v0.1.0 on February 25 and is on v0.8.0 as of April 8 — around 22,000 GitHub stars and climbing fast. It runs on a $5/month VPS. It runs on your laptop. It runs on your exo cluster. It is model-agnostic — Gemma 4, Qwen, Llama, your own fine-tune, anything that speaks the OpenAI API format. Its self-improving skill memory follows the new agentskills.io standard, which means the skills it writes are portable to any other agent that speaks it.
What makes it different from every other agent is a built-in learning loop:
The Hermes learning loop
Not a model retraining itself in the dark — a practical agent that gets sharper at your work.
A worked example. You ask Hermes to scrape pricing pages from a list of competitors and dump them into a CSV every Monday morning. The first time, it figures out the scrape logic, writes the CSV, hands it to you. Then it writes a skill called weekly-competitor-pricing-scrape with the working code, the schedule, the source list, and the gotchas it hit. Next Monday, it doesn't think. It runs the skill. The Monday after that, when one of the sites changes its layout, it patches the skill in place and keeps going. By month three, you have a folder of skills that look exactly like the work you do, and an agent that opens each week already knowing how to do it.
OpenClaw was the cloud-tier sibling. Hermes is the local-tier sibling. After this week, that distinction is the whole game. One can be unplugged from above. The other cannot.
Benchmarks are public. Evals are yours.
Every model launch comes with a chart. MMLU, HumanEval, GPQA, SWE-bench, ARC-AGI. They all matter, in the same way that horsepower matters — they tell you the engine isn't broken. They tell you almost nothing about how the model does on your job.
The moat is your eval set. Twenty prompts that look exactly like the work you need done. Scored every time you swap a model, tweak a prompt, or change a tool.
You do not need a framework. You do not need a research paper. You need a spreadsheet. Columns: Prompt. What a good answer looks like. Model A score (1–5). Model B score (1–5). Notes.
That is an eval suite. It costs you nothing. It will outperform every benchmark chart you have ever read for the work you actually do. The first version takes an hour. The first improvement it surfaces will pay for that hour ten times over.
Open Google Sheets right now
Write five prompts that look like real work. Score Gemma 4 against ChatGPT. You now have an eval set. Add fifteen more this week.
Regression test for intelligence
Graduate the spreadsheet into a JSON file. Run it through a script every time you bump a model version. Auto-commit the scores to git.
Rent an H100 for the afternoon
Some workloads need real silicon. Fine-tuning a 13B model. Generating training data with a teacher model. Running an embedding pipeline over a million documents. Your laptop will get you started. It will not get you finished.
The answer is not to buy a $10,000 H100. The answer is to rent one for the four hours you need it.
RunPod, Lambda, and Vast.ai rent H100s, A100s, and 4090s by the minute. As of April 2026 on RunPod: RTX 4090 from $0.34/hr community / $0.59/hr secure. H100 80GB from $2.39/hr on-demand secure, up to ~$2.99/hr on community nodes. Spin up a pod from a template, mount a network volume, pull your dataset, run the job, tear it down. The card goes back into the pool. You get billed for the eight hours your fine-tune actually ran, not for a card sitting in a closet.
A full afternoon on an H100 is ~$10. A weekend-long fine-tune rarely crosses $30. The cost of learning to train models in 2026 is a nice dinner. The cost of not learning is the email you got April 4.
Two paths you can start on Saturday
Both paths converge on the same loop. Pick the one that matches how you like to work and get moving.
LM Studio + a Google Sheet
Install LM Studio. Download Gemma 4 E4B. Open Google Sheets and write twenty prompts that look like the work you do every week. Score the model against your current AI tool. Save. Score again next week, and the week after. That is the loop. That is the entire game at this stage.
Ollama + exo + Tailscale + Hermes
Install Ollama on your daily driver. Install exo on every spare device. Wrap them in Tailscale. Drop Hermes on top. Build the same twenty-prompt eval as a JSON file and a runner script. When the spreadsheet isn't enough, rent an H100 on RunPod for an afternoon and fine-tune Gemma 4 26B on your own data.
Same three verbs. Run. Measure. Improve. The only difference between the two paths is how soon the reps start to compound.
Get the AI Prompt Pack
The exact prompts we use to run automations, write copy, and ship faster. Free download.
Disarmed, in order
"I don't have a GPU." You probably already do. The M-series chip in any Mac sold since 2020 is a GPU. The integrated graphics on a recent iPad is a GPU. The old gaming laptop in the closet is a GPU. With Gemma 4 E4B and TurboQuant, the bar to run a frontier-quality model is lower than it has ever been. If you have a laptop you bought in the last four years, you have enough.
"I'm not technical enough." The beginner path in this post has zero terminal commands. LM Studio is a button. Google Sheets is a spreadsheet. Both already exist on your machine. The only technical skill required is the willingness to write down what a good answer looks like.
"What if it's hard?" The first hour is hard the way every new tool's first hour is hard. The second hour is not. By the end of a Saturday, you have a model running, twenty prompts scored, and a sense of what your local stack can and can't do. That is more reps than 95% of people who talk about AI on LinkedIn will get this year.
"What if local models aren't good enough?" They are good enough for more of your work than you think, and the gap closes every month. Gemma 4 E4B posts 69.4% MMLU-Pro from a laptop — territory that was frontier closed-model work eighteen months ago. The 31B Dense variant sits at 85.2 MMLU-Pro, 84.3 GPQA Diamond, 80.0 LiveCodeBench v6 — roughly 7–10 points off GPT-5.4 and Claude Opus 4.6 on the hardest evals, on a single consumer GPU. The point is not to replace the frontier. It is to own a baseline nobody can take from you, and to reserve the frontier for the work that actually earns its per-token cost.
The excuses are the real cost of waiting. Each one is a rep you did not get.
The compounding reps.
Part 1 said the people already building are not just ahead. They are in a different category. This week proved it. The people who were already running their own models did not get an email from Anthropic. Their bill did not change. Their stack did not break. They closed the laptop and kept working.
The gap is widening. And inside the group of practitioners, a second gap is widening — between the people who run local models and the people who measure and train them. The second gap is the one that compounds.
Part 1 was the door. Part 2 is the staircase. You do not need a PhD. You do not need an H100. You do not need permission.
You need a spreadsheet, a laptop you already own, and a weekend. Start tonight.