Two signals pointing the same direction
A company at the absolute frontier of AI development could not keep their own roadmap from leaking for more than a few hours. Not because they are careless. Because the pace is so fast that operational security is losing the race to the release cadence.
At the same time, compute infrastructure costs are still rising. Data center buildouts at historic scale. GPU lead times stretched out. The big players signing hundred-billion-dollar deals for chips and electricity.
Rising capability plus rising cost means one thing: the gap between people who started and people who have not is widening every single day. The people already building are not just ahead. They are in a different category. This post is about how to join them.
Cloud AI vs. running AI locally: what's actually different
When most people use AI, they are sending their question to a server somewhere, waiting for an answer, and paying per request. That is cloud AI. ChatGPT, Claude, Gemini — all cloud. Your data leaves your device, goes to a data center, gets processed, and comes back.
Running AI locally means the model lives on your machine. Your laptop, your desktop, your server. Nothing leaves your device. No API bill. No latency from a round trip across the internet. No usage limits. You run a command and the model answers instantly from hardware sitting in front of you.
Cloud = someone else's computer
ChatGPT and similar tools are powerful but you are renting intelligence. Local AI means you own it. No monthly fee, no data leaving your machine, no rate limits.
Local = full control of the stack
Local inference lets you swap models mid-pipeline, run custom fine-tunes, avoid vendor lock-in, and operate entirely offline. Critical for privacy-sensitive applications.
The models you can download right now
The models available for free download today would have been considered frontier AI two years ago. Llama (Meta), Mistral, Qwen (Alibaba), Phi (Microsoft) — these are not research toys. They are production-grade systems powering real applications.
The difference between these and ChatGPT is not quality. For most everyday tasks, the gap is small to negligible. The difference is that you can download them, run them yourself, modify them, and build on top of them without asking anyone's permission or paying per query.
Open source models are why the window to build is still open. You do not need a corporate API budget to start. You need a machine and fifteen minutes.
What quantizing means and why it matters
A full AI model is enormous. Llama 3 at full precision takes up over 140GB of storage and needs high-end server hardware to run. Most people do not have that — and they do not need it.
Quantizing is the process of compressing a model. You trade a small amount of numerical precision for a massive reduction in size and memory requirements. Think of it like converting a raw audio file to a high-quality MP3. You lose a tiny bit of detail. You gain something that actually fits on your device and runs fast.
A quantized 7 billion parameter model takes up about 4GB and runs on any modern laptop. A 13B model runs on a machine with a decent GPU. The quality difference from the full-size version is barely noticeable for most tasks.
The file format quantized models come in is called GGUF. When you see a model labeled Q4, Q5, or Q8 — those are quantization levels.
Smallest file size. Fastest on CPU. Slightly lower quality. Good starting point for older hardware.
Good balance. Recommended for most users. Runs well on modern laptops and mid-range GPUs.
Closest to full quality. Needs more RAM. Best for machines with 16GB+ VRAM or unified memory.
Start with Q5. It runs well on most hardware and the quality is excellent for everyday use.
Running your first local model in five minutes
The easiest way to run local models is a tool called Ollama. It handles everything: downloading models, running the inference server, managing versions. Three commands and you are running.
That is it. You now have a local AI running on your machine with no API key, no cloud dependency, and no per-query cost. Ollama also exposes the model at localhost:11434 — compatible with any tool that supports the OpenAI API format, which means it works as a drop-in replacement in most AI apps and code.
Want to go bigger?
Once you are comfortable, try these models for different use cases.
Stop guessing. Start measuring.
An eval is how you stop wondering if your AI is working and start knowing.
Here is the beginner version: write ten questions. Write the answers you expect. Run your model against them. Compare. That is an eval. A spreadsheet works fine. Questions in column A, expected answers in B, model outputs in C, score in D.
Here is why it matters more than almost anything else: once you have a baseline, every change has a scoreboard. Different model, different prompt, different settings — you can see immediately whether things got better or worse. The teams moving fastest with AI are not using smarter models than everyone else. They are iterating faster because they have better feedback loops.
Manual eval in a spreadsheet
Ten questions, expected answers, model outputs. Score each one 0 or 1. Track your average over time. Simple and effective.
Automated eval pipelines
Tools like LangSmith, Braintrust, and PromptFoo let you run hundreds of test cases automatically and track performance across model versions and prompts.
A model that knows your domain
Fine-tuning means taking an existing open source model and training it further on your own data. The result is a model that knows your domain, matches your style, and outperforms any general-purpose model on your specific tasks.
This used to require a research team and server infrastructure. It does not anymore. A tool called Unsloth lets you fine-tune a 7B model on a consumer GPU in a few hours. You need a dataset of examples, a GPU with at least 8GB VRAM, and an afternoon.
You do not need to start here. But knowing this exists changes what you think is possible. A model trained on your customer support history. A model trained on your company's documentation. A model trained on your own writing. All of this is within reach for someone with a laptop and a free weekend.
Get the AI Prompt Pack
The exact prompts we use to run automations, write copy, and ship faster. Free download.
The compounding advantage is just reps.
Every week you spend running local models, writing evals, and iterating on prompts is a week of distance between you and the person still waiting for the right moment.
The signals say the frontier is moving fast and the cost to play is going up. The map says: install Ollama, pull a model, write ten test cases, and start this week.
That is it. That is the whole thing. The window is open. Start now.