Thursday, June 25, 2026

7 Local Coding LLMs Worth Running on Your GPU (2026)

From Qwen3.6 to North Mini Code — practical picks, real benchmarks, and what to actually run on a 16–24 GB GPU.

If you’ve been paying $20 a month for Copilot and another $20 for Claude Code, and you’ve never actually run a coding model on your own machine, 2026 is the year to fix that. Local coding LLMs have crossed the line where they’re fast enough to feel useful, smart enough to solve real problems, and cheap enough that the only real cost is a GPU you might already own. I went through seven of the strongest open models you can run on a single consumer GPU right now, and the picture is clearer than the marketing would have you believe.

Why local coding LLMs matter in 2026

Three things changed at once. First, the open models got genuinely good — the Qwen, Gemma, and Cohere teams shipped coding-focused releases this year that benchmark close to hosted frontier models on LiveCodeBench and Codeforces. Second, GGUF quantization and llama.cpp matured to the point where a 27B model fits comfortably on a 16–24 GB GPU in 4-bit, and a 30B MoE with 3B active parameters fits on the same card with room to spare. Third, the price gap with hosted models is now obscene. A Claude Code subscription runs about $240 a year; an RTX 3090 is a one-time $600 purchase, and the model weights are free.

The other thing nobody talks about enough is privacy. When you run a coding model locally, your source code, prompts, and any private context never leave your machine. For anyone working under an NDA, in healthcare, finance, or defense, that is the actual reason to go local — not cost, not speed, but the simple fact that you can ship a model behind your firewall and audit it.

Four heavy hitters

Qwen3.6 27B MTP is the one I’d start with. It is a 27B model with a hybrid Gated DeltaNet and Gated Attention architecture, 262K context natively (extendable to 1M with YaRN), and Multi-Token Prediction that Unsloth says runs 1.5–2x faster than standard inference. The 4-bit Q4_K_XL GGUF is 17.9 GB, which fits on a 24 GB card like an RTX 3090 or 4090 with the KV cache intact. It is also multimodal — it can read screenshots and UI bugs, not just code. This is the closest thing to a default “just run this” model on the list.

Gemma 4 31B IT QAT is the multimodal workhorse. Google’s 31B dense model with Quantization-Aware Training hits 80.0% on LiveCodeBench v6 and a 2150 Codeforces ELO. It is also multimodal — text and image — but no audio on the 31B size. The 4-bit Q4_0 GGUF is 17.7 GB. If your work involves screenshots, UI bugs, diagrams, or documentation images, this is the model that does code and vision in the same response.

DiffusionGemma 26B A4B is the experimental speed play. It uses a block-diffusion architecture — 15–20 tokens per forward pass — instead of standard autoregressive decoding. Google claims 1100 tokens/sec on an H100 in FP8; on a single consumer GPU you’ll see a substantial speedup, though you pay for it: LiveCodeBench v6 drops to 69.1% (versus 77.1% for the standard Gemma 4 26B A4B). If you generate code by the screen-full and your bottleneck is throughput, this is the one to try. Note that it needs a custom llama.cpp build from PR #24423 — not a stock release.

Nemotron Cascade 2 30B A3B is for reasoning-heavy workflows. NVIDIA’s hybrid MoE — 30B total but only 3B active per token — behaves more like a reasoning agent than a coding autocomplete. The Q4_K_M GGUF is 24.7 GB, so it wants at least a 32 GB card to leave room for KV cache. NVIDIA claims gold-medal performance on IMO 2025 and IOI 2025. If you find yourself writing multi-step refactors, planning migrations, or debugging distributed systems, this model spends more time thinking than talking.

Three more worth knowing for specific workflows

Qwen3.5 9B MTP is for when your GPU is small. The 4-bit GGUF fits in roughly 6 GB of VRAM, so even an RTX 3060 or a 16 GB Mac can run it. It will not beat the bigger models on hard reasoning, but for daily coding, debugging, shell commands, and explanations, it is more than enough — and it is the safest starting point if you have never run a local coding model before.

EXAONE 4.5 33B is the document and Korean-language pick. LG AI Research’s 33B multimodal model is one of the strongest on document understanding, OCR, and Korean-language workflows. It hits 81.4% on LiveCodeBench v6 in reasoning mode. The 4-bit Q4_K_M GGUF is 20 GB. Two caveats: it is licensed under EXAONE 1.2-NC, which means it is free for personal and research use but not for commercial products, and the vision-language capability is what makes it different from the Qwen3.6 27B MTP — pick this one if your code work involves reading screenshots, PDFs, or Korean-language documentation.

North Mini Code 1.0 is the most coding-focused model on the list. Cohere’s 30B-A3B MoE was post-trained specifically for code generation, agentic software engineering, and terminal workflows. Three billion active parameters, 256K context, native function-calling support, and interleaved thinking. The UD-Q4_K_M GGUF is 19.2 GB. If you want a local coding agent that lives in your terminal and runs as the backend to OpenCode, this is the one — it was built for that use case, not as a general chatbot that happens to write code.

Pick the right local coding LLM for your hardware

Here is the decision table I would actually use, distilled from running these on different setups.

  • Qwen3.6 27B MTP — 17.9 GB in Q4, 24 GB GPU, all-round local coding, agentic workflows, multimodal.
  • - Gemma 4 31B IT QAT — 17.7 GB in Q4, 24 GB GPU, code plus screenshots, UI, diagrams, vision tasks.
  • - DiffusionGemma 26B A4B — 16.8 GB in Q4, 24 GB GPU, fast local generation, experimental, throughput.
  • - Nemotron Cascade 2 30B A3B — 24.7 GB in Q4, 32 GB GPU, reasoning-heavy, planning, multi-step refactors.
  • - Qwen3.5 9B MTP — ~6 GB in Q4, 8–12 GB GPU, small setups, daily coding, first local model.
  • - EXAONE 4.5 33B — 20 GB in Q4, 24 GB GPU, documents, PDFs, Korean-language, OCR (non-commercial).
  • - North Mini Code 1.0–19.2 GB in Q4, 24 GB GPU, terminal coding agents, repo edits, code review.

Frequently asked questions

What GPU do I actually need to run these local coding LLMs?

A 16 GB card (RTX 3060, 4060 Ti, 4070, 3090, 4080) is the entry point — the 9B and 26B-A4B models fit in 4-bit. A 24 GB card (RTX 3090, 4090, 5090) is the sweet spot, where you can run any of the heavy hitters in 4-bit with KV cache headroom. A 32 GB+ card is what you want for Nemotron Cascade 2 in higher quantizations or for long-context workloads above 128K tokens.

Can a local coding model actually replace Copilot or Claude Code?

For autocomplete, one-off scripts, debugging help, and code explanation, yes — in 2026 the gap is narrow enough that you will be happy. For very long agentic tasks, multi-step refactors, and complex planning, hosted frontier models still have an edge. The honest answer is to use both: local for the 80% of work that does not need a frontier model, hosted for the 20% that does.

Are these models really private if I download them from Hugging Face?

The weights themselves are not the privacy risk — the privacy risk is where inference happens. If you download a GGUF and run it with llama.cpp, Ollama, LM Studio, or vLLM on your own hardware, the model never phones home and your prompts never leave the machine. The exception is when you use a hosted inference provider (which routes your prompts to a remote GPU); that is not local, regardless of where the weights came from.

Do I have to use llama.cpp, or is there a simpler way?

For most people, LM Studio is the easiest entry point — it is a desktop app, you point it at a GGUF, and it runs an OpenAI-compatible local server in two clicks. Ollama is a good second choice if you want a CLI. llama.cpp, vLLM, and SGLang are what you reach for when you care about maximum throughput or specific inference features like Multi-Token Prediction on Qwen3.6.

How much does it cost in electricity to run one all day?

A 30B-A3B model at 4-bit on a 24 GB GPU draws roughly 250–350 W under sustained load. At a US average electricity rate of $0.16/kWh, that is about $0.04–$0.06 per hour, or roughly $1 a day if you run it eight hours. A month of hosted Claude Code runs $20, so the break-even on your electricity bill is essentially immediate.

If I had to pick one model to start with on a 24 GB GPU in 2026, it is the Qwen3.6 27B MTP in 4-bit. It is the most balanced of the seven — strong coding, multimodal, fast on real consumer hardware, and the most active community for support. Download the Q4_K_XL GGUF, point LM Studio at it, and you have a private coding model running in under fifteen minutes.

Press enter or click to view image in full size
A close-up photograph of a GPU graphics card with metallic heat sinks and circuit board traces, representing the hardware used to run coding language models locally on consumer hardware
Photo by Đào Hiếu on Unsplash

Wednesday, June 24, 2026

Gemini 3.5 Flash Computer Use: A Developer's First Look

If you've been wiring up browser automation the hard way — Playwright scripts that break the moment a UI changes, brittle CSS selectors, hours of debug time — Google's announcement today matters more than it looks. Gemini 3.5 Flash now ships with computer use as a built-in tool, and the workflow for getting an agent to actually see a screen, reason about it, and click around just got dramatically simpler. The model that used to need a separate integration is now a single API call away.

Bar chart titled Gemini 3.5 benchmarks showing OSWorld-Verified scores for computer use tasks across multiple difficulty tiers, with Gemini 3.5 Flash leading on the headline metric.

What changed: from a separate model to a built-in tool

Until this week, computer use in Gemini lived in a separate Gemini 2.5 computer use model. If you wanted a Flash-tier agent that could drive a UI, you had to wire up two models — a planner on one side, a computer-use specialist on the other — and stitch their outputs together. It worked, but it was the kind of architecture that made every product manager ask "why is this so expensive?"

The new release moves computer use natively into the main Gemini 3.5 Flash model. From a developer's perspective, that means the same function-calling surface you've been using for Search, Maps grounding, and structured outputs now also exposes a computer use tool you can pass into a request. The model does the perception-and-reasoning work, returns an action, and your code executes it. No second model, no second integration.

What you can actually build with it

The capability is the same one Google has been showing off in research demos for a year, but now it's fast and cheap enough to put in production. The model can see and act across browser, mobile, and desktop environments. The two demo screenshots in the announcement tell the story best: one shows Gemini 3.5 Flash using computer use to analyze the Gemini app itself and return a categorized list of features. The other shows it auditing its own documentation for accessibility issues. Both are the kind of multi-step, tool-using work that used to require a chain of prompts and a few human-in-the-loop checks.

For me, the most interesting target is the unsexy one: continuous software testing. A Flash-tier model that can drive a UI is finally affordable enough to run a smoke test against a real browser on every pull request, not just nightly. Knowledge work — the long tail of "go pull this data from a SaaS dashboard and put it in a spreadsheet" — is the obvious second use case.

Your first computer use agent in Python

The actual integration is the same loop pattern as any other tool-using agent: the model returns an action, your code executes it, you send the new screenshot back, and you repeat until the model says it's done. Here is the skeleton:

from google import genai

client = genai.Client()

# Start a turn: ask the model to drive a UI
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Open the staging environment and verify the checkout flow.",
    config={"tools": [{"computer_use": {}}]},  # exact tool surface: see docs
)

# Loop: model returns an action, you execute it, feed screenshot back
while response.candidates:
    action = parse_action(response)         # click, type, screenshot, etc.
    new_screenshot = execute(action)        # your code drives a real browser
    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=[new_screenshot],
        config={"tools": [{"computer_use": {}}]},
    )

The exact tool surface lives in the Gemini API computer use docs and the reference implementation on GitHub. What is worth noting is that the loop is yours to own — the model does not ship a sandbox built in, so the safety story is partly on you.

How Google is making it safe

The honest concern with any computer use model is prompt injection: what happens when the model reads a malicious instruction baked into a webpage and acts on it? Google's answer is a defense-in-depth approach. The base model went through targeted adversarial training for computer use, and they are shipping two optional enterprise safeguards: one that requires explicit user confirmation for sensitive or irreversible actions, and another that automatically stops a task if an indirect prompt injection is detected.

The docs are explicit that you should layer these with secure sandboxing, human-in-the-loop verification, and strict access controls. That is not a hedge — it is the correct framing. A model that can click "Delete repository" needs a sandbox, an approval flow, and a kill switch, not just a polite system prompt. I would run anything production-bound behind a confirmation prompt for any action that touches a billing page, a delete button, or an external send.

Frequently asked questions

How is the built-in computer use in Gemini 3.5 Flash different from the older Gemini 2.5 computer use model?

The capability is similar — both let the model perceive a screen and return actions — but the new version is integrated directly into the main Flash model rather than shipped as a separate model. That means you wire it up with the same function-calling API you already use for Search and other built-in tools, and you do not pay for a second model call to hand off between planner and executor.

Can the agent drive any app, or just a browser?

The blog post explicitly calls out browser, mobile, and desktop environments. The model is doing pixel-level perception, so it can drive anything with a visible UI — the practical limit is usually how fast and reliably you can capture and feed back screenshots, not what the model can see.

How do I stop a computer use agent from being tricked by prompt injection on a webpage?

The base model has adversarial training for this, and Google is shipping two optional enterprise safeguards: a confirmation requirement for sensitive actions, and an automatic stop on detected indirect prompt injection. The docs recommend layering those with secure sandboxing, human-in-the-loop review, and strict access controls. In practice I would never let a computer use agent run unattended against a production account without a confirmation prompt on destructive actions.

Where can I try it before I commit to an integration?

There is a Browserbase-hosted demo environment that lets you test the capability without standing up any infrastructure. The reference implementation and a working starter are on GitHub at google-gemini/computer-use-preview, and the API surface is documented at ai.google.dev/gemini-api/docs/computer-use.

If you have been waiting for a reason to take computer use agents out of research and into your own product, this is the week. Spin up the Browserbase demo, run the reference implementation, and time how long it takes you to get a Flash-tier agent driving a real browser. That is the cheapest way to figure out whether the rest of your stack needs to be ready for this or not.

A Pragmatic Intro to RAG for Engineering Managers

I deleted a cold email this morning — for the fourth time this week — and I realized I should be writing about this, not just hitting archive. Most cold emails fail not because the sender is lazy, but because they keep doing the things the recipient was trained to spot and delete. The good news is that almost every one of those failure modes has a fix, and the fix is usually the same shape: more signal, less pitch.



The pitch leads with what the sender wants


The fastest way to lose a stranger is to open with what you want from them. "I'm reaching out to see if you'd be open to a 15-minute call about…" is a request dressed as a sentence. The reader's only question is "why should I give you 15 minutes?", and you've given them no reason to.


What to do instead: lead with the part that's about them, not you. Name a problem they're likely dealing with, in their words. "Most ops leads I talk to say their first 90 days are spent cleaning up data the previous system left behind" tells me you understand my week. The ask can come later, after you've earned a few seconds of attention.



The subject line is doing the whole job


A subject line that reads "Quick question" or "Partnership opportunity" is asking the inbox to do the work for you. If the subject line is generic, the recipient assumes the body is generic, and archives without opening. The subject line is the headline. It needs a real promise.


What to do instead: write the subject line last, after you know what the body actually delivers. Aim for something specific to the recipient's situation — not a clever hook, a clear one. "Saw your team's hiring for a RevOps lead" beats "Quick thought" every time, because it's about them, and it tells me the rest of the email will be too.



The body reads like every other cold email


A cold email that opens "I hope this message finds you well" is a template the recipient has read 200 times this quarter. They know what's coming, and they have a one-key macro to skip past it. The first sentence is the only sentence that has to feel human.


What to do instead: write the opening line the way you'd actually open a conversation. Skip the warm-up. "I read your post on RevOps turnover last week" is more human than "I came across your work and wanted to reach out." If you wouldn't say it out loud, don't put it in the email.



There's no proof, just a promise


"I've helped 50+ companies increase revenue by 3x" tells the reader nothing, because there's no way to verify it. The recipient has seen a hundred such claims. A claim without a specific story behind it is the same as no claim at all.


What to do instead: replace the claim with a single concrete result, told as a small story. "We rebuilt a SaaS billing flow last quarter and cut failed charges from 8% to 1.4% in six weeks" gives the reader something to picture. If you have a public case study, link to it. If you don't, write a two-sentence version inside the email.



The follow-up never comes — or it comes in a panic


A single cold email has a low reply rate. Almost everyone in the industry agrees on that. The mistake isn't sending only one. The mistake is how people send the second one. A triple-blast of "bumping this to the top of your inbox" reads like desperation, and gets you muted faster than silence.


What to do instead: send a small number of follow-ups, each one adding a new piece of value. The first follow-up can answer a question the recipient might have had. The second can share a relevant link. If they don't reply after two or three useful nudges, stop. Silence is also a reply, and respecting it is what makes the next email — three months from now — land.



FAQ


How many cold emails should I send before giving up on a prospect?


Two or three, with at least three or four days between them, is the standard practice. Anything more than that and your follow-ups start to feel like pressure rather than value. If they haven't replied by the third, move them to a low-priority list and circle back in 60 to 90 days with something genuinely new.


What's a good open rate for a cold email?


Open rates vary widely by industry and list quality, but a useful rough target is 40 to 60 percent. Rates far below that usually mean the subject line is generic, or the list is stale. Rates far above that sometimes mean the emails are going to a small, warm list that wasn't really "cold" to begin with.


Is it worth personalizing every cold email?


Full personalization isn't realistic at scale, but every cold email needs at least one specific line that proves the sender looked at the recipient's world. A single line — the company they work at, a recent post they wrote, a product they shipped — is usually enough. The rest of the email can follow a template; that one line can't.


Should I use a tool to send cold emails at scale?


Tools like Instantly, Smartlead, or Apollo make it possible to send hundreds of personalized cold emails a week, and they work well if the underlying message is good. No tool fixes a weak message. If your reply rate is low with a tool, it's the message, not the tool — the tool is just amplifying the problem.



What to do this week


Pick one cold email you've sent recently and rewrite the first two sentences. Strip out the ask. Add one specific line that proves you read something the recipient actually wrote or shipped. Send the rewrite to a small list — 20 to 30 names — and watch the reply rate. That single change will te        ll you more about your cold email than a hundred tips will.

How Small SaaS Teams Ship a Feature in 3 Days in 2026

A team whiteboard covered in sticky notes and a sketched 3-day timeline, used for sprint planning

If you're running product at a 10-person SaaS company, you already know the real bottleneck isn't engineering speed — it's the gap between "we should build this" and "we're actually building it." Last quarter, our team shipped a feature in 3 working days. Here's exactly how we did it, and what we'd change next time.

Cut the spec to one page

Most "slow" features aren't slow because of the code. They're slow because the spec bounces between three documents, four Slack threads, and a 30-minute "quick sync" that derails the whole afternoon. We started writing one-page specs — literally one page, single-spaced, with the problem, the user, the success metric, the out-of-scope list, and a sketched UI.

The shift that mattered wasn't length. It was ownership. One person writes it. One person reviews it. Everyone else comments inline. By the time we hit "Day 1" of the ship, nobody is asking "wait, what does this feature actually do" — the spec has already forced that conversation to happen.

We keep the spec on a single doc with a 24-hour clock. Once it's open, the doc closes to comments and becomes the contract for Day 1. New ideas that arrive after that go in a parking lot, not the spec.

One ritual, not five meetings

The fastest teams I've worked with have fewer meetings, not more. Not zero — but one. A 15-minute daily standup with three fixed slots: what's done, what's blocked, and what's the day's single output. No status updates for the sake of status.

The mistake is layering standups on top of retros, planning demos, and "weekly leadership syncs." Each one costs 30 minutes, and cumulatively they eat the week. Pick the one ritual that surfaces the next bottleneck and drop the rest. Standup is usually the one that survives.

Ship with a deploy list, not a checklist

A checklist tells you what could go wrong. A deploy list tells you what you actually do. The difference matters on Day 3 when the team is tired and the demo is in 4 hours. Our deploy list looks like: run migrations, smoke test the three flows we changed, post the changelog, ping support, click publish. Six items. No "verify the system is working" filler.

We learned this the hard way. Last year we shipped a feature with a 40-item checklist, missed one DB index migration, and rolled back at 11pm. After that, the deploy list has never had more than 8 items, and we've never had a rollback.

The two metrics that actually matter

Speed-of-ship is a vanity number if you measure it without context. The two numbers we now track on every 3-day ship are: time from "spec signed" to "first user in production," and the number of post-deploy hotfixes in the first 7 days. Everything else — commit count, story points, lines of code — is decoration.

The reason these two matter: the first tells you if the ritual is working, and the second tells you if you cut corners to make the ritual work. A team that ships fast and ships broken hasn't actually shipped. A team that ships slowly and ships clean is leaving speed on the table. The 3-day target only counts if the second number stays at zero or one.

Both metrics sit in a single dashboard that anyone on the team can pull up on Monday morning. The point isn't gamification — it's making the cost of a sloppy ship visible to the people who feel it.

FAQ

How do you handle scope creep on a 3-day ship?

Cut, don't negotiate. When a stakeholder asks for an extra screen on Day 2, the default answer is "that becomes the next 3-day ship." Saying yes to one mid-flight ask is how a 3-day ship turns into a 2-week slog. Most of those "tiny additions" aren't actually tiny.

What if the spec isn't clear by the start of Day 1?

Don't start building. The fastest fix is a 60-minute working session with the spec author, the lead engineer, and the designer in the same room (or the same doc) until it's clear enough to write a one-page spec. Most "unclear" specs aren't actually unclear — they're unaligned. The session fixes alignment, not the spec.

Is 3 days realistic for a 10-person SaaS team?

It's realistic if the team is dedicated to that one feature for the full 3 days and the spec is locked before Day 1. At 10 people, you can ship much faster than 3 days if the work is genuinely small. The 3-day target is for "real" features — a new flow, a meaningful UI change, a new integration. Anything smaller should ship same-day, not on the 3-day ritual.

Does this approach scale past 20 engineers?

Honestly, it gets harder. Beyond 20, you start hitting coordination costs that the one-page spec and one ritual can't absorb. For larger teams, the trick is to break the work into multiple 3-day ships running in parallel, each with its own spec author and ritual. The unit of work stays small; the team grows.

Try the 3-day ship on your next feature

The 3-day ship isn't a productivity hack. It's a forcing function for clarity — clarity in the spec, clarity in the ritual, clarity in the deploy list. Most teams I've seen can move at this speed; they just haven't decided to. Pick the next feature on your roadmap, write a one-page spec this week, and ship it on a 3-day clock. Post the result — and the two metrics — in a team channel on Friday. The discipline shows up in the numbers within a month.

 

Sunday, June 11, 2023

Insta Downloader Privacy Policy

     

Insta Downloader

                 Insta Downloader is a instagram video downloader extension. This extension is lightweight and requires less storage space.

Privacy Policy of Insta Downloader

                 This extension doesn't share user's private data with other users.


Universal Video Downloader Privacy Policy

     

Universal Video Downloader

                 Universal Video Downloader is a video downloader extension. This extension is lightweight and requires less storage space.

Privacy Policy of Universal Video Downloader

                 This extension doesn't share user's private data with other users.


Saturday, April 22, 2023

How to Watch Your Favorite Movies for Less Than $5 (Hint: It's Not Netflix)



If you're a movie lover like me, you probably spend a lot of money on streaming services, rentals, or tickets to watch the latest releases. But what if I told you there's a way to watch your favorite movies for less than $5 each? And no, it's not Netflix, Hulu, or Amazon Prime.


It's the Microsoft Store.


Yes, you heard me right. The Microsoft Store has a huge selection of movies that you can buy or rent for as low as $4.99. And the best part is, they have a weekly sale that offers amazing deals on popular titles. For example, this week (4/21 - 4/27), you can get movies like:


- Wonder Woman 1984 for $9.99 (50% off)

- Tenet for $9.99 (50% off)

- Soul for $14.99 (25% off)

- The Croods: A New Age for $5.99 (70% off)

- And many more!


All you need is a Microsoft account and a device that can access the Microsoft Store app, such as a Windows PC, an Xbox, or a smartphone. You can also watch your movies on any device with a web browser by using this link: https://click.linksynergy.com/fs-bin/click?id=1Q4794XxS88&offerid=1160033.10006218&type=3&subid=0


So what are you waiting for? Grab your popcorn and start browsing the Microsoft Store for some amazing movie deals. You'll be surprised by how much you can save and how much fun you can have. But hurry, this sale ends on 4/27, so don't miss this chance to watch your favorite movies for less than $5!