Friday, June 26, 2026

A New Quality Index for Native Speech-to-Speech Models

 

If you’re building a voice agent in 2026, you’ve probably noticed that picking a speech-to-speech model is still mostly vibes. The demos all sound great and the only way to know which one is best for your product is to wire up four APIs and run your own eval. On June 23, 2026, Artificial Analysis tried to make that easier: they published the Speech-to-Speech Index, a single composite score for native S2S model quality.

What the index actually measures

The index isn’t one benchmark — it’s an equally weighted average of three. Each tests a different capability a voice agent needs in production.

Big Bench Audio is the reasoning layer: 1,000 questions across formal fallacies, navigation, object counting, and web of lies. The model has to think before it speaks.

Full Duplex Bench is the conversation layer. It scores pauses, turn-taking, interruptions, and backchannels — the parts of a real phone call that demos never show.

τ-Voice is the agentic layer. It runs end-to-end customer-service scenarios across airline, retail, and telecom and checks whether the model completes the task.

A model has to have valid scores on all three to be ranked at all. That’s a meaningful bar: the index is opinionated about what good looks like for a real product.

The quality leaderboard

The composite scores for the four models in the index:

  1. OpenAI GPT-Realtime-2 (High) — 77.2%
  2. 2. xAI Grok Voice Think Fast 1.0–75.7%
  3. 3. OpenAI GPT-Realtime-1.5–72.0%
  4. 4. Google Gemini 3.1 Flash Live Preview (High) — 69.5%

The top four are within 8 points of each other, so there isn’t a runaway winner. But the shape of each model’s wins is different. GPT-Realtime-2 leads on Conversational Dynamics: it handles the awkward, real-time parts of conversation better than the rest. Grok Voice Think Fast 1.0 leads on Agentic Performance: it actually finishes customer-service tasks.

For builders, that split is the most useful signal in the index. Phone agents should weight Grok’s lead higher; long-form assistants should weight OpenAI’s lead higher.

Speed: time to first audio

For voice products, the metric that actually drives perceived quality is Time to First Audio (TTFA) — how long from the user finishing a sentence to the model starting its response:

Deepslate Opal — 0.44s

GPT-Realtime-1.5–0.82s

Grok Voice Think Fast 1.0–1.25s

GPT-Realtime-2 (High) — 2.33s

Gemini 3.1 Flash Live Preview (High) — 2.98s

The pattern: the higher the model scores on quality, the slower it tends to be. GPT-Realtime-2 is the best model in the test and 5x slower than Deepslate Opal. For a fast-back-and-forth IVR, 2.3 seconds of dead air feels broken. For a thoughtful assistant, that latency may be worth it.

Deepslate Opal isn’t in the quality ranking (it doesn’t yet have valid scores on all three sub-benchmarks), but its 0.44s TTFA opens up new product categories if the quality story holds up under your own testing.

Cost: what you’ll pay per interaction

The index also reports per-interaction cost:

Gemini 3.1 Flash Live Preview (Minimal) — $1.50

Gemini 3.1 Flash Live Preview (High) — $1.75

Grok Voice Think Fast 1.0 — $3.00

GPT-Realtime-2 (High) — $4.14

Deepslate Opal’s cost isn’t reported in the index.

The cheapest model is 2.7x cheaper than GPT-Realtime-2, and Gemini’s pricing is genuinely aggressive for a frontier-tier model. For any product where unit economics matter — a customer-service line, a high-volume voice assistant — the math on Gemini Flash Live is hard to beat. The trade-off is real, though: the cheapest model is also the lowest-scoring on the quality leaderboard.

What this means for builders

The index is a starting point, not a verdict. It tells you how four frontier S2S models compare on a fixed set of tasks. It doesn’t tell you how they’ll perform on your audio, your users, your domain, or your edge cases.

The useful pattern is this: use the index to pick a shortlist of one or two models, then run a small evaluation on your own data. A 50-call eval on your own audio with your own success metric is worth more than any external benchmark.

The split between GPT-Realtime-2’s Conversational Dynamics lead and Grok’s Agentic Performance lead is the most useful signal for product builders. Most voice products in 2026 are either more like a phone call (where dynamics matter) or more like a task (where agentic performance matters). Pick accordingly.

FAQ


What’s the difference between native speech-to-speech and a stitched pipeline?

A native S2S model takes audio in and produces audio out directly, without a separate speech-to-text → LLM → text-to-speech pipeline. Stitched pipelines are slower, lose prosody and emotion through the text step, and cost more to run. All four models in this index are native S2S.


What does Conversational Dynamics measure, and why does it matter?

Conversational Dynamics comes from the Full Duplex Bench subset — it scores pauses, interruptions, backchannels (“uh-huh”), and turn-taking. Models that score well here feel like they are actually listening. Models that score poorly feel like they are waiting for their turn to talk.


Why is the cheapest model not the quality leader?

Quality and cost are correlated in every model category, and voice is no exception. Gemini 3.1 Flash Live Preview is the cheapest ($1.50) and the lowest in the quality ranking (69.5% vs. 77.2% for GPT-Realtime-2). Every percentage point of quality costs more in compute.


Is OpenAI’s GPT-Realtime-2 worth the 2.7x cost over Gemini Flash Live?

It depends on what you’re building. For a 10-minute customer-service call where the user is venting and interrupting, the Conversational Dynamics lead is probably worth the cost. For a 30-second IVR tree, it’s overkill and the 2.33s TTFA is going to hurt.


Should I pick a model based on this index or test on my own data?

Test on your own data. The index is a shortlist tool, not a decision tool. Real production audio has noise, accents, domain jargon, and edge cases that benchmarks don’t capture. Run at least 50 calls against the top two candidates before you commit.


The index gives builders a useful starting point: four frontier models, one composite score, and clear data on what’s slow, cheap, and good at the parts of conversation that matter. Treat it as a shortlist, run a 50-call eval on your own audio, and let your users’ actual call quality be the final tiebreaker.

Thursday, June 25, 2026

Local-First AI: 1-Bit GLM-5.2 vs Claude Opus and GPT-5.5

A 1-bit GLM-5.2 GGUF ran on a 256GB Mac Studio M3 Ultra and held its own against GPT-5.5 and Claude Opus 4.8.

Three models, one workstation, no API calls — and the smallest of them kept up.

If you build with LLMs, the implication is worth sitting with. Local-first AI no longer means “good enough for chat, weak at real work.” The 1-bit GGUF is not a stunt demo. It is a working checkpoint you can pull today and serve from your own machine.

The test that made people pay attention

Daniel Hanchen from Unsloth posted the same prompt to GLM-5.2, Claude Opus 4.8, and GPT-5.5. The prompt itself was deliberately hard: build a technically impressive Backrooms horror game in a single self-contained HTML file, with a raycaster, procedural textures, infinite chunks, dynamic lighting, Web Audio, and a long list of psychological-horror behaviors. Not the kind of thing a one-shot would normally nail.

The point of the exercise, in Hanchen’s own framing, was not to crown a winner. The point was to show that 1-bit can actually perform well. The 1-bit GLM-5.2 GGUF ran at roughly 21.6 tokens per second on the Mac Studio, fast enough that one-shot felt like a normal chat session, not a wait. A side-by-side showcase video came out of the test too — generated, notably, by the Q4 model itself, running inside the Pi Coding Agent, with Puppeteer capturing frames, labels, and ffmpeg encoding the result end to end. The test produced its own documentary.

Why 1-bit is suddenly viable

A 1-bit model sounds like a contradiction. If each weight is one bit, the model is basically a lookup table. In practice, modern 1-bit and sub-1-bit quantization (Unsloth’s dynamic GGUF variants, QuIP# and related techniques) pack two weights per bit and rely on lookup-table decoders that fit neatly on Apple Silicon and modern GPUs. The math is real, and the calibration data matters a lot.

The other half of the story is hardware. Apple Silicon’s unified memory lets a 256GB Mac Studio hold a model that would otherwise need a server with multiple GPUs. When the model lives in unified memory and the quant is small enough, the bottleneck is no longer VRAM, it is memory bandwidth — and that is where M3 Ultra is genuinely fast.

So the 1-bit result is not a free lunch. It is the intersection of a specific quantization recipe, a specific architecture, and a specific workstation. But that intersection is now reachable for an individual developer, not just a research lab.

Where extreme quantization still breaks

A commenter in the same thread ran UD-Q2_K_XL against UD-Q4_K_M on the same game-dev prompt. Q2’s token usage was substantially higher than Q4’s, and on that particular test the Q2 output was slower and visibly broken. Smaller quant does not always mean “smaller and better.” Below a certain point, the model starts rambling, losing structure, or duplicating tokens to compensate for the lost precision.

This is the local-first tradeoff in one screenshot: every halving of the bit width is a coin flip. Q4 and Q5 are usually safe for instruction following. Q3 is where you start watching carefully. Q2 is research-grade — sometimes it surprises you, often it does not. The 1-bit result above is impressive, but it is also the best case the team chose to publish. Treat the lower bounds with the same respect you would give a benchmark from a paper that did not show variance.

What this means for the cost of inference

If a $5,000 Mac Studio can run a 1-bit GLM-5.2 at 21 tok/s and hold its own against two of the most expensive closed models on the planet, the per-token economics of “good enough” work just changed. You do not need to call an API for every refactor, every docstring, every test scaffold. You can run a long context on your own desk, pay nothing per token, and keep the data on the device.

That does not put OpenAI or Anthropic out of business. The frontier keeps moving, and the hardest tasks still benefit from the biggest models. But the floor — the layer of work that used to be “too expensive to use an API for but too important to skip” — is now a local inference problem. For a lot of teams, that is a bigger shift than the next flagship model.

FAQ

What is 1-bit LLM quantization?

1-bit quantization stores each weight using roughly one bit, usually by pairing two weights per bit and decoding them with a lookup table at runtime. Modern 1-bit GGUFs (Unsloth’s UD-Q2_K_XL, QuIP#-style variants) use calibration data to choose which weights to pack together, which is why the resulting model can still follow long instructions.

Can a 1-bit model really match GPT-5.5 and Claude Opus?

In Unsloth’s side-by-side, yes — on the one-shot prompt the team chose, the 1-bit GLM-5.2 was judged comparable to both frontier models. That is one test on one prompt, not a general claim. For most coding and writing tasks, expect a real quality gap; for short, well-scoped tasks, the gap can be small.

What hardware do I need to run 1-bit GLM-5.2 locally?

Unsloth’s 1-bit GGUF ran at about 21.6 tok/s on a Mac Studio M3 Ultra with 256GB of unified memory. Lighter quants like UD-Q4_K_M run on a wide range of Apple Silicon Macs and on most modern CUDA GPUs — the Unsloth docs list the exact memory requirement for each variant. The llama.cpp install is one curl line; Ollama, LM Studio, and Unsloth Studio all wrap the same backend.

How does 1-bit compare to 4-bit or 8-bit quantization?

4-bit and 5-bit GGUFs are the practical sweet spot for most production use: small enough to fit on consumer hardware, accurate enough to trust. 8-bit is closer to lossless and rarely worth the size. 1-bit and 2-bit are extreme: blazingly fast and small, but the failure mode is silent — the model runs fine, it just hallucinates more.

Is 1-bit the future of local AI?

I would not bet on 1-bit as the steady state — Q4 and Q5 will keep doing most of the work for a while. But the fact that 1-bit is now a credible option on a single workstation tells you where the curve is going. Every six months, the smallest quant that “actually works” gets smaller.

Try it on your machine tonight

Pull the 1-bit GLM-5.2 GGUF, serve it with llama.cpp or Unsloth Studio, and run the same Backrooms prompt yourself. Twenty minutes of setup, zero API spend, and you will have a first-hand answer to the only question that matters: does it actually work for you?


7 Local Coding LLMs Worth Running on Your GPU (2026)

From Qwen3.6 to North Mini Code — practical picks, real benchmarks, and what to actually run on a 16–24 GB GPU.

If you’ve been paying $20 a month for Copilot and another $20 for Claude Code, and you’ve never actually run a coding model on your own machine, 2026 is the year to fix that. Local coding LLMs have crossed the line where they’re fast enough to feel useful, smart enough to solve real problems, and cheap enough that the only real cost is a GPU you might already own. I went through seven of the strongest open models you can run on a single consumer GPU right now, and the picture is clearer than the marketing would have you believe.

Why local coding LLMs matter in 2026

Three things changed at once. First, the open models got genuinely good — the Qwen, Gemma, and Cohere teams shipped coding-focused releases this year that benchmark close to hosted frontier models on LiveCodeBench and Codeforces. Second, GGUF quantization and llama.cpp matured to the point where a 27B model fits comfortably on a 16–24 GB GPU in 4-bit, and a 30B MoE with 3B active parameters fits on the same card with room to spare. Third, the price gap with hosted models is now obscene. A Claude Code subscription runs about $240 a year; an RTX 3090 is a one-time $600 purchase, and the model weights are free.

The other thing nobody talks about enough is privacy. When you run a coding model locally, your source code, prompts, and any private context never leave your machine. For anyone working under an NDA, in healthcare, finance, or defense, that is the actual reason to go local — not cost, not speed, but the simple fact that you can ship a model behind your firewall and audit it.

Four heavy hitters

Qwen3.6 27B MTP is the one I’d start with. It is a 27B model with a hybrid Gated DeltaNet and Gated Attention architecture, 262K context natively (extendable to 1M with YaRN), and Multi-Token Prediction that Unsloth says runs 1.5–2x faster than standard inference. The 4-bit Q4_K_XL GGUF is 17.9 GB, which fits on a 24 GB card like an RTX 3090 or 4090 with the KV cache intact. It is also multimodal — it can read screenshots and UI bugs, not just code. This is the closest thing to a default “just run this” model on the list.

Gemma 4 31B IT QAT is the multimodal workhorse. Google’s 31B dense model with Quantization-Aware Training hits 80.0% on LiveCodeBench v6 and a 2150 Codeforces ELO. It is also multimodal — text and image — but no audio on the 31B size. The 4-bit Q4_0 GGUF is 17.7 GB. If your work involves screenshots, UI bugs, diagrams, or documentation images, this is the model that does code and vision in the same response.

DiffusionGemma 26B A4B is the experimental speed play. It uses a block-diffusion architecture — 15–20 tokens per forward pass — instead of standard autoregressive decoding. Google claims 1100 tokens/sec on an H100 in FP8; on a single consumer GPU you’ll see a substantial speedup, though you pay for it: LiveCodeBench v6 drops to 69.1% (versus 77.1% for the standard Gemma 4 26B A4B). If you generate code by the screen-full and your bottleneck is throughput, this is the one to try. Note that it needs a custom llama.cpp build from PR #24423 — not a stock release.

Nemotron Cascade 2 30B A3B is for reasoning-heavy workflows. NVIDIA’s hybrid MoE — 30B total but only 3B active per token — behaves more like a reasoning agent than a coding autocomplete. The Q4_K_M GGUF is 24.7 GB, so it wants at least a 32 GB card to leave room for KV cache. NVIDIA claims gold-medal performance on IMO 2025 and IOI 2025. If you find yourself writing multi-step refactors, planning migrations, or debugging distributed systems, this model spends more time thinking than talking.

Three more worth knowing for specific workflows

Qwen3.5 9B MTP is for when your GPU is small. The 4-bit GGUF fits in roughly 6 GB of VRAM, so even an RTX 3060 or a 16 GB Mac can run it. It will not beat the bigger models on hard reasoning, but for daily coding, debugging, shell commands, and explanations, it is more than enough — and it is the safest starting point if you have never run a local coding model before.

EXAONE 4.5 33B is the document and Korean-language pick. LG AI Research’s 33B multimodal model is one of the strongest on document understanding, OCR, and Korean-language workflows. It hits 81.4% on LiveCodeBench v6 in reasoning mode. The 4-bit Q4_K_M GGUF is 20 GB. Two caveats: it is licensed under EXAONE 1.2-NC, which means it is free for personal and research use but not for commercial products, and the vision-language capability is what makes it different from the Qwen3.6 27B MTP — pick this one if your code work involves reading screenshots, PDFs, or Korean-language documentation.

North Mini Code 1.0 is the most coding-focused model on the list. Cohere’s 30B-A3B MoE was post-trained specifically for code generation, agentic software engineering, and terminal workflows. Three billion active parameters, 256K context, native function-calling support, and interleaved thinking. The UD-Q4_K_M GGUF is 19.2 GB. If you want a local coding agent that lives in your terminal and runs as the backend to OpenCode, this is the one — it was built for that use case, not as a general chatbot that happens to write code.

Pick the right local coding LLM for your hardware

Here is the decision table I would actually use, distilled from running these on different setups.

  • Qwen3.6 27B MTP — 17.9 GB in Q4, 24 GB GPU, all-round local coding, agentic workflows, multimodal.
  • - Gemma 4 31B IT QAT — 17.7 GB in Q4, 24 GB GPU, code plus screenshots, UI, diagrams, vision tasks.
  • - DiffusionGemma 26B A4B — 16.8 GB in Q4, 24 GB GPU, fast local generation, experimental, throughput.
  • - Nemotron Cascade 2 30B A3B — 24.7 GB in Q4, 32 GB GPU, reasoning-heavy, planning, multi-step refactors.
  • - Qwen3.5 9B MTP — ~6 GB in Q4, 8–12 GB GPU, small setups, daily coding, first local model.
  • - EXAONE 4.5 33B — 20 GB in Q4, 24 GB GPU, documents, PDFs, Korean-language, OCR (non-commercial).
  • - North Mini Code 1.0–19.2 GB in Q4, 24 GB GPU, terminal coding agents, repo edits, code review.

Frequently asked questions

What GPU do I actually need to run these local coding LLMs?

A 16 GB card (RTX 3060, 4060 Ti, 4070, 3090, 4080) is the entry point — the 9B and 26B-A4B models fit in 4-bit. A 24 GB card (RTX 3090, 4090, 5090) is the sweet spot, where you can run any of the heavy hitters in 4-bit with KV cache headroom. A 32 GB+ card is what you want for Nemotron Cascade 2 in higher quantizations or for long-context workloads above 128K tokens.

Can a local coding model actually replace Copilot or Claude Code?

For autocomplete, one-off scripts, debugging help, and code explanation, yes — in 2026 the gap is narrow enough that you will be happy. For very long agentic tasks, multi-step refactors, and complex planning, hosted frontier models still have an edge. The honest answer is to use both: local for the 80% of work that does not need a frontier model, hosted for the 20% that does.

Are these models really private if I download them from Hugging Face?

The weights themselves are not the privacy risk — the privacy risk is where inference happens. If you download a GGUF and run it with llama.cpp, Ollama, LM Studio, or vLLM on your own hardware, the model never phones home and your prompts never leave the machine. The exception is when you use a hosted inference provider (which routes your prompts to a remote GPU); that is not local, regardless of where the weights came from.

Do I have to use llama.cpp, or is there a simpler way?

For most people, LM Studio is the easiest entry point — it is a desktop app, you point it at a GGUF, and it runs an OpenAI-compatible local server in two clicks. Ollama is a good second choice if you want a CLI. llama.cpp, vLLM, and SGLang are what you reach for when you care about maximum throughput or specific inference features like Multi-Token Prediction on Qwen3.6.

How much does it cost in electricity to run one all day?

A 30B-A3B model at 4-bit on a 24 GB GPU draws roughly 250–350 W under sustained load. At a US average electricity rate of $0.16/kWh, that is about $0.04–$0.06 per hour, or roughly $1 a day if you run it eight hours. A month of hosted Claude Code runs $20, so the break-even on your electricity bill is essentially immediate.

If I had to pick one model to start with on a 24 GB GPU in 2026, it is the Qwen3.6 27B MTP in 4-bit. It is the most balanced of the seven — strong coding, multimodal, fast on real consumer hardware, and the most active community for support. Download the Q4_K_XL GGUF, point LM Studio at it, and you have a private coding model running in under fifteen minutes.

Press enter or click to view image in full size
A close-up photograph of a GPU graphics card with metallic heat sinks and circuit board traces, representing the hardware used to run coding language models locally on consumer hardware
Photo by Đào Hiếu on Unsplash

Wednesday, June 24, 2026

Gemini 3.5 Flash Computer Use: A Developer's First Look

If you've been wiring up browser automation the hard way — Playwright scripts that break the moment a UI changes, brittle CSS selectors, hours of debug time — Google's announcement today matters more than it looks. Gemini 3.5 Flash now ships with computer use as a built-in tool, and the workflow for getting an agent to actually see a screen, reason about it, and click around just got dramatically simpler. The model that used to need a separate integration is now a single API call away.

Bar chart titled Gemini 3.5 benchmarks showing OSWorld-Verified scores for computer use tasks across multiple difficulty tiers, with Gemini 3.5 Flash leading on the headline metric.

What changed: from a separate model to a built-in tool

Until this week, computer use in Gemini lived in a separate Gemini 2.5 computer use model. If you wanted a Flash-tier agent that could drive a UI, you had to wire up two models — a planner on one side, a computer-use specialist on the other — and stitch their outputs together. It worked, but it was the kind of architecture that made every product manager ask "why is this so expensive?"

The new release moves computer use natively into the main Gemini 3.5 Flash model. From a developer's perspective, that means the same function-calling surface you've been using for Search, Maps grounding, and structured outputs now also exposes a computer use tool you can pass into a request. The model does the perception-and-reasoning work, returns an action, and your code executes it. No second model, no second integration.

What you can actually build with it

The capability is the same one Google has been showing off in research demos for a year, but now it's fast and cheap enough to put in production. The model can see and act across browser, mobile, and desktop environments. The two demo screenshots in the announcement tell the story best: one shows Gemini 3.5 Flash using computer use to analyze the Gemini app itself and return a categorized list of features. The other shows it auditing its own documentation for accessibility issues. Both are the kind of multi-step, tool-using work that used to require a chain of prompts and a few human-in-the-loop checks.

For me, the most interesting target is the unsexy one: continuous software testing. A Flash-tier model that can drive a UI is finally affordable enough to run a smoke test against a real browser on every pull request, not just nightly. Knowledge work — the long tail of "go pull this data from a SaaS dashboard and put it in a spreadsheet" — is the obvious second use case.

Your first computer use agent in Python

The actual integration is the same loop pattern as any other tool-using agent: the model returns an action, your code executes it, you send the new screenshot back, and you repeat until the model says it's done. Here is the skeleton:

from google import genai

client = genai.Client()

# Start a turn: ask the model to drive a UI
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Open the staging environment and verify the checkout flow.",
    config={"tools": [{"computer_use": {}}]},  # exact tool surface: see docs
)

# Loop: model returns an action, you execute it, feed screenshot back
while response.candidates:
    action = parse_action(response)         # click, type, screenshot, etc.
    new_screenshot = execute(action)        # your code drives a real browser
    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=[new_screenshot],
        config={"tools": [{"computer_use": {}}]},
    )

The exact tool surface lives in the Gemini API computer use docs and the reference implementation on GitHub. What is worth noting is that the loop is yours to own — the model does not ship a sandbox built in, so the safety story is partly on you.

How Google is making it safe

The honest concern with any computer use model is prompt injection: what happens when the model reads a malicious instruction baked into a webpage and acts on it? Google's answer is a defense-in-depth approach. The base model went through targeted adversarial training for computer use, and they are shipping two optional enterprise safeguards: one that requires explicit user confirmation for sensitive or irreversible actions, and another that automatically stops a task if an indirect prompt injection is detected.

The docs are explicit that you should layer these with secure sandboxing, human-in-the-loop verification, and strict access controls. That is not a hedge — it is the correct framing. A model that can click "Delete repository" needs a sandbox, an approval flow, and a kill switch, not just a polite system prompt. I would run anything production-bound behind a confirmation prompt for any action that touches a billing page, a delete button, or an external send.

Frequently asked questions

How is the built-in computer use in Gemini 3.5 Flash different from the older Gemini 2.5 computer use model?

The capability is similar — both let the model perceive a screen and return actions — but the new version is integrated directly into the main Flash model rather than shipped as a separate model. That means you wire it up with the same function-calling API you already use for Search and other built-in tools, and you do not pay for a second model call to hand off between planner and executor.

Can the agent drive any app, or just a browser?

The blog post explicitly calls out browser, mobile, and desktop environments. The model is doing pixel-level perception, so it can drive anything with a visible UI — the practical limit is usually how fast and reliably you can capture and feed back screenshots, not what the model can see.

How do I stop a computer use agent from being tricked by prompt injection on a webpage?

The base model has adversarial training for this, and Google is shipping two optional enterprise safeguards: a confirmation requirement for sensitive actions, and an automatic stop on detected indirect prompt injection. The docs recommend layering those with secure sandboxing, human-in-the-loop review, and strict access controls. In practice I would never let a computer use agent run unattended against a production account without a confirmation prompt on destructive actions.

Where can I try it before I commit to an integration?

There is a Browserbase-hosted demo environment that lets you test the capability without standing up any infrastructure. The reference implementation and a working starter are on GitHub at google-gemini/computer-use-preview, and the API surface is documented at ai.google.dev/gemini-api/docs/computer-use.

If you have been waiting for a reason to take computer use agents out of research and into your own product, this is the week. Spin up the Browserbase demo, run the reference implementation, and time how long it takes you to get a Flash-tier agent driving a real browser. That is the cheapest way to figure out whether the rest of your stack needs to be ready for this or not.

A Pragmatic Intro to RAG for Engineering Managers

I deleted a cold email this morning — for the fourth time this week — and I realized I should be writing about this, not just hitting archive. Most cold emails fail not because the sender is lazy, but because they keep doing the things the recipient was trained to spot and delete. The good news is that almost every one of those failure modes has a fix, and the fix is usually the same shape: more signal, less pitch.



The pitch leads with what the sender wants


The fastest way to lose a stranger is to open with what you want from them. "I'm reaching out to see if you'd be open to a 15-minute call about…" is a request dressed as a sentence. The reader's only question is "why should I give you 15 minutes?", and you've given them no reason to.


What to do instead: lead with the part that's about them, not you. Name a problem they're likely dealing with, in their words. "Most ops leads I talk to say their first 90 days are spent cleaning up data the previous system left behind" tells me you understand my week. The ask can come later, after you've earned a few seconds of attention.



The subject line is doing the whole job


A subject line that reads "Quick question" or "Partnership opportunity" is asking the inbox to do the work for you. If the subject line is generic, the recipient assumes the body is generic, and archives without opening. The subject line is the headline. It needs a real promise.


What to do instead: write the subject line last, after you know what the body actually delivers. Aim for something specific to the recipient's situation — not a clever hook, a clear one. "Saw your team's hiring for a RevOps lead" beats "Quick thought" every time, because it's about them, and it tells me the rest of the email will be too.



The body reads like every other cold email


A cold email that opens "I hope this message finds you well" is a template the recipient has read 200 times this quarter. They know what's coming, and they have a one-key macro to skip past it. The first sentence is the only sentence that has to feel human.


What to do instead: write the opening line the way you'd actually open a conversation. Skip the warm-up. "I read your post on RevOps turnover last week" is more human than "I came across your work and wanted to reach out." If you wouldn't say it out loud, don't put it in the email.



There's no proof, just a promise


"I've helped 50+ companies increase revenue by 3x" tells the reader nothing, because there's no way to verify it. The recipient has seen a hundred such claims. A claim without a specific story behind it is the same as no claim at all.


What to do instead: replace the claim with a single concrete result, told as a small story. "We rebuilt a SaaS billing flow last quarter and cut failed charges from 8% to 1.4% in six weeks" gives the reader something to picture. If you have a public case study, link to it. If you don't, write a two-sentence version inside the email.



The follow-up never comes — or it comes in a panic


A single cold email has a low reply rate. Almost everyone in the industry agrees on that. The mistake isn't sending only one. The mistake is how people send the second one. A triple-blast of "bumping this to the top of your inbox" reads like desperation, and gets you muted faster than silence.


What to do instead: send a small number of follow-ups, each one adding a new piece of value. The first follow-up can answer a question the recipient might have had. The second can share a relevant link. If they don't reply after two or three useful nudges, stop. Silence is also a reply, and respecting it is what makes the next email — three months from now — land.



FAQ


How many cold emails should I send before giving up on a prospect?


Two or three, with at least three or four days between them, is the standard practice. Anything more than that and your follow-ups start to feel like pressure rather than value. If they haven't replied by the third, move them to a low-priority list and circle back in 60 to 90 days with something genuinely new.


What's a good open rate for a cold email?


Open rates vary widely by industry and list quality, but a useful rough target is 40 to 60 percent. Rates far below that usually mean the subject line is generic, or the list is stale. Rates far above that sometimes mean the emails are going to a small, warm list that wasn't really "cold" to begin with.


Is it worth personalizing every cold email?


Full personalization isn't realistic at scale, but every cold email needs at least one specific line that proves the sender looked at the recipient's world. A single line — the company they work at, a recent post they wrote, a product they shipped — is usually enough. The rest of the email can follow a template; that one line can't.


Should I use a tool to send cold emails at scale?


Tools like Instantly, Smartlead, or Apollo make it possible to send hundreds of personalized cold emails a week, and they work well if the underlying message is good. No tool fixes a weak message. If your reply rate is low with a tool, it's the message, not the tool — the tool is just amplifying the problem.



What to do this week


Pick one cold email you've sent recently and rewrite the first two sentences. Strip out the ask. Add one specific line that proves you read something the recipient actually wrote or shipped. Send the rewrite to a small list — 20 to 30 names — and watch the reply rate. That single change will te        ll you more about your cold email than a hundred tips will.

How Small SaaS Teams Ship a Feature in 3 Days in 2026

A team whiteboard covered in sticky notes and a sketched 3-day timeline, used for sprint planning

If you're running product at a 10-person SaaS company, you already know the real bottleneck isn't engineering speed — it's the gap between "we should build this" and "we're actually building it." Last quarter, our team shipped a feature in 3 working days. Here's exactly how we did it, and what we'd change next time.

Cut the spec to one page

Most "slow" features aren't slow because of the code. They're slow because the spec bounces between three documents, four Slack threads, and a 30-minute "quick sync" that derails the whole afternoon. We started writing one-page specs — literally one page, single-spaced, with the problem, the user, the success metric, the out-of-scope list, and a sketched UI.

The shift that mattered wasn't length. It was ownership. One person writes it. One person reviews it. Everyone else comments inline. By the time we hit "Day 1" of the ship, nobody is asking "wait, what does this feature actually do" — the spec has already forced that conversation to happen.

We keep the spec on a single doc with a 24-hour clock. Once it's open, the doc closes to comments and becomes the contract for Day 1. New ideas that arrive after that go in a parking lot, not the spec.

One ritual, not five meetings

The fastest teams I've worked with have fewer meetings, not more. Not zero — but one. A 15-minute daily standup with three fixed slots: what's done, what's blocked, and what's the day's single output. No status updates for the sake of status.

The mistake is layering standups on top of retros, planning demos, and "weekly leadership syncs." Each one costs 30 minutes, and cumulatively they eat the week. Pick the one ritual that surfaces the next bottleneck and drop the rest. Standup is usually the one that survives.

Ship with a deploy list, not a checklist

A checklist tells you what could go wrong. A deploy list tells you what you actually do. The difference matters on Day 3 when the team is tired and the demo is in 4 hours. Our deploy list looks like: run migrations, smoke test the three flows we changed, post the changelog, ping support, click publish. Six items. No "verify the system is working" filler.

We learned this the hard way. Last year we shipped a feature with a 40-item checklist, missed one DB index migration, and rolled back at 11pm. After that, the deploy list has never had more than 8 items, and we've never had a rollback.

The two metrics that actually matter

Speed-of-ship is a vanity number if you measure it without context. The two numbers we now track on every 3-day ship are: time from "spec signed" to "first user in production," and the number of post-deploy hotfixes in the first 7 days. Everything else — commit count, story points, lines of code — is decoration.

The reason these two matter: the first tells you if the ritual is working, and the second tells you if you cut corners to make the ritual work. A team that ships fast and ships broken hasn't actually shipped. A team that ships slowly and ships clean is leaving speed on the table. The 3-day target only counts if the second number stays at zero or one.

Both metrics sit in a single dashboard that anyone on the team can pull up on Monday morning. The point isn't gamification — it's making the cost of a sloppy ship visible to the people who feel it.

FAQ

How do you handle scope creep on a 3-day ship?

Cut, don't negotiate. When a stakeholder asks for an extra screen on Day 2, the default answer is "that becomes the next 3-day ship." Saying yes to one mid-flight ask is how a 3-day ship turns into a 2-week slog. Most of those "tiny additions" aren't actually tiny.

What if the spec isn't clear by the start of Day 1?

Don't start building. The fastest fix is a 60-minute working session with the spec author, the lead engineer, and the designer in the same room (or the same doc) until it's clear enough to write a one-page spec. Most "unclear" specs aren't actually unclear — they're unaligned. The session fixes alignment, not the spec.

Is 3 days realistic for a 10-person SaaS team?

It's realistic if the team is dedicated to that one feature for the full 3 days and the spec is locked before Day 1. At 10 people, you can ship much faster than 3 days if the work is genuinely small. The 3-day target is for "real" features — a new flow, a meaningful UI change, a new integration. Anything smaller should ship same-day, not on the 3-day ritual.

Does this approach scale past 20 engineers?

Honestly, it gets harder. Beyond 20, you start hitting coordination costs that the one-page spec and one ritual can't absorb. For larger teams, the trick is to break the work into multiple 3-day ships running in parallel, each with its own spec author and ritual. The unit of work stays small; the team grows.

Try the 3-day ship on your next feature

The 3-day ship isn't a productivity hack. It's a forcing function for clarity — clarity in the spec, clarity in the ritual, clarity in the deploy list. Most teams I've seen can move at this speed; they just haven't decided to. Pick the next feature on your roadmap, write a one-page spec this week, and ship it on a 3-day clock. Post the result — and the two metrics — in a team channel on Friday. The discipline shows up in the numbers within a month.

 

Sunday, June 11, 2023

Insta Downloader Privacy Policy

     

Insta Downloader

                 Insta Downloader is a instagram video downloader extension. This extension is lightweight and requires less storage space.

Privacy Policy of Insta Downloader

                 This extension doesn't share user's private data with other users.