Tuesday, June 30, 2026

China Just Dropped 20M Free AI Tokens (And Nobody Noticed)

 

Photo by Mohammad Rahmani on Unsplash

If you’re a developer who pays $20/month for Cursor Pro, $10 for GitHub Copilot, or burns through API credits like candy — stop. A 756-billion-parameter coding model just landed with a 20-million-token free tier. No credit card. No subscription. And somehow, almost nobody in the Western developer community is talking about it.

What Zhipu AI’s GLM-5.2 Actually Gives You

GLM-5.2 is a massive 756B-parameter model from Zhipu AI, a Beijing-based AI lab often described as China’s closest equivalent to OpenAI. The headline offer is 20 million free API tokens for new developers — not a trial, not a “first month free” gimmick. Create an account and you get the full quota immediately, no billing info required.

Beyond the token grant, you also get 120 free image and video credits, access to GLM-5.2’s “High” and “Max Thinking” reasoning modes, and a 1-million-token context window. That’s large enough to feed an entire codebase into a single prompt and still have room for instructions.

The API is OpenAI-compatible. You can point Cursor, Claude Code, Cline, or any OpenAI SDK at it by swapping the base URL and model name. No custom integration, no new tools to learn. If your editor already speaks OpenAI, it already speaks GLM-5.2.

How It Stacks Up Against What You’re Already Paying For

Do the math on your current AI coding stack. Cursor Pro costs $20/month. GitHub Copilot is $10/month. Claude API charges per token. Add them up and you are looking at $30+ per month for tools that help you write code faster.

GLM-5.2 replaces all of them at zero cost for the first 20 million tokens. For a solo developer or small team experimenting with AI-assisted coding, the savings add up fast. Twenty million tokens goes a long way — hundreds of code completions, dozens of full-file refactors, and plenty of room for trial and error.

GLM-5.2 also supports a “Max Thinking” mode that applies chain-of-thought reasoning to complex coding tasks. In practice, this means better results on multi-step refactors, debugging sessions, and architectural decisions — exactly the places where smaller models fall apart.

Why the Silence?

If the offer is real and the model is competitive, why isn’t everyone talking about it? Three factors explain the gap.

Geographic attention bias. Chinese AI labs rarely receive the same Western media coverage as OpenAI, Anthropic, or Google DeepMind. A breakthrough from Beijing doesn’t trend on Hacker News the same way one from San Francisco does. This isn’t new — it’s been true since the earliest days of China’s AI industry.

Trust and data privacy. GLM-5.2 routes through Chinese infrastructure. For many Western developers and enterprises, that’s a dealbreaker. Data residency requirements, compliance policies, and geopolitical caution create a barrier that no amount of free tokens can overcome.

The U.S.-China AI perception gap. Some developers avoid Chinese models on principle; others assume they can’t be competitive. The assumption is increasingly outdated — several Chinese models now rank in the top tier of coding benchmarks — but the perception lingers. GLM-5.2’s 756B parameter count and benchmark scores are competitive with frontier Western models, but mindshare in the developer community hasn’t caught up.

How to Try It in Two Minutes

Here’s the fastest path to get coding with GLM-5.2:

  1. Register at open.bigmodel.cn
  2. 2. Verify your account with your phone number (OTP arrives in a few minutes)
  3. 3. Create an API key from the dashboard
  4. 4. Set your base URL to the GLM-5.2 endpoint
  5. 5. Select model: glm-5.2

For Cursor users: open Settings, go to Models, add a new model provider, and paste your GLM-5.2 API key and base URL. For OpenAI SDK users: set OPENAI_BASE_URL and OPENAI_API_KEY as environment variables. For Cline and similar tools: the model provider setup screen accepts any OpenAI-compatible endpoint — add GLM-5.2 as a custom provider and you’re done.

The Real Catch — Three Caveats Worth Knowing

A free offer at this scale comes with tradeoffs worth understanding up front.

Data residency. Zhipu AI’s API servers are in China. If your codebase contains sensitive or proprietary code that you can’t route through Chinese infrastructure, this isn’t for you. No free tier is worth a compliance violation.

Phone verification. Registration requires a phone number and OTP. Some non-Chinese users report delays receiving the verification code. If you’re outside China, budget a few extra minutes for this step.

Long-term uncertainty. Zhipu AI hasn’t published clear post-quota pricing. The 20M free tokens are framed as a developer acquisition play rather than a limited promotion, but any free API offering can change. If you build a workflow around it, keep a paid fallback ready.

FAQ

Can I use GLM-5.2 with Cursor or VS Code?

Yes. The API is fully OpenAI-compatible. Add it as a custom model provider in Cursor, Claude Code, Cline, Continue.dev, or any tool that supports OpenAI’s API format. Just swap the base URL and model name — no custom integration needed.

How does GLM-5.2 compare to GPT-4o or Claude 4 Sonnet?

GLM-5.2’s 756B parameters put it in the same weight class as the largest frontier models. On coding benchmarks, it scores competitively. The practical differentiators are the 1M-token context window and the zero-cost entry point. Most developers report it handles complex refactoring and debugging well in Max Thinking mode.

What data does Zhipu AI collect from API calls?

Zhipu AI’s data handling policies are less transparent than Western providers. Review the terms of service carefully before sending proprietary code. For open-source or personal projects this is less of a concern, but enterprise teams should involve legal before routing sensitive code through the API.

Is phone verification required for all users?

Yes, registration requires a phone number. Some users outside China report OTP delivery delays. If you don’t receive the code within a few minutes, try again after an hour — the system sometimes throttles international SMS.

Will the 20M free tokens refresh or is it a one-time grant?

The 20M tokens are a one-time welcome grant for new developers, not a recurring monthly quota. Zhipu AI hasn’t announced post-consumption pricing. Pace your usage accordingly: use it for evaluation and experimentation first, migration second.

Try It Before the Quota Changes

Twenty million tokens is enough to decide whether GLM-5.2 fits into your workflow. Sign up, point your editor at it, and spend an afternoon testing it on your actual codebase. If it works, you’ve just eliminated a monthly subscription. If it doesn’t, you’re out two minutes and zero dollars.

The offer is real. The model is competitive. The silence from the Western developer community won’t last forever — and when the conversation starts, you’ll already have an opinion.

Copilot Beats Claude Code on Cost and Matches It on Quality

 

GitHub benchmarked its agentic harness across 5 test suites and found that model-agnostic agents deliver the same results for fewer tokens. Here’s what that means for your next project.

If you’re a developer deciding between Copilot, Claude Code, and Codex CLI, you’ve probably seen plenty of claims but not much controlled data. Last week, GitHub published a head-to-head comparison of the Copilot agentic harness against the native harnesses that ship with leading models — holding the model and task fixed across five separate benchmarks.

The results land on a scatter plot that’s hard to ignore. Copilot CLI clusters in the top-left corner: high task resolution at low cost. Claude Code and Codex CLI sit to the right, spending more per task for equivalent or worse resolution.

What GitHub Actually Tested

GitHub ran every harness-model combination across five benchmark suites: SWE-bench Verified, SWE-bench Pro, SkillsBench, TerminalBench, and Win-Hill. The methodology held the model constant and varied only the harness, isolating the harness as the variable rather than the model’s capability. On TerminalBench 2 alone, every configuration was run five times with a shaded ±1σ ellipse to capture variance.

This matters because most AI coding tool comparisons conflate model quality with harness quality. A great model inside a wasteful harness gives you expensive, slow results. A decent model inside an efficient harness might outperform it.

The Chart That Changes the Calculation

The scatter plot splits into two clear clusters. On the left, GPT-family models (GPT-5.4, GPT-5.5) run at $0.40–$0.60 per task with 65–70% resolution. On the right, Claude-family models (Sonnet 4.6, Opus 4.7) run at $0.80–$1.40 per task with 68–78% resolution. Copilot CLI holds the top-left position across both clusters — above-average resolution with below-average cost.

Why Token Efficiency Is the Real Story

The headline number isn’t resolution — it’s tokens. Across most configurations, the Copilot agentic harness used fewer tokens to reach the same result. Fewer tokens means faster feedback loops, lower latency during interactive use, and cheaper CI/CD integrations.

For a team running AI-assisted code reviews or automated patch generation, token count maps directly to cost per operation. A harness that wastes tokens on verbose planning traces adds up fast. At scale, the difference between $0.60 and $1.20 per task isn’t academic — it’s your monthly infrastructure bill.

Vendor-native harnesses are the worst offender. They’re optimized for one model’s output format and don’t adapt. The Copilot harness, by contrast, is model-agnostic — it speaks the same protocol to any underlying model and strips out overhead.

The 20-Model Advantage No One’s Talking About

Consider a concrete scenario. You’re iterating on a React component. Quick feedback matters, so you route the task to GPT-5.5 through Copilot — fast, cheap, good enough. Then you hit a tricky race condition in the state management. You route the same conversation to Claude Opus 4.7 — deeper reasoning, more tokens, higher cost, but the bug is complex. Once the fix is validated, you’re back to the fast model.

This is impossible with vendor-native harnesses. Codex CLI locks you into OpenAI’s model line. Claude Code’s harness locks you into Anthropic’s. The Copilot agentic harness supports more than 20 models and lets you switch per task without changing your workflow.

How to Choose Your AI Coding Agent Now

Prioritize model diversity. A team using a single model’s native harness is one API deprecation away from rebuilding their workflow. A model-agnostic harness insulates you.

Watch cost per task, not cost per token. A tool that uses 2x the tokens for the same result is expensive regardless of per-token pricing. The benchmark data gives you the real metric.

Test on your own workload. Run a side-by-side on your most common task — a PR review, a refactor — and measure both resolution and tokens.

The data doesn’t say Copilot is categorically better. It says a well-designed agentic harness beats a vendor-locked one, regardless of the model underneath.

FAQ

Does Copilot support models other than OpenAI?

Yes. The Copilot agentic harness works with more than 20 models from OpenAI, Anthropic, and others. You can switch between them per task without changing your editor or workflow.

How do the benchmarks translate to real-world use?

Benchmarks measure task-completion accuracy in controlled environments. Real-world results vary by codebase, but the relative efficiency advantage — fewer tokens for the same resolution — tends to carry over because it’s a harness property, not a model property.

Should I switch from Claude Code to Copilot based on this data?

Not necessarily. If Claude Code gives you results you’re happy with and cost isn’t a concern, there’s no urgent reason to switch. But if you’re comparing tools from scratch or feeling the cost of verbose agent traces, the data suggests a model-agnostic harness delivers better economics.

What is an agentic harness?

It’s the middleware between you and the model. It decides how to break a task into steps, what context to include, when to call tools, and how to format the response. A good harness minimizes wasted tokens while maximizing task completion.

Can I use Copilot’s harness with my own API keys?

Copilot is a paid GitHub subscription. You can’t bring your own model API keys, but the model choice within the harness — across 20+ models — is included in the subscription.

Pick a Task and Measure

Open GitHub’s chart and look at the scatter plot for yourself. Then pick one task from your daily work — something you’d normally ask an AI assistant for — and run it side by side in Copilot and your current tool. Measure time to resolution and tokens used. The benchmark is a useful signal, but your own workflow is the only test that matters.

Your AI Coding Subscription Is Draining Your Wallet

 

Photo by Luca Bravo on Unsplash

I opened my credit card statement last month and found four charges I didn’t remember approving — $347 total, all for AI coding subscriptions. I cancelled three of them that same afternoon.

For a while, “unlimited” AI coding plans felt like the obvious choice. Pay a flat fee, use the assistant as much as you want, never think about tokens or credits. But that model never made economic sense for the companies running it — advanced models are expensive to serve — and the pendulum has swung hard toward measured usage.

I actually prefer the new direction. Token-based and quota-based plans let you budget your consumption, work in bursts without penalty, and never wonder if the “unlimited” label is about to get throttled. The hard part is figuring out which plan actually delivers.

Here’s what I found after a month of testing five AI coding subscription plans against real development workflows.

The End of Unlimited (And Why You Should Be Happy)

The tipping point was inevitable. Running frontier coding models costs real compute, and the old “all you can eat” pricing was burning VC cash, not building sustainable products.

What emerged instead is a matrix of options: token-based plans where you buy a pool of tokens each month, credit-based plans that meter specific capabilities, and quota-based plans that refresh weekly or daily. Each model suits a different working style, but they all share one upside — you know what you’re paying for.

The plans I tested: MiniMax Token Plan, MiMo Token Plan, GLM Coding Plan, OpenAI Codex (included with ChatGPT), and Kimi Code. Each got at least a week of real coding time.

MiniMax Token Plan — $20 for More Tokens Than You’ll Use

MiniMax’s Token Plan is the easiest recommendation on this list. For $20 a month, you get access to MiniMax’s coding models through their web app and desktop app, plus integrations with Claude Code, Cursor, Cline, Kilo Code, Roo Code, Codex CLI, and OpenCode.

The token allowance is generous. For daily coding — debugging, refactoring, running agentic workflows — I never came close to exhausting it. If you want to start even smaller, prepaid credits begin at $5.

This is the plan I’d recommend to any developer who wants high usage at a low price, no games, no hidden throttles.

MiMo Token Plan — The Speed King Nobody’s Talking About

MiMo surprised me more than any other plan on this list. The responses are fast, it uses fewer reasoning tokens than comparable services, and the UI generation quality is genuinely good.

The plan runs on credits that refresh monthly. You use them across MiMo’s model lineup, including MiMo-V2.5-Pro, which supports up to a 1 million-token context window and is built for agentic coding and long-horizon software tasks. It integrates with tools like OpenCode, Cline, OpenClaw, Kilo Code, and Blackbox.

If you’re building custom AI workflows or testing multiple models in parallel, MiMo’s combination of speed and token efficiency makes it a strong second option. It’s not a full IDE subscription — it’s a model access plan — but for agentic coding, it punches above its price.

GLM Coding Plan — Worth It Only If You Need GLM Models

GLM’s Coding Plan from Z.ai has gone through changes recently, and the price has increased. The company is investing in better models like GLM-5.2 and deeper integrations with coding tools, and the subscription reflects that cost.

Here’s the honest take: if you specifically want GLM models for your coding workflow — they work with Claude Code, Cline, Kilo Code, OpenCode, and OpenClaw — the plan delivers. The models are strong for focused coding agent sessions.

But if you’re just looking for the best generic coding subscription, cheaper options exist. GLM made more sense before the price increase. Today, use it when you need GLM-5.2 specifically.

OpenAI Codex — Free (If You Already Pay for ChatGPT)

OpenAI Codex lives inside the VS Code extension, and it’s the plan I use most days — not because it’s the best, but because it’s included with my ChatGPT subscription.

Codex understands your codebase well, handles code generation, debugging, project edits, and large-codebase navigation. The catch is the daily and weekly limits. In a serious coding session, those limits can disappear within an hour. OpenAI lets you buy extra credits as a cushion, but that adds to the cost.

The math is simple: if you already pay for ChatGPT, use Codex as your daily driver. When you hit the limit, switch to MiniMax or MiMo as a backup. No need for a separate primary subscription.

Kimi Code — Predictable Quota, No Monthly Burnout

Kimi Code uses a weekly refreshed quota instead of a monthly token pool. You get a set amount of usage every week, and it resets — no rollover, no guessing.

The Kimi K2.7 Code model handles codebase understanding, terminal tasks, file edits, debugging, refactoring, and feature building. You can access it through the web app, VS Code extension, and CLI.

The weekly refresh is an interesting tradeoff. If you code consistently every week, it works well. If you have heavy weeks and light weeks, a monthly token pool gives you more flexibility. Kimi Code is a solid choice if you’re already in the Kimi ecosystem or prefer K2.7 over other models.

FAQ

Can I use multiple AI coding subscriptions at once?

Yes, and most developers I know do exactly this. A common setup: OpenAI Codex as the daily driver (included with ChatGPT), with MiniMax or MiMo as a backup for heavy coding sessions when Codex limits run out.

Which AI coding plan is best for someone new to AI coding?

Start with OpenAI Codex if you already subscribe to ChatGPT. If you don’t, the MiniMax Token Plan at $20 a month is the lowest-risk entry point with the broadest tool support.

Do token-based credit plans expire?

It depends on the plan. MiniMax offers prepaid credits starting at $5 that you use when needed. Monthly token subscriptions reset each billing cycle. Kimi Code’s quota refreshes every week and does not roll over. Always check the plan’s expiration policy before buying.

How do I know which plan fits my workflow?

Match the pricing model to your work pattern: monthly token plans for bursty usage (heavy sprints, then lighter weeks), weekly quotas for consistent daily coding, and included subscriptions (Codex with ChatGPT) for the baseline you already pay for.

Here’s the quick-reference comparison:

  • MiniMax Token Plan: $20/month token pool — Best value on the list
  • - MiMo Token Plan: Monthly credits — Fast and token-efficient
  • - GLM Coding Plan: Quota-based subscription — Only if you need GLM
  • - OpenAI Codex: Included with ChatGPT — Free if you’re already paying
  • - Kimi Code: Weekly refreshed quota — Solid but niche

Open your billing page right now. If you’re paying more than $50 a month for any single AI coding subscription on this list, try swapping for a month. Start with Codex if you already have ChatGPT — it’s already on your bill. Add MiniMax or MiMo as a $20 backup. I saved $160 my first month, and my output didn’t drop.

Monday, June 29, 2026

Claude Rewrote 5,000 Lines of My Code — Here’s What I Learned

AI code generation concept with abstract technology patterns representing machine learning and software development
Photo by Numan Ali on Unsplash

If you’re a developer who hasn’t touched Claude yet, I get the skepticism. I was there a few weeks ago. I’d watched the demos, read the tweets, nodded along — and kept writing code the same way I always had. Then I gave it a real test: refactor a legacy module I’d been dreading. Five thousand lines, six months old, written by someone who’d already left. I expected a mess. What I saw changed how I think about AI-assisted development.

Why I Started Skeptical (and You Should Be Too)

Every AI coding tool makes the same promises. “Write code faster.” “Fewer bugs.” “Ship more.” And every one I’d tried before Claude delivered on maybe half of those. GitHub Copilot was great at autocomplete but useless for architecture. ChatGPT could write a function but couldn’t hold context across an entire codebase. I’d learned to use AI as a fancy autocomplete, not a collaborator.

Claude, specifically Claude Code (Anthropic’s terminal-based agent), promised something different: not just writing code, but understanding it. Reading entire projects, reasoning about architecture, and making changes that spanned multiple files. I wanted to believe it. I also wanted proof.

The Test I Threw at It — A 5,000-Line Refactor

The module was an internal dashboard API written in Node.js. Six months of organic growth had turned it into a god object nightmare: one file handled auth, routing, database queries, email notifications, and caching. Every new feature required touching at least four functions in the same file. Tests were sparse. Comments were aspirational.

I pointed Claude Code at the repo and gave it a single instruction: “Refactor this API into a clean layered architecture. Split concerns. Don’t break the tests.”

It started by reading every file in the project. Not the file I pointed at — the entire project. It identified imports, mapped dependencies, and built a mental model of the codebase. Then it wrote a plan: which files to create, what to extract, how to wire the layers together. It asked one clarifying question about the authentication flow before it began.

What Claude Did That No Other AI Could

It maintained context across every file it touched. When it renamed a function in the service layer, it updated every import and every caller across the entire project — not just the file it was editing.

Second, it understood the test suite. It ran the tests after every change and caught regressions I would have missed. When a test broke, it didn’t just report the failure — it read the test, understood what it was testing, and adjusted the implementation until the test passed.

Third, it made judgment calls. At one point it had a choice between two refactoring strategies: extract a base class or use composition. It chose composition, left a comment explaining the tradeoff, and asked me to confirm before proceeding. That felt less like a tool and more like a junior developer who’d read the same books I had.

Where Claude Still Falls Short (I Tried to Break It)

I spent the second week trying to find its limits. I found several.

It struggles with highly unconventional code. If your project uses exotic patterns or undocumented frameworks, Claude can hallucinate APIs that don’t exist. It also has a blind spot for performance — its first pass at a database query used N+1 patterns that would have crushed production. And it lacks domain intuition. It can refactor a payment module’s structure but can’t tell you if the business logic for refunds is wrong.

The tool also defaults to verbose code. It writes defensive, over-documented, enterprise-style code by default unless you explicitly tell it to be concise. The first pass added more comments than I was comfortable maintaining.

How I Get the Best Out of Claude Now (My Playbook)

After two weeks of trial and error, I settled on a workflow that consistently delivers:

Start with a written spec. A vague instruction gives you a vague result. I now write one paragraph describing the outcome, one constraint sentence, and one “don’t do this” sentence. That’s usually enough.

Start with a written spec. A vague instruction gives you a vague result. I now write one paragraph describing the outcome, one constraint sentence, and one “don’t do this” sentence.

Review every change file by file. Claude Code’s diff view is excellent. I read every change before accepting it. This caught the N+1 query and two unnecessary abstractions.

Use it for the boring stuff. The real win wasn’t the architecture decisions — it was Claude handling the boilerplate: writing migration scripts, updating type definitions, syncing documentation, fixing lint errors across 30 files.

Verify the tests yourself. Claude runs tests and reports results, but I still run them locally before committing. Once, a test passed in Claude’s headless environment but failed on my machine due to a timezone issue. Trust but verify.

FAQ

Is Claude better than GitHub Copilot for coding?

They solve different problems. Copilot is excellent at inline autocomplete — finishing your line, generating the next function. Claude is better at multi-file reasoning, refactoring, and architectural changes. I use both: Copilot for the in-the-moment flow, Claude for the structural work.

Can Claude work with legacy codebases?

Yes, and this is where it shines. It reads your entire project context before making changes, so it understands the existing patterns, naming conventions, and dependency graph. I’ve seen similar results on Rails, Python, and Go projects of varying ages.

How much does Claude Code cost?

Claude Code is included with Claude Pro ($20/month) and Claude Max subscriptions. The Pro plan is sufficient for individual developers. The Max plan offers higher usage limits for teams running large refactors daily.

Does Claude write secure code?

It writes code that follows standard security patterns (input validation, parameterized queries, proper error handling), but it won’t catch domain-specific security issues. Always review autogenerated code for business logic vulnerabilities. Claude is a tool, not a security auditor.

Can Claude replace junior developers?

No — and framing it that way misses the point. Claude handles the mechanical parts of coding efficiently, but it can’t attend standups, understand product context, or negotiate tradeoffs with stakeholders. What it does do is remove the grunt work so developers can focus on the parts that require human judgment.

Give Claude One Bad Codebase This Week

Pick the module you’ve been avoiding — the one with the TODO comment that says “refactor this when we have time.” That time is now. Point Claude at it, write a one-paragraph spec, and see what happens. You might be surprised. I was.

The worst case is you review its output and throw it away. The best case is you reclaim a week of your life. Either way, you’ll know whether Claude is right for your workflow. I already know mine.


The White House Just Put OpenAI on a Leash

 

Photo by Andrew Neel on Unsplash

Last week, the White House did something it has never done before: it asked OpenAI to keep its next AI model on a short leash, releasing it only to a small group of government-approved partners before the public ever gets to see it.

If you follow artificial intelligence news, you know government involvement usually comes months after a model ships — not before. This time is different.

What Actually Happened with GPT-5.6

The White House’s Office of the National Cyber Director and the Office of Science and Technology Policy sent a direct request to OpenAI earlier this week. The message was clear: GPT-5.6 should not launch broadly. Instead, it should go to a limited circle of government-approved partners who can test and evaluate it before any wider release.

According to a report by Axios, this marks the first instance of the US government seeking to limit the rollout of a frontier AI model before its public release. Commerce Secretary Howard Lutnick personally discussed the model with OpenAI CEO Sam Altman, pressing to ensure that relevant government agencies had tested and approved it before a broader rollout.

The request is notable for what it is not. It is not a formal regulation, a legal mandate, or a legislative action. It is a direct ask from the executive branch — one that OpenAI appears to be complying with voluntarily. In a world where AI companies have largely set their own release timelines and safety standards, this is a meaningful shift in the balance of power.

Why the White House Stepped In

The administration isn’t taking a heavy-handed approach across the board. It intervened here because GPT-5.6 is not a routine update. The White House described the model as having ‘Mythos-like’ capability — a direct comparison to one of the most powerful and carefully restricted AI models ever built.

This intervention follows President Donald Trump’s executive order earlier this month directing federal agencies to establish a voluntary testing protocol for frontier AI models before their release. According to reports, debates within the administration over how restrictive the framework should be had delayed the order for weeks.

The GPT-5.6 request represents the first real test of that emerging framework. The executive order created the structure; the White House’s direct engagement with OpenAI shows it intends to use it.

Mythos-Like — The Model That Was Too Powerful to Release

Anthropic’s Mythos, unveiled in April 2026, was designed to identify cybersecurity vulnerabilities and potential threats before hackers could exploit them. It was a powerful defensive tool — and Anthropic decided it was too powerful to release openly.

The company chose to make Mythos available only to a limited group of technology companies under Project Glasswing, an initiative focused on securing the world’s most critical software. Anthropic did not want the model in the hands of bad actors, but it also did not want the responsibility of unrestricted access.

Describing GPT-5.6 as Mythos-like tells us two things. First, the model has capabilities in the same league — powerful enough that unrestricted access could pose real risks, particularly in cybersecurity, vulnerability discovery, and automated threat analysis. Second, the government is taking a page from Anthropic’s playbook, treating GPT-5.6 not as just another model update but as a potential systemic risk.

The comparison also raises an uncomfortable question: how many AI models currently in development have capabilities their creators consider too dangerous to release? Mythos and GPT-5.6 are the ones we know about. There are almost certainly others.

OpenAI’s Take — and What Comes Next

OpenAI has been working with the Trump administration on the GPT-5.6 rollout, but CEO Sam Altman made the company’s position clear in a message to employees. A restricted release is “not our preferred long-term model,” he told the team. OpenAI will work toward a more sustainable approach for future launches.

What does more sustainable mean? Likely a return to broader releases once the testing and safety frameworks are established and trusted. OpenAI’s business model depends on widespread adoption — restricting access indefinitely is not viable. But for GPT-5.6 specifically, the company is accepting the conditions.

For now, a small group of government-approved partners will receive early GPT-5.6 access for testing and evaluation. Once the voluntary testing framework is operational and the model passes required safety checks, a broader rollout is expected — though no timeline has been announced. The testing protocol itself is still being designed, and debates within the administration about its scope continue.

FAQ

Is GPT-5.6 canceled or delayed indefinitely?

No. GPT-5.6 is still being released, just in a controlled, limited way. Government-approved partners get early access first. Once testing is complete and the safety framework is in place, a broader release is expected. The question is how it enters the world, not whether.

What does Mythos-like mean in plain English?

Anthropic’s Mythos was a cybersecurity AI so powerful that Anthropic itself decided not to release it publicly. It could find unknown vulnerabilities in software systems that even trained security experts would miss. Saying GPT-5.6 is Mythos-like means it has comparable capabilities and warrants similar caution.

Does this mean the government will control all future AI releases?

Not all, but likely the most powerful ones. The voluntary testing framework being developed will apply to frontier AI models — the most capable systems being built. Smaller models, incremental updates, and narrowly scoped tools will probably not face the same level of scrutiny. The dividing line between frontier and everything else is one of the open questions the framework must answer.

Will other companies like Google and Anthropic face the same restrictions?

The executive order covers all frontier AI models, not just OpenAI’s. Anthropic already self-restricted Mythos. Google, Meta, and other companies building advanced models will likely need to engage with the same framework. What happened with OpenAI this week is the first test case, not the last.

How does this affect regular users of AI tools?

Short term, very little. GPT-4 and existing tools continue working as before. Medium term, if the testing framework becomes standard practice, frontier AI models may ship more slowly and with more safety documentation. For most users, the visible difference will be fewer surprise model launches and more advance notice about new capabilities.

This is the first chapter of a new story — one where governments arent just reacting to AI after it ships, but shaping how it enters the world. Whether you see that as cautious or constraining probably says more about your view of AI than anything else.

Bookmark this post and check back when the Trump administration publishes its testing framework. That document will define the rules for every frontier model that follows — and will tell us whether this was a one-off or the new normal.

Sunday, June 28, 2026

GPT-5.6 Sol: What Developers Need to Know

 

OpenAI announced GPT-5.6 Sol, a new reasoning model with two sibling tiers — Terra and Luna — but the naming change matters more than the benchmark numbers. The number (5.6) tracks the generation; Sol, Terra, and Luna are durable capability tiers that will evolve on their own cadence. That means developers now evaluate a model family rather than a single checkpoint, choosing across intelligence, speed, and cost.

The GPT-5.6 Family: Sol, Terra, and Luna

Sol is the flagship — OpenAI’s strongest model for agentic coding, biology, and cybersecurity workloads. Terra is positioned as a balanced model for everyday work, with competitive performance to GPT-5.5 at half the input price ($2.50 vs $5 per 1M tokens). Luna is the fast, affordable tier at $1 input / $6 output per 1M tokens, aimed at high-volume or latency-sensitive use cases.

The tier structure is new for OpenAI. Instead of releasing one model and obsoleting the last, the Sol/Terra/Luna names stay stable while the underlying checkpoints improve. If you build on Terra today and Terra v2 ships next quarter, your integration continues to work — the capability tier becomes the contract.

New Reasoning Modes: max and ultra

GPT-5.6 introduces two new reasoning modes. max reasoning effort gives Sol more time to reason deeply on complex tasks. ultra mode extends beyond single-agent reasoning by orchestrating subagents to parallelize work.

The results show up in benchmarks. Sol sets a new state of the art on Terminal-Bench 2.1, which tests command-line workflows requiring planning, iteration, and tool coordination. On GeneBench v1, Sol outperforms GPT-5.5 on long-horizon genomics analysis while using fewer tokens. On ExploitBench, Sol is competitive with Mythos Preview at roughly one-third the output tokens.

For developers building agentic workflows, ultra mode is the most interesting new capability. It shifts the model from a single reasoning pass to a multi-agent architecture — something most teams currently build themselves on top of the API. OpenAI is packaging it as a mode parameter.

Safety and the API: What Changes for Developers

GPT-5.6 Sol ships with what OpenAI calls its most robust safety stack to date. For developers, the observable effects are three: real-time content classifiers may pause generation for review, the model is trained to refuse prohibited cyber assistance even under jailbreak attempts, and account-level review can flag suspicious patterns across conversations.

During the preview, some legitimate requests — especially in dual-use areas like vulnerability research — may be blocked or delayed. OpenAI is explicit about this: the preview is as much about testing safeguard reliability for legitimate users as it is about constraining misuse.

OpenAI dedicated over 700,000 A100-equivalent GPU hours to automated red-teaming for this release, targeting universal jailbreaks that work across many contexts rather than narrow single-shot attacks.

Pricing, Prompt Caching, and the API

Per 1M tokens, Sol is $5 input / $30 output, Terra is $2.50 input / $15 output, and Luna is $1 input / $6 output. The preview starts with API and Codex access for select partners.

Prompt caching gets a meaningful update: explicit cache breakpoints and a 30-minute minimum cache life. Cache writes are billed at 1.25x the uncached input rate, while cache reads continue at the 90% discount. This makes long-context workflows — codebase analysis, multi-turn agent conversations, document processing — more predictable to budget.

FAQ

How does GPT-5.6 Sol compare to GPT-5.5 for coding?

Sol sets new state-of-the-art results on Terminal-Bench 2.1, which tests real command-line workflows with planning and tool use. Terra is competitive with GPT-5.5 at half the input price. For most production coding workflows, Terra likely offers the best cost-to-performance ratio.

When will GPT-5.6 models be available to all developers?

OpenAI is running a limited preview for select partners first. Broader API and ChatGPT availability is expected in the coming weeks. The U.S. government requested the phased rollout; OpenAI has stated this should not become the long-term default.

What is the ultra reasoning mode and how do subagents work?

Ultra mode goes beyond a single model pass by spawning subagents that work in parallel on different parts of a task. It is designed for complex, multi-step work where a single reasoning chain would be a bottleneck. It is controlled through a mode parameter in the API.

Does GPT-5.6 Sol cross the Cyber Critical threshold?

OpenAI’s Preparedness Framework assessment says it does not. Sol identified bugs and exploitation primitives in Chromium and Firefox evaluations but did not autonomously produce a functional full-chain exploit under tested conditions. The safety stack is designed to absorb the increased capability.

Will prompt caching work the same way as GPT-5.5?

No. GPT-5.6 introduces explicit cache breakpoints and a 30-minute minimum cache lifetime. Cache writes cost 1.25x the uncached input rate. Cache reads still get the 90% discount. This is a net improvement for teams doing long-context work.

Start Building with the Preview

Request access to the GPT-5.6 API preview, run your existing eval suite against Sol’s max reasoning mode and Terra for comparison, and model your per-query cost at each tier. The Terra tier at $2.50/$15 per 1M tokens likely covers most production needs today — save Sol for the hardest 10% of your traffic.

Friday, June 26, 2026

A New Quality Index for Native Speech-to-Speech Models

 

If you’re building a voice agent in 2026, you’ve probably noticed that picking a speech-to-speech model is still mostly vibes. The demos all sound great and the only way to know which one is best for your product is to wire up four APIs and run your own eval. On June 23, 2026, Artificial Analysis tried to make that easier: they published the Speech-to-Speech Index, a single composite score for native S2S model quality.

What the index actually measures

The index isn’t one benchmark — it’s an equally weighted average of three. Each tests a different capability a voice agent needs in production.

Big Bench Audio is the reasoning layer: 1,000 questions across formal fallacies, navigation, object counting, and web of lies. The model has to think before it speaks.

Full Duplex Bench is the conversation layer. It scores pauses, turn-taking, interruptions, and backchannels — the parts of a real phone call that demos never show.

τ-Voice is the agentic layer. It runs end-to-end customer-service scenarios across airline, retail, and telecom and checks whether the model completes the task.

A model has to have valid scores on all three to be ranked at all. That’s a meaningful bar: the index is opinionated about what good looks like for a real product.

The quality leaderboard

The composite scores for the four models in the index:

  1. OpenAI GPT-Realtime-2 (High) — 77.2%
  2. 2. xAI Grok Voice Think Fast 1.0–75.7%
  3. 3. OpenAI GPT-Realtime-1.5–72.0%
  4. 4. Google Gemini 3.1 Flash Live Preview (High) — 69.5%

The top four are within 8 points of each other, so there isn’t a runaway winner. But the shape of each model’s wins is different. GPT-Realtime-2 leads on Conversational Dynamics: it handles the awkward, real-time parts of conversation better than the rest. Grok Voice Think Fast 1.0 leads on Agentic Performance: it actually finishes customer-service tasks.

For builders, that split is the most useful signal in the index. Phone agents should weight Grok’s lead higher; long-form assistants should weight OpenAI’s lead higher.

Speed: time to first audio

For voice products, the metric that actually drives perceived quality is Time to First Audio (TTFA) — how long from the user finishing a sentence to the model starting its response:

Deepslate Opal — 0.44s

GPT-Realtime-1.5–0.82s

Grok Voice Think Fast 1.0–1.25s

GPT-Realtime-2 (High) — 2.33s

Gemini 3.1 Flash Live Preview (High) — 2.98s

The pattern: the higher the model scores on quality, the slower it tends to be. GPT-Realtime-2 is the best model in the test and 5x slower than Deepslate Opal. For a fast-back-and-forth IVR, 2.3 seconds of dead air feels broken. For a thoughtful assistant, that latency may be worth it.

Deepslate Opal isn’t in the quality ranking (it doesn’t yet have valid scores on all three sub-benchmarks), but its 0.44s TTFA opens up new product categories if the quality story holds up under your own testing.

Cost: what you’ll pay per interaction

The index also reports per-interaction cost:

Gemini 3.1 Flash Live Preview (Minimal) — $1.50

Gemini 3.1 Flash Live Preview (High) — $1.75

Grok Voice Think Fast 1.0 — $3.00

GPT-Realtime-2 (High) — $4.14

Deepslate Opal’s cost isn’t reported in the index.

The cheapest model is 2.7x cheaper than GPT-Realtime-2, and Gemini’s pricing is genuinely aggressive for a frontier-tier model. For any product where unit economics matter — a customer-service line, a high-volume voice assistant — the math on Gemini Flash Live is hard to beat. The trade-off is real, though: the cheapest model is also the lowest-scoring on the quality leaderboard.

What this means for builders

The index is a starting point, not a verdict. It tells you how four frontier S2S models compare on a fixed set of tasks. It doesn’t tell you how they’ll perform on your audio, your users, your domain, or your edge cases.

The useful pattern is this: use the index to pick a shortlist of one or two models, then run a small evaluation on your own data. A 50-call eval on your own audio with your own success metric is worth more than any external benchmark.

The split between GPT-Realtime-2’s Conversational Dynamics lead and Grok’s Agentic Performance lead is the most useful signal for product builders. Most voice products in 2026 are either more like a phone call (where dynamics matter) or more like a task (where agentic performance matters). Pick accordingly.

FAQ


What’s the difference between native speech-to-speech and a stitched pipeline?

A native S2S model takes audio in and produces audio out directly, without a separate speech-to-text → LLM → text-to-speech pipeline. Stitched pipelines are slower, lose prosody and emotion through the text step, and cost more to run. All four models in this index are native S2S.


What does Conversational Dynamics measure, and why does it matter?

Conversational Dynamics comes from the Full Duplex Bench subset — it scores pauses, interruptions, backchannels (“uh-huh”), and turn-taking. Models that score well here feel like they are actually listening. Models that score poorly feel like they are waiting for their turn to talk.


Why is the cheapest model not the quality leader?

Quality and cost are correlated in every model category, and voice is no exception. Gemini 3.1 Flash Live Preview is the cheapest ($1.50) and the lowest in the quality ranking (69.5% vs. 77.2% for GPT-Realtime-2). Every percentage point of quality costs more in compute.


Is OpenAI’s GPT-Realtime-2 worth the 2.7x cost over Gemini Flash Live?

It depends on what you’re building. For a 10-minute customer-service call where the user is venting and interrupting, the Conversational Dynamics lead is probably worth the cost. For a 30-second IVR tree, it’s overkill and the 2.33s TTFA is going to hurt.


Should I pick a model based on this index or test on my own data?

Test on your own data. The index is a shortlist tool, not a decision tool. Real production audio has noise, accents, domain jargon, and edge cases that benchmarks don’t capture. Run at least 50 calls against the top two candidates before you commit.


The index gives builders a useful starting point: four frontier models, one composite score, and clear data on what’s slow, cheap, and good at the parts of conversation that matter. Treat it as a shortlist, run a 50-call eval on your own audio, and let your users’ actual call quality be the final tiebreaker.