Wednesday, July 1, 2026

This Free App Replaced Otter, ChatGPT & NotebookLM


NoteFlow - AI Note Taker - Free download and install on Windows | Microsoft Store

NoteFlow is a free, offline AI note-taker for meetings, document chat, and WhatsApp — all without a subscription. Here’s why I switched.

I was paying for four AI subscriptions and still felt like I was missing something. Otter.ai for meeting notes. ChatGPT Plus for research. NotebookLM for document analysis. And a WhatsApp bot I hacked together with the OpenAI API that kept running out of credits. Total bill: roughly $80 a month, or nearly a thousand dollars a year. Then I found NoteFlow — a free, offline AI note-taker that does all four jobs on my own computer. No cloud. No subscription. No data leaving my machine. Here’s how it works and why I’m not going back.

NoteFlow: Say Goodbye to Cloud Transcription Bills

The most expensive habit I had was meeting transcription. Otter.ai costs $20 a month per seat. Microsoft Copilot for Teams is $30. Fireflies runs $18. And every single one of them sends your meeting audio to the cloud. NoteFlow does the opposite.

It captures both sides of a call — your microphone and computer audio — and transcribes live using AI running entirely on your Windows PC. Words appear on screen as you speak. Nothing hits the internet. After the meeting, AI turns the transcript plus any notes you typed alongside into a polished, structured summary with one tap.

The pricing difference is absurd. NoteFlow’s free tier handles local transcription with a 30-minute recording cap. The Pro plan unlocks unlimited recording, advanced AI models, and full Notebooks access for $9.99 per year. That’s less than what Otter.ai charges in a single month.

Your Free, Offline NotebookLM Alternative

NotebookLM is useful — being able to dump documents into a notebook and ask questions about them is genuinely powerful. But it’s cloud-based, Google-controlled, and the file types are limited. NoteFlow’s Notebooks feature does the same thing locally.

Create a notebook and add meeting transcripts, PDFs, text files, web pages, even audio and video. Then chat with your documents using AI that reads your files and answers questions with source citations — every response shows you exactly which document it came from. You can also generate Study Guides, FAQs, Briefing Docs, and Timelines from any notebook collection with one click. All on your computer. All private.

If you’ve been eyeing NotebookLM but wanted it offline and unlimited, this is the closest thing I’ve found — and it’s free.

An AI Assistant in Your Pocket (For Free)

The feature that surprised me most was the WhatsApp AI bot. You link your WhatsApp number by scanning a QR code in the NoteFlow settings. Approve which contacts can trigger the AI. Then anyone on your whitelist can message your local LLM through WhatsApp — and the AI replies from your own computer, not from a cloud API.

No per-message fee. No usage quota. No cloud relay. The AI runs on your machine and responds through WhatsApp Web. I use it for quick research questions, drafting replies, and bouncing off ideas without opening a browser tab.

What You Actually Save (Hint: It’s a Lot)

NoteFlow’s website has an interactive savings calculator that compares your usage against seven cloud tools. I ran my numbers: 5 meetings a week, 15 notebook chat turns per week, 2 AI artifacts a month, 5 WhatsApp queries a week. The calculator told me I could save up to $12,108 a year compared to Microsoft Copilot for Teams.

The reason NoteFlow can do this is simple: the AI runs on your computer. Cloud tools charge per call because every inference costs them server time. NoteFlow has no per-call infrastructure cost, so it passes that saving to you.

FAQ

How is NoteFlow free when other AI tools charge monthly?

Because the AI runs on your computer, not in a data center. NoteFlow doesn’t have per-call infrastructure costs to amortise, so it can offer unlimited AI at a flat price. The free tier covers local transcription with a 30-minute cap. Pro is $9.99 per year — less than the monthly cost of any cloud competitor.

Can NoteFlow really replace NotebookLM and ChatGPT?

For the use cases most people actually need — meeting transcription, document Q&A, AI-generated summaries, and chat — yes. NoteFlow’s Notebooks feature matches NotebookLM’s core functionality while adding support for audio and video files. The local LLM handles questions similarly to ChatGPT, with the tradeoff that it’s a smaller model running on your hardware. For daily productivity tasks, the difference is negligible.

Does NoteFlow work without internet?

Completely. The app is designed to work offline — on a plane, in a secure facility, or behind an air-gapped network. Every feature, including transcription, document chat, and AI enhancement, runs locally. The only exception is the WhatsApp bot, which needs an internet connection to relay messages.

How does the WhatsApp bot differ from ChatGPT’s mobile app?

ChatGPT’s mobile app sends your messages to OpenAI’s servers. The NoteFlow WhatsApp bot routes messages through WhatsApp to your local LLM on your computer. No data touches a third-party AI API. You also control exactly which contacts can use it — anyone not on your whitelist is silently ignored.

Is my data really private?

NoteFlow uses on-device AI exclusively. Raw audio is deleted after processing by default. Notes are stored in a local encrypted database. There’s no account, no telemetry, no remote logging. The app is verified to work fully offline.

I went from managing four cloud subscriptions and worrying about meeting recording limits to a single free app that handles everything locally. The switch took me about 10 minutes: download from the Microsoft Store, install, open — no account creation, no credit card. If you’re paying for even one cloud AI tool, run your numbers through the savings calculator first. I think you’ll be surprised at what you find.

I Found an AI Model That Costs 1/3 on OpenCode GO

 

Photo by Mohammad Rahmani on Unsplash

I opened my OpenCode GO dashboard last Thursday and stared at the Minimax M3 row. The “3x” badge in the corner didn’t look special, but the math was: three times the output for the same dollar. I’ve been running it for a week alongside Claude, Gemini, DeepSeek, and Qwen — here’s what the deal actually looks like in practice.

What the 3x Usage Deal Actually Is

Minimax M3 is available on OpenCode GO with a 3x usage multiplier. For every dollar you spend against your credit pool, you get three dollars’ worth of M3 API calls. It’s a limited-time promotion, but while it’s active, it changes the calculus on which model makes sense for day-to-day coding.

The OpenCode GO plan costs $10 per month and gives you roughly $60 in API credits across its supported models. With the 3x boost on M3, that $60 effectively becomes $180 worth of M3 usage against the per-hour rate limit. For a developer running multiple agentic loops or frequent sub-agent calls, that’s not a small difference — it’s the difference between carefully rationing your calls and not thinking about cost at all.

The catch: there’s a usage cap per five-hour window. If you’re running constant agent sessions, you’ll hit that ceiling. For those moments, going direct to the API with permanent discounts like DeepSeek’s 75% offer may work better. But for the majority of development work — the daily flow of writing, debugging, testing, and reviewing — the GO plan plus M3 is hard to beat on pure value.

How OpenCode GO Pricing Shapes Your Choice

OpenCode GO isn’t a raw API subscription. You pay $10 and get a pool of credits that apply across models at different burn rates. Some models eat credits fast; others are more economical. The 3x boost on M3 makes it one of the most credit-efficient models on the platform.

This matters more than you’d think. When every sub-agent call, every orchestrator loop, and every tool-use request draws from the same pool, the credit multiplier on M3 means you can run more experiments in the same budget. I found myself trying approaches I would have skipped on other models — not because M3 is always better, but because the effective cost per attempt was low enough that the question wasn’t “is this worth the API call?” but “does this approach make sense?”

How M3 Compares to Qwen, DeepSeek, and Gemini

I ran Minimax M3 against Qwen 3.7 Max, DeepSeek V4 Pro, Gemini 3.5 Flash, and Claude Sonnet on a set of coding tasks over the past week. Here’s what stood out.

Better instruction-following than Qwen. Qwen 3.7 Max is smart but unpredictable. It often ignores parts of the spec, writes overly aggressive code, or adds features nobody asked for. M3 is more disciplined — it follows the prompt more closely and even asks clarifying questions before diving in. That alone saves a round-trip.

More consistent than DeepSeek V4 Pro. DeepSeek V4 Pro can match Claude Sonnet on a good day, but it hallucinates. It’ll “misunderstand” a detailed plan and produce something that looks right architecturally but doesn’t fit the spec. M3 is more conservative — it stays closer to what you asked for, which matters more for production code than raw creativity.

Comparable to Gemini 3.5 Flash in coding, better in reasoning. Several developers in the OpenCode community agree: M3 is on par with Gemini 3.5 Flash for code generation, but it handles multi-step agentic tasks more reliably. Gemini Flash tends to lose context in longer chains; M3 holds the thread better.

Still below Claude for complex tasks. For architecture decisions, multi-file refactors, and nuanced business logic, Claude Sonnet 4 or 5 remains ahead. But M3 closes the gap more than its price tag suggests. The gap is narrower than the cost difference would imply.

Why I Use M3 as My Daily Driver

I use Minimax M3 as my all-purpose model on OpenCode. For orchestrator tasks, sub-agent routing, and day-to-day coding, it handles everything competently. The fact that it costs a third of what I’d pay for other models of similar quality means I can run more experiments, iterate faster, and keep my monthly costs predictable.

The feature that surprised me most: M3 asks questions before it acts. When the spec is ambiguous, it pauses and asks for clarification rather than guessing wrong and producing broken output. That’s rare in this price bracket and makes it significantly safer for agentic workflows where a wrong turn costs minutes, not just tokens.

FAQ

Is Minimax M3 as good as Claude for coding?

For complex architecture and multi-file refactors, no — Claude Sonnet 4 or 5 is still clearly ahead. But for day-to-day coding, sub-agent tasks, and straightforward feature work, M3 is surprisingly close at a fraction of the cost.

How long will the 3x usage promotion last?

It’s a limited-time event, and OpenCode hasn’t announced an exact end date. Promotions like this typically run for weeks to months. Check the OpenCode GO pricing page for the current status.

Should I use OpenCode GO or the official Minimax API?

If you’re a casual to moderate user, OpenCode GO at $10 per month with roughly $60 in credits is the better value. If you’re a power user hitting the five-hour rate limits regularly, the direct API route may give you more flexibility. With the 3x boost, M3 on GO is especially attractive for the middle tier of usage.

What makes Minimax M3 different from Qwen and DeepSeek?

M3 is more careful. It follows instructions more closely, asks clarifying questions, and produces more predictable output. Qwen 3.7 Max is more powerful but erratic — it can produce brilliant results or go off the rails. DeepSeek V4 Pro is inconsistent — impressive one moment, hallucinating the next. M3 trades some peak performance for reliability, which is a worthwhile swap for production work.

Try It for a Week

The 3x usage deal on Minimax M3 is one of the best value propositions in AI coding right now. If you’re already on OpenCode GO, switch M3 on for a week and watch your effective cost per task drop. If you’re not on the plan yet, grab a $5 discount at the OpenCode GO page and see for yourself.

If Minimax M3 isn’t the right fit for your use case, you’ve lost nothing — the GO plan works across dozens of models. But if it does click, you’ve just cut your effective API cost by two-thirds. That’s a bet worth taking.

Grok 4.5 Is Coming for Opus - Every Single Month

 

Photo by Salvador Rios on Unsplash

If you build on LLMs, you’ve gotten used to the rhythm: a new frontier model every 3 to 6 months, a splashy benchmark paper, then radio silence until the next one. On June 28, Elon Musk broke that rhythm. Grok 4.5 — the latest model from xAI — is now in private beta at SpaceX and Tesla, and early evaluations place it “close to, perhaps exceeding” Claude Opus 4.6. That’s not the full story though. The real story is what comes after the benchmark claim.

What Makes Grok 4.5 Different

Grok 4.5 is built on the V9 foundation model — 1.5 trillion parameters, three times the size of the V8 model that currently serves all Grok production traffic. Training completed on May 26, 2026. That’s roughly 4 weeks from training completion to private deployment — already faster than the industry turnaround time.

What’s more unusual is the supplemental training data: a large amount of Cursor developer workflow data. SpaceX acquired Cursor for $60 billion earlier this month, and the developer interaction data is already being folded into the training pipeline. For AI/ML engineers, that’s the detail worth watching — a model trained on real development workflows has a fundamentally different signal set than one trained on internet text and synthetic data alone.

The Real Bombshell — Monthly Foundation Models

Here’s the part that shifts the conversation. Musk announced that SpaceX will release a completely new foundation model, trained from scratch, every month for the rest of 2026.

OpenAI, Anthropic, and Google currently release major frontier models every 3 to 6 months. A monthly cadence isn’t just faster — it’s a different category of operation. It means the training infrastructure, data pipeline, and evaluation stack are all running at a tempo that no other lab has publicly demonstrated. It means the team is structured to iterate, not perfect.

Is every monthly release going to be a leap forward? Almost certainly not. But a monthly cadence means the gap between learning what works and deploying what works next shrinks from quarters to weeks. Over the rest of 2026, that’s 6 more foundation models. Even if only 2 of them are breakthroughs, that’s 2 more than any competitor is promising over the same period.

What the Opus Claim Actually Means

Elon positioning Grok 4.5 against Claude Opus 4.6 is the first public benchmark claim xAI has made against a frontier rival. The phrasing — “close to, perhaps exceeding” — is carefully hedged but still significant. Anthropic’s Claude Opus 4.6 and OpenAI’s GPT-5.5 are the current bar for general-purpose reasoning. Matching either of them would put Grok in the frontier conversation for the first time.

For AI/ML engineers evaluating the claim: private beta results at two companies are not the same as public benchmarks. But the V9 model’s 1.5 trillion parameter count, combined with reinforcement learning that continues post-training, means there’s real headroom. The V8 model that powers current Grok production traffic is already competitive on several coding and reasoning benchmarks. A 3x parameter increase with Cursor-enhanced training data is a credible path to the frontier.

Why Cursor Changes the Equation

The $60 billion Cursor acquisition is usually framed as a talent or product grab. But the training detail is more specific: Grok 4.5 used “a large amount of Cursor developer workflow data” in supplemental training.

This is the most interesting part of the announcement for AI/ML engineers. Cursor captures real developer behavior — how engineers navigate codebases, what they autocomplete, what they reject, what they rewrite. Training on that signal set produces a model that understands developer intent, not just developer output. It’s the difference between a model that can write code and a model that knows how developers actually work.

If this pattern holds for future Grok releases, xAI has a data moat that’s hard to replicate. No other frontier lab has access to real-time developer workflow data at this scale.

FAQ

How does Grok 4.5 compare to Claude Opus 4.6 on benchmarks?

Private evaluations at SpaceX and Tesla show Grok 4.5 performing “close to, perhaps exceeding” Opus 4.6 in internal testing. Public benchmark numbers have not been released yet, so independent verification is not available yet. The comparison is significant because it is the first time xAI has publicly claimed frontier-level performance.

Can xAI really ship a new foundation model every month?

The V9 model training completed on May 26 and Grok 4.5 entered private beta roughly 4 weeks later. That timeline — from training completion to deployment — is already faster than the industry norm. A monthly cadence from scratch means the infrastructure, data pipeline, and evaluation stack are all designed for this tempo. Whether quality holds at that speed is the open question.

When will Grok 4.5 be available to the public?

No public release date has been announced. The current private beta is limited to internal teams at SpaceX and Tesla. Based on xAI previous release patterns, a public beta or API release would follow after the internal testing phase, likely within the next few months.

What does the Cursor acquisition mean for Grok capabilities?

Cursor developer workflow data gives Grok 4.5 training signal that captures real engineering behavior — how developers navigate code, what they accept or reject from AI suggestions. This is a fundamentally different data type than public internet text or synthetic data. If xAI continues this approach, future Grok releases could have a meaningful advantage in code generation and developer tooling.


Grok 4.5 is the first sign that xAI is not trying to catch up with the frontier labs — it is redefining what the frontier means. A monthly foundation model cadence, a training pipeline fed by real developer data, and the first public claim against Opus. For AI/ML engineers, the next 6 months just got a lot more interesting. Watch xAI release channel and benchmark each new foundation model drop against your own systems. The next one arrives in roughly 30 days.

Tuesday, June 30, 2026

DeepSeek Just Shattered the Speed-Accuracy Tradeoff

 

Photo by Solen Feyissa on Unsplash

Every LLM user has felt it — that pause between hitting Enter and seeing the first word appear. In production systems, that pause compounds across hundreds of concurrent requests, turning inference latency into a hard scaling ceiling. DeepSeek’s DSpark framework just made that pause up to 85% shorter without sacrificing a single token of quality. And the code ships under MIT — no waitlist, no managed API, no gatekeeping.

The Token-by-Token Bottleneck

Large language models generate text one token at a time. Each token — roughly a word or sub-word unit — requires a full forward pass through every layer of the model. A 70B-parameter model needs to compute activations across all 70 billion parameters for every single token produced.

This sequential dependency is baked into the architecture. Generating a 500-word response requires 500 serial passes through the full model. There is no parallelism at the token level in standard autoregressive decoding — token N+1 literally cannot start until token N finishes.

The practical consequence is a hard latency floor. You cannot make inference meaningfully faster without upgrading to more expensive hardware or shrinking the model — which usually means trading quality for speed. Engineers have been making that tradeoff for years, and it has never felt good.

Why Speculative Decoding Is Incomplete

Speculative decoding was the first serious attempt to break this ceiling. The idea is elegant: a small, fast draft model guesses several tokens ahead in a single forward pass. The large model then verifies all the guesses in one parallel pass. If the guesses are correct, you saved multiple sequential passes. The larger the accepted block, the bigger the speedup.

The catch is that existing parallel drafters guess independently. Each drafted token has no information about the token the drafter predicted immediately before it. Accuracy collapses toward the end of each block because the draft diverges further from what the large model would actually generate. The speculator rejects more tokens, the accepted block shrinks, and the speedup evaporates.

Various approaches have tried to fix this — Medusa, Eagle, and Self-Speculative decoding all improve on vanilla speculation. But they all share the same fundamental weakness: parallel drafters cannot correct course mid-block.

How DSpark Breaks the Speed-Quality Tradeoff

DSpark attacks the problem from two angles simultaneously.

First, a fast parallel backbone drafts all candidate tokens in a single forward pass — preserving the throughput advantage of parallel speculation. Then a tiny sequential head — adding roughly 1% latency overhead — reads the previous token before predicting the next one. This small dependency chain stabilizes the draft predictions dramatically without negating the parallelism benefit. You get the throughput of parallel drafting with the accuracy of sequential correction.

Second, a confidence head scores each drafted token’s likelihood of being accepted by the target model. A live GPU scheduler then decides, in real time, which tokens are worth verifying based on current hardware load. When the GPU is saturated, it accepts more high-confidence tokens to maintain throughput. When it is idle, it can afford to verify borderline candidates. The scheduler adapts to traffic patterns without human tuning.

Together, these two innovations close the gap that prior speculative methods left open.

What the Benchmarks Actually Say

The results come from production traffic, not synthetic benchmarks — which matters because real inference workloads mix short queries, long generations, and idle periods in unpredictable ways.

  • 60–85% faster generation latency per user
  • - 30% higher acceptance rate over the best prior speculative decoders
  • - No measurable degradation in output quality or coherence

The most striking finding is that DSpark enables latency tiers that were previously impossible under strict response-time guarantees. Teams that had to choose between quality and responsiveness no longer have to.

FAQ

Can I use DSpark with any LLM, or only DeepSeek models?

DSpark is a general speculative decoding framework, not a model-specific optimization. The paper and the training repo describe it as model-agnostic, meaning it can be applied to any autoregressive LLM. The MIT license does not restrict which models you pair it with.

Does DSpark require additional GPU memory?

The framework adds a small sequential head and a confidence scoring layer — the paper reports roughly 1% overhead in latency, and the memory footprint scales with the draft model, not the target model. For most deployments, the memory cost is negligible compared to the speedup.

Is DSpark production-ready or still a research project?

The training code is released under MIT, and the paper presents production traffic results — not just offline benchmarks. The 60–85% speedup figure comes from real user workloads. Deploying it requires familiarity with custom inference pipelines, but the framework is designed for integration, not just experimentation.

What GPU hardware do I need to run DSpark?

DSpark targets standard inference GPUs — the same hardware you are already using for LLM serving. The draft model runs on the same GPU as the target model; the overhead from the sequential head and confidence scorer is negligible. No specialized hardware is required.

How does DSpark compare to Medusa, Eagle, and other speculative decoders?

Prior speculative decoders all share the same weakness: parallel drafters lose accuracy as block length grows. DSpark’s sequential head breaks that pattern. The 30% higher acceptance rate over prior methods quantifies the improvement in concrete terms.

Go Play With It

The repo is on GitHub under MIT. No license fees, no signup wall, no managed API to onboard. Clone it, wire it into your inference pipeline, and measure the speedup on your own traffic. If you are running LLMs in production, this is likely the easiest 60% latency improvement you will see this year.

Fable 5 Was Banned. The Truth Is Wild.

 

Anthropic’s Fable 5 export control crisis, safety classifier war, and a new jailbreak framework that changes everything.

The US government shut down Anthropic’s most advanced model on June 12 — not for what it had done, but for what it might be capable of. For 18 days, Fable 5 vanished for every user worldwide, and Anthropic stayed silent. On June 30, the export controls lifted, and the company published a detailed post explaining everything. The story it told was more nuanced — and more important — than the headlines suggested.

What Actually Happened to Fable 5

On June 9, Anthropic launched Fable 5 and Mythos 5 — two versions of the same underlying model with dramatically different safety profiles. Fable 5 went out broadly with strong safeguards. Mythos 5, with weaker guardrails, went only to a small set of trusted Project Glasswing partners for defensive cybersecurity work.

Three days later, on June 12, the US government applied export controls to both models. The order restricted access to foreign nationals both inside and outside the United States. Since Anthropic had no way to verify nationality in real time, they suspended access for all users. Every developer, every enterprise customer, every Claude user who relied on Fable 5 suddenly lost access with zero warning.

The Amazon Report That Triggered It All

The export control directive came after the government learned about a discovery by Amazon researchers. They had found a method to bypass Fable 5’s safeguards: prompting the model to identify software vulnerabilities. In one case, Fable 5 produced code demonstrating how a vulnerability could be exploited.

Here’s where it gets interesting. When Anthropic tested the same technique across other models, they found that many less capable models — including Claude Opus 4.8, GPT-5.5, and Kimi K2.7 — could identify the same vulnerabilities. Every single model they tested could produce the same exploit demonstration: Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, 4.7, 4.8, GPT-5.4, GPT-5.5, and Kimi K2.7.

The reported technique did not expose any unique Mythos-level cyber capabilities. It was what Anthropic calls a “safety margin” case: a behavior unlikely to be dangerous but blocked anyway out of abundance of caution.

How Fable 5’s Safety Margin Works

Anthropic launched Fable 5 with the strongest safeguards it has ever applied. In the month before launch, they doubled the team working on this problem. The result is a defense-in-depth system where multiple mechanisms work together.

The core mechanism is classifiers — smaller AI systems that monitor each interaction and detect when the model is asked to perform a potentially harmful cybersecurity task. When triggered, they block the model from responding.

Anthropic deliberately calibrated these classifiers to err on the side of caution. The “safety margin” is a zone where requests that are probably benign but could theoretically be harmful are also blocked. For Fable 5, they made this safety margin much larger than in any prior launch. The tradeoff was explicit: more frustrating false positives for users meant fewer genuinely harmful requests would slip through.

Why the Jailbreak Wasn’t a Breakthrough

Against this backdrop, the Amazon researchers’ technique makes more sense. It wasn’t exposing a hidden offensive capability unique to Fable 5. It was poking into the safety margin — finding a behavior that was blocked as a precaution rather than because it was uniquely dangerous.

Anthropic moved quickly anyway. They trained an improved safety classifier that blocks the specific technique in over 99% of cases. If a request hits this new classifier, the user gets notified and the request is routed to Opus 4.8 instead.

The new classifier comes with a real cost: more benign requests during routine coding and debugging will now be flagged. Anthropic says they’ll keep refining the balance.

A New Industry Framework for Jailbreaks

The most significant outcome of this episode might be what Anthropic proposed next. They’re partnering with Amazon, Microsoft, Google, and other Glasswing partners to draft a consensus framework for assessing the severity of AI jailbreaks.

Right now, there’s no industry standard. When a jailbreak is discovered, developers and governments have no agreed-upon method for assessing its severity. Was it a minor edge case or a critical vulnerability? Nobody can say with confidence.

Anthropic’s proposed framework scores jailbreaks on four criteria:

  • Capability gain: How far beyond existing tools does the jailbreak take the user? If weaker models can do the same thing, the score is low.
  • - Breadth: How many distinct offensive tasks does the same technique unlock?
  • - Ease of weaponization: How much human effort is needed to turn the jailbreak into an actual attack?
  • - Discoverability: How easy is it for someone to obtain the technique?

Anthropic also launched a new HackerOne program where security researchers can submit potential cyber jailbreaks for review.

What Comes Next

Anthropic announced four commitments for deeper government collaboration: pre-release government access and evaluation for models on the capability frontier, rapid information sharing on safeguards, dedicated resources for joint research, and a push for a common industry security standard.

Fable 5 is available again starting July 1. Pro, Max, Team, and select Enterprise users get it included for up to 50% of weekly usage through July 7, after which it shifts to usage credits. AWS, Google Cloud, and Microsoft Foundry access is being restored as quickly as possible.

FAQ

Why was Fable 5 banned in the first place?

The US government applied export controls on June 12 after Amazon researchers reported a method to bypass Fable 5’s safeguards, showing it could identify software vulnerabilities and produce exploit code. The concern was that foreign nationals could use the model for offensive cyber purposes. Once controls were lifted on June 30, Anthropic restored access globally.

What did the Amazon researchers actually find?

They discovered a prompt technique that got Fable 5 to identify software vulnerabilities and, in one case, demonstrate how one could be exploited. However, Anthropic’s testing showed that nearly every other major model — including much weaker ones — could produce the same results. The technique didn’t expose any capabilities unique to Fable 5.

What’s the difference between Fable 5 and Mythos 5?

They share the same underlying model architecture. Fable 5 launched with strong safeguards for general use. Mythos 5 has fewer safeguards and was released only to a small number of trusted Glasswing partners for defensive cybersecurity work. Mythos 5 can find and exploit vulnerabilities better than any other model and all but the most skilled human security experts.

Could export controls happen to other AI models?

Yes. The June 2 Executive Order on Promoting Advanced AI Innovation and Security established the framework for this kind of intervention. As AI capabilities in cybersecurity and other sensitive domains advance, governments will increasingly scrutinize powerful models before and after release. A standardized jailbreak assessment framework could help prevent the kind of sudden global shutdown that Fable 5 experienced.

Can I use Fable 5 now?

Starting July 1, Fable 5 is available globally on the Claude Platform, Claude.ai, Claude Code, and Claude Cowork. Pro, Max, Team, and select Enterprise users get it included for up to 50% of weekly usage through July 7, after which usage credits apply.

The Fable 5 saga is a dress rehearsal for decisions governments and AI companies will face repeatedly from here. Anthropic turned an 18-day crisis into a proposal for something the industry badly needs: a shared standard for scoring AI jailbreaks. Open Anthropic’s post, read the jailbreak criteria section, and decide for yourself whether this framework sets the right bar. If you work in AI, share this with your team — the conversation about how we assess risk in frontier models is just getting started.

Claude Sonnet 5 Just Made Opus Look Expensive

 

If you’re a developer who watched Opus 4.8 pull ahead on agentic tasks over the last few months, Sonnet 5 is the model that brings those capabilities back to a price that makes sense for daily API calls. Announced yesterday by Anthropic, Claude Sonnet 5 closes the gap with Opus 4.8 across the dimensions that matter most for shipping real software: reasoning, tool use, coding, and autonomous multi-step execution — at roughly half the cost.

What Makes Sonnet 5 Different

Sonnet 5 is the most agentic Sonnet model Anthropic has released. It can make plans, use browsers and terminals, and run autonomously at a level that previously required Opus-class models. On BrowseComp, a benchmark for agentic search, Sonnet 5 at high effort levels matches Opus 4.8 performance while staying on a much lower cost curve. On OSWorld-Verified, a computer-use evaluation, the story is similar.

The improvement over Sonnet 4.6 is substantial across the board. Anthropic published benchmarks show gains in reasoning, tool use, coding, and knowledge work — the four pillars of agentic development. For developers, this means the model you reach for by default in Free and Pro plans can now handle tasks that used to be Opus-only.

Pricing That Actually Moves the Needle

The pricing story is where Sonnet 5 gets interesting for anyone paying API bills. Through August 31, 2026, introductory pricing sits at $2 per million input tokens and $10 per million output tokens. After that, standard pricing kicks in at $3 and $15 respectively. Compare that to Opus 4.8 at $5 and $25, and the math becomes straightforward.

There is one tradeoff. Sonnet 5 uses an updated tokenizer that maps the same input to roughly 1.0 to 1.35 times more tokens depending on content type, similar to the change Anthropic introduced with Opus 4.7. Anthropic set the introductory pricing to be roughly cost-neutral during the transition, but teams running high-volume pipelines should benchmark their actual token counts before assuming the per-token savings translate directly.

The effort parameter is worth understanding. Sonnet 5 can operate at different effort levels — low, medium, high, and extra high — letting you pay for capability only when you need it. A quick lint check runs at low effort; a full codebase refactor runs at high. The cost-performance curves published by Anthropic show that medium effort on Sonnet 5 already beats Sonnet 4.6 at any effort level, and high effort comes close to Opus 4.8 on many tasks.

What Early Access Partners Found

The early access feedback tells a consistent story: Sonnet 5 finishes tasks where prior Sonnet models stopped short. Zimu Li at Canva described it handling multi-step software engineering work across messy technical contexts. The Rust engineer Neel Chotai reported that Sonnet 5, unprompted, wrote a reproducing test, implemented the fix, then stashed it to confirm the bug came back without the change — all in a single pass.

Dominic Elm, a founding engineer, noted that Sonnet 5 shines on brownfield code — race conditions, hidden tests, the parts nobody wants to touch. For Lovable co-founder Fabian Hedin, the model ability to refuse unsafe requests was just as important as its ability to build. At ClickHouse, Ryadh Dahimene reported that Sonnet 5 reasons in tighter steps and gets users to answers faster.

Safety That Scales with Agency

Giving a model more autonomy raises the stakes on safety. Anthropic pre-deployment evaluations found that Sonnet 5 is better than Sonnet 4.6 at refusing malicious requests, resisting prompt injection hijack attempts, and avoiding hallucination and sycophancy. Its overall rate of misaligned behavior on automated auditing is lower than its predecessor.

On cybersecurity, Anthropic chose restraint. Sonnet 5 was never able to develop a working exploit for Firefox vulnerabilities in testing — something Opus 4.8 and Mythos 5 both managed. Cyber safeguards are enabled by default, and the model is part of the Cyber Verification Program for organizations that need reduced guardrails.

How to Try Sonnet 5 Today

Sonnet 5 is the default model for Free and Pro plans starting today. It is also available to Max, Team, and Enterprise users. For developers, it is accessible through Claude Code and the Claude Platform API using the model ID claude-sonnet-5. Anthropic has raised rate limits across Chat, Cowork, Claude Code, and the Platform to accommodate the higher token usage of higher effort levels.

FAQ

How does Claude Sonnet 5 compare to Opus 4.8 for coding tasks?

Sonnet 5 at high and extra-high effort levels comes close to Opus 4.8 on coding benchmarks and real-world pull requests. Early access partners reported it carrying complex multi-file changes through to tested, verified results. For most daily development work, Sonnet 5 is sufficient — reach for Opus 4.8 only for the hardest problems where you need maximum capability regardless of cost.

What is the effort parameter and how does it affect cost?

The effort parameter lets you choose how much inference compute the model spends on a task. Low effort is cheapest and fastest, suitable for simple lookups or linting. High and extra-high effort spend more tokens to produce better results on complex tasks like code review or architectural analysis. You pay per token, so higher effort costs more — but only when you need it.

Is Sonnet 5 available in Claude Code and the API?

Yes. Sonnet 5 is available in Claude Code and through the Claude Platform API. The model ID is claude-sonnet-5. Rate limits have been increased across all tiers to support the higher token usage from effort-level selection.

Does the new tokenizer make Sonnet 5 more expensive in practice?

The updated tokenizer maps the same input to 1.0 to 1.35 times more tokens depending on content type. Through August 31, introductory pricing is set to be roughly cost-neutral versus Sonnet 4.6. After that, the standard $3/$15 pricing still represents a meaningful improvement in capability per dollar, especially at medium effort levels.

How safe is Sonnet 5 for autonomous agent workflows?

Sonnet 5 performs better than Sonnet 4.6 on every safety metric Anthropic measures: lower hallucination, lower sycophancy, better refusal of malicious requests, and better resistance to prompt injection. Cyber safeguards are enabled by default. If you are building agents that operate on user behalf, Sonnet 5 is the safest Sonnet yet for that use case.

Try Sonnet 5 Today and Decide for Yourself

If you are on a Free or Pro plan, Sonnet 5 is already the default — open Claude and see the difference. If you are building on the API, spin up claude-sonnet-5 at the introductory pricing and run it against your test suite. The model costs less than Opus, handles more than any Sonnet before it, and the only way to know if it works for your specific workload is to try it. Set aside 30 minutes this week to run your most painful agentic task through Sonnet 5 and compare the result.

China Just Dropped 20M Free AI Tokens (And Nobody Noticed)

 

Photo by Mohammad Rahmani on Unsplash

If you’re a developer who pays $20/month for Cursor Pro, $10 for GitHub Copilot, or burns through API credits like candy — stop. A 756-billion-parameter coding model just landed with a 20-million-token free tier. No credit card. No subscription. And somehow, almost nobody in the Western developer community is talking about it.

What Zhipu AI’s GLM-5.2 Actually Gives You

GLM-5.2 is a massive 756B-parameter model from Zhipu AI, a Beijing-based AI lab often described as China’s closest equivalent to OpenAI. The headline offer is 20 million free API tokens for new developers — not a trial, not a “first month free” gimmick. Create an account and you get the full quota immediately, no billing info required.

Beyond the token grant, you also get 120 free image and video credits, access to GLM-5.2’s “High” and “Max Thinking” reasoning modes, and a 1-million-token context window. That’s large enough to feed an entire codebase into a single prompt and still have room for instructions.

The API is OpenAI-compatible. You can point Cursor, Claude Code, Cline, or any OpenAI SDK at it by swapping the base URL and model name. No custom integration, no new tools to learn. If your editor already speaks OpenAI, it already speaks GLM-5.2.

How It Stacks Up Against What You’re Already Paying For

Do the math on your current AI coding stack. Cursor Pro costs $20/month. GitHub Copilot is $10/month. Claude API charges per token. Add them up and you are looking at $30+ per month for tools that help you write code faster.

GLM-5.2 replaces all of them at zero cost for the first 20 million tokens. For a solo developer or small team experimenting with AI-assisted coding, the savings add up fast. Twenty million tokens goes a long way — hundreds of code completions, dozens of full-file refactors, and plenty of room for trial and error.

GLM-5.2 also supports a “Max Thinking” mode that applies chain-of-thought reasoning to complex coding tasks. In practice, this means better results on multi-step refactors, debugging sessions, and architectural decisions — exactly the places where smaller models fall apart.

Why the Silence?

If the offer is real and the model is competitive, why isn’t everyone talking about it? Three factors explain the gap.

Geographic attention bias. Chinese AI labs rarely receive the same Western media coverage as OpenAI, Anthropic, or Google DeepMind. A breakthrough from Beijing doesn’t trend on Hacker News the same way one from San Francisco does. This isn’t new — it’s been true since the earliest days of China’s AI industry.

Trust and data privacy. GLM-5.2 routes through Chinese infrastructure. For many Western developers and enterprises, that’s a dealbreaker. Data residency requirements, compliance policies, and geopolitical caution create a barrier that no amount of free tokens can overcome.

The U.S.-China AI perception gap. Some developers avoid Chinese models on principle; others assume they can’t be competitive. The assumption is increasingly outdated — several Chinese models now rank in the top tier of coding benchmarks — but the perception lingers. GLM-5.2’s 756B parameter count and benchmark scores are competitive with frontier Western models, but mindshare in the developer community hasn’t caught up.

How to Try It in Two Minutes

Here’s the fastest path to get coding with GLM-5.2:

  1. Register at open.bigmodel.cn
  2. 2. Verify your account with your phone number (OTP arrives in a few minutes)
  3. 3. Create an API key from the dashboard
  4. 4. Set your base URL to the GLM-5.2 endpoint
  5. 5. Select model: glm-5.2

For Cursor users: open Settings, go to Models, add a new model provider, and paste your GLM-5.2 API key and base URL. For OpenAI SDK users: set OPENAI_BASE_URL and OPENAI_API_KEY as environment variables. For Cline and similar tools: the model provider setup screen accepts any OpenAI-compatible endpoint — add GLM-5.2 as a custom provider and you’re done.

The Real Catch — Three Caveats Worth Knowing

A free offer at this scale comes with tradeoffs worth understanding up front.

Data residency. Zhipu AI’s API servers are in China. If your codebase contains sensitive or proprietary code that you can’t route through Chinese infrastructure, this isn’t for you. No free tier is worth a compliance violation.

Phone verification. Registration requires a phone number and OTP. Some non-Chinese users report delays receiving the verification code. If you’re outside China, budget a few extra minutes for this step.

Long-term uncertainty. Zhipu AI hasn’t published clear post-quota pricing. The 20M free tokens are framed as a developer acquisition play rather than a limited promotion, but any free API offering can change. If you build a workflow around it, keep a paid fallback ready.

FAQ

Can I use GLM-5.2 with Cursor or VS Code?

Yes. The API is fully OpenAI-compatible. Add it as a custom model provider in Cursor, Claude Code, Cline, Continue.dev, or any tool that supports OpenAI’s API format. Just swap the base URL and model name — no custom integration needed.

How does GLM-5.2 compare to GPT-4o or Claude 4 Sonnet?

GLM-5.2’s 756B parameters put it in the same weight class as the largest frontier models. On coding benchmarks, it scores competitively. The practical differentiators are the 1M-token context window and the zero-cost entry point. Most developers report it handles complex refactoring and debugging well in Max Thinking mode.

What data does Zhipu AI collect from API calls?

Zhipu AI’s data handling policies are less transparent than Western providers. Review the terms of service carefully before sending proprietary code. For open-source or personal projects this is less of a concern, but enterprise teams should involve legal before routing sensitive code through the API.

Is phone verification required for all users?

Yes, registration requires a phone number. Some users outside China report OTP delivery delays. If you don’t receive the code within a few minutes, try again after an hour — the system sometimes throttles international SMS.

Will the 20M free tokens refresh or is it a one-time grant?

The 20M tokens are a one-time welcome grant for new developers, not a recurring monthly quota. Zhipu AI hasn’t announced post-consumption pricing. Pace your usage accordingly: use it for evaluation and experimentation first, migration second.

Try It Before the Quota Changes

Twenty million tokens is enough to decide whether GLM-5.2 fits into your workflow. Sign up, point your editor at it, and spend an afternoon testing it on your actual codebase. If it works, you’ve just eliminated a monthly subscription. If it doesn’t, you’re out two minutes and zero dollars.

The offer is real. The model is competitive. The silence from the Western developer community won’t last forever — and when the conversation starts, you’ll already have an opinion.