JDML
← All notes

Local LLMs

Gemma's second act: not ready for agents, but worth watching

· 4 min read · By

Google shipped a new family of Gemma models and we put them through the same agent workloads we push to Claude every day. The short version: if you were hoping to cut your Anthropic bill to zero and run everything locally, you're not there yet. They're closer than they were six months ago, but there's a clear gap, and it lives in exactly the place most people want to use a local model.

The honest disappointment

Agent flows are where Gemma still struggles. Multi-step tool use, structured outputs that survive a long context, recovering gracefully when a tool call comes back malformed. These are the things Claude makes look trivial. Gemma doesn't fail catastrophically, it stutters in ways that compound. One bad tool call in a ten-step loop is usually enough to derail the whole run.

For anything where an agent has to plan, call a tool, read the result, decide what to do next, and repeat, we're still reaching for Claude. Nothing we've tested locally has the same combination of instruction-following, tool-use reliability, and the willingness to say "I don't know" instead of confidently hallucinating a JSON blob.

Where Gemma actually shines

Single-turn tasks. Summaries. Classification. Extraction from structured documents. Anywhere you want a fast, cheap, local model that doesn't need to reason across twenty steps, Gemma holds its own. The new generation follows tight instructions noticeably better than the last, and the licensing is friendly enough to run on your own hardware without a legal conversation.

The real story: Qwen is still the one to beat

If you're looking at local LLMs today, Qwen is still the benchmark. The Qwen family has a meaningful lead on reasoning, coding, and tool use, the exact places Gemma is weakest. We run Qwen in production for a few internal workloads, and nothing from this Gemma release has given us a reason to switch.

That's not a shot at Google. Gemma is a credible, open, well-documented model family, and having more than one serious open-weight player is how the ecosystem gets better. It just isn't winning on the benchmarks that matter for the work we do, and pretending otherwise wouldn't help anyone.

What we're watching

  • Whether the next Gemma iteration closes the tool-use gap
  • Fine-tuning Gemma for narrow agent workflows where general flexibility isn't needed
  • Hybrid setups: Gemma locally for cheap, high-volume turns, Claude for the hard steps
  • Whether Qwen stays in front as the open-weight leader

Our current stance

Claude for agents. Qwen (or Gemma) for everything local and cheap. Revisit in six months. The pace on this stuff is fast enough that any take older than a quarter should be treated as suspect, including this one.

Building something in this space? Let's talk.

We spend a lot of time with these tools. If you're trying to figure out which model fits your workload, we're happy to share what we've learned.

Get in touch