The Explore-Exploit Tradeoff for AI Tools

Jill Cates

28 Mar 2026 — 4 min read

Optimizing the multi-armed bandit, one paw-step at a time. (Image generated using Nano Banana 2)

I've been spending a lot of time lately figuring out which AI tools to invest in. New models, new frameworks, new CLI tools, new IDE integrations (or no IDE at all?!). The ecosystem is moving so fast that just orienting yourself feels like a project in its own right.

At some point, though, I had to stop evaluating and start building.

This tension is something I keep coming back to. In reinforcement learning, it has a name: the explore-exploit tradeoff. The classic framing is the multi-armed bandit problem. You're standing in front of a row of slot machines, each with an unknown payout rate. Do you keep pulling the lever on the machine that's been paying out decently? Or do you try a new one that might be better, but might also be a total dud?

That's the dilemma with AI tools right now. Andrej Karpathy, one of the most respected voices in ML, captures the feeling pretty well in this post:

I've never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become…
— Andrej Karpathy (@karpathy) December 26, 2025

The surface area of what we could be learning right now is enormous, and it's expanding faster than anyone can keep up with. The pull to explore feels rational because these tools are genuinely improving at a wild pace. Sometimes the new thing really is a meaningful leap forward.

But there's a cost to perpetual exploration. Every hour spent evaluating a new tool is an hour not spent building with the tools you already have. And building is where the compounding happens.

The 80/20 Split

I've landed on a rough ratio that works for me, at least for now: 80% exploitation, 20% exploration.

The 80% is about depth. I choose the tools that work for me and commit to them long enough to build real fluency. There's a compounding effect to sticking with something – you start to internalize the quirks and work around the limitations without thinking. At the moment, I’ve doubled down on Claude Code and built out a set of custom skills and agents to automate a lot of the tedious parts of my workflow. These days, over 90% of my PRs involve AI somewhere in the loop.
The 20% is about staying current without losing your mind. For me, this looks like reading tech blogs, paying attention to what coworkers share in Slack, experimenting with new workflows (I've been playing around with agent teams lately), and occasionally spinning up a tool I haven't tried before just to get a feel for it. The goal isn’t to adopt everything. It’s to maintain enough awareness that when something genuinely useful drops, I recognize it.

This ratio isn't fixed, though. If you're just getting started with AI tools, you probably should be exploring more – maybe 50-50 or even 60-40 in favour of exploration. If you're mid-project and shipping on a deadline, it might be 95% exploit and 5% explore. The point isn't the exact numbers. It's being intentional about the split rather than letting it happen to you.

The 20% in Practice

I've found that exploration only works if I'm deliberate about it. My version of this is usually a couple of hours on Friday afternoon, when my brain is too cooked for deep work anyway. I pick one thing: a new model, a tool a coworker mentioned, a workflow I read about in a blog post. If it doesn't "click" within an hour, I move on.

Most things don’t stick, and that’s expected. I recently tried Pi after hearing a lot of hype about it at work, and it didn’t really change anything for me. I still prefer Claude Code. But occasionally something does land. For example, I tried autoresearch in the same spirit, and it’s already making its way into my daily workflow.

The point of this isn’t to find a winner every time. It’s to keep a light sampling loop running in the background, so that when something is genuinely useful, it stands out quickly.

What I try to avoid is reactive exploration - e.g., seeing a demo of a flashy tool and immediately going down a three-hour setup rabbit hole in the middle of a workday. Sometimes this pays off, but more often it fragments my attention and leaves me with a shallow understanding of something I didn’t spend enough time with to evaluate properly. I still catch myself slipping into this, which is exactly why I rely on the 80/20 split as a guardrail.

Exploration feels productive because it looks like progress. But compounding only happens on the exploitation side. In reinforcement learning, there's no optimal solution to the explore-exploit tradeoff — only policies that adapt well over time. I think that's true here too. The goal isn't to get the ratio perfect. It's to notice when you've been exploring so long that you've forgotten what you were trying to build.