What gets measured

Fri, 24 Apr 2026 16:07:05 GMT

In February, Anthropic published something quietly important. Buried inside an education report, the company released what it called the AI Fluency Index, a baseline measurement of how people actually collaborate with AI, derived from analysis of nearly ten thousand Claude conversations over a single week in January.

The headline finding was simple and, if you sit with it, unsettling. The most effective AI users treat the model as a thought partner rather than an oracle. They iterate. They push back. They question reasoning. Users who iterated on their prompts questioned the model’s reasoning 5.6 times more often, and spotted missing context four times more frequently, than those who didn’t. The least effective users accepted polished outputs without scrutiny. And critically, the more polished the output looked, the less likely anyone was to check it.

A few weeks later, Zapier published a different kind of contribution. Version two of its AI Fluency Rubric, the framework it now uses to assess every candidate it hires, added a fourth assessment dimension beyond mindset, strategy, and building. The new dimension was accountability, captured in a single line that I keep returning to: with AI, you can delegate the work, but not the accountability.

Zapier also introduced a concept it called “fluency slope”: not where a candidate sits today on AI capability, but the trajectory of how they got there. Someone who plateaued eight months ago on the same three tools is a different candidate, in their view, than someone actively experimenting and building on what they’ve learned.

Two reports, separated by a few weeks, from two of the most thoughtful companies operating in the AI economy. Both pointing, I think, at the same conclusion.

AI fluency is now a measurable, observable construct. The organisations that take it seriously are systematically pulling ahead of the ones that don’t. And the behavioural science of how humans work well with AI, which is to say the actual foundation of every productivity claim being made about AI in the economy right now, has crossed the line from interesting idea into something you can put a rubric against.

This is the moment a new category becomes real. And it’s the moment to ask the question those two reports don’t answer.

The question no one can currently answer

Anthropic’s research tells us, with statistical confidence, what good AI usage looks like at the level of an individual conversation. Zapier’s rubric tells us what good AI fluency looks like at the level of an individual hire. Between them, they’ve defined the behavioural unit of analysis with more rigour than we had two years ago.

Neither answers the question that a board, a chief executive, or a CFO actually asks.

How does our organisation compare? Are our people, in aggregate, in the top quartile of AI-enabled performance, or the bottom? Is our investment in tools, training, and transformation producing the business outcomes we’re paying for? And against what reference point, what benchmark, could we possibly know?

That benchmark does not currently exist.

What exists today is a fragmented landscape of partial views. IT dashboards show licence counts. Procurement reports show software spend. Microsoft’s Work Trend Index publishes useful aggregate sentiment data. Ramp’s AI Index tracks adoption through transaction data, a clever move that uses what Ramp can see (what companies pay for) to estimate what they can’t (how those companies actually work). McKinsey, BCG, and Deloitte run bespoke six-figure readiness assessments that aren’t comparable to each other by design. Individual certification platforms test whether a single employee can pass a 40-question quiz.

Each of these is doing something real. None of them tell a leadership team where their organisation actually sits on the dimension that matters, which is whether their people, working with AI, are producing measurably better business outcomes than peers in their industry.

This is the gap. It is going to be filled. The only question is by what, and by when.

Why the measurement problem is harder than it looks

I want to pause on why this gap hasn’t already been closed, because the answer tells you something about what filling it requires.

The obvious move would be to build a telemetry dashboard. Pull admin-API data from the major AI platforms (ChatGPT Enterprise, Claude, Copilot, Gemini), aggregate usage metrics across an organisation, publish a scorecard. Several companies are pursuing some version of this. It is useful. It is not a benchmark.

The problem is that usage and fluency are not the same thing, and fluency and outcomes are not the same thing. Anthropic’s own research demonstrated this directly. People who used more tools were not more fluent than people who used fewer tools more deeply. People who received polished outputs checked them less, not more. Telemetry captures the first-order signal of who is doing what and how much, but not the second-order signal that actually correlates with performance: how they are doing it, and whether it is working.

The second obvious move would be a sentiment survey. Ask employees how confident they feel using AI, whether they think their training is adequate, whether they’re optimistic about the future of AI in their role. Aggregate the responses, produce a score. This is the default instinct of most HR and transformation teams right now, and it is the weakest version of the measurement. Feelings correlate poorly with performance. A confident workforce can produce worse work than a less-confident workforce using AI more rigorously. The opposite is also true. Sentiment is a real signal, but it is not the signal that leaders buying AI actually want.

The harder move, and the one I think is required, is to measure behaviours, drawing on the taxonomies that Anthropic and Zapier have now validated, and link those behaviours to outcomes that leaders already trust. Revenue per employee. Revenue growth. Hiring efficiency. Margin. Time-to-ship. The numbers that get reported to boards, not the numbers that get reported to HR.

This is harder because it requires a longitudinal dataset, a defensible taxonomy, and the statistical discipline to make causal claims rather than correlational ones. It is also the only version of this measurement that produces something a CFO will actually pay for, and more importantly, actually act on.

The Output Index

I’m building The Output Index, a standardised, longitudinal benchmark of how organisations turn AI usage into business performance, scored across companies and segmented by industry, size, and geography.

It works on a simple principle. The behavioural research from Anthropic, Zapier, and others gives us a defensible taxonomy of how people work with AI well. Augmentation over delegation. Iteration over one-shot. Verification over acceptance. Accountability over abdication. I measure that taxonomy at the organisational scale through a standardised survey instrument, taken across a representative sample of an organisation’s workforce. I then link those usage patterns to the outcomes that matter to the people paying for the transformation.

The result, over time, is a single comparable score, the Output Index, that tells any organisation where it sits relative to its industry peers, which specific usage patterns separate the top quartile from the bottom, and what outcome delta to expect from closing the gap.

The dataset itself is the asset. As more organisations are measured, the benchmark sharpens. Industry-specific quartiles become possible. Longitudinal trajectories, Zapier’s “slope” concept applied at the organisational level, become visible. The shared vocabulary that CFOs, COOs, and Chief AI Officers currently lack starts to exist.

It is worth being honest about where this sits today. I have an instrument, a thesis, and the beginning of a dataset. I do not yet have industry-by-industry quartiles, because those take time and volume to build. What I have is the clear methodology, the intellectual foundation laid down by the people doing serious work in this space, and the conviction that the measurement itself is going to be built by someone in the next eighteen months. I would rather it be built carefully, transparently, and in public. Which is what I intend to do here.

Why now, not in three years

There is an argument that this work is premature. AI tooling changes monthly. The construct is unstable. The methodology will need to evolve.

All of these are true. They are also exactly the reasons to start now.

The companies that defined the language of customer satisfaction (Bain with Net Promoter), employee engagement (Gallup with Q12), credit risk (Moody’s with the rating), and software category authority (Gartner with the Magic Quadrant) all started measuring while the underlying construct was still contested. They did not wait for the category to settle. They settled the category through the act of measuring it consistently, publicly, and credibly over time. The methodology evolved. The standard endured.

The same opportunity exists now in AI-enabled performance, and it has a finite window. The Anthropic and Zapier work has materially shortened that window because it has validated the construct. The remaining question is not whether AI fluency can be measured at scale, but who builds the commercial benchmark layer that lets every organisation see where it sits.

I intend to build that layer.

The next ninety days, in public

The next three months are about building foundations honestly. Three specific things.

The methodology. A first version of the Output Index instrument is live as of this week. It draws on the behavioural taxonomies established by Anthropic, the assessment framework pioneered by Zapier, and the broader emerging science of how humans work effectively with AI. It is anonymous, short (eight questions, four minutes), and available to anyone who wants to contribute. You can take it at outputindex.com and contribute to the first wave of data.

The founding cohort. A founding group of ten to fifteen organisations across financial services, professional services, and technology will be the first companies formally measured by the Output Index. Their data will seed the institutional benchmark. Their reports will be the first proof that the methodology works at organisational scale. If you lead AI transformation, people strategy, or operations at an organisation that wants to be in this founding cohort, I would like to talk to you.

The public conversation. This Substack will publish every week or two on what the early data shows, where the methodology is being challenged, and how the conversation with the people building this category is evolving. The goal is not to broadcast conclusions but to build the standard in public, with the people who care about getting it right. Subscribe if you want to follow along.

The closing thought

The line from Zapier’s rubric is worth borrowing as we close. With AI, you can delegate the work, but not the accountability.

The same is true at the organisational scale. Companies can buy AI tools, deploy AI workflows, hire AI specialists, and publish AI strategies. None of that delegates the accountability for whether any of it is actually producing better work, better outcomes, or better results than peers achieving the same. That accountability sits with leaders. Accountability without measurement is just hope.

The Output Index is being built so that hope can become evidence.

If this resonates, subscribe below, take the assessment, or get in touch. This is the beginning of something that will take years to build properly. The first conversations are the most important.

Tom is the founder of The Output Index. This Substack publishes every one to two weeks on the measurement of AI-enabled performance. Get in touch: tom@outputindex.com.