←back to Blog

How to Reap Compound Benefits From Generative AI

Carolyn Geason-Beissel/MIT SMR | Minneapolis Institute of Art

In domain after domain, AI has compressed work that used to be expensive — generating drafts, code, prototypes, and analyses. The marginal cost of a first attempt has dropped sharply. What remains expensive is what happens after the output arrives: evaluating what gets generated. That involves separating signals from noise, catching errors, capturing what was learned, and applying those lessons to the next iteration.

This shift changes what organizations should optimize for. The old question was “How do we produce more, faster?” The new question is “How do we systematically learn from, and with, what AI produces?”

Most organizations still overinvest in answering the old question. They treat artificial intelligence as a throughput accelerator: task in, output out, loop closes. This is consumption economics. A serious CFO instantly recognizes the pattern: asset depreciation.

The organizations pulling ahead answer the new question. They treat AI as a capability accelerator: task in, output out. But they also ask, “What worked? What failed? What should change next time?” Insights get captured, converted into shared knowledge, and applied to subsequent interactions. Each cycle makes the next more effective. This is compounding value. Serious CFOs recognize this pattern, too: asset appreciation.

The data bears this out. Organizations that build systematic feedback loops between humans and AI are six times more likely to derive substantial financial benefits from AI, according to research by MIT Sloan Management Review and Boston Consulting Group.1 Organizations that invest in learning with AI are 73% more likely to achieve significant financial impact.2 Yet, as of 2024, 70% of companies had adopted AI, but only 15% were using it for organizational learning.3

Leaders seeking compound returns must build what most companies don’t yet understand, let alone possess: systems that verify AI outputs, evaluate what they reveal, and capture what was learned so that each interaction becomes a building block for the next. This type of ROI with GenAI — return on iteration — doesn’t happen by accident; it requires infrastructure. Let’s examine what that infrastructure looks like.

Why This Moment Is Structurally Different

This is not old productivity advice dressed in new rhetoric. Two complementary economic dynamics that reinforce each other in a virtuous cycle make compounding management an imperative.

In his 1966 book The Tacit Dimension, philosopher Michael Polanyi observed that humans know more than they can articulate. For decades, that tacit knowledge protected knowledge workers. What could not be explicitly described could not be automated. Tacit expertise was a moat.

AI breaches that moat — not by codifying tacit knowledge but by inferring it from behavioral traces at scale. Large language models (LLMs) absorb how experts actually work, including knowledge the experts never articulated. Legal reasoning in briefs and opinions, financial judgment in analyst reports and trading patterns, strategic thinking in board presentations: As these behavioral traces become more legible to AI models, the tacit expertise embedded in them becomes readable by machines.

Boris Cherny, who led the development of Claude Code, described a revealing moment: After he gave Claude the tools to interact with his file system, the AI began exploring the system on its own to find answers. “It was mind-blowing,” Cherny said. He had not programmed that capability. The model inferred how developers work from the traces they had left behind — behaviors that no one had previously formalized.

The second dynamic makes the economic case for compounding even more compelling. In 1865, economist William Stanley Jevons observed that when steam engines became more efficient, coal consumption increased rather than decreased. Efficiency gains made the capability cheaper, stimulating demand. As tacit expertise becomes readable by machines, the cost of sophisticated capability drops dramatically. Projects that were previously too expensive to prototype can proliferate. Iteration cycles that once took months compress to hours. More expertise becomes readable to machines, expanding what AI can access while enhancing the AI’s knowledge base and improving its capability. More capability expands what organizations attempt. The loop feeds itself.

The data supports this structural shift. Organizations that combine strong organizational learning with learning specific to AI are up to 80% more effective at managing uncertainty.4 The implication is direct: Becoming better learners with AI is at least as important as using AI to create efficiencies.

The organizational challenge worldwide is not whether or how AI will access their people’s domain expertise — that appears computationally inevitable. The issue is developing the competence and commitment to install mechanisms that reap compounding returns on human-AI interactions before competitors do.

Three Steps to Compounding Benefits

What do those essential mechanisms look like? We argue that organizations must prioritize three distinct but interrelated operations. When all three of the following steps are present and connected, organizations can reap compounding benefits on AI use. When any step is missing, organizations merely consume AI outputs.

1. Verification. The question here is “Does this output meet the standard?” This step produces a binary answer: correct or incorrect, usable or not. Verification compares output against a criterion that already exists. Unverified AI output is noise with a confident tone. But verification, used alone, catches errors without generating learning.

2. Evaluation. For this step, the question is “What does this output reveal?” Where verification compares output against existing standards, evaluation may generate standards that did not exist before. This is why evaluation requires domain expertise in ways verification often does not. The expert as evaluator is not merely checking quality. They are discovering what quality means in this new context. With AI outputs, evaluation is required across three dimensions: volume, variety, and velocity. Human bandwidth to do evaluations, not AI access, becomes the binding constraint.

3. Learning capture. The third question is “How do we ensure that this insight persists?” When evaluation is not recorded, knowledge does not compound; it evaporates after each interaction. Learning capture converts single insights into organizational knowledge, such as documented criteria, updated prompts, and shared repositories of what worked and why. Think of it as version control for organizational judgment. Without it, evaluation is a one-time event. And learning capture alone (documentation without verification or evaluation upstream) produces nothing but organized noise.

Those three steps dynamically reinforce one another. Better verification produces cleaner signals for evaluation. Better evaluation generates richer material for capture. Better capture improves the criteria used in the next round of verification. The cycle is the point.

There is yet another valuable and scalable learning dividend: Most experts cannot fully articulate what makes their judgment good. Forcing that judgment into written standards, such as the way developers write CLAUDE.md files that specify what “good” code looks like, makes the tacit explicit for colleagues and for AI alike. The gap between what an LLM delivers and what the expert wanted surfaces unspoken knowledge.

At Anthropic, Cherny gives the AI a way to verify its own work — a test suite, a browser check — before a human ever sees it. To evaluate the work’s quality, he concurrently runs 10 to 15 Claude instances that generate swarms of smart subagents: One checks style while another hunts bugs, then a second cohort challenges the first for false positives. Capture is key: A CLAUDE.md file gathers mistakes, corrections, and design principles inside the workflow itself — not after its completion but while it is happening. Each new session inherits what every prior session learned. For Cherny and his developers, the benefits compound.

There are analogous questions for leaders of other business functions: What is your equivalent of version control for organizational decisions? Of automated testing for new approaches? Of code review to make evaluation criteria explicit and shared? The “verification-evaluation-learning capture” flywheel offers both challenge and opportunity for managers and executives who want to use AI to do measurably more than simply cut costs and improve efficiencies.

Consider a marketing team using AI to generate campaign briefs. Verification asks whether the brief meets basic brand standards, such as consistent tone, correct product claims, and regulation-compliant disclaimers. Automation is fast and cheap. Evaluation asks what the brief reveals: Did AI surface customer insights the team hadn’t named? Did it miss the emotional register entirely? Are these insights “actionable” — meaning, can they trigger interactions and offers to cultivate relationships and/or close deals? These judgments require a senior strategist, not a checklist.

Learning capture asks whether that strategist’s correction — “Our brand never leads with product features; it leads with customer identity” — gets written into a shared prompt template or brief standard for the whole team to use the next time. Without that last step, the strategist’s insight dies with the session. With it, every subsequent brief starts smarter. And perhaps that brief becomes the charter for designing an intelligent marketing agent.

The moment a CMO and/or CFO builds dashboards around those questions and criteria, the organization has begun compounding.

When Verification Masquerades as Evaluation

The machinery requires a human who holds the loop open when every instinct says to close it.

Jaana Dogan, a principal engineer at Google responsible for developer infrastructure on the Gemini API, ran a revealing experiment. She pointed Claude Code — a rival’s tool — at a problem her team had spent many months solving. Given a short prompt with no proprietary Google data, Claude Code generated a design solution comparable to the one her team had landed on, along with a working prototype.

Most managers, seeing that output, would just verify: “Does this match what we built? Close enough? Adopt or reject.” Verification is fast, comfortable, and binary. It answers the question already in your head.

Dogan did something different. She decided, “It’s not perfect and I’m iterating on it.”

Evaluation interrogates what the output reveals — about the problem, about your assumptions, and about what you haven’t yet named. Dogan could do this because she had months of judgment to bring to the encounter. AI compressed the implementation; it could not compress the formation of expertise. Without that prior work, only two moves exist: Accept or reject. With it, a third move opens up: Stay in the encounter and learn.

This is the distinction most organizations miss. They treat AI outputs as verdicts to be confirmed rather than starting points to be interrogated. The result is consumption dressed up as adoption — verification mistaken for the whole job.

The implication: Deploy AI first in domains where your people already have deep expertise, not because AI needs hand-holding but because evaluation requires someone capable of recognizing what “not perfect” actually means and knowing what iteration may reveal. The expert as evaluator is not a transitional role.

But Dogan’s insight lives only in her head until infrastructure captures it. The question for any organization is not whether individual experts can hold loops open — some always will. It’s whether the machinery exists to convert their judgment into shared knowledge that persists.

That machinery is what most organizations lack. They have experts. Some even have experts with the right disposition. What they don’t have is the infrastructure that makes compounding automatic rather than incidental.

Building the Capability

Translating these practices into infrastructure for business functions beyond software is the work that remains for leaders. This requires a minimum of five moves.

1. Preserve your company’s evaluation expertise. To reap compound interest, you’re dependent on people who can accurately evaluate AI output. This is domain expertise repositioned: the expert as evaluator rather than the expert as producer. Organizations that let people’s deep expertise atrophy because “AI can do that now” will lose this very valuable capability.

2. Build verification mechanisms. As noted above, the cycle cannot begin without verification of output. Software verification is cheap: Code runs or it doesn’t. Finance has moderate verification costs; models can be stress-tested against historical data, for example. Strategic planning has expensive verification costs: Long bets may not resolve for years. Most organizations treat expensive verification costs as a good reason not to start some work with AI tools. Instead, the smart move is doing minimally viable verification, the cheapest credible check that an AI output is not wrong. Consider multijudge systems that surface disagreement, and consistency checks that compare outputs across different formulations of the same problem. None of these guarantees correctness, but each offers enough verification to start the cycle.

3. Institute evaluation practices. Few organizations systematically evaluate AI outputs. After every significant AI interaction, users should ask three questions: What worked? What failed? What was interestingly wrong — wrong in a way that reveals something about the problem the team has not previously articulated? That third question is where hidden value lives. An output that fails in a way the expert noticed but had not yet named becomes new organizational knowledge: It is tacit expertise becoming explicit. People must be prompted to ask these questions as part of the existing workflow. Build evaluation into workflows to pave the way for value to compound.

4. Create capture systems. Evaluation without capture evaporates. Capture systems operate on two levels: inferential (learning from patterns in accumulated traces, the way AI learns from historical data) and explicit (recording human judgment in retrievable form). Both matter. A practical approach to both is lightweight infrastructure: decision journals that record not just what was decided but why; prompt repositories that preserve what worked and what failed instructively; and evaluation logs that make the team’s evolving standards searchable. The design principle is retrievability, not comprehensiveness. A marketing team’s capture system is a prompt library and a shared brief template. A finance team’s is an annotated model log. Every function can build its equivalent of CLAUDE.md. Discipline, not cost or creativity, is the true constraint.

5. Measure the cycle, not just the output. Most organizations judge an AI deployment’s success using measures like tools adopted, hours saved, or tasks completed. These are consumption metrics. Organizations trying to reap compound returns measure the cycle: How many interactions were verified? How many were evaluated? How much learning was captured? How quickly did captured learning change subsequent practice? Did your team leaders learn things from AI interactions last week that changed how they worked this week? If not, the cycle is not running.

The Deeper Transformation

Leaders want to consume AI. They ask, “How do we produce faster, better, cheaper with AI?” The new question is “How do we learn from what AI produces systematically, and at speed?”

Productivity in an era of generative AI is not output per unit of input. It is also determined by measurable learning per unit of interaction. Organizations that build the machinery to run the cycle — verify, evaluate, capture, apply — will build that capability over time. Those that do not will consume AI without converting it into knowledge. They’ll be busy, perhaps, but not learning and not reaping compound benefits.

Dogan’s eight words embody this shift: “It’s not perfect and I’m iterating on it.” She verified that the output was usable. She evaluated what it revealed.

She is iterating; her learning is being applied to the next interaction. The compounding cycle is running. It is available to any organization willing to build the machinery that makes it possible.