←back to Blog

What AI Can Teach Us About Designing Better KPIs

Gary Waters/Ikon Images

In 2016, Wells Fargo found itself embroiled in scandal when headlines revealed that its employees, under pressure to meet aggressive sales targets, had opened millions of unauthorized customer accounts. The root cause wasn’t just unethical behavior but a flawed approach to performance measurement. When Wells Fargo’s leadership incentivized employees to sell eight financial products per customer, they inadvertently encouraged gaming behaviors that harmed customers, employees, and ultimately the bank itself. The metric had become more important than its underlying business purpose.

Although Goodhart’s law in action only sometimes rises to the level of scandal, companies everywhere fall prey to it: “When a measure becomes a target, it ceases to be a good measure.” Despite decades of warnings against metric fixation, leaders continue to build incentives around narrow indicators, which results in gaming and ethical lapses and harms business performance.

Traditional solutions to overcoming Goodhart’s law, like balanced scorecards and KPIs, often fail because they remain vulnerable to narrow optimization and gaming behaviors in the absence of careful oversight. The persistence of metric fixation signals a need for a more sophisticated approach to address the problem.

A New Lens: Insights From AI Training

Leaders increasingly view organizations as systems to be optimized for specific outcomes, much as machine learning researchers optimize algorithms. Since both contexts involve optimizing proxy measures that can diverge from the true goals, solutions from AI research could help solve persistent organizational measurement problems. Of course, organizations consist of people with the agency and complex motivations that algorithms lack, so these techniques provide frameworks that require human adaptation, not mechanistic solutions.

AI researchers have long studied a phenomenon called overfitting, where machine learning models perform well on training data but fail when faced with real-world scenarios because they can’t generalize to predict outcomes with the new data.

Consider a simple hypothetical example: An image recognition model tasked with distinguishing dogs from cats is trained on thousands of pet photos. During training, the system appears to perform flawlessly, correctly identifying every animal in the data. But when shown new photos in the real world, it fails dramatically. Why? The model didn’t learn to recognize the essential features of dogs and cats; instead, it discovered incidental patterns in the training data. Perhaps the dog photos had grass backgrounds while cat photos were taken indoors. The system optimized for measurement (training accuracy), not the underlying goal (identifying real features).

AI researchers recognize that optimizing too aggressively for proxy measures eventually undermines real-world outcomes. Jascha Sohl-Dickstein, a prominent AI researcher who has pioneered work in this area, has said that “increased efficiency can sometimes, counterintuitively, lead to worse outcomes,” noting that this happens “almost everywhere,” whether in machine learning or organizational settings.1 In machine learning, researchers observe that performance on a model’s true objective follows a particular pattern: It initially improves as the model optimizes for the proxy measure, then plateaus, and eventually deteriorates dramatically as optimization intensifies. This demonstrates how the aggressive pursuit of metrics can destroy the very outcomes those metrics were intended to promote.

The phenomenon occurs in organizations, too, when teams pursue metrics in ways that disconnect them from their strategic intent: Sales teams hit targets by pushing unwanted products, schools improve test scores by teaching to the test rather than building deeper understanding, or customer service representatives rush through calls to optimize for call-time metrics while sacrificing how completely they solve each customer’s problems. Short-term wins on such metrics can hide or exacerbate deeper issues.

Understanding how AI researchers avoid the problem of overfitting can help managers rethink how they design and track performance indicators to avoid gaming behaviors or narrow optimization and instead drive meaningful results.

The Metric Intelligence Framework in Action

Machine learning has evolved four strategies to combat overfitting, each with direct implications for organizational design.

1. Early stopping: Prevent overoptimization through timely reassessment. Machine learning researchers discovered that continued training after a certain point harms a model’s performance after it’s fed new data. The solution is early stopping, which entails monitoring performance on separate data and halting optimization when validation performance peaks, even if training performance could improve further.

In a typical image recognition system, engineers might train the model on thousands of labeled images while regularly testing it on separate validation images. They may track the error rate on both sets and stop training when validation error starts to increase — a signal of overfitting — even if training error is still decreasing.

Organizations can implement early stopping by regularly reassessing metrics against true objectives. Amazon implements this principle by periodically reviewing and adjusting its performance metrics. The company famously shifted its primary metric focus from profit margins to free cash flow per share in 2004. This prevents teams from overoptimizing for any single metric and ensures alignment with changing business needs.

Intel’s experience in the late 1990s illustrates the process. The company had long emphasized processor clock speed as its primary performance metric and marketing focus. When auditing the metric, Intel discovered that continued optimization for speed was causing excessive power consumption, overheated chips, and diminishing returns in actual computing performance. This realization prompted Intel to shift toward a more balanced set of metrics that included energy efficiency and overall computing capabilities — measurements that better aligned with customer value.

Microsoft’s shift to a more fluid goal-setting process exemplifies these same principles. When Satya Nadella became Microsoft’s CEO in 2014, he devoted significant attention to helping the organization understand that metrics like revenue and profit were imperfect proxies for the company’s broader mission.2 This cultural shift away from metric fixation and toward an understanding of measurement limitations helped Microsoft employees make decisions that balanced quantitative targets with qualitative judgment. Instead of adhering to annual targets regardless of market changes, the company now conducts quarterly business reviews where teams can adjust their metrics and goals based on changing conditions and emerging data. The approach allows the organization to stop optimization for metrics that are becoming counterproductive.

Another way to apply early stopping is by setting minimum targets rather than maximization goals. Unlike Wells Fargo’s “eight products per customer” — which drove aggressive selling— true minimum targets establish a threshold of acceptable performance without rewards for exceeding it. For instance, a customer service department might set a minimum target of resolving 85% of inquiries within 24 hours. Once teams meet these minimum standards, they can focus on other dimensions of performance, preventing overfitting to any single metric.

2. Noise injection: Build robustness through controlled randomness. AI systems benefit from deliberate randomness. When models rely too heavily on specific patterns, they memorize quirks rather than learning genuine relationships. As in our earlier example, if all dog photos in the training data have grass backgrounds, the model might learn “grass means dog” rather than identifying actual canine features. To prevent such overfitting, some connections in the model are randomly turned off during training. Similarly, slightly altering training data, such as rotating images or adding background noise to audio, ensures that the AI learns meaningful patterns instead of irrelevant coincidences. This approach is known as noise injection.

Organizations can apply noise injection in other contexts by conducting random audits, rotating team compositions, shuffling decision makers, or varying operational conditions. A sales team that always presents to the same manager might optimize for that person’s preferences rather than customer value. By introducing controlled variability, organizations can prevent people from gaming predictable patterns.

Financial institutions like JPMorgan Chase use random compliance audits that mirror AI’s noise injection techniques. By making the timing of evaluations unpredictable, it discourages employees from developing workarounds or gaming behaviors that might emerge under predictable assessment schedules. The effectiveness of this approach is well documented in the IRS’s random-audit program, where studies show each dollar invested in random audits generates up to $6 in additional revenue through deterrence effects.3 Research has also found that randomly audited taxpayers maintain higher compliance for at least five years following an audit, with long-term benefits that exceed the immediate revenue recovered.4

The Food and Drug Administration (FDA) implements this principle through its unannounced-inspection program for pharmaceutical manufacturers. Unlike scheduled audits, where facilities can temporarily optimize for evaluation day, surprise inspections prevent companies from hiding compliance issues. These random evaluations consistently reveal problems that would have been concealed during planned assessments, thus encouraging companies to maintain continuous compliance. These random evaluations proved so effective that the FDA expanded its unannounced-inspection program in 2025.

Customer service organizations apply this principle when they use mystery shoppers who appear at unpredictable times, preventing retail staff members from providing excellent service only when they expect to be evaluated. Sales organizations might randomly audit customer interactions rather than focusing exclusively on closed deals, ensuring that representatives maintain ethical standards throughout the sales process. The Transportation Security Administration rotates screening personnel across different checkpoints and functions — baggage screening, metal detectors, document verification — to prevent security gaps that could emerge from predictable assignments. For instance, if the same guard always screened the same checkpoint, bad actors could study their screening patterns or even cultivate relationships with them to engage in social engineering. Random rotation ensures that no one can predict or influence who will be conducting their screening.

3. Capacity alignment: Match metric complexity to organizational capabilities. Machine learning researchers match model complexity to their data and monitoring capabilities. This means that a team with millions of training examples and strong validation methods can safely deploy complex models with many parameters, whereas teams with limited data must use simpler models to avoid overfitting.

Organizations should also match metric complexity with their oversight abilities. A company with advanced analytics infrastructure and skilled data scientists can implement complex, multidimensional metrics while maintaining integrity. Organizations with limited oversight capacity should stick to simpler, more robust metrics, even if they capture fewer nuances. The risk rises when metric complexity outpaces the quality of data and oversight. The same failure happens in machine learning: Overly complex models trained on limited data quickly collapse, especially when the same small validation set is used again and again.5

WeWork’s “community adjusted EBITDA” illustrates this dangerous middle ground. The company created a metric that excluded not just standard EBITDA adjustments but also basic operating expenses like marketing, administrative costs, and development spending. This metric was complex enough to obscure the company’s financial position from investors who lacked the expertise to dissect it, yet it wasn’t a sophisticated measure that captured real business performance. When the Securities and Exchange Commission intervened, requesting the metric’s removal from WeWork’s S-1 filing, the company attempted to rename it “contribution margin” to satisfy regulators. By then, however, the metric had already helped attract billions in investment despite being meaningless for evaluating business health. WeWork’s eventual bankruptcy proved that its business model was fundamentally flawed.

Similar dynamics have played out in the financial sector. In the early 2010s, several major banks implemented sophisticated risk assessment metrics following the financial crisis, but many lacked the analytical infrastructure to properly validate those measures. JPMorgan Chase’s “London Whale” trading incident in 2012 revealed this problem. The bank implemented a sophisticated Value-at-Risk model that proved to be fatally flawed, with spreadsheet errors that understated risk by 50%. The model’s complexity allowed traders to game the system by manipulating price inputs within technically acceptable ranges, and risk managers lacked the analytical infrastructure to detect either the errors or the manipulation. This toxic combination of a sophisticated but broken model, deliberate gaming, and inadequate oversight capabilities resulted in over $6 billion in losses.

Startups excel at alignment by focusing on simple metrics suited to their stage, such as customer acquisition cost, monthly recurring revenue, and retention rates. These are easy to track and directly tied to goals like product-market fit and cash flow. As businesses grow, metrics expand to include unit economics like lifetime value to acquisition cost, net revenue retention, cohort analysis, and service-level objectives. Large enterprises use more advanced analytics, tracking predictive customer health scores, experiment uplift, model performance and fairness in AI systems, and risk-adjusted ROI across portfolios. These are often integrated into balanced scorecards that connect financial, customer, and operational indicators.

4. Regularization: Create balance through simplicity incentives. Regularization addresses a core problem in both AI and organizations: the tendency to optimize so effectively for a proxy measure that it actively harms the true goal leaders care about.

In machine learning, regularization helps models avoid becoming overly specific, by penalizing excessive complexity. For instance, some methods penalize overly detailed rules, encouraging models to identify simpler, more general patterns. Without regularization, models become too tailored to training data and perform poorly when faced with new data. Organizations can implement this principle similarly: Adding constraints prevents the complex gaming behaviors that arise from excessive optimization. The constraints make gaming the system more complex and costly than performing well. Just as regularization forces AI models toward simpler patterns, organizational constraints make straightforward good performance simpler than elaborate gaming strategies.

Amazon’s approach to seller ratings illustrates this principle. Its algorithm can detect unusual patterns in customer feedback, such as sudden spikes in five-star reviews from accounts with a minimal purchase history or clusters of generic reviews posted within short time windows. Such patterns indicate that a seller has been soliciting positive reviews, which is prohibited and results in severe penalties. This constraint prevents sellers from gaming the review system and maintains alignment between the proxy measure (star ratings) and the true goal (seller quality). Without such constraints, sellers would find ways to inflate their ratings at the cost of real service quality.

Netflix recognized that optimizing solely for hours viewed could incentivize the production of content that captures attention without providing value that retains subscribers. To prevent this distortion, Netflix implemented a multi-metric approach. As the company stated, “Success on Netflix comes in all shapes and sizes, and is not determined by hours viewed alone.” In 2023, it switched to a views metric (hours viewed divided by runtime), calling it “our best measure of member satisfaction and a key driver of retention.” By using multiple complementary metrics that balanced one another, Netflix created the organizational equivalent of regularization: preventing any single measurement from being gamed at the expense of true viewer satisfaction. This strategy has enabled the company to invest in diverse content with varying time horizons rather than focusing on immediate viewership metrics, supporting its long-term competitive position.

Organizations can balance growth and efficiency through regularization by adding complexity penalties to highly optimized processes. After discovering that pure profit maximization was leading to environmental concerns, Unilever incorporated sustainability requirements as constraints on profit-seeking through its Sustainable Living Plan. Instead of adding complexity, these requirements function as regularization terms that prevent extreme profit optimization by requiring balanced performance across multiple dimensions.

Progressive compensation structures also serve as regularization tools. Whole Foods (before its acquisition by Amazon) implemented executive salary caps tied to average worker compensation, creating a natural constraint against optimizing solely for leadership enrichment. Some organizations have found success with increased oversight mechanisms for teams that consistently exceed targets, triggering automatic reviews to verify that exceptional results aren’t coming at the expense of unmeasured dimensions. Ben & Jerry’s provides another example (notwithstanding recent turmoil at the company), with its required community engagement investments tied to growth initiatives, ensuring that profit maximization remains balanced with social impact considerations.

Regularization isn’t about making things simpler. It’s about setting constraints that keep optimization aligned with real goals. These constraints work best when they create balanced incentive structures that discourage extreme behaviors while still rewarding genuine improvement.

By adapting these solutions, organizations can maintain the benefits of measurement while avoiding the pitfalls of metric fixation. Each strategy addresses a different aspect of the Goodhart’s law problem, creating a framework for intelligent organizational metrics that drive performance without encouraging gaming behaviors.

The Future of Intelligent Measurement

The combination of artificial intelligence research and organizational design offers practical solutions to metric fixation. By applying techniques that keep AI models aligned with real-world goals, managers can improve how organizational performance is measured.

These approaches must work in concert rather than isolation. Organizations adopting metric intelligence treat measurement as an ongoing practice requiring constant refinement, not a “set it and forget it” solution. Successful implementations involve those who will be measured in designing the measurement system, resulting in both better metrics and greater buy-in for the new practices.

Adopting this framework means overcoming hurdles familiar to both managers and AI developers. Leaders need to have patience and resist pressure for short-term results, just as AI teams hold off on deploying models until they’re fully trained and tested. Cultural resistance to new metrics parallels the trust issues with black-box algorithms, whose decisions can seem opaque. And success depends on integration: Metrics must connect across organizational silos, just as new algorithms must be woven into existing software and operations.

Moreover, while these AI-inspired approaches improve measurement, they require careful human oversight. As Jerry Muller argues in his book The Tyranny of Metrics, not everything that matters can be quantified, and — given that organizations consist of individuals with complex motivations and behaviors — managerial judgment remains essential.6 The ideal approach balances structure with flexibility, adding enough friction to prevent gaming without stifling initiative and employing sufficient measures without drowning in data.

This matters now because metrics are becoming more numerous and detailed. Without improved measurement strategies, organizations risk repeating past mistakes of narrow optimization. As AI agents increasingly take on direct roles within organizations, robust performance metrics become even more important. The insights from AI training won’t just help managers measure human performance effectively; they will guide how goals are set and evaluated for AI workers.