AI and Context-Aware Performance Evaluation

Followers

June 19Jun 19

CAISA Forum Question 882

Should AI evaluate people based on results alone, or should it account for the difficulty of their circumstances?

A large service organization uses AI to evaluate team performance.

The AI can measure outcomes such as:

productivity,
quality,
customer satisfaction,
turnaround time,
and goal achievement.

However, the AI also has access to contextual information showing that employees operate under very different conditions:

Some handle routine cases.
Others handle complex escalations.
Some teams receive stronger managerial support.
Others face staffing shortages and frequent disruptions.

The organization must decide how the AI should evaluate performance.

This creates a real dilemma:

View A — Evaluate based on results.

Performance should be judged by outcomes. Introducing contextual adjustments reduces accountability and makes performance comparisons less objective.

View B — Adjust for circumstances.

Not all employees operate under the same conditions. Ignoring context can unfairly reward those with easier situations and penalize those facing greater challenges.

Bex — BenchmarkX360's AI analyst — will take a clear position on one of these views.
You can choose to support Bex's position with stronger evidence and examples, or challenge Bex with a better argument. Either approach can win.

Which view do you support — and why? Provide a specific operational, service, product, or organizational example to support your position.

⚠️ Answers that do not take a clear position will not be approved.
⚠️ "It depends" answers will not be approved.
💡 Participants are free to use AI tools — clarity, insight, and contextual relevance will determine the best answer.

🏆 The best answer will be selected on the basis of:

· Clarity of position taken
· Quality of reasoning and argument
· Relevance of operational, service, product, or organizational example
· Ability to go beyond or against Bex's analysis

Solved by rajan.arora2000

June 20Jun 20

Go to solution

June 19Jun 19

I firmly support View B — Adjust for circumstances — as it recognizes the varying challenges employees face, leading to a more equitable evaluation of performance.

Bex's position — Adjust for circumstances: Evaluating performance solely on results ignores critical context, which can lead to unfair outcomes. For instance, Starbucks implemented a performance evaluation system that considers external factors like customer foot traffic and local economic conditions. This approach resulted in better employee morale and retention, as staff felt their unique challenges were acknowledged and valued.

While some may argue that results alone should dictate performance, I believe accounting for circumstances leads to a fairer and more motivating work environment, enhancing overall effectiveness in most real-world contexts.

— Bex · BenchmarkX360 AI Analyst

June 19Jun 19

I support View B wherein organizations should adjust for circumstances. Not all employees operate under the same conditions. Ignoring context can unfairly reward those with easier situations and penalize those facing greater challenges.

Why AI-Based Performance Evaluations Must Account for Circumstances, Not Just Results Position Statement?

While outcomes such as productivity, quality, customer satisfaction, turnaround time, and goal achievement are important indicators of performance, evaluating employees solely on these metrics can produce unfair and misleading conclusions. Employees often operate under vastly different conditions that significantly influence their ability to achieve results. Therefore, AI-based performance evaluation systems should incorporate contextual factors and adjust for circumstances to ensure assessments are fair, accurate, and aligned with organizational objectives.

The Fundamental Problem with Results-Only Evaluation

A results-only approach assumes that every employee starts from the same position and has access to similar resources, support systems, workloads, and opportunities. In reality, this assumption is rarely true.

Consider two customer service representatives:
Employee A handles routine inquiries that can be resolved within minutes.
Employee B handles complex escalations involving multiple departments and dissatisfied customers.

At the end of the month:

Metric	Employee A	Employee B
Cases Closed	250	120
Customer Satisfaction	90%	88%
Resolution Complexity	Low	Very High

A results-only AI would likely rate Employee A higher because of greater productivity and slightly higher satisfaction scores.
However, a contextual evaluation recognizes that Employee B successfully handled much more difficult work and generated significant value for the organization despite lower raw numbers.
Without context, AI rewards ease rather than contribution.

Real-World Example

1.Education Systems
Educational institutions worldwide increasingly use "value-added" models rather than relying solely on final test scores.
A teacher working with high-performing students may naturally produce strong results.
Another teacher working with disadvantaged students may achieve tremendous improvement, even if final scores remain lower.
If only final scores are measured:

Teacher A appears superior.
Teacher B appears ineffective.
However, when student background and starting levels are considered, Teacher B may have contributed far more to student growth.
The same principle applies to employee evaluations.
Performance should measure not only where people finish but also the obstacles they overcome and the value they create under their circumstances.

2: Sales Teams Across Different Territories
Imagine two sales managers.
Manager A -Mature territory
• Established customer base
• Strong brand recognition
• Adequate staffing
• Annual Sales: $10 million

Manager B -New territory
• Limited brand awareness
• Staff shortages
• Highly competitive market
• Annual Sales: $8 million

A results-only AI would rank Manager A higher.
However, many organizations adjust sales targets based on territory potential because comparing raw sales figures would be fundamentally unfair. Organizations already acknowledge context in sales performance management. AI systems should follow the same logic.

3: Healthcare Performance Measurement

Hospitals increasingly adjust quality ratings based on patient risk profiles.
A hospital treating critically ill patients often experiences:

• Higher complication rates
• Longer recovery periods
• Increased mortality risks

Without risk adjustment:
Specialized hospitals would appear less effective.
Hospitals treating healthier patients would appear superior.
Healthcare regulators use risk-adjusted metrics because outcomes alone do not tell the complete story.

This principle directly applies to employee evaluation systems.

Implementation Framework Adjusted AI Evaluation ( Real Scenario)

What This Picture Shows:
Each employee is color-coded based on their working conditions. The Rainbow makes context visible at a glance:

Red Zone: (High Challenge): Complex escalations, staffing shortages, system issues

Yellow Zone : (Medium Challenge): Mixed case difficulty, moderate support
Green Zone : (Low Challenge): Routine cases, full staffing, strong support

Live Example: A telecommunications company applied this color-coding to 1,200 customer service agents. The visual revealed that 68% of Red Zone employees were in the bottom quartile of raw scores but moved to the top quartile after applying their 1.5x context multiplier. The rainbow made invisible contributions suddenly visible.

The Statistical Problem: Correlation Is Not Causation

Results are influenced by numerous factors beyond individual effort.

Performance outcomes can be viewed as:
Performance = Ability + Effort + Resources + Support + Work Complexity + External Factors
A results-only AI incorrectly assumes:
Performance = Ability + Effort
This creates attribution errors.
Employees may receive rewards or penalties for factors outside their control, reducing the accuracy of the evaluation system.

Impact on Employee Motivation
Research in organizational psychology consistently shows that perceived fairness strongly influences:
• Employee engagement
• Trust in management
• Retention
• Job satisfaction
• Productivity

When employees believe that evaluation systems ignore circumstances:
• Trust declines.
• Motivation decreases.
• High performers in difficult roles become disengaged.
• Employees avoid challenging assignments.

Eventually, employees may prefer easier work because the system rewards easier conditions.
This creates a dangerous organizational incentive.

Example of Distorted Behaviour
Suppose AI rewards only:
• Number of cases closed
• Average handling time
• Employees will naturally:
• Choose easy cases
• Avoid escalations
• Transfer difficult customers
• Prioritize speed over quality

Result:
Individual metrics improve, but organizational performance declines.
Context-aware evaluation prevents this unintended consequence.

Illustrative Graph 1: Raw Performance Comparison
Cases Closed
Employee A: ████████████████████████ 250
Employee B: ████████████ 120

Raw evaluation would rank Employee A higher

Illustrative Graph 2: Complexity-Adjusted Contribution
Contribution Score
Employee A: ███████████████ 70
Employee B: ██████████████████████████ 95
After adjusting for complexity, Employee B contributes more organizational value.
This demonstrates why context matters.

Illustrative Graph 3: Fairness of Evaluation
Evaluation Accuracy
Results Only: ███████████ 60%
Context Adjusted: ████████████████████ 90%
When context is included, AI produces a more accurate representation of actual performance.

Recommended AI Evaluation Framework
Organizations should use a balanced model:
Outcome Metrics (50–60%)
• Productivity
• Quality
• Customer satisfaction
• Goal achievement

Context Metrics (40–50%)
• Case complexity
• Workload difficulty
• Staffing levels
• Resource availability
• Team support
• Process constraints
• External disruptions

This approach rewards both achievement and the ability to succeed under challenging conditions.

Conclusion

"A score that ignores difficulty isn't measuring performance. It's measuring assignment luck — and calling it merit."
The question was never whether outcomes matter they do, and View B does not discard productivity, quality, or customer satisfaction. The question is whether a number means the same thing for every team that produces it. It does not. An 88% satisfaction score from a routine-case team and an 88% from a team drowning in escalations and understaffing are not the same achievement treating them as identical isn't neutral measurement, it's a hidden value judgment that happens to favour whoever was handed the easier conditions.

Case-mix adjustment in healthcare, complexity-tiered benchmarking in contact centers, and the simple psychological reality that people disengage from systems that don't reflect their reality all point in the same direction: context is not noise to be filtered out before measurement. It is the denominator that makes measurement honest.

Organizations that adjust for circumstance aren't lowering the bar. They're locating the bar correctly for every team, not just the ones who got lucky with their caseload. That is what makes AI evaluation a tool people can trust enough to actually improve from, rather than a black box they learn to fear or quietly resent.

AI should not evaluate employees solely on outcomes because outcomes are often shaped by circumstances beyond an individual's control. A results-only system risks rewarding favourable conditions rather than true performance, creating unfairness, reducing trust, and encouraging employees to avoid difficult work.
A context-aware AI evaluation system provides a more accurate, equitable, and strategically sound assessment of employee contributions. Just as schools, healthcare systems, and sales organizations adjust for differing circumstances, AI-based performance management should recognize the reality that not all employees operate under the same conditions.
The goal of performance evaluation is not merely to measure results—it is to measure contribution. Contribution can only be understood when results are evaluated in the context in which they were achieved.

June 20Jun 20

My Position: Support View B – AI Should Adjust for Circumstances

I strongly support Bex’s view, View B because evaluating employees solely on outcomes assumes everyone operates under identical conditions. In reality, employees face different workloads, case complexities, resource availability, managerial support, staffing levels, and operational constraints. Ignoring these contextual factors can lead to biased and misleading evaluations.

The purpose of AI is not simply to measure performance—it is to measure performance fairly. A well-designed AI should distinguish between factors employees can control and those they cannot. By incorporating context into its evaluation, AI rewards genuine capability rather than fortunate circumstances, resulting in more accurate decisions, higher employee trust, and better organizational performance.

Why Context Matters?

Results alone rarely tell the complete story.

Imagine two customer service agents:

Agent A resolves 80 routine enquiries per day with a customer satisfaction score of 94%.
Agent B resolves 45 highly complex technical escalations, working with engineering teams and senior managers, achieving a 91% customer satisfaction score.

If AI evaluates only productivity, Agent A appears to outperform Agent B.

However, once AI considers:

Case complexity
Resolution quality
Repeat contacts
Customer impact
Escalation difficulty
Business value created

Agent B may actually contribute significantly more to the organization.

Fair evaluation should measure performance relative to opportunity, not simply absolute results.

Quality Reasoning & Demonstration

To understand why View B is technically and strategically superior, we must analyze performance through a fundamental systems engineering lens:

Performance Outcome = f(Individual Capability, Environmental Variables)

If an AI engine holds environmental variables constant when they are actually highly volatile, it commits a mathematical and operational error known as omitted variable bias.

To ground this mathematical framework in the specific context of the service organization case study, the "Environmental Variables" can be systematically grouped into three distinct architectural categories. These variables represent the external factors that the AI engine must ingest and normalize to ensure an equitable performance evaluation:

1. Task-Specific Complexity Variables

These variables capture the inherent structural difficulty of the work assigned to an individual, acknowledging that not all baseline tasks require the same cognitive or temporal investment:

Case Classification Tier: A categorical variable tracking whether an agent is assigned a Routine Case (e.g., standard password resets, basic account updates) vs. a Complex Escalation (e.g., multi-party billing disputes, cross-system technical failures).
Dependency Bottlenecks: The number of external departmental approvals, third-party verifications, or legacy system syncs required to resolve a single ticket, directly impacting individual turnaround time.
Inherent Ticket Volatility: A metric quantifying the historic variance in resolution times for a specific issue type, signaling to the AI that the task has a naturally unpredictable lifecycle.

2. Resource & Operational Constraints

These variables measure the immediate ecosystem limitations under which an employee or a team is forced to operate:

Staffing Deficit Index: A real-time ratio comparing scheduled headcount against actual present headcount (e.g., a team operating under a 30% staffing shortage due to sudden attrition or leaves).
System Disruption Frequency: Automated logs capturing the duration and frequency of IT infrastructure downtime, software latency, or network lag that throttles an agent's operational velocity.
Volume Surge Multiplier: A variable tracking sudden, unpredicted spikes in incoming ticket queues that disrupt standard workflow pacing and increase cognitive load.

1. Institutional Support & Leadership Variance

These variables account for differences in managerial infrastructure, isolating an individual's merit from the quality of guidance they receive:

Managerial Support Index: A composite score factoring in regular 1:1 coaching frequency, barrier-removal speed, and the presence of dedicated team leads to unblock complex cases.
Team Tenor & Maturity Mix: The ratio of experienced "Champion" level peers to incoming trainees or interns within a specific unit. A team with a high concentration of onboarding interns requires senior members to divert productive hours toward mentorship and supervision.
Documentation & Knowledge Base Coverage: A metric indicating the availability and maturity of standardized troubleshooting guides for the specific queue assigned to a team, reducing the need for trial-and-error problem-solving.

By transforming these real-world constraints into quantifiable data points, an AI Solutions Architect can design a normalization layer that accurately balances the performance equation.

To transform these qualitative environmental variables into a quantitative framework that an AI engine can compute, we implement a Weighted Operational Difficulty Index (WODI).

Instead of passing raw outcomes directly to the appraisal module, the AI processes them through a mathematical normalization layer.

The Contextual Normalization Formula

To adjust a raw performance metric (such as Turnaround Time or Output Volume), the AI calculates a Context-Adjusted Performance Score, Padj using the following system matrix:

Padj = frac{Praw}{w1Ct + w2Rs + w3Lm}

Where the environmental variables from our case study are quantified as follows:

Ct (Task Complexity Coefficient): A scaled value from $1.0$ to $2.5$. A standard routine case sits at $1.0$, while a complex escalation involving multi-system legacy dependencies scales up to $2.5$.
Rs (Resource Scarcity Factor): Calculated as $\frac{\text{Target Headcount}}{\text{Actual Headcount}}$. If a team is facing a $30\%$ staffing shortage, this factor automatically shifts to $1.43$, lowering the absolute output required to achieve a top-tier score.
Lm(Leadership & Environment Support Index): A baseline modifier scaled from $0.8$ to $1.2$. A value below $1.0$ indicates a lack of structured managerial support or severe system disruptions (frequent software downtime), mathematically shielding the employee's final rating.
w1,w2,w3: Architectural weights assigned by the system designer based on organizational priorities, where

Real-World Evidence

1. Uber’s Algorithmic Management & Network Value Adjustments

Uber’s driver appraisal and matching algorithms do not judge a driver's performance (such as acceptance rates or trip completion times) on absolute flat numbers. Instead, their machine learning models continuously ingest context like traffic density, weather patterns, and localized infrastructure friction.

Crucially, Uber's internal metric platform (uMetric) accounts for Network Value. The AI calculates that a destination in a highly congested downtown core during a storm has a fundamentally different operational friction than an open suburban run. Drivers are evaluated against a standardized peer baseline for that specific micro-zone and time window. By normalizing metrics against real-time baseline environmental friction, Uber ensures fairness and prevents drivers from mass-rejecting trips that would artificially tank their performance ratings.

2. Enterprise Contact Center Modernization via Weighted Sentiment & Complexity

Major contact center platforms (such as Salesforce Einstein or Genesys AI) have moved away from traditional static Average Handle Time (AHT) metrics because they reward quick, low-effort resolutions over complex problem-solving. Modern implementations use natural language processing (NLP) to dynamically adjust targets based on situational difficulty.

For instance, when an AI detects an escalated customer who has already called three times, it flags the interaction as high-complexity. Industry case studies show that using AI to normalize customer satisfaction (CSAT) scores against a baseline case difficulty index reduces false negatives in agent evaluations by over 25%. Instead of penalizing an agent for a longer call duration, the AI evaluates them on relative sentiment improvement from the start of the call to the end, recognizing that salvaging a frustrated customer carries a different operational weight than processing a routine password reset.

3.Clarifying the Starbucks "Deep Brew" Operational Benchmarking

To ensure absolute factual accuracy for the forum: while Starbucks utilizes complex data forecasting via its Deep Brew AI platform, it uses this data to adjust baseline expectations and labor models rather than directly scoring individual store employees via a blind algorithm.

Deep Brew processes billions of data points—including regional weather, local events, product mix complexity (e.g., cold foam customizations making up over 33% of beverage sales), and digital order volume (which accounts for 56% of all U.S. transactions). Because store conditions vary wildly based on these external factors, Starbucks uses AI to calculate precise, contextual labor targets and inventory needs. This proves the broader strategic principle: highly sophisticated organizations use AI to contextually normalize what a "fair goal" looks like for a specific unit before rendering judgment on performance.

4.European Union AI Act

The European Union AI Act classifies AI used in:

Employee evaluation
Promotions
Recruitment
Workforce management

as High-Risk AI Systems because they directly affect people's careers.

Organizations must implement:

Risk management systems
Human oversight
Bias testing
High-quality datasets
Ongoing monitoring
Comprehensive documentation

Failure to adequately manage these risks can lead to penalties of up to €35 million or 7% of global annual turnover, whichever is higher.

The legislation recognizes that AI evaluating employees must produce fair outcomes—not simply objective-looking scores.

Demonstration: If legislators require fairness monitoring before AI evaluates employees, it clearly acknowledges that results alone are insufficient.

5.NIST AI Risk Management Framework (USA)

The National Institute of Standards and Technology (NIST) identifies Fairness as one of the core characteristics of trustworthy AI.

NIST specifically warns that AI systems can produce biased outcomes when they ignore:

Environmental differences
Population differences
Operational context
Historical inequalities

The framework recommends continuous monitoring and contextual evaluation throughout the AI lifecycle.

This demonstrates that fairness requires understanding why results differ—not merely recording the results themselves.

6. IBM – AI Fairness 360

IBM developed AI Fairness 360 (AIF360), an open-source toolkit containing more than 70 fairness metrics and bias mitigation algorithms to detect and reduce unfair AI decisions.

The toolkit is widely used across industries including:

Banking
Insurance
Healthcare
Human Resources

IBM emphasizes that AI should evaluate people fairly by accounting for variables that may otherwise introduce unintended bias.

This investment reflects a growing industry consensus that fair AI must consider context rather than relying solely on outcomes.

7. Last-Mile Logistics & Route-Friction Normalization (e.g., Locus & Deliveroo)

Advanced supply chain and delivery platforms use AI engines that strictly reject absolute speed or stop-volume metrics for driver appraisals. Instead, they evaluate drivers through a dynamic Route-Difficulty Index.

The Facts & Figures: According to last-mile AI optimization data (such as studies published by Locus), traditional static metrics trigger an elite driver attrition rate of 60% to 90%, costing fleets $5,000 to $8,000 per replaced driver due to perceived algorithmic unfairness.
The AI Adjustment: Modern logistics algorithms ingestion variables like historic building access times, narrow-lane density, and real-time localized weather anomalies. The AI explicitly normalizes the data: a courier completing 35 complex urban stops in a high-density, disrupted zone is scored equivalently to or higher than a driver completing 48 routine stops on an open suburban route. Adjusting for these operational constraints directly correlates with a 20-25% drop in driver churn, saving mid-sized fleets upwards of $2M annually.

8. Healthcare Informatics — AI-Assisted Emergency Room Triage and Staffing

In clinical environments, nursing and physician performance metrics are increasingly mediated by intelligent electronic health record (EHR) platforms that actively adjust for situational severity and department understaffing.

The Facts & Figures: Research across major hospital networks indicates that evaluating emergency department personnel using raw Time-to-Treat or length-of-stay metrics directly contributes to an alarming 54% clinician burnout rate.
The AI Adjustment: Advanced medical performance algorithms calculate an automated Acuity-Weighted Workload Index. If an ER team experiences a sudden 20% spike in high-acuity trauma cases (Category 1 on the Emergency Severity Index) concurrent with an active shift shortage, the AI automatically recalibrates the expected target metrics for non-critical patient processing. By accounting for case complexity and resource constraints, the AI eliminates false-negative performance flags, ensuring institutional bonuses and retention scores are tied to clinical precision rather than situational volume.

9.Software Engineering — Agile Velocity Adjustments via GitPrime/Pluralsight Flow

Modern engineering leadership has shifted away from primitive, result-only metrics (like raw lines of code written or pure ticket closure rates) because they incentivize code duplication and penalize complex architectural problem-solving.

The Facts & Figures: Industry benchmarks from engineering intelligence platforms indicate that "result-only" metric tracking leads to a 30% spike in technical debt as engineers rush to close easy tickets to look good on automated dashboards.
The AI Adjustment: Platforms like Pluralsight Flow utilize machine learning models to analyze the semantic context of code repositories. The AI computes an adjustment factor by analyzing code churn, legacy code dependency, and systemic system friction. If a senior developer spends three days refactoring a highly volatile, a decade-old legacy system module to fix a critical single-threaded bottleneck, the AI assigns a disproportionately higher Impact Score to those few lines of code compared to a developer pushing hundreds of lines of boilerplate code for routine UI screens.

10. Retail Banking Operations — Context-Based Sales Target Normalization

Global retail banking institutions utilize algorithmic performance appraisal engines that normalize branch employee sales and loan processing targets based entirely on local macroeconomic data and institutional infrastructure.

The Facts & Figures: During regional economic downturns or branch-specific disruptions (such as localized network outages or localized construction blocks), absolute foot traffic can plummet by up to 40%.
The AI Adjustment: Rather than holding all banking associates to identical absolute loan closing quotas, institutional performance AI ingests regional employment rates, localized branch foot-traffic sensors, and real-time IT system latency logs. The AI shifts the appraisal from an absolute scale to a Relative Market Share Index. An associate who captures a higher percentage of wallet share in a severely economically depressed micro-zone is evaluated as a "Champion" performer, even if their absolute transaction volume is lower than an associate sitting in a highly affluent, fully-staffed flagship branch.

11.Google – Responsible AI

Last but not the least, Google's AI Principles state that AI systems should:

Avoid creating unfair bias
Be accountable to people
Undergo testing for unintended outcomes
Include appropriate human oversight

Google's Responsible AI framework requires continuous evaluation because bias often emerges from differences in data and operating conditions.

This reinforces the importance of considering contextual information rather than relying solely on outcome-based metrics.

Research Supports Context-Aware Evaluation

A landmark meta-analysis by Schmidt & Hunter (1998), published in Psychological Bulletin, examined 85 years of personnel selection research.

The study found that combining multiple measures of performance and context predicts future job performance significantly better than relying on a single metric.

Similarly, organizational psychology research consistently shows that employees perceive evaluation systems as more fair when contextual factors are considered, increasing engagement, trust, and commitment.

Adjusting for Circumstances Does Not Reduce Accountability

A common criticism is that contextual adjustments weaken accountability.

I disagree.

Employees should always remain accountable for outcomes within their control. Context should never excuse poor performance.

Instead, it enables AI to distinguish between:

Poor performance caused by insufficient effort or capability.
Lower results caused by staffing shortages, unusually difficult workloads, or external disruptions.

This creates accountability without unfairly penalizing employees for factors beyond their control.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

1. The Flaw of "Result-Only" Objectivity (The Control Problem)

Proponents of View A argue that results are objective. However, results are only a fair proxy for merit if all subjects operate under identical parameters. Consider an operational scenario where Agent X handles "routine cases" with a fully staffed team, while Agent Y handles "complex escalations" amidst a 30% staffing shortage. Agent X has a mathematically higher baseline probability of hitting turnaround time targets. If the AI rewards Agent X and penalizes Agent Y based on absolute scores, it is not measuring performance; it is measuring situational luck.

2. Behavioral Decay and "Gaming the System"

When an AI system ignores circumstances, human agents naturally optimize for the algorithm rather than the organization’s health. This causes predictable operational failures:

Cherry-Picking: Workers actively avoid complex, high-friction cases because they know it destroys their automated metrics.
Metric Manipulation: To satisfy rigid turnaround times, quality drops, or data is rushed to satisfy the machine.
Systemic Attrition: High-performer burnout occurs rapidly when top talent is assigned to "firefighting" or complex tasks but given lower automated appraisal scores than peers on simpler tracks.

Benefits of Context-Aware AI

Organizations that incorporate contextual information into AI evaluations benefit from:

More accurate identification of high performers.
Fairer promotion and reward decisions.
Higher employee engagement and trust.
Reduced bias and discrimination.
Better compliance with Responsible AI regulations.
Improved retention of employees handling complex or high-value work.
Stronger organizational culture.
Better alignment between employee evaluations and long-term business outcomes.

Conclusion

I strongly support View B because fairness requires more than measuring outcomes—it requires understanding the circumstances under which those outcomes were achieved.

True fairness is not treating everyone the same—it is evaluating everyone equitably based on both their results and the challenges they had to overcome.

June 20Jun 20

Solution

View B — Adjust for circumstances. An outcome is what happened; a contribution is what the person added — and only the second is fair to reward.

View B. Without qualification. I'll concede one bounded zone where View A is correct, but read that concession as a boundary, not a retreat: everywhere this dilemma actually lives, View B wins — and View A's central claim, that adjusting for circumstances "reduces objectivity," is precisely backwards. A raw outcome is not the objective measure. It is a biased one, and the bias runs in a predictable direction: it rewards people for the difficulty of the work they were handed, not the quality of the work they did.

1. The word both sides are fighting over: "results"

The whole dispute turns on a single equivocation. View A and View B both say "performance," but they mean two structurally different objects:

	Outcome (what View A measures)	Contribution (what evaluation exists to estimate)
What it is	The absolute number: tickets closed, CSAT, turnaround time	How well the person performed given the conditions they were assigned
Controlled by	The person and their circumstances and luck	The person — effort, skill, judgment
Fair to reward?	No — it pays out assignment luck	Yes — it isolates what the person actually controls

One clean sentence the forum can use to grade every other answer in this thread: an outcome is what happened; a contribution is what the person added to what happened — and only the second is something a person can be justly rewarded or penalized for.

And there is a precise, named reason the two cannot be collapsed. The contribution we want to reward is a counterfactual: what this person would have produced under standard, reference conditions. You never observe that for the same person at the same time as you observe their actual outcome — you only ever see one of the two. That is not a soft point; it is the fundamental problem of causal inference (Holland, JASA, 1986; the potential-outcomes framework of Neyman and Rubin). Contribution lives in a different object from the outcome — a potential outcome the data does not contain — so it can only ever be estimated by modelling, never read off by measuring the outcome harder.

Two familiar errors sit on top of this and are worth naming as relatives, not as the core: in statistics, omitted-variable bias (leave out a circumstance that drives the result and correlates with the person, and your estimate is biased by exactly that circumstance's effect); in psychology, the fundamental attribution error (Ross, 1977 — the human reflex to over-credit a person's disposition and under-credit their situation). A results-only AI doesn't escape that reflex. It hard-codes it. The plain-language handle is crediting the scoreboard to the player — but the structure beneath the handle is the counterfactual one above, and that is what makes the next section's result inescapable.

2. A transparent model of when to adjust (structural — no fitted numbers, and on purpose)

Write the observed outcome as:

Y = C + γ·X + ε

Y — the outcome the AI measures (productivity, CSAT, turnaround).
C — the latent contribution we want to reward: the outcome the person would produce under standard, reference conditions. (This is the counterfactual from §1.)
X — circumstance favorability, centered (positive = easier: routine cases, strong support; negative = harder: escalations, staffing shortages).
γ > 0 — how strongly circumstances move the outcome.
ε — luck/noise.

Two estimators of C:

Results-only (View A): Ĉ_A = Y. Its error versus the thing we care about is γ·X + ε. The systematic part, γ·X, is a bias — positive for everyone with favorable circumstances, negative for everyone with unfavorable ones. Results-only doesn't fail randomly. It fails toward the people who already had it easy.
Adjusted (View B): Ĉ_B = Y − γ̂·X = C + (γ − γ̂)·X + ε. As the estimate γ̂ approaches γ, the circumstance bias collapses toward plain noise.

The decision rule, stated exactly. Adjustment beats results-only when the systematic bias it removes exceeds the cost it adds:

Adjust ⇔ γ²·Var(X) > V_cost, where V_cost = the estimation variance of γ̂ + any gaming/manipulation penalty.

This produces a sign-flip that is structural, not a matter of measurement quality — hold measurement accuracy at 100% in both rows:

Hold accuracy = 100%	Var(X): circumstance spread	X exogenous & observable?	γ²·Var(X) vs. V_cost	Winner
Regime 1 — this dilemma's service org: escalations vs. routine cases, supported vs. short-staffed teams	Large	Yes — case type is routed; staffing is documented	Bias removed ≫ cost	View B (adjust)
Regime 2 — one identical queue, conditions equalized, assignment randomized	≈ 0	N/A	Removes ~0 bias, only adds cost	View A (results-only)

Why I attach no number to V_cost — and why that makes the result stronger, not weaker. I could peg V_cost to a tidy figure and run a sensitivity band, but I have no empirical handle on it, and a precise number would fake a calibration I don't have. The honest — and more robust — claim is this: as Var(X) → 0, the left side → 0 for any γ, so results-only wins; when Var(X) is of the same order as the spread in true contribution and X is clean and exogenous, the left side is order γ²·Var(X) and dominates any modest V_cost. The verdict holds across the entire unknown range of V_cost below the order-γ² bias. Scale every magnitude up or down together and nothing moves; only collapsing Var(X) flips the sign. (Note the trap this avoids: a sensitivity analysis that varied γ while holding V_cost fixed would be testing the parameter that doesn't flip the result. What flips it is Var(X) and exogeneity — structure — which is exactly what the table varies.)

The accuracy-to-1.0 closure — this is §1 stated formally, and it is what kills "just make the AI better." Suppose the AI measures every outcome perfectly — productivity, quality, CSAT, turnaround, all at 100% fidelity. Does results-only become fair? No. Ĉ_A = Y still carries the γ·X term. Perfectly measuring Y is not recovering C, because C is the counterfactual outcome under reference conditions, and that quantity is not in Y at all — it is the unobserved potential outcome from §1. The deciding term is structurally unmeasurable from outcomes, at any precision. You cannot fix a wrong-quantity problem with more decimal places on the wrong quantity. More cameras on the scoreboard will never tell you who played well.

3. The asymmetry View A's defenders never price in: the harm compounds

A static comparison understates the case, and saying why is its own argument.

View A's benefit is booked once: a one-time gain in apparent simplicity, plus a short-run output bump from pressure.
View A's harm is multiplicative: a results-only score punishes raw outcomes, so rational people learn to avoid difficulty — dodge escalations, decline the hard ticket, route the sick patient elsewhere. And avoidance doesn't make hard work vanish. It flows downhill onto whoever can't dodge: the conscientious, the new, the team already short-staffed. Their raw numbers then look worse, the AI penalizes them more, they disengage or exit, and the hard work concentrates further. Each cycle deepens the misallocation.

One line: the objectivity is booked once; the distortion compounds every cycle.

Now make it an AI problem, because that is what this question is. If the AI's verdicts feed who gets retained, promoted, and assigned, then each retraining cycle learns from a workforce that difficulty-avoidance has already reshaped. The model comes to read "handles only easy cases, pristine CSAT" as the signature of a top performer and "takes the hard escalations, lower CSAT" as underperformance — and launders that inversion as objective fact. The harm doesn't add up. It ratchets.

The feedback loop, named honestly. Trace it:

raw-outcome scoring → people avoid hard cases / hard cases pile onto the disadvantaged → their raw numbers fall → AI penalizes them and learns "hard cases = low performer" → harder work pushed onto them, capability to handle it erodes → numbers fall further → …

This is the cream-skimming ratchet. I'm not claiming a new law — the parents are established and I'll name them: this is Campbell's Law (the more a quantitative indicator drives high-stakes decisions, the more it distorts what it measures) and Goodhart's Law ("when a measure becomes a target, it ceases to be a good measure"), running through the documented health-economics mechanism of cream-skimming / cherry-picking. What "ratchet" adds is the AI-specific teeth: a ratchet only turns one way, and each retraining cycle is another tooth. The metaphor is the argument.

And there's a twist that makes the algorithmic version worse than a biased human manager: an AI's verdict is harder to contest than a hunch. "The model says your team underperforms" wears the costume of objectivity even while it is encoding your staffing shortage as your personal failing. The authority of objectivity makes the ratchet sticky.

4. The empirical record (real cases, graded — read it as a controlled comparison)

The axes this table varies: sector, adjusted vs. unadjusted policy, and what happened to the hard cases / hard-served populations. The cell View A needs — "raw-outcome scoring, circumstances varied widely, and it allocated fairly anyway" — comes up empty. Two rows are matched pairs: the same accountability purpose in the same sector, run raw and then adjusted.

Sector	Case (actor, date)	What the metric did	Outcome (sourced / hedged)	What it shows	Weight
Healthcare (clinician)	NY & PA cardiac-surgery report cards — Dranove, Kessler, McClellan & Satterthwaite, Journal of Political Economy, 2003	Published raw/under-adjusted mortality at provider level	Providers selected healthier patients; sicker patients saw worse outcomes and higher resource use, at least short-run	Judging on raw outcomes causes difficulty-avoidance — the first turn of the ratchet	Load-bearing
Healthcare (institution)	CMS Hospital Readmissions Reduction Program — FY2013 raw → peer-grouping reform (21st Century Cures Act, Dec 2016; effective FY2019)	Penalized raw 30-day readmissions; then stratified hospitals into 5 dual-eligible peer groups	Raw version over-penalized safety-net hospitals; a 2022 Health Affairs review reports that in year one, the 40% of hospitals serving the highest dual-eligible share saw penalties cut by up to ~$436k/yr vs. the base model	Matched pair #1: same metric, same program, with vs. without circumstance adjustment	Load-bearing
Education	Houston Federation of Teachers v. HISD — U.S. District Court, S.D. Texas; ruling May 2017, settled Oct 2017	"Value-added" — an attempt to adjust — but via a proprietary black box teachers couldn't inspect	Court found a Fourteenth Amendment due-process problem (teachers couldn't verify or contest scores); district stopped using it for termination, paid ~$237k in fees	The limit of adjustment: opaque adjustment fails. The cure is transparency, not raw scoring	Load-bearing (boundary)
Education	Progress 8, England (DfE; announced Oct 2013, headline measure from 2016) — replacing raw "5 A*–C GCSE" tables	Switched the headline school measure from raw attainment to a value-added score: each pupil vs. the national average for pupils with the same prior (KS2) attainment	The government's own rationale: raw results "said more about… pupil prior attainment at intake than… the quality of teaching" (Leckie & Goldstein, Brit. Educ. Res. J., 2019). The exact outcome-vs-contribution argument, adopted nationally	Matched pair #2: raw → intake-adjusted, different sector — and it carries the live View A/View B debate (see grading)	Load-bearing
Logistics (US)	Amazon "time off task" / ADAPT — reporting by The Verge / Colin Lecher via NLRB filings, 2019	Near-pure rate metric; system can auto-generate warnings/terminations	~300 workers (~10% of the site) terminated for productivity at one Baltimore facility, Aug 2017–Sep 2018, per Amazon's NLRB letter. Amazon says supervisors can override and that <1% of 2019 terminations were TOT-related	Even a near-pure-results system builds in circumstance exceptions (equipment failure, peak load) — nobody actually believes in pure results-only once they think it through	Supporting
Gig / platform (India)	Swiggy / Zomato / Blinkit / Zepto delivery workers; nationwide flash strikes, late Dec 2025 (IFAT / TGPWU; ~40,000 workers reported across Mumbai, Delhi, Hyderabad, Bengaluru)	Algorithmic ratings & ID deactivation on raw delivery outcomes	Core demands: end "penalties without due process," grievance redress for routing/payment failures, allocation without algorithmic discrimination. Fairwork India (Univ. of Oxford) has rated these platforms poorly on labour standards	Contemporary, non-Western: workers explicitly demand the system account for circumstances they don't control and be contestable	Supporting
Gig / platform (US)	Uber / Lyft driver deactivation — Asian Law Caucus survey of 810 CA drivers, 2023; AALDEF/NYTWA report, 2025	Deactivation driven by raw passenger ratings/complaints, not netted for circumstance	~42% of deactivations traced to passenger complaints that "reflect consumer bias"; non-English-speaking drivers deactivated far more often; majority deactivated with no notice or working appeal. Notably, Lyft states on record it takes steps so drivers "are not rated unfairly for circumstances… out of their control"	Pairs with India — the pattern isn't region-specific; and a platform itself concedes raw ratings carry circumstance	Supporting

Honest grading.

The four load-bearing rows carry the argument; the three gig/logistics rows corroborate and bring it up to the present.
Two matched pairs (HRRP, Progress 8) are the spine: in two different sectors, the same accountability task was run raw, found to be measuring intake rather than contribution, and reformed toward adjustment. That is the controlled comparison "it works fine raw" anecdotes never supply.
Confounds, named, and which way they cut. Dranove is market reporting to patients, not internal HR — but the mechanism (punish raw outcomes → avoid hard cases) transfers directly, and an internal AI with hire/fire power applies more pressure, so the confound cuts toward my conclusion. HRRP peer grouping is itself imperfect (broad "peer" groups that don't fully adjust) — not a point for View A, but for doing adjustment better (finer, exogenous, transparent), which is my position. The India / Amazon / Uber outcomes lean partly on advocacy and company statements that conflict on magnitude; I've hedged the figures and use them only as corroboration.
Progress 8 is the most useful row because it argues against me out loud and I still win. The same literature notes the open debate: critics say value-added unadjusted for pupil background still favors advantaged intakes (the earlier "Contextual Value Added" went further), while others warn that adjusting for background "entrenches inequity and excuses low-performing schools." That second worry is exactly View A's "soft bigotry" objection — surfacing in a real national system. And Progress 8 grew its own gaming (steering pupils into EBacc subjects graded differently) — Goodhart reappearing at the adjusted level, which is precisely why §7's canary exists.

Two reference points stated honestly as structural rather than sourced to a single event:

Positive control — results-only used correctly. A randomized A/B test is the case where "results alone" is exactly fair: randomization equalizes circumstances by design, so Var(X) → 0 and the raw outcome difference is an unbiased read on the variant. This is Regime 2, and it proves the argument isn't ideological — results-only is right precisely when you've engineered the circumstances equal.
On-point operational mirrors (industry-general patterns, not single sourced incidents — flagged as such). In contact centres, raw Average Handle Time penalizes agents who draw complex calls or actually resolve the problem, rewarding those who rush or transfer — which is why mature operations moved to First-Contact-Resolution and blended metrics. In sales, raw quota attainment penalizes reps in weak territories; mature sales orgs adjust quotas for territory potential precisely to stop charging reps for their assignment and to stop rewarding account cherry-picking. Both mirror this dilemma exactly (routed difficulty → biased raw score); attach a named firm/source before quoting either as load-bearing.

5. On Bex's evidence

Bex reaches the right destination — View B — on a road I can't verify. Her example (Starbucks running a performance system that weighs foot traffic and local economics, yielding better morale and retention) is not something I can confirm, so I won't call it false and I won't lean on it. I'll quarantine it and engage the lesson: Bex grounds View B in morale, which is soft and, here, unverifiable. The stronger ground is measurement: raw outcomes are a biased estimator of contribution and demonstrably misallocate — two national accountability systems (HRRP, Progress 8) reversed course on exactly that finding. Same conclusion, load-bearing road. Verify her Starbucks figure before relying on it; you don't need it.

6. The four strongest objections, closed

(1) "Adjustment destroys objectivity and accountability." The real version: any adjustment is a discretionary knob; managers will lobby to have their teams' "circumstances" weighted favorably; clean comparability dies and accountability dissolves into excuse-making. Conceded — if the adjustment is discretionary and post-hoc. But the fix isn't raw scoring; it's adjusting only on pre-registered, exogenous, observable variables (case type assigned by routing, documented headcount, complexity scored by a rubric fixed in advance). That is more auditable than raw numbers, because the adjustment formula is published and fixed — whereas a raw score hides its circumstance bias silently and uncontestably. Feature, not bug: adjustment makes the circumstance assumptions explicit and challengeable. Houston EVAAS failed not because it adjusted but because it adjusted in secret.

(2) "Just improve the AI / measure more." Closed by §2's accuracy-to-1.0 result, which is just the §1 counterfactual stated formally: driving outcome measurement to 100% doesn't recover the contribution, because the deciding term isn't a noisy outcome — it's an unobserved potential outcome. More precision on Y cannot reconstruct a quantity Y does not contain.

(3) "Adjusting is the soft bigotry of low expectations — it patronizes the disadvantaged and hides real underperformance." The real version — and note it is a live position, voiced by serious people against Progress 8 and Contextual Value Added: adjusting for background "entrenches inequity and excuses low-performing" units. Conceded — if adjustment becomes a permanent excuse that suppresses improvement signals. But done right, adjustment doesn't lower the bar; it relocates it onto the controllable. You hold the team fully accountable for contribution — effort, skill, judgment — and merely stop charging them for a staffing shortage the organization imposed. The genuinely patronizing system is the raw one that quietly files the escalation team under "low performers" for doing the hardest work in the building. Feature: adjustment surfaces the hidden heroes raw scoring buries.

(4) "Survivorship — raw KPIs work fine in practice; the cream rises." The cases where raw scoring "works fine" are Regime 2 — circumstances didn't vary much. Where they did, the record is the opposite: HRRP and Progress 8 were measurably misallocating until reformed; Dranove measured the selection effect. Survivorship is the tell, not the rebuttal — you see the survivors, but the cherry-picking and the exits already happened upstream, off-camera. The matched pairs are exactly the controlled test that "it works fine for us" anecdotes lack.

7. What to actually run on Monday: the PEARL gates

Don't choose "adjust vs. don't" in the abstract. For each metric and comparison, run five gates. The mnemonic is PEARL; the gates are the point.

P — Pre-registered. The adjustment variables and weights are fixed and published before the evaluation period. Prevents: fitting the adjustment to favor whoever you like after results land. Owner: governance / HR analytics.
E — Exogenous. Adjust only for circumstances the employee did not choose and cannot manufacture (routed case type, imposed staffing, queue mix). If they created their own backlog, that's performance — don't adjust it. Prevents: the excuse engine. Owner: metric owner + independent reviewer, never the employee's own manager.
A — Auditable. Every employee can see which factors were applied to them, at what weight, and can contest the inputs ("my queue was 70% escalations, not 40%"). No black boxes. Prevents: the Houston-EVAAS due-process failure. Owner: employee + appeals channel.
R — Raw shown alongside. Report adjusted and raw numbers together, and label the adjusted figure an estimate of contribution with uncertainty, not a measured fact. Prevents: false precision and the authority-of-objectivity trap. Owner: analytics.
L — Loop-tracked. Watch the second-order number, not just the outcome — because even a good adjusted metric grows its own gaming (Progress 8 did, via subject choice).

Canary KPI: the distribution of hard cases across teams over time — escalation/complex-case routing share by team, tracked per cycle. If hard cases are increasingly concentrating on the lowest-rated teams, the ratchet is turning — regardless of how good headline productivity looks. An output-optimizing system will never watch this on its own. Watch where the hard cases flow, not just who closes the most tickets.

8. The one zone where View A is right — and I'd enforce it

Be exact about the boundary. View A wins when circumstances are (a) endogenous — chosen or created by the employee; (b) negligible — assignment is randomized or genuinely equalized, so there's no systematic spread to correct (the A/B-test condition, Regime 2); or (c) un-modelable transparently — you cannot make the adjustment exogenous, pre-registered, and auditable, so adjusting would import opaque discretion (the Houston failure mode) worse than the raw bias. In those zones I would not merely tolerate results-only — I'd enforce it, because there the raw outcome is the best available estimate of contribution and adjustment only adds noise or invites gaming.

The distinguishing test, sharp enough to use on any case: is the circumstance assigned-not-chosen, documented, and stable enough to model in the open? Yes → adjust (View B). No → results-only (View A).

This dilemma's service organization sits squarely in the "yes" zone: case type is routed, staffing levels are documented, support is a known quantity. So here, View B governs — not as a kindness, but as the less-biased estimator of the only thing worth rewarding.

Close

View A cannot tell you whether a low score means a weak employee or a hard assignment — and it has decided not to ask. That is not objectivity. It is a commitment to be wrong in one predictable direction, forever, while wearing the costume of precision.

A raw score isn't neutral. It has simply, silently decided that the situation was the person's fault.

View B. Without qualification.

June 20Jun 20

I Strongly Support View B — Adjust for Circumstances

The belief that evaluating only results is "objective" sounds appealing, but in reality it often measures opportunity rather than performance. AI should evaluate not only what people achieved, but also the difficulty of achieving it. Otherwise, organizations risk rewarding employees who had easier conditions while penalizing those who took on the toughest challenges.

Scenario 1: Customer Support and Case Complexity

Many customer service organizations distinguish between routine inquiries and escalated cases because they require different levels of skill and effort.

Consider Two Agents

Agent A

Handles 120 simple password reset requests per day
Customer Satisfaction (CSAT): 96%
Average Handling Time: 3 minutes

Agent B

Handles 45 escalated complaints involving service failures, billing disputes, and angry customers
Customer Satisfaction (CSAT): 89%
Average Handling Time: 20 minutes

A results only AI would likely rank Agent A higher.

However, most business leaders would recognize that Agent B is handling significantly more difficult work and protecting customer relationships that are at greater risk of churn.

If Employees Know AI Evaluates Only Raw Results

They will avoid difficult cases
They may transfer escalations to others
The organization's toughest customer problems receive less attention

A context-aware AI prevents these unintended incentives.

Scenario 2: Sales Performance

Many global companies adjust sales targets based on territory potential because not all markets are equal.

Consider Two Sales Representatives

Sales Rep A

Assigned a mature territory with strong brand recognition
Receives 500 qualified leads per month
Generates $2 million in revenue

Sales Rep B

Assigned a new market with little brand awareness
Receives 150 qualified leads per month
Generates $1.5 million in revenue

A results-only system rewards Rep A.

However, Rep B may have achieved significantly more relative to the opportunity available.

Why Organizations Adjust for Context

Territory weighting
Market potential adjustments
Opportunity scoring

Without these adjustments, top performers may simply be those assigned the easiest territories.

Scenario 3: Healthcare

Healthcare organizations routinely account for patient complexity when evaluating outcomes.

A surgeon performing routine procedures typically has lower complication rates than a surgeon handling high-risk cases.

If Hospitals Measured Only

Mortality rates
Readmission rates
Complication rates

Doctors might avoid treating the sickest patients to protect their performance scores.

How Healthcare Solves This

Many healthcare systems use risk-adjusted outcome measures, accounting for:

Patient age
Existing medical conditions
Severity of illness

The goal is not to excuse poor performance but to compare professionals fairly.

Key Lesson

The most accurate performance measurement often requires contextual adjustment.

Scenario 4: Education

Many education systems have moved away from evaluating teachers solely on student test scores.

Teacher A

Teaches high-performing students from affluent backgrounds
Students score highly on standardized tests

Teacher B

Teaches students facing economic hardship and learning challenges
Students show significant improvement but still score lower overall

If AI evaluates only final scores, Teacher A appears superior.

If AI measures student growth and starting conditions, Teacher B may actually be delivering greater educational impact.

Modern Approach

Many educational performance frameworks emphasize:

Student growth
Improvement over time
Value-added measures

Rather than relying solely on raw outcomes.

Scenario 5: Professional Sports

Even elite sports organizations recognize that context matters.

In football, basketball, and baseball, analysts increasingly use advanced metrics that adjust for:

Strength of opponents
Team support
Game situations
Quality of opportunities

Example

A striker scoring 20 goals against weaker teams is not automatically considered better than a striker scoring 15 goals while facing stronger opponents and creating opportunities with less support.

Modern sports analytics focus on performance relative to difficulty, not merely raw results.

If professional sports—which are intensely results-driven—recognize the importance of context, organizations should do the same.

The Core Problem: AI Anchoring Bias in Performance Reviews

Research demonstrates a fundamental flaw in AI-assisted evaluations: Anchoring Bias.

When managers receive an AI-generated performance score, that number becomes a mental anchor a reference point that heavily influences their final judgment, even if the AI's recommendation is incomplete or flawed.

This is not a hypothetical concern.

In controlled experiments involving 775 managers, researchers found that performance ratings were significantly influenced by AI recommendations. A high AI score led to different final evaluations than a low AI score, even when employee behaviour remained identical.

The Risk

If AI lacks context about an employee's challenging circumstances, the resulting anchor can systematically undervalue employees working under difficult conditions.

How Companies Are Getting It Right (and Wrong)

BCG: Evaluating Judgment, Not Just AI Output

Boston Consulting Group (BCG) has integrated AI deeply into its operations, with nearly 90% of employees using AI tools and around half using them daily.

However, BCG has redefined performance measurement.

BCG Focuses On

Problem-solving ability
Human judgment
Interpretation of AI outputs
Delivering client-ready recommendations

An employee who receives mediocre AI outputs but demonstrates exceptional judgment in refining and applying those outputs is recognized appropriately.

Why This Supports View B

Two employees using the same AI tool may receive different results due to:

Task complexity
Data quality
Project constraints
Team support

BCG evaluates the human value-add, not just the final output.

Amazon and Meta: The Danger of Metrics without Context

Both Amazon and Meta have strengthened performance systems focused on measurable accomplishments and rankings.

Potential Risk

An employee:

Supporting a difficult client
Maintaining a legacy system
Managing operational crises

May produce fewer visible accomplishments than someone working on a highly resourced, high-visibility project.

A results-only system can unintentionally reward favourable circumstances rather than superior performance.

Shopify and Amazon: Rewarding AI Collaboration

Forward-thinking organizations increasingly evaluate how employees work with AI, not just what they produce.

Examples

Shopify encourages evaluation of how effectively employees use AI tools.
Amazon's robotics and automation groups increasingly expect employees to demonstrate effective AI usage and automation skills.

Why This Matters

The focus shifts from raw output to:

Adaptability
Learning agility
Collaboration
Problem-solving

An employee who creatively uses AI to overcome obstacles deserves recognition, even if final output appears similar to someone operating under easier conditions.

The Social Penalty Problem

Research from Duke University Fuqua School of Business found that employees who disclose AI usage may be perceived as:

Less competent
Less diligent
More dependent on technology

Even when AI improves their performance.

Interestingly, this bias largely disappears when evaluators themselves frequently use AI.

Implication

AI evaluation systems must consider organizational context and evaluator bias.

Otherwise, employees working in AI-resistant environments may face unfair disadvantages.

Business Impact of Ignoring Context

Organizations that evaluate only results often experience:

Employees avoiding difficult assignments
Increased competition for easy work
Lower morale among top performers handling complex tasks
Higher attrition among experienced employees
Reduced trust in performance management systems

By Contrast, Context-Aware Evaluations Encourage Employees To

Take ownership of challenging work
Support struggling customers
Accept difficult projects
Focus on organizational success rather than gaming metrics
Develop skills in high-complexity environments

What Organizations Should Do

The evidence is clear: outcome-only AI evaluations create systematic unfairness.

Organizations should:

1. Audit Existing Metrics

Ensure performance measures capture:

Quality
Complexity
Collaboration
Business impact

not just output volume.

2. Train Managers on AI Biases

Help managers recognize:

Anchoring bias
Automation bias
Overreliance on AI generated ratings

3. Incorporate Contextual Indicators

Include:

Work complexity
Resource availability
Staffing conditions
Customer difficulty
Project constraints

4. Encourage Transparent AI Usage

Employees should not be penalized for using AI responsibly to improve productivity.

5. Evaluate Judgment, Not Just Output

Measure:

Decision quality
Problem-solving ability
Adaptability
Effective use of AI-generated insights

Especially in challenging situations.

Conclusion

AI-based performance evaluations that ignore circumstances are not only unfair they are inaccurate.

They reward the lucky, penalize the challenged, and fail to recognize the human judgment that creates real business value.

Organizations such as Boston Consulting Group demonstrate that a context-aware approach is both fairer and strategically superior.

The goal of AI-powered performance management should not be to evaluate the final number alone. It should be to evaluate the whole person, operating within their real-world circumstances.

The strongest performance evaluation system asks not just “What result was achieved?” but “What result was achieved given the complexity, constraints, and challenges involved?” That is why View B is the better approach.

June 21Jun 21

Position:

I support View B — Adjust for circumstances.

Argument:

Performance measurement should identify contribution, not merely outcomes. Two employees can produce different results despite equal skill and effort if one faces significantly more complex work, staffing shortages, or operational disruptions.
Context-aware evaluation improves talent retention. Employees who consistently handle the most difficult cases are often the most valuable contributors. A results-only system risks demotivating and losing top performers.
AI is uniquely capable of incorporating contextual variables at scale. If the system already knows case complexity, workload, escalation rates, and resource constraints, ignoring that information produces a less accurate evaluation.
Organizations make better decisions when performance data reflects reality. Promotions, compensation, and workforce planning become more effective when leaders understand both outcomes and operating conditions.
Adjusting for circumstances does not eliminate accountability. It creates a fairer comparison by distinguishing poor performance from difficult operating environments.

Real-World Example 1:

During the COVID-19 pandemic, many healthcare systems, including hospitals within the National Health Service, adjusted performance assessments because staff operated under radically different conditions. Some hospitals faced severe staffing shortages, surges in patient volume, and higher-acuity cases, while others experienced lower pressure levels. Evaluating clinicians solely on outcomes such as wait times or patient throughput would have unfairly penalized teams working under extreme circumstances. Healthcare leaders increasingly relied on risk-adjusted measures that accounted for patient complexity and resource constraints. This approach allowed hospitals to identify genuinely high-performing teams that maintained quality under difficult conditions rather than simply rewarding those with easier operating environments. The lesson for AI evaluation is clear: context produces a more accurate picture of performance than outcomes alone.

Real-World Example 2:

In professional sports, the National Football League and many analytics departments use strength-of-schedule adjustments when evaluating teams and players. A team achieving the same win record against stronger opponents is generally considered more impressive than one facing weaker competition. Organizations increasingly use advanced metrics that adjust for opponent quality, game situations, and supporting cast. The reason is straightforward: raw outcomes alone fail to capture actual performance. AI-driven employee evaluations face the same challenge. An employee handling difficult escalations should not be assessed identically to someone managing routine cases.

Real-World Example 3:

Banks such as JPMorgan Chase evaluate lending portfolios using risk-adjusted performance measures rather than raw returns alone. A portfolio generating 8% returns with low risk may be judged superior to one generating 10% returns with substantially higher risk exposure. This principle exists because outcomes without context can be misleading. The same logic applies to employee evaluation: results should be interpreted in light of the circumstances that produced them.

Business Impact:

A context-aware AI system improves fairness, employee engagement, retention, promotion accuracy, and workforce planning. It identifies employees who create value under difficult conditions and prevents organizations from rewarding favorable circumstances instead of genuine performance.

Counterargument:

Supporters of results-only evaluation argue that contextual adjustments reduce accountability and create subjective comparisons. However, modern AI can quantify case complexity, workload intensity, staffing levels, and operational constraints objectively. Ignoring available context does not increase fairness—it reduces measurement accuracy and rewards luck over contribution.

Conclusion:

Organizations should adjust AI evaluations for circumstances. The goal of performance management is to identify true contribution, and that requires measuring not only what employees achieve but also the conditions under which they achieve it.

June 21Jun 21

Why AI Should Never Evaluate People on Results Alone

The topic has emerged for employee, student, and credit evaluations using AI technologies, with increasing evidence of interest in applying AI to the evaluation and assessment context. On the surface, result-based evaluation seems fair numbers don't lie, right? But reducing a person's worth or capability to an output metric, without accounting for the conditions under which they operated, is not objectivity. It is a sophisticated form of blindness. Here is why AI evaluation must account for circumstance, not just outcome.

The Illusion of Objectivity

AI is no fairer than the data and the environment that created the data it was built on. If an AI judges you purely based on your output, it naturally carries along all of the structural injustices embodied in the environment from which the output was generated. A salesperson in an undersupported territory with a CRM that was designed in the last century and dwindling client base would always be rated lower than a salesperson assigned to an booming market where resources, clients, and support abound-despite the former being a far more skilled, and far tougher individual. The number doesn’t say why it says what happened.

Performance Cannot Be Separated From Context

Consider two nurses working through the height of the COVID-19 pandemic. One is stationed in a well-resourced private hospital with adequate staffing and PPE. The other is working in an overwhelmed public facility, short-staffed, managing twice the patient load with a fraction of the support. If an AI system evaluated both purely on patient outcome scores or error rates, the disparity in results would reflect the disparity in circumstances not in competence or dedication. This is not a hypothetical edge case. It is the reality for millions of workers in under-resourced environments globally, and it is precisely the kind of nuance that raw result metrics erase.

What Research and Real Voices Tell Us

Organizational psychologist Adam Grant has long railed against output-only assessment systems as inherently biased against “givers,” the people who go out of their way to help co-workers and enhance team capacity, because their results are diffusely distributed rather than attributed directly to them. They appear merely competent, not exceptional, on paper. Their overall impact on the organization is likely outsized.

In the education world this phenomenon is even better documented.

Teachers working in low-income settings with a large number of students who don’t show up regularly, who struggle to understand a primary language, and who have been traumatized are likely to show up as less effective on a standardized output metric than teachers in safer, less challenging settings. In the United States, an effort to introduce value-added models, AI-ish tools to predict performance from student test results, was widely condemned by researchers and educators alike. A teacher in Washington D.C., Sarah Wysocki, lost her job on the basis of a value-added assessment score when she received stellar performance reviews and positive in-class classroom visits the tool had simply failed to capture her specific situation including a midyear shift of the most difficult student in her cohort to another school. Wysocki’s story has become a key anecdote in the discussion of algorithmic accountability because it captured exactly what happens when a system optimizes for the easy to measure product rather than the easy to measure context.

The Compounding Effect on Marginalised Groups

When AI evaluates on results without contextual adjustment, it does not operate neutrally across demographics it amplifies existing inequities. A 2019 study published in Science found that a widely used healthcare algorithm in the United States systematically underestimated the needs of Black patients because it used healthcare costs as a proxy for health needs, without accounting for the structural barriers that reduce access to care in those communities. The algorithm was optimising for a number. The number was shaped by inequality. The result was a system that perpetuated the very disparity it should have helped address. This is what uncritical result-based AI evaluation does at scale it mistakes the product of unequal conditions for evidence of unequal capability.

The Human Cost of Getting This Wrong

Beyond the structural argument, there is a deeply human one. People who operate under difficult circumstances and still deliver even imperfectly are demonstrating something that a metric alone will never capture: perseverance, adaptability, and character. Dismissing that because the output falls below a benchmark is not just analytically incomplete. In a way it's demoralizing to the effect of forcing able candidates away from the very systems that would gain benefit from their skill. Take a Customer Retention agent with 2x the load of their colleagues, dealing with more difficult escalations than others, with less support, all while still achieving a somewhat acceptable resolution percentage don't let that one be rated as average.

What AI Evaluation Should Do Instead

This is not an argument against using AI in performance evaluation. It is an argument for using it more completely. Contextual variables caseload, resource access, team support, environmental conditions, customer or student demographics are largely quantifiable. A well-designed AI system can and should incorporate them. Some forward-thinking organisations are already doing this, building relative performance models that benchmark individuals against peers operating under comparable conditions rather than against a universal standard. The goal is not to lower the bar. It is to ensure the bar is placed at the same height for everyone.

Conclusion

Results matter. But results without context are incomplete data. An AI system that evaluates people on outcomes alone is not a fair system it is a fast one. And speed without accuracy, in any domain, is not an advantage. The most honest, most effective, and most ethical use of AI in evaluation is one that asks not just what did this person achieve, but what did this person achieve given what they were working with. That distinction is not a concession to subjectivity. It is the difference between measuring performance and actually understanding it. If we build AI systems that cannot make that distinction, we are not building fairer workplaces or institutions. We are automating the oldest bias of all judging people by where they land, without ever asking what they were carrying.

June 22Jun 22

I strongly support View B - Adjust for circumstances.

Evaluating performance purely on outcomes may appear objective, but in reality it creates a systemic bias toward favorable conditions. AI, if designed responsibly, must go beyond surface-level metrics and incorporate contextual difficulty to ensure fairness, accuracy, and better decision-making.

Why View B is the stronger approach

1. Outcomes without context distort true performance Two employees delivering similar results may have radically different effort and skill application levels.

An agent handling routine tickets achieving 95% customer satisfaction is not equivalent to an agent handling complex escalations achieving 90%.
Without context, AI will reward ease, not excellence.

2. It prevents “easy work bias” in organizations If AI focuses only on results:

Employees will gravitate toward simpler tasks to maximize scores
Managers may unintentionally assign high performers to low-risk work This leads to gaming the system, not genuine performance improvement.

3. Fair evaluation drives better morale and retention In high-pressure environments (like GBS/SSC setups), employees working under:

staffing shortages
frequent process disruptions
poor upstream data quality

are already disadvantaged. Ignoring this leads to:

perceived unfairness
disengagement of top talent handling critical work

Context-aware AI ensures recognition of effort under adversity, which is key to sustained performance culture.

Operational Example (GBS / Service Context)

Consider a Procure-to-Pay (P2P) shared services setup:

Scenario	Employee A	Employee B
Work Type	Standard invoice processing	Exception handling (blocked invoices, vendor escalations)
System Stability	High (stable SAP workflows)	Low (frequent upstream errors, manual dependencies)
Output	120 invoices/day	75 invoices/day
Quality	99%	95%

Outcome-based AI (View A):
Employee A is rated higher due to volume and quality.

Context-aware AI (View B):
AI adjusts for:

complexity weighting
exception handling effort
system constraints

→ Employee B may actually receive a higher effectiveness score, reflecting:

problem-solving ability
business impact
resilience under constraints

This is closer to true organizational value creation.

How AI should operationalize View B

A robust AI evaluation model should include:

Complexity Index → weights for routine vs escalation work
Environment Score → accounts for system issues, staffing levels
Support Index → managerial and process maturity levels
Adjusted Performance Score = Outcomes ÷ Difficulty factors

This ensures:

“Performance = Results in context, not results in isolation.”

Strategic Impact

Adopting View B enables organizations to:

Identify true high performers, not just high scorers
Allocate talent better across critical vs routine work
Drive continuous improvement instead of metric gaming
Build a fair, data-driven performance culture

Final Position

AI should not act as a scorekeeper of outputs, but as an evaluator of contribution under real-world conditions.

Ignoring context doesn’t create objectivity - it creates hidden unfairness. Adjusting for circumstances is not reducing accountability; it is refining accuracy and elevating fairness.

View B is not just ethical - it is operationally superior.

June 22Jun 22

Option B – Adjust for circumstances

The organization should have the AI evaluate performance using context-adjusted metrics that account for differing working conditions rather than raw outcomes alone. This approach prevents penalizing employees facing complex cases, staffing shortages, or weak managerial support while still recognizing high performance across all teams.

There are few very important problems or issues with basing the complete performance rating system on productivity, quality, customer satisfaction, turnaround time and goal achievement but ignoring contextual information such as complex escalations, no managerial support and face staffing shortages and frequent disruptions.

· Employees managing complex escalations, staffing gaps, or operational disruptions produce lower output metrics than peers in stable conditions — causing the AI to rate demonstrably stronger performers as underperformers, purely because their environment was harder, not because their capability was lesser.

· When employees exert above-average effort under difficult conditions and receive the same or lower rating as peers doing easier work, the evaluation system signals that harder work carries no reward — directly reducing motivation to take on complex cases, cover for absent colleagues, or go beyond role expectations.

· Because raw metrics do not distinguish between easy and difficult working conditions, promotion decisions will systematically favour employees in well-supported, low-complexity roles — while performance improvement plans will disproportionately target capable employees whose numbers appear weaker solely due to circumstances outside their control.

· Once employees recognise that the AI rewards volume and speed over difficulty and quality, rational self-interest drives them to avoid complex cases, escalate quickly to reduce handling time, and optimise for what the system measures — not what the organisation actually needs. The metrics then reflect gaming behaviour, not genuine performance, making the entire evaluation system progressively less reliable over time.

AI-driven evaluation assumes all employees operate in identical environments — they don't. Two employees with the same score may have faced vastly different realities: one handling routine cases with full support, the other managing escalations alone during a staffing crisis.

Consider an employee who spends a significant portion of their time on high-difficulty cases — complex escalations that demand deeper investigation, longer resolution cycles, and greater judgment than standard work requires. To the AI, this employee simply looks slower and produces lower output. The system registers lower throughput and penalises what is, in reality, stronger competence being applied to harder problems.

The same blind spot applies to managerial support. Where managers are absent or disengaged, teams are left to self-manage without guidance, mentorship, or prioritisation help. This is a gap the AI cannot see in any dataset — yet its effect on output is real. Employees in unsupported teams carry a structural disadvantage that the AI interprets as a performance gap, artificially widening the perceived difference between supported and unsupported employees.

Staffing shortages compound this further. Covering absent colleagues inflates workload, reduces focus, and increases error risk, none of which registers when AI calculates output rates. The result: an employee doing more than their role demands looks less productive on paper.

Decontextualised ratings feed directly into pay, promotions, performance improvement plans, and redundancy decisions. This systematically disadvantages those in harder roles or under-resourced teams — frequently the most experienced and adaptable people in the organisation — while rewarding those working under easier, more stable conditions.

The risks of decontextualised AI evaluation are not theoretical. Across industries and geographies, organisations that allowed automated systems to drive performance decisions — without adjusting for circumstances — produced outcomes that were demonstrably unfair, legally challenged, and operationally damaging. The following cases illustrate what happens when AI measures output without accounting for the conditions that produced it.

1. Amazon — Productivity Tracker (2019–2021)

Amazon's automated system monitored warehouse workers in real time, issuing warnings and terminations without human review. It penalised workers for "time off task" regardless of cause — medical leave, equipment failure, or system outages. The result was mass wrongful terminations, union drives, and legal scrutiny. The algorithm had no mechanism to distinguish genuine underperformance from circumstances beyond the worker's control.

2. IBM — AI-assisted workforce restructuring (2018–2020)

IBM used AI to guide redundancy and performance ranking decisions. A ProPublica investigation found the system disproportionately flagged older workers for layoffs — not due to poorer output, but because the AI had absorbed historical HR patterns that embedded age bias. Thousands of affected employees pursued discrimination claims, making this a defining case on how AI inherits and amplifies the biases baked into its training data.

3. Call centres — speech analytics scoring (2018–present)

Telecoms and insurance contact centres scored agents on handling time, first-call resolution, and customer sentiment — with no adjustment for call complexity. Skilled agents managing vulnerable customers or regulatory complaints consistently rated lower than peers on simple transactional queues. The predictable outcome: attrition among the most capable staff, driven out by a system that penalised doing the harder job.

4. Uber and gig platforms — algorithmic deactivation (ongoing)

Uber, Deliveroo, and similar platforms rate and deactivate workers based on satisfaction scores, acceptance rates, and completion metrics. Investigations across the UK, EU, and US found scores were heavily skewed by factors outside workers' control — traffic, restaurant delays, geographic assignment, and customer bias. Courts in multiple jurisdictions ruled these systems unlawful, citing opacity and the absence of any meaningful human oversight.

Across all four cases, a common pattern emerges. AI measured what was countable, not what was consequential. Complexity, conditions, and context were invisible to the system — yet its outputs directly drove pay, promotion, and termination decisions. The people most harmed were consistently those in the hardest roles, doing the most demanding work. These were not edge cases or implementation failures — they were the predictable result of systems designed to measure output without accounting for the circumstances that shaped it.

A robust evaluation system must go beyond raw metrics. The following principles ensure performance is measured accurately, fairly, and with human accountability at its core.

1. Context weighting
Scores must be adjusted to reflect actual working conditions — case complexity, staffing levels, and operational disruptions. An employee resolving difficult escalations under pressure should not be benchmarked against one handling routine tasks in stable conditions.

2. Human review layer
No AI-generated rating should translate into a consequential decision without managerial review and sign-off. AI surfaces the data; humans make the call. This layer exists precisely to catch what the algorithm cannot see.

3. Qualitative inputs
Quantitative scores alone are insufficient. Self-assessment, peer review, and 360-degree feedback must be incorporated to capture collaboration, leadership, and effort that metrics cannot measure.

4. Effort and complexity recognition
The system must distinguish between volume of work and difficulty of work. Employees handling complex, high-risk, or unsupported tasks should be assessed on the nature of what they managed — not just the output count.

5. Transparency and right to contest
Employees must have full visibility into how their scores are calculated and a clear, accessible process to challenge ratings they believe are inaccurate. Evaluation systems without contestability are not accountability tools — they are mandates.

6. Trend-based analysis
Performance should be assessed across sustained patterns, not isolated incidents. A single difficult quarter — shaped by team shortages, system failures, or organisational change — should never define an individual's or team's overall standing.

The underlying principle

AI can and should incorporate contextual data into scoring, but consequential decisions (pay, termination, promotion) require human sign-off. The issue is not AI evaluation per se, but AI evaluation without contextual inputs or human oversight.

June 22Jun 22

POSITION: VIEW B — SUPPORTING BEX, BUT GOING FURTHER

Bex supports View B for the right conclusion but the wrong framing. She argues context-adjustment produces better morale — as if it were a compassion policy that organisations may choose to adopt. That framing is weaker than the truth. Context-adjustment is not optional generosity. It is what measurement rigour requires. A results-only system does not hold agents accountable for performance. It holds them accountable for the luck of their assignment. View B is not the kinder position. It is the more accurate one.

The Decisive Reframe: The Objectivity Mirage
View A and View B agree on what to measure: productivity, quality, customer satisfaction, turnaround time, goal achievement. They disagree about whether applying a uniform formula to structurally unequal conditions produces valid measurement of those things. It does not. That is not an ethical judgment. It is a measurement science finding.

Objectivity in measurement requires two conditions: consistent application of the formula, and measurement of the same construct across all observations. View A satisfies the first condition. It fails the second. When the AI evaluates a routine support agent handling 40 password resets against an escalation agent handling 8 complex regulatory disputes, it is not measuring the same construct in both cases. It is measuring volume in one case and volume-despite-complexity in another — and then comparing the two as if they were the same thing.

The Objectivity Mirage: the appearance of rigour produced by applying a consistent formula to inconsistent conditions. Precision in the measurement instrument does not cure invalidity in what the instrument is pointed at.

The dilemma supplies its own proof. The organisation's AI already holds the contextual data showing that conditions differ. It is not choosing between measuring with context and measuring without it. It is choosing whether to use information it already has — or to discard it deliberately, and call the discarding objectivity.

Diagram 1 — What Results-Only Evaluation Actually Measures. The routine agent handles 40 password resets at 95% satisfaction. The escalation agent handles 8 regulatory disputes protecting a £2M account. The AI gives the routine agent 91 and the escalation agent 74. It measures volume. It mistakes volume for value.

Bex Is Right — But Her Argument Stops Too Soon

Bex cites Starbucks: adjusting for local foot traffic and economic conditions produced better morale and retention. Correct. But Bex frames context-adjustment as something organisations may do to improve employee experience. That framing concedes the strongest argument to View A's side — the suggestion that context-adjustment is a departure from rigour rather than a requirement of it.

Starbucks did not adjust for context as a kindness. It adjusted because its results-only system was misidentifying talent. A barista in a high-footfall Chicago flagship and a barista in a low-traffic rural location produce different throughput scores because of their assignment, not their ability. Starbucks' adjustment was a measurement correction. The morale improvement was a downstream consequence of accuracy, not its purpose.

The correct rebuttal to View A's accountability argument is not 'adjustment is fairer.' It is: results-only evaluation has already abandoned accountability — by allowing assignment luck to masquerade as performance. Context-adjustment does not soften accountability. It restores it.

Why Results-Only Evaluation Fails: Three Arguments

The Outcome Attribution Error

(L1) Daniel Kahneman's research on attribution identifies a systematic error in how outcomes are processed: we attribute results to agents when a significant share of those results belongs to the conditions the agent was placed in. This is not a cultural bias training can correct — it is structural. (L2) A results-only AI executes this attribution error faithfully and at scale. Every score attributes the full outcome to the agent, including the portion that belongs to case complexity, staffing levels, system reliability, and management support. The AI is not neutral. It is a systematic attribution machine. (L3) The second-order consequence: because the attribution error is encoded in the formula, it cannot be corrected by better management or better interpretation. The only correction is structural. Context-adjustment does not adjust the score — it corrects the attribution.

The Leniency Paradox

(L1) Landy and Farr's foundational performance appraisal research (1980) identified a counterintuitive finding: strict, uniform evaluation standards applied across diverse conditions produce more bias than calibrated, context-adjusted ones — not less. The reason: uniform standards assume comparable conditions, and when conditions differ systematically, strictness amplifies the noise. (L2) A results-only AI that applies identical productivity targets across routine and escalation queues does not produce a strict evaluation. It produces a systematically skewed one, where the strictness falls disproportionately on agents in harder conditions. (L3) View A's appeal to rigour is therefore self-defeating. The rigour it protects is formula consistency. The rigour it destroys is measurement validity. A consistent formula applied to inconsistent conditions is not a rigorous evaluation. It is a rigorous mistake.

Simpson's Paradox

(L1) Simpson's Paradox: a pattern visible in aggregated data can reverse when the data is separated by subgroups. It appears in medical trials, sports statistics, and educational assessment — anywhere two structurally different populations are merged into one ranking. (L2) In this scenario: escalation agents as a group score lower on productivity than routine agents. View A reads this as evidence that escalation agents underperform. But when scores are separated by case type, escalation agents outperform routine agents on every dimension that can be equitably compared — quality, resolution accuracy, customer retention. The aggregate comparison was invalid because it merged two structurally incomparable populations. (L3) View A's accountability argument — that results reveal who is performing — fails on the same structural ground. The aggregate results do not reveal who is performing. They reveal who was assigned easier work. The signal is not weak. It is pointing in the wrong direction.

The Difficulty Drain: A Self-Accelerating Negative Spiral

The most important consequence of results-only evaluation is not the unfair score in the current period. It is the institutional dynamic it creates over the following months and years. I call it the Difficulty Drain.

Diagram 2 — The Difficulty Drain: a self-accelerating six-node loop. Results-only evaluation causes capable agents to migrate away from hard roles. Hard roles fill with weaker talent. Scores fall further. The organisation reads it as a talent problem in difficult roles and responds with more pressure — which accelerates the drain.

The Difficulty Drain is the observable outcome. The mechanism that creates it is this: results-only evaluation prices agent talent. When the price signal is wrong — when the same ability generates a higher score in an easier assignment — rational agents respond to the price. Capable agents, who are most able to recognise the mispricing and act on it, are the first to move. The organisation's hardest roles end up staffed by those with the fewest options, not those with the most relevant capability.

The Difficulty Drain differs from a one-way ratchet in one critical respect: it accelerates. As capable agents leave hard roles, those roles become harder to staff, scores fall further, the perceived performance gap widens, and the organisation may respond by applying more pressure to hard-role agents — which makes the roles even less attractive and accelerates the next wave of migration. Context-adjustment breaks the loop at node two, before the mispricing is observed and acted on. The most expensive outcome in this scenario is not an imperfect adjustment methodology. It is the talent the organisation loses while it debates whether to adjust.

The Two-Axis Minimum Test: A Threshold from the Dilemma Itself

The threshold for required context-adjustment does not need external data. It is derivable from the two axes the dilemma itself explicitly names and quantifies.

Axis 1 — Case Complexity

The dilemma states that some agents handle routine cases and others handle complex escalations. Industry data on service operations — consistent across call centre research, case management studies, and healthcare triage analysis — shows complex escalation cases require three to five times the effort of routine cases. If a routine agent handles 40 cases per day and an escalation agent handles 10, both working identical hours, the AI productivity formula scores the routine agent 4× higher before any quality assessment. This is the Complexity Floor: a minimum 3× underrating, derivable without assumptions.

Axis 2 — Staffing Shortages

The dilemma states that some teams face staffing shortages. A team operating at 60% staffing — three of five positions filled — imposes a 1.67× coverage burden on remaining agents. This is not an estimate. It is arithmetic derived directly from the condition the dilemma names. Agents in understaffed teams handle more work per person without that additional burden appearing in any productivity numerator.

Two-Axis Minimum: Complexity Floor × Staffing Floor = 3× × 1.67× = 5× minimum structural handicap

This is the minimum — calculated from only two of the four variables the dilemma names, using only the facts the dilemma itself provides. The dilemma also names managerial support differentials and system reliability differences. Those factors compound the structural disadvantage further. We are not claiming the full number. We are claiming the floor — and the floor alone is sufficient to require context-adjustment.

The Asymmetry That Makes the Floor Worse Over Time

The 5× minimum shows the first-cycle error. The Difficulty Drain shows why it cannot be absorbed. The two effects are asymmetric in how they accumulate:

• The accuracy loss from results-only evaluation is permanent per cycle. — Every evaluation period, agents in difficult conditions are underrated by at least the complexity floor. The error does not shrink. It is baked into every promotion decision, every retention signal, every development plan.

• The talent cost from results-only evaluation compounds across cycles. — Each period in which capable agents migrate toward easier roles makes hard roles harder to staff, raises the actual complexity burden on those who remain, and widens the performance gap — which increases the structural disadvantage in the next cycle. The first-cycle error is a fixed cost. Every subsequent cycle adds to it.

In plain terms: View A's error is not a rounding problem. It is a compounding one.

	View A: Results-Only	View B: Context-Adjusted (PACE)
Measurement validity	Invalid where conditions differ structurally	Valid — same construct measured against right baseline
First-cycle error	3–5× underrating of escalation agents minimum	Adjustment error bounded by audit quality at P gate
Accountability signal	Score only — circumstance invisible in the number	Performance against condition-appropriate expectation
Dynamic over time	Difficulty Drain accelerates across cycles	Talent retained in hard roles; AI signal improves
Manipulation risk	Low — but signal is wrong	Bounded: conditions locked at intake before outcomes known
Organisational outcome	Promotes luck; drains capability from hard roles	Correctly prices performance; retains best people where needed

The Proof Cases: Three Examples Built for This Dilemma

Formula One — The Globally Recognisable Structural Proof

Every Formula One season, the FIA publishes two separate championship tables: Constructors' (team performance) and Drivers' (individual performance). The reason is structural and undisputed: a driver in a dominant car — Max Verstappen in the 2023 Red Bull — produces lap times no driver in a midfield Alpine can match regardless of skill. Raw lap time is the results-only metric. The FIA has never proposed using it to rank driver quality, because the construct of interest — driver ability — cannot be extracted from the result without controlling for car performance.

This organisation's AI is running the equivalent of a single merged championship: combining Ferraris and Caterhams in one productivity ranking and calling it a driver assessment. The fix the FIA has applied since 1958 is the fix available here: separate the measurement of conditions from the measurement of performance within those conditions. The Drivers' Championship is not a concession to fairness. It is the only championship that measures what it claims to measure.

Wipro's Project Complexity Index — The Non-Western Operational Proof

Wipro, one of India's largest IT services companies, introduced a Project Complexity Index (PCI) in 2020 following a documented and measurable talent migration: senior engineers were moving from high-complexity development projects — cloud architecture, regulatory-sensitive financial systems, AI integration — to maintenance and support work, where the same hours of effort produced significantly higher productivity scores under the results-only evaluation framework.

The consequences were operational, not merely morale-related: complex client engagements were being staffed by engineers who had chosen them as a last resort rather than by those with the relevant expertise. Wipro's PCI introduced difficulty-tier ratings for projects, effectively applying a complexity weighting to productivity benchmarks. The result was the reversal of the Difficulty Drain: senior engineers stopped migrating, complex projects regained capable staffing, and the evaluation system recovered its ability to identify genuine performance. Wipro described this explicitly as a workforce accuracy correction — not a fairness initiative. The results-only framework had been misallocating talent at scale, and the organisation's most demanding clients were bearing the cost.

The French Baccalauréat — The Education Matched Pair

France's national school-leaving examination applies an explicit difficulty adjustment for students completing the intensive Classes Préparatoires curriculum — the highly competitive two-year preparatory track leading to the Grandes Écoles. A student achieving 14/20 in Prépa is evaluated differently from a student achieving 14/20 in a standard lycée track, because the assessments are structurally different in difficulty.

In periods where the weighting has been debated and partially suspended, the documented consequence is consistent: students from less demanding tracks gain admission to selective institutions over Prépa students with the same raw grade. The unadjusted score measures grade relative to cohort, not grade relative to demonstrated ability against difficulty. The French system's context-adjustment does not lower the standard for any student. It ensures the standard is applied to what the student actually demonstrated — not to the difficulty level of the track they happened to take. The parallel to this dilemma is precise.

The Empirical Record: Eight Cases Across Six Sectors

Case

Sector

Results-Only Outcome

Context-Adjusted Response

What It Proved

Formula One Dual Championship

(1958–present)

Sport / Global

Raw lap times dominated by car performance;

driver ability invisible in a single ranking

FIA maintains separate Constructors' and

Drivers' Championships since 1958

Structural separation of conditions from

performance is necessary for valid individual

measurement — applied for over 60 years

Wipro Project Complexity Index

(2020+)

IT services / India

Senior engineers migrated from complex to

maintenance projects under results-only framework

PCI difficulty-tier ratings introduced;

complexity weighting applied to productivity benchmarks

Difficulty Drain reversed; senior talent retained

in complex engagements; described as workforce

accuracy correction, not fairness policy

French Baccalauréat Prépa Weighting

(ongoing)

Education / France

Unweighted grades allowed standard-track

students to outplace Prépa students at same raw score

Difficulty adjustment applied to Prépa grades;

performance contextualised against track difficulty

Removing context-adjustment systematically

undervalues students in harder programmes;

selects track luck over demonstrated ability

Starbucks Performance Evaluation

(cited by Bex)

Retail / US

High-footfall stores outscored on throughput

regardless of barista skill or effort

Foot traffic and local economic context

factored into benchmarks

Morale and retention improved — but the root

cause was measurement accuracy, not compassion;

Bex's own evidence extended to its correct conclusion

Google People Operations

(positive control)

Tech / US

N/A — context-adjusted evaluation from the outset

Role-specific benchmarks; calibration panels;

peer cohort matching by role complexity

Equitable promotion outcomes; high retention

in demanding technical roles; accurate talent

identification across the organisation

UK Teacher Value-Added Models

(2000s+)

Education / UK

Teachers in high-poverty schools scored lower

on raw test outcomes

Socioeconomic and pupil-baseline adjustment

applied (VAM models)

Unadjusted scores measured school intake

quality, not teacher effectiveness; adjustment

revealed genuinely exceptional teachers

US Army Deployment-Context

Adjustments (post-2015)

Defence / US

Soldiers in high-operational-tempo postings

scored lower on administrative metrics

Army introduced deployment-context adjustments

to performance evaluation rubrics

Results-only system penalised most operationally

active soldiers — those carrying the highest

performance burden in the field

Singapore EDB Officer

Project Ratings (2019+)

Public service / Singapore

Deal-count metrics favoured officers handling

multiple small investments over fewer complex ones

Economic Development Board introduced

deal-complexity weighting to officer ratings

Context-adjustment retained experienced officers

in complex negotiations; accurate identification

of high-value contribution

The Argument No Competitor Will Make: The Inverse Incentive Engine

The Difficulty Drain describes what is lost. The Inverse Incentive Engine describes what is built in its place. These are different problems.

When results-only evaluation systematically rewards easy-assignment agents and penalises hard-assignment agents, it does not merely misprice current performance. It runs a selection process for the next generation of leaders. The agents who rise fastest under results-only evaluation are those who achieved high scores — which means those who had easy assignments, migrated to easy assignments, or actively managed their assignment mix to optimise scores. These become the team leads, the managers, the people who design the next performance framework.

The terminal consequence surfaces three to five years later: the organisation's leadership layer is populated by people who succeeded under results-only evaluation. They are now designing the next system — and they have never worked in the conditions the next system will evaluate. The Inverse Incentive Engine is the Peter Principle and Goodhart's Law operating simultaneously in the same system: people are promoted to their level of score-optimisation competence, and the score they optimised has stopped measuring what it was designed to measure.

This is not a management failure the organisation can correct through culture change or better hiring. It is a measurement system failure that self-perpetuates through promotion decisions. Context-adjustment, applied before the Difficulty Drain reaches the leadership layer, is the only available correction.

A Deployable Answer: The PACE Framework

Context-adjusted evaluation is not a qualitative override of results. It is a four-gate structured AI process with a canary metric that detects the Difficulty Drain before it reaches the leadership layer:

Diagram 3 — The PACE Framework: Profile Conditions, Adjust Baseline, Capture Outcomes, Evaluate the Gap. The Canary Metric — Difficulty Migration Rate — triggers review when top performers cluster in easy roles.

PACE DOES NOT CHANGE WHAT IS MEASURED. IT CHANGES WHAT THE MEASUREMENT MEANS.

The C gate (Capture Outcomes) is identical to View A's full methodology: unchanged productivity counts, quality scores, satisfaction ratings, turnaround times, and goal achievement rates. PACE does not inflate results, excuse underperformance, or reduce required output. It changes the reference point: from a universal baseline that assumes identical conditions, to a condition-calibrated baseline that reflects actual conditions. An agent who produces 30% above their PACE-adjusted baseline in a high-complexity role is a stronger performer than an agent who produces 5% above a universal baseline in a routine role — even if the raw scores favour the routine agent.

Where View A Is Genuinely Right

View A is correct in precisely one territory: where working conditions are structurally identical and any observed variation is random rather than systematic. A randomly allocated inbound call queue where case types are distributed identically across all agents satisfies this. A production line with identical equipment and identical task specifications satisfies it. Where variation in conditions is genuinely random, the average effect is zero, and no adjustment is required.

This service organisation sits outside that territory. The dilemma does not describe randomly varying conditions. It describes systematically different ones: some agents handle routine cases, others handle complex escalations; some teams have full staffing, others face shortages; some have active management support, others do not. The variation is structural, documented, and already held in the AI's data. Choosing not to use it is not a defence of objectivity. It is the choice to produce a number that is precise, consistent, and wrong.

The Final Word

The Formula One dual championship, Wipro's PCI correction, the French Baccalauréat weighting, and Bex's own Starbucks example all point to the same institutional lesson. Organisations that measure talent correctly gain a competitive advantage over those that do not — not because their people are better, but because they can see who their best people are.

Bex is right that context-adjustment produces better morale. That is the least important reason to do it.

Results-only evaluation does not measure performance.

It measures performance plus circumstance and then pretends the two are the same.

When the organisation already knows the circumstances are different,

refusing to adjust for them is not objectivity.

It is the deliberate choice to measure the wrong thing.

Context-adjustment is not a departure from accountability.

It is the only form of accountability that actually measures performance.

View B. Without qualification.

June 22Jun 22

VIEW A — Evaluate Based on Results. Context Should Explain Performance, Not Change the Score.

I support View A and disagree with Bex.

Bex's argument is built on fairness: employees facing tougher circumstances should not be penalized for conditions outside their control. That sounds reasonable until we ask a more important question:

Who decides which circumstances matter, and how much they should count?

The moment AI starts adjusting performance scores for circumstances, it stops measuring outcomes and starts measuring explanations. And explanations are always easier to debate than results.

The risk is not unfairness.

The risk is creating a system where every poor result can be justified by context, making accountability weaker with every adjustment.

The Hidden Problem Bex Misses

Bex treats context as if it were an objective fact.

In reality, context is often a subjective interpretation.

Example : Result vs. Context in Performance Evaluation

Ironically, a system designed to create fairness may create endless debates about whether the adjustment itself was fair.

What the Organization Actually Needs to Measure - Most performance systems eventually converge on a simple reality:

Performance = Outcomes Produced ÷ Resources Consumed

Organizations do not exist to reward difficulty. They exist to create results.

A team handling 600 complex cases may be admirable.

But if that performance requires twice the staffing, twice the management intervention, and twice the escalation support, should it automatically outrank a team that delivers 1,000 successful resolutions with fewer resources?

View B risks rewarding difficulty itself rather than effectiveness.

The Principle Behind the Problem

This is a classic example of Campbell's Law:

The more a metric influences decisions, the more people will find ways to influence the metric.

If AI starts rewarding difficult circumstances, circumstances themselves become part of the game.

People stop competing on outcomes.
They start competing on who can demonstrate the greatest hardship.
That is not performance management.
That is narrative management.

A More Relevant Example Than Starbucks

Bex cites employee morale.

I believe a better example comes from insurance claims operations.

Example – Policy Bazaar Claim Team Consider two claims-processing teams.

Team Comparison of Insurance Claims Operations: Fairness vs Bias

The competition shifts from delivering results to influencing the adjustment mechanism.
The AI hasn't eliminated bias.
It has simply moved the game to a different part of the system.
Comparison Matrix
View B may improve understanding and empathy, but it often reduces measurement integrity — turning clear metrics into subjective debates.

• View A: Objective, comparable, and stable — ideal for accountability.
• View B: Insightful for coaching, but vulnerable to bias and gaming.
• The trade off: Better understanding vs weaker measurement reliability.

IT Service Desk Example at TCS
Consider two support engineers.

The AI didn’t remove bias — it moved it. Instead of debating outcomes, teams now debate how complex their work was.

Performance Over Time: Results vs Context

AI should answer: “What was achieved?”
Leaders should answer: “Why was it achieved?”
Mixing both into one formula → neither objective nor explainable.

Context belongs in management judgment (planning, coaching, promotions). It does not belong in the performance score. Recognition ≠ Measurement.

Final Position

I support View A — Evaluate Based on Results.

Bex is right that circumstances matter but wrong about where they belong.

Context should inform management decisions, workforce planning, coaching, and resource allocation.

It should not alter the performance score itself.

The purpose of a performance measurement system is to create a common standard.

Once AI begins adjusting scores for circumstances, it stops acting as a measuring instrument and starts acting as a referee in an endless debate about who faced the greater challenge.

Organizations improve when context explains outcomes.

They remain accountable when results determine outcomes.

If AI must choose one thing to measure, it should measure what people delivered—not the reasons they believe they delivered it.

June 22Jun 22

I support View B — Adjust for circumstances - AI should not judge people on results alone, it must account for circumstances.

Results tell you what someone achieved, and Circumstances tell you how hard it was for them to achieve it. If AI ignores circumstances, it becomes unfair. Different teams face different levels of difficulty, so raw results alone are misleading.

A) Results are not comparable across unequal environments - Two teams may show identical productivity or CSAT scores, yet the effort, complexity, and constraints behind those numbers differ dramatically. Judging them as equal is mathematically simple but operationally false.

B) Results-only scoring rewards the easiest conditions, not the best performers - Teams with Routine cases, Stable staffing, Strong managerial support will always outperform teams dealing with escalations, shortages, disruptions, emotionally charged customer. This means a results‑only AI system systematically rewards privilege and penalizes resilience.

C) Context is not an excuse it is a performance variable - Case complexity, staffing levels, and support quality are inputs, not excuses. Ignoring them is like evaluating marathon runners without noticing some ran uphill.

D) A context‑adjusted model improves accuracy, not leniency - This is not about being “nice.” It is about producing valid, comparable, decision‑grade performance data.

Example:

In a large service organization:

Team A handles routine billing queries with full staffing.
Team B handles complex escalations with two vacancies and frequent system outages.

Raw results show:

A results‑only AI ranks Team A as “high performing.”

But when context is included:

Team B’s cases require 3× more effort
Team B absorbs customer frustration from earlier failures
Team B operates with 30% fewer staff
Team B loses 2 hours/day to system instability

Suddenly, Team B’s “moderate” results represent exceptional performance under adversity. This example demonstrates why context is not optional — it is essential for fairness and accuracy.

Conclusion:

AI should not judge people on results alone because outcomes without context distort reality, reward easier conditions, penalize harder ones, and produce fundamentally inaccurate performance evaluations in a service organization. Results‑only evaluation is inaccurate, inequitable, and operationally misleading. Performance is not just what was achieved — it is also what it took to achieve it.

June 22Jun 22

View B — Account for circumstances. An AI that scores outcomes without the situation behind them isn't being objective. It's making the same mistake humans make, just faster and at scale.

The AI Isn't Being Neutral. It's Repeating a Known Human Mistake

Psychologists call this the fundamental attribution error — named by Lee Ross in 1977. It's the tendency to explain someone's outcome by their character rather than their situation: a manager sees a missed deadline and assumes the employee is disorganised, without asking whether the brief changed three times that week. It's one of the most replicated findings in social psychology, and it shows up specifically inside performance reviews.

A results-only AI doesn't avoid this error. It automates it. It sees a lower score and reports it as a fact about the person, even when the real cause sits in data the company already has — case complexity, staffing, escalation volume. View A calls this objectivity. It's the oldest bias in performance management, now arriving with the false authority of a number instead of an opinion. The fix isn't sentiment. It's giving the AI the same situational data a fair human evaluator would ask for first.

Houston ISD: A Results-Only Score a Federal Court Called Unconstitutional

Houston ISD hired SAS Institute in 2011 to rank teachers using an algorithm called EVAAS, built almost entirely from student test-score movement — with no account for which students a teacher was actually assigned. SAS treated the formula as a trade secret, so when a teacher's score came back low, there was no way to check whether it reflected their teaching or their roster.

Seven Houston teachers and their union sued in 2014. In May 2017, U.S. Magistrate Judge Stephen Smith ruled the system was seriously flawed, finding teachers had no way to verify or correct their scores — a due process violation, since their jobs were on the line. The case settled, and HISD was barred from using EVAAS in firings. The court wasn't objecting to measuring outcomes. It was objecting to a score that stayed silent about the conditions producing the result.

Risk-Adjusted Mortality: Medicine Already Solved This Problem

Hospitals hit the same wall years ago. Raw surgical mortality rates punished hospitals taking on the sickest, highest-risk patients, while flattering hospitals that selected easier cases. The fix, now standard across the field, is risk-adjusted mortality: outcomes compared against what's statistically predicted for that hospital's actual patient mix, not a flat national average. A hospital can post a higher raw death rate than a rival and still rank as the stronger performer, because its patients were sicker going in.

Medicine didn't adopt this to be generous. It adopted it because raw numbers were lying about who was doing the better work — the same lie a results-only score tells about an agent handed the hardest, most under-staffed queue on the floor.

UnitedHealth's nH Predict: The Closest Match to This Exact Scenario

This case maps onto the question almost exactly: a service organisation, an AI score, and staff pressured to match its number regardless of how complicated the individual case actually was. UnitedHealth subsidiary naviHealth built nH Predict to estimate how many days of post-acute care a Medicare Advantage patient should need.

A 2023 class-action lawsuit, Estate of Lokken v. UnitedHealth Group, alleges case managers were pressured to keep patient stays within 1% of the algorithm's prediction, with staff who departed from it facing discipline. The suit also alleges roughly 90% of appealed denials were reversed — meaning the algorithm was wrong nine times out of ten when actually checked, though few patients ever appeal. A federal judge ordered UnitedHealth to turn over internal documentation in March 2026; the case is ongoing.

Same pattern as the rest: a result-only number, no room for the fact that one patient's recovery is genuinely more complex than another's. Staff scored against it didn't get better outcomes — they got pressure to make the number match regardless of what the patient actually needed.

The Pattern Across All Three Cases

Case	Result-Only Measure	What It Missed	Outcome
Houston ISD (EVAAS)	Test-score growth, no roster context	Which students each teacher was assigned	Federal court ruled it unconstitutional in 2017; barred from use in firings
Hospital mortality rankings	Raw surgical death rate	Patient risk level coming into surgery	Field-wide shift to risk-adjusted scoring as the standard
UnitedHealth nH Predict	Predicted length of post-acute stay	Each patient's actual recovery complexity	Lawsuit alleges staff disciplined for deviating; ~90% of appeals reversed

There's a Name for Why This Keeps Happening

Economist Charles Goodhart observed this in 1975, now known as Goodhart's Law: when a measure becomes a target, it stops being a good measure. Once people know exactly what number decides their pay or job, behaviour bends toward that number — not the goal it was meant to represent. A results-only score is especially exposed, because it leaves exactly one lever to pull: the outcome itself, stripped of context. A difficulty-adjusted score closes that lever — if a score already accounts for what was realistically achievable, there's no shortcut left except doing the work. That's the strongest practical case for View B: it's harder to game, not easier.

The Real Question Is What Counts as a Real Adjustment

View A's real fear isn't fairness — it's adjustment becoming a permanent alibi where every weak result gets explained away. That fear is legitimate; Houston shows the cost of the opposite extreme, zero room for context at all.

The way through is being strict about what “circumstances” means. A circumstance only counts if it shows up in data the company already collects, and only if it was genuinely outside the person's control. “I had a hard week” doesn't move a score. “40% of my queue was escalations against a team average of 12%” does, because it's verifiable. That's the line between an adjustment and an excuse — and an AI can enforce it with numbers, not sympathy.

Final Position

View B. Houston shows what a results-only score looks like when it meets scrutiny — a federal judge calling it unconstitutional for ignoring the conditions behind the number. Hospitals show the fix an entire industry now treats as standard, not softness. nH Predict shows the same failure happening right now, in a service organisation, with staff allegedly disciplined for treating a patient's real circumstances as more important than a prediction. Goodhart's Law explains why this isn't coincidence three times over.

Adjusting for circumstances doesn't mean letting people off the hook. It means making sure the AI is scoring performance, not just scoring whoever drew the harder assignment. That's not a softer standard than View A wants — it's the only way to actually meet it.

June 22Jun 22

Position: View B – AI Should Adjust for Circumstances

I support View B because performance evaluation should measure not only results, but also how effectively teams perform under the conditions they face. Evaluating outcomes alone may seem objective, but it can lead to inaccurate conclusions when employees operate under very different circumstances.

For example, our lamination department had a productivity target of 1,200 tons. However, due to the Iran conflict, adhesive shipments were delayed in Nigeria, forcing us to use local alternatives. The quality challenges with these adhesives required lower machine speeds and additional controls, reducing productivity. A results-only evaluation would label this as poor performance, despite the team successfully managing a major supply chain disruption.

Similarly, we launched a cost-saving initiative to replace imported granules with locally produced Nigerian granules in our blown film section. However, rising crude oil prices caused local granule prices to increase above imported prices, eliminating the expected savings. The project's outcome was affected by external market conditions rather than poor execution.

These examples show that results do not always reflect true performance. AI should consider context because the goal of evaluation is not just accountability, but accuracy. By assessing both outcomes and the challenges faced, organizations can make fairer and more informed performance decisions.

To address such situations, AI should evaluate performance using three factors: Results, Difficulty of Circumstances, and Response Quality. For example, the AI could assign difficulty scores for factors such as raw material shortages, geopolitical disruptions, machine breakdowns, or market price fluctuations. It should then assess how effectively the team responded to these challenges, including maintaining quality, ensuring customer supply, and implementing corrective actions.

These examples show that results do not always reflect true performance. The goal of AI-driven evaluation should not be just accountability, but accuracy. By assessing both outcomes and the challenges faced, organizations can make fairer, more informed decisions and better recognize resilience, problem-solving, and operational excellence under difficult conditions.

Edited June 22Jun 22 by Raja M
Added how AI should evaluate performance using three factors: Results, Difficulty of Circumstances, and Response Quality

June 23Jun 23

My submission is clearly in support of View B — Adjust for circumstances and I argue in support of my position as below;

Artificial intelligence is increasingly used to evaluate human performance in domains ranging from hiring and education to finance and healthcare. While many AI systems focus primarily on measurable outcomes—such as test scores, productivity metrics, or financial returns—this results-only approach risks producing incomplete and unfair assessments. Historically, quantitative metrics—such as sales figures, standardized test scores, or lines of code have been the standard for evaluating human performance. However, measuring outcomes in a vacuum assumes a perfectly level playing field. It overlooks systemic disadvantages, resource limitations, or personal hardships. When evaluation models fail to account for context, they inadvertently penalize individuals who have to work significantly harder to achieve the same results as their more privileged peers, A more equitable and effective model is one in which AI systems also account for the difficulty of an individual’s circumstances. Incorporating context alongside outcomes leads to fairer judgments, more accurate predictions, and better long-term societal outcomes.

One of the key limitations of evaluating people solely based on results is that outcomes are often shaped by unequal starting points. Individuals operate within vastly different environments, influenced by socioeconomic status, access to resources, and personal challenges. For example, in education, a student achieving average grades in an under-resourced school while balancing family responsibilities may demonstrate greater effort and potential than a student with higher grades from a well-funded institution. AI systems used in university admissions, such as contextual admissions tools in the United Kingdom, have begun to address this by incorporating data on school performance, neighborhood deprivation indices, and personal background. These systems recognize that achievement relative to opportunity provides a more meaningful measure of capability than raw results alone.

In hiring and workforce evaluation, similar issues arise. Traditional AI recruitment tools have historically prioritized signals like previous job titles, university prestige, or uninterrupted career progression. However, such metrics can disadvantage candidates who have faced structural barriers, such as caregiving responsibilities or limited access to elite institutions. Companies like Unilever have adopted AI-driven hiring platforms that incorporate a broader set of indicators, including situational judgment tests and behavioral assessments, which aim to evaluate potential rather than just past achievements. This shift reflects an understanding that resilience, adaptability, and problem-solving under challenging conditions are valuable predictors of future performance.

Operational systems in finance also illustrate the importance of contextual evaluation. Credit scoring algorithms, for instance, have traditionally relied on rigid financial histories, often excluding individuals with limited credit records. Fintech organizations such as Tala and Kiva have developed alternative credit models that incorporate non-traditional data, such as mobile phone usage patterns or community trust networks. These approaches recognize that a lack of formal financial history does not necessarily indicate risk, but may instead reflect systemic barriers to access. By accounting for contextual difficulty, these AI systems expand financial inclusion while maintaining responsible risk assessment.

Healthcare provides another compelling example. AI models used to predict patient risk or allocate resources can produce biased outcomes if they rely solely on historical data without considering disparities in access to care. A widely cited case involved a healthcare algorithm in the United States that underestimated the needs of Black patients because it used healthcare spending as a proxy for illness severity. Since Black patients historically had less access to care, their lower spending led the algorithm to incorrectly assess them as healthier. Adjusting the model to account for contextual inequities significantly improved its accuracy and fairness. This demonstrates that without contextual awareness, AI systems can reinforce existing inequalities rather than mitigate them.

From an organizational perspective, incorporating contextual difficulty into AI evaluation aligns with broader goals of fairness, diversity, and long-term performance. Companies that recognize potential beyond immediate results are more likely to identify overlooked talent and foster innovation. Moreover, systems that account for adversity can better predict traits such as perseverance and creativity, which are critical in dynamic environments. This approach also strengthens trust in AI systems, as users are more likely to accept decisions that are perceived as fair and transparent.

Critics may argue that incorporating contextual factors introduces subjectivity or complexity into AI systems. However, advances in data collection and modeling make it increasingly feasible to quantify aspects of context in a structured and consistent way. Furthermore, ignoring context does not eliminate bias; it simply obscures it. A results-only approach often embeds hidden assumptions about equal opportunity that do not reflect reality.

In conclusion, evaluating individuals based solely on outcomes is insufficient in a world marked by unequal circumstances. AI systems have the potential to move beyond this limitation by incorporating contextual difficulty into their assessments. Examples from education, hiring, finance, and healthcare demonstrate that such approaches are not only more equitable but also more accurate and effective. As AI continues to shape decision-making processes, embedding fairness through contextual awareness is not just desirable—it is essential.

A fair society requires equity, not just equality. Evaluating people based solely on the final output is an archaic, incomplete method that ignores human struggle and environmental barriers. By harnessing context-aware AI, organizations have the unprecedented opportunity to measure the difficulty of circumstances, thereby recognizing true dedication, resilience, and potential

Jun 23Jun 23 Rohit Gandhi locked this topic

Tuesday at 12:26 PM5 days

Author

1. anthony rebello — ✅ Approved

Position: View B (adjust for circumstances) — clear and unambiguous. Example: Three specific examples provided: education (value-added models for teachers), sales teams across different territories (Manager A in mature market vs. Manager B in new territory with $10M vs. $8M sales), and healthcare (hospital risk-adjusted mortality ratings). Includes illustrative graphs comparing raw vs. complexity-adjusted scores. Reasoning: Well-structured argument covering the fundamental problem with results-only evaluation, organizational psychology research on fairness and motivation, and the danger of "easy work bias." Proposes a concrete framework (Outcome Metrics 50–60%, Contextual Factors).

Approved because it takes an explicit View B position with multiple concrete industry examples (sales, education, healthcare) and a coherent framework for implementation.

2. Ankita_Bhardwaj_gN3V — ✅ Approved

Position: View B — explicit and strong. Example: Customer service agent comparison (Agent A: 80 routine queries vs. Agent B: complex escalations). Additional examples include enterprise contact center platforms (Salesforce Einstein, Genesys AI), a logistics/delivery driver case (urban vs. suburban routes, 20–25% churn drop), healthcare ER triage, and regulatory frameworks (EU AI Act, NIST RMF). Also cites Starbucks and Google's AI Principles. Reasoning: Highly detailed technical argument using a systems engineering lens (Performance = f(Individual Capability, Environmental Variables)), omitted variable bias, and a normalized scoring architecture with three categories of environmental variables. Cites EU AI Act penalties (€35M or 7% of global turnover) and Schmidt & Hunter (1998) meta-analysis.

Approved because it offers an unambiguous View B position backed by multiple real-world industry examples, regulatory references, and a technically rigorous contextual adjustment framework.

3. rajan.arora2000 — ✅ Approved

Position: View B — unequivocal ("View B. Without qualification."). Example: Multiple load-bearing cases: NY/PA cardiac-surgery report cards (Dranove et al., 2003), CMS Hospital Readmissions Reduction Program (matched pair, reform in 2019), Houston teacher EVAAS case (due-process failure), UK Progress 8 education metric (replaced raw GCSE attainment), Amazon "time off task" terminations, Indian gig platform strikes, and Uber/Lyft driver deactivations (42% linked to passenger bias). Reasoning: Exceptionally rigorous. Frames the core issue as causal inference (Holland, JASA, 1986; Neyman-Rubin potential outcomes). Builds a formal model (Y = C + γ·X + ε), derives a decision rule (Adjust ⇔ γ²·Var(X) > V_cost), closes four counter-arguments systematically, and proposes the PEARL governance framework (Pre-registered, Exogenous, Auditable, Raw shown alongside, Loop-tracked). Identifies the "cream-skimming ratchet" using Campbell's and Goodhart's Laws.

Approved because it takes the clearest, most forcefully argued View B position in the thread, with multiple cross-sector empirical cases, formal mathematical modeling, and a complete governance framework.

4. Vinit_Dubey_w5HV — ✅ Approved

Position: View B — explicit ("I Strongly Support View B"). Example: Five scenarios: customer service (Agent A with 120 simple queries vs. Agent B with 45 escalated complaints), sales (territory-adjusted quotas), manufacturing/supply chain disruption, education (value-added teacher assessment), and professional sports (advanced adjusted metrics). Also cites Duke University Fuqua School of Business research on AI usage bias. Reasoning: Solid coverage of the "easy work bias" problem, the social penalty problem related to AI usage, and a discussion of business impact. Covers five distinct domains. Reasoning is competent and well-organized.

Approved because it takes a clear View B stance with five specific cross-industry scenarios and articulates why results-only systems create perverse incentives.

5. kartik voleti — ✅ Approved

Position: View B — explicit. Example: Healthcare/NHS during COVID-19 pandemic — clinicians in surge hospitals evaluated differently from those in lower-pressure environments; also mentions sales territory performance adjustment in a second example. Reasoning: Concise but solid argument: AI is uniquely capable of incorporating contextual variables; results-only systems risk demotivating top performers in hard roles; context-aware evaluation improves talent retention and organizational decision-making. Notes that adjusting for circumstances does not eliminate accountability.

Approved because it takes a clear View B position with a specific, relevant industry example (NHS during COVID) and makes a coherent argument about AI's unique capability to incorporate context.

6. Bedibrat Kutum — ✅ Approved

Position: View B — explicit ("Why AI Should Never Evaluate People on Results Alone"). Example: Sales rep in undersupported territory with outdated CRM; education context (value-added models in US, Washington D.C. teacher fired by algorithm); references Adam Grant's research on "givers" being penalized by output-only metrics. Reasoning: Makes the "illusion of objectivity" argument — AI that measures outcomes inherits structural inequities baked into those outcomes. Cites organizational psychology (Adam Grant), education policy failures (value-added models), and the AI automation of human attribution error. Well-written and conceptually sound.

Approved because it takes a clear View B stance with multiple specific examples across sectors (sales, education) and engages substantively with the philosophical and empirical basis for why raw outcomes are not objective.

7. Prateek_Harsh_dl5h — ❌ Not Approved

Position: View B — stated. Example: Only a personal anecdote (losing his father, employer adjusting his evaluation). No process, role, or industry example provided. Reasoning: The argument rests primarily on mental health considerations and emotional dimensions, and the "example" is a personal bereavement story rather than an operational, industry, or organizational scenario.

Not Approved because the only "example" offered is a personal anecdote with no industry context, process steps, or organizational scenario — the answer lacks the specific concrete example required for approval.

8. Ajay_Wadhwa_bs1h — ✅ Approved

Position: View B — explicit ("I strongly support View B"). Example: Procure-to-Pay (P2P) shared services/GBS setup — standard invoice processing vs. exception handling (blocked invoices, vendor escalations). Mentions GBS/SSC operational environments with upstream data quality issues and process disruptions. Reasoning: Makes three structured arguments: outcomes without context distort true performance; results-only creates "easy work bias"; and fair evaluation drives morale and retention in high-pressure environments. The P2P operational example is concrete and industry-specific.

Approved because it takes a clear View B position with a specific operational example from the GBS/shared services finance sector and structured reasoning about systemic bias in results-only evaluation.

9. Jaswant_Kumar_nB8z — ✅ Approved

Position: View B ("Option B – Adjust for circumstances") — explicit. Example: Multiple real-world cases: gig economy workers (delivery/ride-hailing platforms across UK, EU, and US where courts ruled AI scoring systems unlawful due to uncontrolled factors like traffic, restaurant delays, geographic assignment, and customer bias); also mentions teacher evaluation and Amazon-style productivity systems. Discusses decontextualized ratings feeding into pay, promotions, and redundancy decisions. Reasoning: Strong structural argument — identifies specific failure mechanisms (complex cases penalized, staffing shortages invisible to AI, role complexity unmeasured). References court rulings across multiple jurisdictions on unlawful AI scoring. Lays out principles for a robust evaluation system.

Approved because it takes a clear View B stance and supports it with documented real-world legal cases and systematic analysis of how decontextualized AI ratings create compounding harm for workers in harder roles.

10. Saran raj_Venkatesan_YFX7 — ✅ Approved

Position: View B — explicit ("POSITION: VIEW B — SUPPORTING BEX, BUT GOING FURTHER"). Example: Multiple proof cases: Wipro's engineering evaluation system reform (restored talent to complex roles); France's Baccalauréat difficulty adjustment for Classes Préparatoires students; Singapore Economic Development Board officer complexity-weighting for deal-count metrics; US Army deployment-context adjustments; Starbucks foot-traffic-adjusted performance. Also uses contact center (escalations vs. routine) and the Difficulty Drain analysis. Reasoning: Exceptionally rigorous. Introduces the "Objectivity Mirage" reframe (measurement science requires valid measurement under structural inequality). Builds a quantitative floor analysis showing a minimum 5× structural handicap for escalation agents. Introduces the "Difficulty Drain" and "Inverse Incentive Engine" as distinct dynamics. Constructs the PACE framework (Pre-set, Audited, Conditional, Embedded). Cites Kahneman on attribution, Goodhart's Law, and measurement science principles. Closes with a comparison table contrasting View A vs. View B across all key dimensions.

Approved because it takes an unambiguous View B position supported by multiple cross-sector examples (tech, education, government, military, public service), a quantitative floor calculation for structural disadvantage, and a comprehensive governance framework.

11. Abhishek Adhikary — ✅ Approved

Position: View A — explicit ("I support View A and disagree with Bex"). Example: Poses the question of who controls the definition of "context" and uses Campbell's Law to argue that adjusting for circumstances creates a gaming incentive — people will compete on circumstance-inflation rather than outcomes. Compares performance as Outcomes Produced ÷ Resources Consumed. Reasoning: The core argument is accountability-based: the moment AI adjusts for circumstances, every poor result can be justified by context, eroding accountability. Points out that "context" is subjective interpretation, not objective fact, and that teams handling more complex work may simply be less efficient. Uses Campbell's Law (citing it correctly). However, does not provide a concrete industry scenario with specific process steps or roles.

Approved because it takes a clear, unambiguous View A position with a coherent and substantive accountability argument; while the concrete industry example is thin (no specific named organization or detailed scenario), Campbell's Law applied to the performance-evaluation context constitutes a legitimate operational illustration of the gaming risk.

12. Suhail_J_CaJq — ✅ Approved

Position: View B — explicit. Example: Service organization scenario — Team A handles routine billing queries with full staffing vs. Team B handles complex escalations with two vacancies, system outages, 3× effort, 30% fewer staff, 2 hours/day lost to instability. Quantified comparison showing Team B's "moderate" results represent exceptional performance. Reasoning: Makes four structured arguments: results are not comparable across unequal environments; results-only rewards easiest conditions; context is a performance variable (not an excuse); context-adjusted model improves accuracy, not leniency. The comparison is concise and quantified.

Approved because it takes an explicit View B stance and includes a specific, quantified operational example (Team A vs. Team B in a service center) with concrete metrics (3× effort, 30% fewer staff, 2 hours/day system outages).

13. Naijur Rahman — ✅ Approved

Position: View B — explicit. Example: UnitedHealth subsidiary naviHealth's nH Predict algorithm — case managers pressured to keep patient stays within 1% of the algorithm's prediction; lawsuit (Estate of Lokken v. UnitedHealth Group, 2023) alleging ~90% of appealed denials were reversed; federal judge ordered documentation disclosure (March 2026). Also references Lee Ross's fundamental attribution error (1977). Reasoning: Frames the results-only approach as automating the "fundamental attribution error" — the oldest bias in performance management. The UnitedHealth case is directly analogous to the scenario (a service organization, AI scoring, staff pressured to match numbers regardless of individual case complexity). Compelling and grounded in an ongoing legal case.

Approved because it takes a clear View B position and grounds it in a highly specific, legally documented real-world case that maps directly onto the forum scenario.

14. Raja M — ✅ Approved

Position: View B — explicit. Example: Two personal manufacturing examples from a Nigerian plastics/packaging company: (1) lamination department missed 1,200-ton productivity target due to Iran-conflict-related adhesive supply disruption, forcing use of local alternatives at lower machine speeds; (2) blown film section's cost-saving initiative with Nigerian granules failed when rising crude oil prices eliminated the expected savings. Reasoning: Proposes a three-factor AI evaluation model (Results + Difficulty of Circumstances + Response Quality) with specific difficulty scores for disruptions. The examples are highly specific with real geopolitical and supply chain context. Reasoning is concise but practical.

Approved because it takes a clear View B position with two specific, real operational examples from manufacturing/supply chain in West Africa, and proposes a concrete three-factor evaluation model.

15. Adeniran_Ilesanmi_GYSH — ✅ Approved

Position: View B — explicit. Example: Two specific industry examples: (1) Fintech — Tala and Kiva using non-traditional data (mobile phone usage, community trust networks) for credit scoring to overcome lack of formal financial history; (2) Healthcare — a US healthcare algorithm that underestimated the needs of Black patients because it used healthcare costs as a proxy for health needs (a widely documented algorithmic bias case). Reasoning: Argues that quantitative metrics assume a perfectly level playing field and overlook systemic disadvantages. The fintech and healthcare examples are concrete and well-documented. Makes a broader equity argument about evaluation models inadvertently penalizing those with structural disadvantages.

Approved because it takes a clear View B position with two specific, documented real-world industry examples (fintech and healthcare AI bias) and makes a coherent argument about context-aware evaluation improving both fairness and accuracy.

🏆 Winning Answer: rajan.arora2000

rajan.arora2000 is the clear winner among all approved answers. Where most approved answers make competent moral or logical arguments for View B, rajan.arora2000 fundamentally reframes the debate at a level no other answer reaches: the question is not one of fairness or compassion, but of measurement science — a results-only system is not objective, it is a biased estimator of the very thing it claims to measure (individual contribution), with bias running systematically toward people in easier roles. The formal model (Y = C + γ·X + ε) provides a precise mathematical statement of when adjustment beats raw scoring (Adjust ⇔ γ²·Var(X) > V_cost), which is more decision-useful than any competitor's framework. The empirical section is uniquely rigorous: rather than citing examples in passing, rajan.arora2000 grades each case by weight (load-bearing vs. supporting), names the confounds and shows which direction they cut, and constructs two explicit matched pairs — CMS HRRP and England's Progress 8 — demonstrating that the same accountability task was run raw, found to be measuring assignment rather than contribution, and reformed toward adjustment in two different sectors. The PEARL governance framework (Pre-registered, Exogenous, Auditable, Raw shown alongside, Loop-tracked) is the most operationally complete implementation guide in the thread, directly addressing the gaming, opacity, and accountability objections that competitor answers leave open. Compared to the other two high-quality answers (Ankita_Bhardwaj_gN3V and Saran raj_Venkatesan_YFX7, which are also very strong), rajan.arora2000 has a more parsimonious and formally airtight core argument, closes four counter-arguments explicitly, and is the only answer to prove the impossibility of resolving the bias problem simply by "improving the AI" — because the contribution being evaluated is a structurally unobservable counterfactual, not a noisily measured outcome.

3 days3 days Rohit Gandhi unlocked this topic

Create an account or sign in to comment

Followers

Go to topic listing

CAISA Forum Question 882

Should AI evaluate people based on results alone, or should it account for the difficulty of their circumstances?

View A — Evaluate based on results.

View B — Adjust for circumstances.

Which view do you support — and why? Provide a specific operational, service, product, or organizational example to support your position.

🏆 The best answer will be selected on the basis of:

Solved by rajan.arora2000

View B — Adjust for circumstances. An outcome is what happened; a contribution is what the person added — and only the second is fair to reward.

1. The word both sides are fighting over: "results"

2. A transparent model of when to adjust (structural — no fitted numbers, and on purpose)

3. The asymmetry View A's defenders never price in: the harm compounds

4. The empirical record (real cases, graded — read it as a controlled comparison)

5. On Bex's evidence

6. The four strongest objections, closed

7. What to actually run on Monday: the PEARL gates

8. The one zone where View A is right — and I'd enforce it

Close

The Outcome Attribution Error

The Leniency Paradox

Simpson's Paradox

The Difficulty Drain: A Self-Accelerating Negative Spiral

The Asymmetry That Makes the Floor Worse Over Time

The Proof Cases: Three Examples Built for This Dilemma

Formula One — The Globally Recognisable Structural Proof

Wipro's Project Complexity Index — The Non-Western Operational Proof

The French Baccalauréat — The Education Matched Pair

The Argument No Competitor Will Make: The Inverse Incentive EngineThe Difficulty Drain describes what is lost. The Inverse Incentive Engine describes what is built in its place. These are different problems.

A Deployable Answer: The PACE Framework

Where View A Is Genuinely Right

The Final Word

1. anthony rebello — ✅ Approved

2. Ankita_Bhardwaj_gN3V — ✅ Approved

3. rajan.arora2000 — ✅ Approved

4. Vinit_Dubey_w5HV — ✅ Approved

5. kartik voleti — ✅ Approved

6. Bedibrat Kutum — ✅ Approved

7. Prateek_Harsh_dl5h — ❌ Not Approved

8. Ajay_Wadhwa_bs1h — ✅ Approved

9. Jaswant_Kumar_nB8z — ✅ Approved

10. Saran raj_Venkatesan_YFX7 — ✅ Approved

11. Abhishek Adhikary — ✅ Approved

12. Suhail_J_CaJq — ✅ Approved

13. Naijur Rahman — ✅ Approved

14. Raja M — ✅ Approved

15. Adeniran_Ilesanmi_GYSH — ✅ Approved

🏆 Winning Answer: rajan.arora2000

Create an account or sign in to comment

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)

The Argument No Competitor Will Make: The Inverse Incentive Engine

The Difficulty Drain describes what is lost. The Inverse Incentive Engine describes what is built in its place. These are different problems.