How Should Performance Metrics Change When AI Becomes Part of the Workflow?

February 13Feb 13

Q846
When AI becomes part of a process — offering recommendations, automating steps, or influencing decisions — traditional performance metrics may no longer tell the full story.
If individuals are still measured the same way as before, they may either over-rely on AI or resist it altogether.

Think of a specific process in your domain where AI is involved.
How should performance expectations or success measures change to ensure people use AI responsibly and effectively?
What behaviors should be encouraged — and what unintended behaviors should be prevented?

⚠️ Any answer that is generic or does not connect with a specific process will not be approved.

🏆 The best answer will be selected on the basis of:

Relevance of the chosen process
Thoughtfulness in aligning incentives with AI usage
Practicality of the revised performance approach

February 14Feb 14

Context: In Amazon, FOAA (Finance Operations Account and Analysis) Bangalore team is responsible for preparing and submitting the balance sheet reconciliations in Account Reconciliation Module (ARM) tool monthly. There are 16 teams in FOAA Bangalore. Each balance sheet account reconciliation has two levels of review i.e. Reviewer 1 (internal - Bangalore) and Reviewer 2 (external) – Central Accounting teams.

It was identified that out of 212,966 annual reconciliations submitted by preparers, reviewer 1 had rejected 17,581 i.e. 8.26%, these were classified as L1 rejections. In similar, out of 149,114 annual reconciliations submitted by preparers, post reviewer 1 review and approvals - reviewer 2 had rejected 2,255 i.e. 1.51%. Total rejection rate was 9.77%. From the total rejection list, it was not clear the reason for the rejections.

Automating Rejection Reasons in ARM tool: Currently, FOAA BLR team approach the L1/L2 reviewers of respective teams to provide the reasons for rejections. This process is completely manual and very subjective. Reason being, the complete analysis depends upon the L1/L2 reviewers’ comments on the reasons. This is not a scalable approach to evaluate. I designed the rejection reason codes post analyzing the manual inputs provided by both reviewers, followed by I developed the rejection reasons codes in ARM (drop down menu) as below. Both reviewers 1 and 2 must select the right category of rejection from drop down and proceed with rejecting the account. The model is very scalable, tracks timely and auditable for future reference.

Major Rejection Categories List:

1) Missing\Inappropriate Authoritative Supporting Document (ASD’s) updated in ARM or in reconciliation spreadsheet (ASD enhancement)

2) Missing\Inappropriate or incorrect classification of reconciling items

3) Missing\Inappropriate explanation (commentary) updated in ARM

4) Incorrect dates updated – impact on aging of the reconciling items

5) White board journals (above $250K correction journals) impact during quarter close

6) Reconciliation assigned to incorrect reviewers

7) Incorrect Supporting documents\links updated or unable to open the links and attachments

8) Missing\Inappropriate action plan to clear the reconciling items updated in ARM

9) Reconciliation rejected by error

10) Not submitted on time or missing deadlines

Framework established to track performance metrics:

Based on the rejection reason provided, it is observed that, on average, 21% recons due to ASD enhancement, 18% recons were rejected by error, 16% recons due to format enhancement, 14% due to incorrect classification or categorization of open items, 12% recons due to commentary enhancement, and 10% due to invalid supporting links.

I developed a framework to monitor the L1 and L2 rejections monthly. This mechanism allows FOAA teams to categorize rejection categories and set up monthly review meetings to understand the root cause of rejections and take proactive measures to avoid future rejections. The control mechanism consists of:

1) Maintaining a tracker of reconciliation rejection categories.

2) Monthly review with FOAA leadership to provide the details of rejections.

3) Identify the repetitive recons which were rejected in the past two months with similar rejection reason.

4) Conduct brainstorming sessions and control measures to avoid rejections

5) Documenting the reconciliation rejection scenarios, impact and control measures for training purposes

6) Adopting best practices which were implemented in other areas to reduce rejections.

ARM Reconciliation Status Dashboard: A live visual dashboard is displayed to track the reconciliation completion status at each team level and raise flags on due dates. The tool is designed to download and publish the metrics to leadership monthly and provide updates on goal status. This ARM (AI) tool reduced the metrics preparation time and provided more accurate details.

In planned phase manner, we reduced Level 2 (L2) reconciliation rejections in ARM from 1.51% to 0.35% (on a TTM basis) and thereby delivering a 90bps YoY improvement which is calculated as

(Total L2 rejections\Total number of accounts submitted for L2 approval) x 100

In similar we reduced Level 1 (L1) reconciliation rejections rate from 8.26% to 2.25%

(Total L1 rejections\Total number of accounts submitted for L1 approval) x 100

These two metrics (time and quality) helped to evaluate not only the team’s performance but also individual performers (within the team) on quality aspects and decide the annual rating, promotion cycles, and hikes.

February 14Feb 14

In the banking world (payment processing), AI is increasingly used to recommend repair actions, prioritize investigations, and suggest risk outcomes. While the AI provides decision support, the final action typically remains with the operations analyst. This shift fundamentally changes how performance should be measured. If individuals continue to be evaluated only on traditional productivity metrics such as turnaround time or case volumes, it can unintentionally drive two harmful behaviors:

· Blind acceptance of AI recommendations to maintain speed

· Resistance to AI usage due to fear of accountability.

Therefore, performance expectations must evolve to measure decision quality, risk awareness, and responsible AI usage, rather than only output volume. By embedding responsible AI behaviors into performance expectations, organizations can sustain both human expertise and technological advancement while maintaining regulatory and business integrity.

Incorporate Learning and Adaptability Indicators

Since AI systems evolve continuously, performance measurement should reward employees who adapt effectively.

Examples include:

1. Participation in AI feedback and calibration exercises

2. Contribution to identifying AI errors or drift

3. Ability to handle complex exceptions beyond AI capability

Practical Performance Framework for AI-Supported Processes

A balanced performance model should combine 3 evaluation pillars:

Operational Efficiency - Measures turnaround time and throughput while maintaining defined quality.
Responsible AI Interaction
Measures appropriate usage, overrides governance, and provides feedback contributions.
Learning and Improvement Contribution
Recognizes employees who help refine AI capabilities and identify improvement opportunities.

Edited February 14Feb 14 by Kush Singh
Spellcheck correction

February 14Feb 14

We have an AI with human in loop automation where AI performs activity of extracting data from documents as per a pre defined rule. If the AI is unable to extract data because of unavailability that particular case moves to exception queue to be managed by Human.

We have developed 2 sets of metrics to evaluate the performance of individuals post the AI implementation,

If the AI is able to extract all data points and marks the case as completed, we perform a 30% sample to ensure that the AI is performing accurately. The expectation from the processor is to identify if there was any assumptions or in correct processing by the AI.

If the AI marks a case as fail, the expectation is for the processor to identify if it was correctly marked as failed or the data was available but AI was unable to extract the same. In this case the user will need to assess if the error was due to rule building or an actual failure.

In summary, the measure should be:

How well were AI recommendations evaluated and applied?

And Include:

Override rationale quality

Exception handling accuracy

Risk mitigation effectiveness

This is purely basis the experience I have on the AI implemented within our process but I will be keen to hear other perspective to expand my horizon.

February 15Feb 15

Solution

The Process: AI-Augmented IT Service Desk (Tier 1 Support)

In this process, an AI "Co-pilot" drafts responses to user tickets and suggests troubleshooting steps based on past data. The human agent reviews the draft, edits it for context, and sends the final solution to the user.

Revised Success Measures (KPIs)

Traditional metrics like "Tickets Resolved per Hour" are dangerous here because they encourage agents to mindlessly accept AI suggestions to hit their numbers. We should replace them with:

Metric 1: The AI-Validation Rate (AVR)

Instead of measuring speed, we measure how often an agent identifies and corrects a technical error in the AI’s draft. This rewards critical thinking over "blind clicking"

Metric 2: Knowledge Base (KB) Evolution Contribution

We measure how many times an agent updates a system article because the AI provided outdated or incorrect advice. This shifts the agent’s role from a "Consumer" to a "Curator" of AI knowledge.

Metric 3: High-Complexity First Contact Resolution (HC-FCR)

Success is measured only on complex tickets where the AI had "Low Confidence" This highlights the human’s unique value in solving what the machine cannot.

Encouraged vs. Prevented Behaviors

1. Behavior to Encourage: "The Critical Editor"

We must reward agents who treat AI as a junior assistant, not a boss.

The Incentive: Agents who flag the most "AI Hallucinations" (errors) should be promoted as "Process SME" This ensures that "questioning the machine" is seen as a sign of high skill, not a waste of time.

2. Behavior to Prevent: "The Rubber Stamp" (Automation Bias)

The biggest risk is "The Rubber Stamp"—where an agent copies and replaces AI text without reading it to finish their shift faster.

The Prevention: Shift quality audits to include "AI-Attribution" If an agent passes through an AI error that a human should have caught, they receive a "Double Penalty." This ensures accuracy is never sacrificed for the sake of AI-powered speed.

The Practical Result

By changing these metrics, the agent is no longer a "button-pusher" competing with a machine. Instead, they become the Quality Controller. This structure aligns human intuition with AI speed, ensuring the system improves over time rather than just producing faster, low-quality outputs.

1

February 15Feb 15

Use case:- Motor accident claims (Insurance operations)

Importance of the process

In an insurance operations,Motor accident claim assessment is very important.Whether to approve,reject or investigate a claim has direct financial impact . To guide the officers,AI system checks the accident photos,history of the driver ,police report etc.Even though AI can detect any fraud and predict the cost of repair,human takes the final decision.A wrong rejection can result in the customer complaints and regulatory issues where as a wrong approval can cause financial loss.

Scenario1:AI was right but human action was wrong

For a vehicle damage,a customer submitted claim. AI reviewed the images and reports. AI found inconsistency and suggested for further investigation indicating fraud.But the officer focused on the traditional evaluation on the speed and focused on the daily case closure count .To meet the low turn around time and further delays,the claim is approved ignoring the alert from AI.

When there was internal audit,it was pointed that the claim was fraud and company had a big loss.On the performance score point,the office still received good score because as per the old matrices they reward for speed and not on the decision quality.This shows that,if the metrics is based on the low turnaround time, the agent is encouraged to ignore AI even though it is correct.

Her the performance metrics of the Officer should be changed to quality than the speed.

Scenario2:Human followed AI but the officer got negative score

Here AI reviewed an accident claim and recommended to APPROVE because the estimate on repair and the damage in photos were matching with the incident description.The officer trusted AI here and approved immediately.After a week,there is an alert on this claim because the AI failed to consider the vehicle model which had a known recall that could have changed the liability rules. Here there is a system gap and the officer is not at fault.

The officer here received a negative score for approving incorrectly. This discourages the adoption of AI and is not fair.

When AI is part of the workflow performance metrics must separate human error from AI error. Humans should not be punished for following system guidance unless missed mandatory verification steps are missed by them

How performance metrics must change based on these scenarios

The above scenarios clearly shows that the performance metrics should be evaluated based on the quality and noton speed when AI is involved.Officers performance should be measured based on whether they used AI appropriate , followed all guidelines, put more attention on the high risk cases and the out come is reflecting on the decision making.Instead of focusing on the low turn around time and closure or multiple claims,focus should be on the quality of the claim closure.If the officer is overriding the AI recommendation, check if it is been justified and whether it had any financial loss. At the same time,the performance of AI should also be tracked. AI should be marked with negative score if the AI recommendation is at wrong and AI should be evaluated on the rule awareness accuracy and reliability.Officer must check if all the required elements like Vehicle model rules,Repair eligibility ,Policy exclusions are considered instead of blindly following AI.

Conclusion

In our case,performance metrics changes when AI is part of the work flow.Human metrics should be more on the decision accuracy ,proper use of AI recommendations ,Justification on overriding AI recommendation, Financial impact and Compliance checks. AI should also be measured on its rules coverage and reliability. This creates a fair and balanced evaluation where the responsibility is shared between human and AI and can improve the process together.

February 16Feb 16

When AI becomes a part of workflow, performance matrix must change from gauging individuals to evaluating collaboration of human and AI.

Lets take an example of Supply chain - Demand forecasting. AI is being extensively used for generating Demand Forecast of SKUs and Location wise, it also suggests safety stocks.

Now traditional KPIs like Forecast accuracy, Bias % or task completion rate might not work, it can either make organization, or individuals go for completely blind acceptance or if accuracy drops, go for resistance.

After AI implementations performance matrix can be changed eg. % of overrides that improved accuracy or % of ignored alerts that caused issues. We can have impact score based of override. same way we can have KPIs for safety stocks.

If KPIs are not changed it can either demotivate the individuals or make them completely blind towards risks.

February 16Feb 16

Previous Stage – Pre AI deployment

In a BPO payment processing process the Older KPIs were defined as per listed below. The assignment was done manually by allocating Invoices and processers used to issue the payment by validating the information manually and updating the information in the system. The performance was evaluated basis below listed KPIs with weightages

· Number of payments issued (35% weightage)

· Quality achieved (No. of errors received on the processed payment) (45% weightage)

· Innovation (If any idea or kaizen implemented in the process resulting in efficiency gain or quality improvement) (10% weightage)

· Professionalism which included planned leaves, Collaboration with the teams, helping team in completing the numbers (Overtime) etc. (10% weightage)

Post doing R&D the management decided to implement a BOT for issuing the payments. The BOT would issue payments up to a specific threshold e.g.$ 2500 without anyone’s intervention or approval (The reason for determining this threshold was basis the past historical data where in 60% of invoices received for payments were within the threshold of $2500. However anything above this threshold would come to humans for intervention, they would validate and route the payment back to BOT for issuing the payment.

Post AI Deployment

Considering the automation of payments i.e. (60%) the productivity metric will need to be revised, and this will further have a domino effect on the quality metric too as majority of the payments are issued automatically. Considering these facts the metrics should be changed as per below.

· Number of payments reviewed correctly and responded back to BOT timely within the TAT (The 40% payments coming for human review (Revised weightage 20%)

· Creating a dashboard showing the comparison, of No. of payments issued by BOT vs human review, Incorrect recommendations provided by AI, Overriding done on the incorrect recommendations along with analysis done on fixing the issue (5% weightage)

· Quality achieved (This would relate to No. of overrides done on the AI’s recommendations which was provided incorrectly by AI (Revised weightage 40 + 10% Total 50%). 10% incentive to be provided on the humans flagging off incorrect information provided by the AI

· Innovation (If any idea or kaizen implemented in the process resulting in efficiency gain or quality improvement) (weightage 10%) – This would remain as it because continuous improvement is a very important metric and can be used to enhance the AI’s recommendations and efficiency further

· Professionalism which included planned leaves, Collaboration with the teams, helping team in completing the numbers etc (OT). (5%). OT and leaves will not be a problem as majority of the payments are issued by BOT hence revising the weightage for this parameter

· Handling exceptions: Resolving cases which are bit ambiguous and AI cannot classify (5% weightage)

Along with the above defined metrics for KPIs there needs to be some behaviours that needs to be defined (Both encouraging and preventing) with a weightage of 5%. Below listed are the few behaviours.

Behaviours that need to be encouraged

Validating and documenting the recommendations provided by AI
Using AI to reduce the workload which has repetitive and redundant steps but simultaneously reviewing it
Documenting the exact reason whenever manual intervention is done
Humans should take accountability and ownership of the outcomes even though AI is providing automated decisions

Unintended behaviours to prevent

Blindly following the AI’s recommendation and not reviewing it with human intel and logic
Ignoring or overlooking compliance and regulatory standards
Not documenting reasons for overriding AI recommendations
Blaming AI for providing incorrect information

February 16Feb 16

Concrete BPO Initiative: AI-Assisted insurance claim adjudication/resolution (Back Office).

The reason why this process is very relevant.

Insurance claims processing is a classical BPO process involving agents reading documentation, verifying policy regulations, and identifying inconsistencies and/or approving, rejecting or escalating a claim.

Now AI tends to be used in this workflow as follows:

One document extractor (extracts data on documents, forms, IDs)
One that gives an advice on the next-best-action (approve vs reject vs request info)?
A scorer of fraud-risk (indicates suspicious trends)

This is decision-heavy work. That makes it the ideal illustration when conventional measures (speed + throughput) may backfire after AI is implemented.

The Issue with The Existing Metrics when AI is involved.

A standard agent/associate scorecard could be on:

Average handling time (AHT)
Claims processed per day
Error rate (audit findings)
Escalation rate

These metrics are no longer neutral once AI makes suggestions. They begin to develop behavior that is dangerous.

For example:

In the scenario where AHT remains the best indicator of success, agents can rubber-stamp AI approvals so as to get quicker.
In the event that there is a harsh punishment of audit mistakes, it is possible that agents will not use AI to minimize perceived risk.
In case of any fines on escalations, agents might disregard the flags of fraud or evade borderline cases.

The objective therefore no longer becomes fastest agent to win, but best human-AI decision outcomes win.

How Performance Expectation (Same Initiative) Should Change.

1) Replacement of Speed First with Decision Quality using AI Assistance.

Requirements that had been made earlier: Process or Execute 60 claims/day.

New requirements: Make accurate or precise decisions, apply AI in the right way and achieve the throughput ranges.

AHT should be a supporting measure of success metric, as opposed to a primary one.

Practical procedure:

- Keep the productivity low (e.g. 40-55 claims/day)

- Incentivizing based on decision-based quality and responsibility in using AI in this kind of band.

This discourages speed abuse by the agents and capacity is also not compromised.

2) Apply the AI Decision Metrics instead of just Output or Productivity Metrics.

Metric A: The value of AI Approval Agreement (however, it must have context)

This should not be “higher is better”.

Instead, track:

Agreement value by the type of the claim
Acceptance value of the AI’s confidence score.
Concurrence level between agent and consultant experience level.

What good looks like:

There is high agreement when confidence of AI is high and the claim is simple.
Less agreement when it comes to complicated claims where uncertainty is involved.

It is, thus, safe to oppose artificial intelligence where required.

Metric B: Adjust or Revise Quality Scores.

Perform sample QA for every case that is overridden by an agent (approve vs reject vs escalate).

Reward:

Correction of the overrides cases
Properly documented reasoning
The correct escalation in case of the uncertainty being a reality.

This diverts the toxicity behavior of: Always obey AI lest I blame him.

Metric C: Practical Escalation Rate.

Increase in escalations are commonly considered to be inefficient. Claims that are AI-assisted have the safety valve of escalation.

Track:

Appropriateness of Escalation (QA)
The timing of the escalation (late or early)
Escalation reasons (Missing docs - reason, risk of fraud - reason, ambiguous policy- reason).

What you want to stop: The agents will suppress escalations so that they appear effective.

3) Shifting QA “Agent Errors” to “System + Human Outcomes”.

Old QA approach:

"Did the agent make a mistake?"

New QA approach:

Was the AI wrong?
Was AI weakness identified by the agent?
Was the decision that was made the right one?
Did the downstream compliance have enough documentation?

This is important other than agents using AI as a scapegoat or as a shield.

Practical implementation:

QA scorecards are supposed to consist of two elements:

Decision correctness
Quality in the reasoning of decisions (at times with low AI confidence)

What Are the Behaviors to Be Promoted (In this Initiative)?

Encourage these:

Question AI when there is a discrepancy in the signals : AI advises to grant approval, however, there is the mismatch of documents.
Start with AI but do not make it the ultimate decision maker: AI will summarize, agent or associates will authenticate.
Document why you overrode AI: Not long essays but Simple structured reasons.
Escalate prematurely in fraud or ambiguity of policy: In particular, when AI raises the flag of danger and evidence is missing.
Learn from AI errors: The agents highlighting repeated AI errors shall be appreciated.

Which Unintended Behaviors to avoid?

Avoid these:

Blind approval: "The AI has stated to approve, and then we approved.
Automation denial: Agents refuse AI in order to protect their own side.
Metric gaming: Fast approvals in an effort to achieve through-put metric and led to rework in downstream.
Unspoken conflict: Agents bypass AI without making a reasoning (kills learning loops).

Escalation prevention: Preventing escalations to seem busy, enhancing fraud leakage.

Practical Updated Scorecard (In Claims Assisted by AI)

A practical scoring model may be in the form of:

1) Judgement Quality (40%)

Final judgement correctness (QA)
Following the compliance

2) Responsible usage of AI (30%)

Proper use of AI recommendations.
Override quality score
Proper rationale tagging

3) Productivity (20%)

Making throughput within anticipated band.
AHT employed as a protector (no weapon)

4) Risk & Learning Contribution (10%)

Valid fraud escalations
Identifying patterns of AI failure.

Practical Result of This Initiative (What is different in the Business)

Provided that it is implemented in the right way, the results of this performance approach are measurable:

Reduced claims leakage (reduced wrong approvals)
Less re-work (improvements in first pass decision-making)
Increased fraud capture rate (escalations valued, but not captured)
Constant productivity (agents do not go fast without risking everything)
Improved AI faster (override reasons train)

Bottom Line

Once AI gets into the claims adjudication, it no longer becomes process claims fast.

The task to be done is the making of defensible decisions under the help of AI.

Therefore performance measures should develop to:

Quality of decisions + reliable artificial intelligence judgement and regulated productivity.

This is the way of getting adoption and safety, where neither of the blind trust or the adamant resistance is stimulated.

February 16Feb 16

AI doesn’t just automate tasks. It changes decision-making itself. If you keep measuring people the old way, you’ll get the wrong behaviours. Let’s explore this in a real healthcare process.

When AI Enters Clinical Workflow, Your Performance Metrics Must Change

Healthcare organizations are rapidly embedding AI into clinical workflows from radiology triage to predictive risk scoring. But here’s the uncomfortable truth, if you introduce AI and keep legacy performance metrics, you will incentivize the wrong behaviours. And in healthcare, wrong behaviours don’t just affect KPIs - they affect patient safety, regulatory exposure, and institutional trust.

Let’s take a concrete example.

The Scenario: AI-Assisted Radiology

Imagine a radiology department using AI to detect lung nodules in chest CT scans. The system flags suspicious regions and provides confidence scores. The radiologist makes the final call.

Traditionally, radiologists are measured on:

Turnaround time
Volume of scans read
Diagnostic accuracy
Peer review discrepancy rates

These metrics worked in a fully human workflow.

They are insufficient and potentially dangerous in a human + AI workflow.

The Core Shift: From Individual Performance to Human - AI System Performance

When AI becomes part of the workflow, performance measurement must expand beyond output and accuracy.

It must measure:

Clinical outcomes
Human oversight quality
Interaction behaviour between clinician and AI
Risk monitoring participation
System-level safety contribution

Anything less creates blind spots.

Regulatory Reality: This Is No Longer Optional

Across jurisdictions, AI in medical diagnosis is classified as high-risk.

Under the EU AI Act, healthcare diagnostic AI requires:

Demonstrable human oversight
Risk management systems
Post-market monitoring
Auditability

In India, the Digital Personal Data Protection Act, 2023 reinforces accountability around data usage, transparency, and lawful processing.

In the U.S., the FDA’s AI/ML Software as a Medical Device framework demands real-world performance monitoring and controlled model updates.

What Should Change in Performance Measurement?

1. Measure Appropriate AI Engagement — Not Just Agreement

Blindly accepting AI output is not efficiency. Ignoring AI systematically is not expert judgment.

Instead, measure:

Documented overrides with clinical reasoning
Appropriate agreement rates
Escalation behaviour when AI and clinician disagree
Review of AI confidence levels

The goal is not high agreement. The goal is informed agreement.

2. Redesign Productivity Expectations

If you continue pushing volume-based targets without adjustment, clinicians will treat AI as a speed enhancer instead of a safety enhancer.

That leads to:

Automation bias
Shallow review
Over-trust in high-confidence outputs

Organizations should temporarily recalibrate turnaround expectations during AI integration and incorporate review-quality indicators. Speed without scrutiny is a liability multiplier.

3. Make AI Literacy a Performance Competency

If clinicians are expected to use AI responsibly, they must understand:

Model limitations
Bias risks
Confidence scoring
Update cycles

Performance systems should track:

AI training completion
Participation in model review sessions
Error reporting contributions

Responsible AI use is now a professional skill. Treat it that way.

4. Reward Risk Reporting, Not Just Low Error Rates

A department that reports zero AI errors is not necessarily safe. It may simply lack psychological safety.

Healthy AI-integrated systems track:

AI-related incident reporting rates
Detection time for model drift
Participation in post-market monitoring reviews

Transparency should improve evaluations do not harm them.

5. Shift from Hero Culture to System Culture

AI reduces variability when integrated properly.

Stop over-rewarding:

Highest individual throughput
Lone high performers who bypass AI

Start rewarding:

Contribution to system improvement
Peer collaboration during AI disagreements
Engagement in model validation reviews

The future is system excellence, not individual heroics.

Behaviours to Encourage

Critical engagement with AI outputs
Clear documentation of overrides
Responsible escalation of uncertainty
Continuous learning as models evolve
Transparent communication with patients about AI involvement

Behaviours to Prevent

Automation bias
Deskilling through over-reliance
Metric gaming (accepting AI to protect speed KPIs)
Blame shifting onto “the algorithm”
Using AI outside validated scope

The Strategic Imperative

Performance metrics shape behaviour. Behaviour shapes safety. Safety shapes regulatory exposure and public trust. AI does not reduce accountability. It redistributes and amplifies it.

Healthcare leaders who redesign performance systems alongside AI deployment will:

Strengthen regulatory defensibility
Improve patient outcomes
Build resilient digital culture
Future-proof their institutions

Those who don’t will spend years correcting preventable cultural damage. AI transformation is not a technology project. It is a governance project. And governance begins with what you choose to measure.

February 16Feb 16

Domain : Manufacturing : Oils and Gases

Context :

In Air separation unit the process is maximum hazardous and very sensitive, always on trigger of process failure(Even in stable condition) even due to small errors and internal noise or external noise, this leads to high risk of safety to the Employee, Environment, Surrounding and assets this is safety concerns due to critical characteristic of products and process, sensitive operations of high pressure and temperatures and other parameters. also the On supply to On site customer is so critical it’s mandatory to be on top and vigilant in managing the process and plant consistently.

Intent :

The intent was to build stable Artificial Intelligence Predictive and control Model to ensure high ‘’Safety and customer service and performance’’ even from start of the process and then maintaining the End to End stable process parameters which leads to better temperature and flow distribution and pressure ratios to attain the desired cryogenic product out put.

HOW and what considerations are made to build the below AI Model :

AI has emerged has assistant, guide and consultant to review the present process conditions been operated based on real online data and analyse that in real time in few seconds and make the suitable decision to the bring the process bias, to reflect the process output the intended output for highly safe and to ensure high Safety and customer service and performance

To build the trust on AI Model by the operator, process operation team we have involved then in the design, considered all technical details, design conditions of the End to End process Lets say from high capacity Air compressors and high capacity Turbines and communication flow to End to End stake holders and taken all suggestions that would call a need for inclusion in the AI Model as Operators and Process Owners being the face of the process and close to reality they know the process very well.

Major methodology Executed.

Risk analysis, Brain Storming,
Suitable best considerations were made, solutions were identified & Evaluation.
The solution was built, tested.
Simulations were done with involvement of Shop floor operators and Process Owners
Ensured trust building and empowerment to Shop floor operators and Process Owners by involving them in End to End development and till Go-Live Commissioning.
Control Measures taken to sustain the implementation and developed Trust
New KPI & Redefining KPI suitable to upgraded performance with AI consideration.

While on deployment the key question we had was ‘’ How Should Performance Metrics Change When AI Becomes Part of the Workflow?’’

Redefining the Performance KPI and it’s review and defining the New KPI’s

We used the concepts of DMAIC project methodology of each phase that needs to define baseline performance of each KPI, measurements and redefining KPI and new KPIs creation to ensure ‘’control’’ in place to ensure the stability and capability in parameters and process.

Approach :

Brainstorming :

As a first step we have conducted brainstorming to identify the possible ideas for what KPIs to be redefined and structured or identify the New KPI.

Shortlisted the needed KPI using Multi voting and rated by experts and process Owners.

Defining Base line Performance :

Comparing with baseline performance from previous performance : as the process improved by AI Predictive Modelling

Example : The Baseline performance of the Power Consumption of Air Compressor was 5200 Kwh, in other wards 185 kWh/t of oxygen produced, This baseline has to be redefined to 5132 KWh as first level change, 168 Kwh/t as the performance of air compressor is increased with reduced power consumption due to pressure profile distribution/reduction.

Measure phase concept :

MSA (Measurement System Analysis) performed for the all sensors on Logical validation, calibration were done to have accuracy in measurement, also verified and did the re-settings on Pressure Control valves on Lower and higher SPAN to ensure the Pressure Control valve operation to the fine tuned small variations provided by AI Predictive Model.

Example : Earlier Turbine discharge Pressure Control valve was operating at 4.5 to 4.8 barg with new performance the Pressure Control valve was set and tuned for Lower span of 3.9 barg.

Analyse Phase concept :

Existing Process Failure mode Analysis was revisited, new RPN were calculated and New list of high priority failures were identified with New failures were also identified in concurrence with Operators and Process Owner team to look for new controls, Especially by applying POKA YOKE.

POKA YOKE Methodology were redefined for all the necessary parameters to ensure the safety of the system and keeping process and people and asset safety as the priority.

POKA YOKE ‘’Warning’’ :

Example settings of existing ‘’Alarm’’ were redefined to new AI Predictive Model results and process change.

POKAYOKE ‘’Control’’ :

The Pressure, temperature transmitters SPAN settings were changed to bring in more Controls from existing.
New measurements points were identified and installed during plant shutdown.

POKAYOKE ‘’Shutdown’’ :

The shutdown settings were reviewed based on new AI process requirements so that process and plant goes on shutdown when the AI doesn’t predict the internal or external noise which could have impact on safety and leads to undesired or uncontrollable process deviation.

Control Phase concepts implemented :

Included the new baseline, new specification limits, target, measurements MIN and MAX settings
Frequency of New Alarm count was included so that to track how many times process is deviated.
Communication and reports are ensured to get circulated to all concerned functions and team.
3 shifts technical support was ensured until the performance was established across all team members of all shifts.

Conclusion :

AI Model provides solutions which is probably close the stable process as needed by the standards but still there are strong workflow to be deployed with End to End thought process to avoid and to address any deviation or failure so that process runs with stable and capable enough to deliver Safety as first priority and for better customer service with high efficiency within the process.

February 16Feb 16

In Operations domain, we’ve deployed few AI applications. Although we hope that AI results are always better than manual work but there have been times when AI output has been not up to the mark.

In traditional Operation’s work, lot of manual or automated work has been done on daily basis. Tasks are normally repetitive but time and again new scenarios come up which requires human inputs.

Metrics normally depends on what kind of process are we running in the organization.

In traditional metrics environment, we go for balanced scorecard i.e. Production and People aspect example in Reverse Supply Chain,

- Units Received

- Units WIP

- Units in Ready State

- Units Sold

- Margin

- Man hours utilized

- QA

And few more

With AI, although the business metrics will not change much but we will see improvements across the board. Example:

- Received, WIP & Ready State cycle time should be reduced

- Man-hours utilized should show significant changeover.

- Error count in QA should be reduced by significant level

The behaviors that we should encourage

- Monitor the AI output and ensure its accuracy

- Make callouts early so that damage can be controlled at the earliest.

- Look for opportunity areas to implement AI.

- Check for Cycle time benefits

- Cost benefit of implementing AI

February 16Feb 16

As a follower of the cards and payments industry, I think that one of the parts where AI is most useful is detection of Fraudlent transactions. Increase in use of AI in finding these transactions is very helpful, but there is a case of finding false positives too. These false positives could be more costly than missing out some actual fraud transactions. Balancing speed and correctness is of utmost importance. Metrics like number of suspicious transactions flagged or number of incidents handled will not be a correct measurement with AI.

Emphasis should more on metrics like how many more patterns were discovered. Trends in Reduction in false positives. Capture of real fraud. Human interaction will be there to identify scenarios that AI has missed and make sure compliance is met. The anomalies detected by AI should be scored and a threshold point should be determined to map out the fraudulent transactions.

Encouraged behavior includes people using AI in assistive mode to refine detection models to catch newer methods of fraud. Also having quicker, continuous feedback loops where team members can override AI decisions and Agents learn from that.

Overreliance on AI solutions should be avoided. A correct and valid but a bit unusual transaction is something that might be flagged. This could lead to customer dissatisfaction and that could be more costly. Agent training sets should be diverse so that bias does not crop in.

February 17Feb 17

I am from scientific publishing Solution company and to answer this question I would like to talk about one AI solution which we recently implemented in our customer Service Department.

Customer service in scientific publishing domain is mainly for resolving Authors queries for their menuscripts and articles. Through customer service we try to resolve author queries which may be sometimes very Complex queries like doing some corrections or type setting related changes in the articles or they could be very simple such as providing them the information about the stage of their article or the timelines or expected date of publishing the article.

Earlier author query resolution process was entirely manual but recently we have introduced AI (Vendor) automation there and now AI is handling most of the Author Queries. I would say almost around 80% of other queries are now being resolved by AI and automation.

Changing the metrics of success:

So while process was manual to handle all the author related queries the common Metrics we used to have (AHT)Average handling time and volume of tickets. Since we have introduced AI and automation over there, We had to look for other metrics.

Instead of looking for everything kn twrms of handling time, which is already resuced significantly by AI, now average handling time is not relevant Metrics but instead of that we need to focus on the Accuracy of the response by AI.

And since AI is not able to handle all the kind of queries few areas or few category still need the human intervention and for those again being the most critical one now the average handling time is also changed because to handle these queries manually, this will need huge mount of time now. So earlier if we use to have two to three days of Average Handling Time, now for these special cases now the average handling time is more to 5 days and that's how we started tracking the revised matric.

Importantly, in scientific publishing industry we really need to be very careful about what kind of content we are approving what kind of details we are sharing with the authors, we really can't relay upon automation and AI 100% so for that human in the loop will always be there and I would not say that that the metric we introduced but yes we are still identifying the percentage of touchless responses so let's say earlier we use to track volume of tickets so now we are tracking how many percent of that volume is handked using AI and thats the another metric we have introduced in the new system when we move from manual to AI system.

So to conclude this question I would say that with the introduction of AI the conventional manual work related Matrics have been changed, new AI oriented Matrics are introduced. And for the areas where manual work is still required, the conventional metrics are still there but now target values have been changed.

February 17Feb 17

Author

✅ Vijay Yivaturi — Strongly tied to a real FOAA ARM process; clear incentive redesign via rejection reason codes, dashboards, and measurable quality outcomes; good linkage between behavior and outcomes.
🟡 Kush Singh — Relevant process (banking payments) and the “blind trust vs resistance” risk is well stated, but the revised scorecard needs 2–3 concrete measures and examples of encouraged/penalized behaviors. In future, anchor your metrics and behaviors with a concrete example and thresholds.
✅ Himanshu Lohani — Clear specific process (document extraction + exception queue) and a practical split of human evaluation: validating AI success + diagnosing AI failure; good responsibility framing.
✅ iambpawan — Excellent concrete process (AI co-pilot in IT service desk) with crisp revised metrics (validation rate, KB contribution, complex-ticket FCR) + clear behaviors to encourage/prevent; very practical.
✅ Preethi Bijesh — Strong process (motor accident claims) and very thoughtful incentive alignment: separates human vs AI error, emphasizes decision quality and override justification; good unintended-behavior prevention.
🟡 Abhishek Chaudhary — Relevant example (demand forecasting) and right direction (override impact KPIs), but it’s too brief—needs a clearer “new scorecard” and examples of gaming behaviors to prevent. In future, include a clearer metric set and an example of how it changes decisions.
✅ Vijay Gonsalves — Good specificity (payments bot thresholding) with revised weightages, exception handling, and behavior guardrails; practical and directly usable.
✅ Taby Sheikh — Very strong, concrete process (AI-assisted claims adjudication) with a structured scorecard and clear incentive logic; well covers “agreement ≠ good” and escalation behavior.
✅ Jinad Padiyath — Strong concrete healthcare use case; good incentive redesign and behavior controls; slightly heavy on regulatory references but still coherent and practical.
❌ Dhruva Kapur — Not specific enough to one process; mostly general thoughts about cycle time decomposition.
🟡 Bharath CN — Specific high-risk process and lots of detail, but the answer drifts into DMAIC narrative; the “people performance measures” piece needs to be sharper. In future, state the revised KPIs and the exact behaviors they drive (good vs bad).
✅ Manish Gupta — Explains how performance metrics and their movement can be tracked better and regularly using AI.
🟡 Anil Kumar (CAISA) — Relevant domain (fraud) but still broad; needs a specific workflow step (e.g., sanctions alert adjudication) and explicit metrics for overrides, false positives, and accountability. In future, use one concrete process and define 3–5 measurable success indicators.
✅ Aditya Bhavsar — Clear process (AI handling author queries) and sensible metric shift (accuracy + touchless rate + revised targets for complex cases); good practicality.

🏆 Best answer : iambpawan — most crisp, process-specific, and incentive-aligned with clear behaviors and safeguards.

1

How Should Performance Metrics Change When AI Becomes Part of the Workflow?

Featured Replies

Q846
When AI becomes part of a process — offering recommendations, automating steps, or influencing decisions — traditional performance metrics may no longer tell the full story.
If individuals are still measured the same way as before, they may either over-rely on AI or resist it altogether.

Solved by iambpawan

The Process: AI-Augmented IT Service Desk (Tier 1 Support)

Revised Success Measures (KPIs)

Encouraged vs. Prevented Behaviors

The Practical Result

Create an account or sign in to comment

Who's Online (See full list)

Lead AI Transformation without coding

Most Solved

Forum Statistics

Member Statistics

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)

Q846When AI becomes part of a process — offering recommendations, automating steps, or influencing decisions — traditional performance metrics may no longer tell the full story.If individuals are still measured the same way as before, they may either over-rely on AI or resist it altogether.

Solved by iambpawan

The Process: AI-Augmented IT Service Desk (Tier 1 Support)

Revised Success Measures (KPIs)

Encouraged vs. Prevented Behaviors

The Practical Result

Create an account or sign in to comment

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)

Q846
When AI becomes part of a process — offering recommendations, automating steps, or influencing decisions — traditional performance metrics may no longer tell the full story.
If individuals are still measured the same way as before, they may either over-rely on AI or resist it altogether.