Skip to content
View in the app

A better way to browse. Learn more.

Benchmark Six Sigma Forum

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

What Should Teams Learn When AI Advice Is Ignored — or Proven Wrong?

Featured Replies

Q845

In AI-assisted processes, there will be moments when:

  • A team ignores the AI’s recommendation and later discovers the AI was right.

  • The team follows the AI’s recommendation and later finds it was flawed.


These moments are not just operational events — they are powerful learning opportunities. Think of a specific process in your domain where AI provides recommendations.
How should teams systematically learn from such cases?

What questions should be asked, and how should those insights improve both human judgment and the AI system itself?

⚠️ Any answer that is generic or does not connect with a specific process will not be approved.

🏆 The best answer will be selected on the basis of:

  • Relevance and clarity of the chosen process

  • Depth of insight into learning from divergence

  • Practicality of the feedback and improvement approach

Note for website visitors

Solved by Tabrez Shaikh

There can definitely be instances where AI predictions can be ignored or on the flip side be flawed. it all depends on the law of the confusion matrix and for example sake I would like to discuss the two dimensions of the confusion matrix.

  1. False Positive - Model or agent predicts positive, however it is not true.

  2. False Negative - Model or agent predicts negative, however is true or positive.

while the item No 1 is classed as type 1 error in hypothesis testing, type 2 error could be seen in item no 2 above, a beta risk.

it means False Positive do less harm but in principle, False Negative can be really harmful.

in our organisation, there is a team who monitors models or is otherwise called ML ops, where the model owner continuously studies these processes by means of SHAP values.

The model owner also studies something as recall rate and the moment there is a decline to the recall rate, the human intervention is triggered or human in the loop.

Usually recall rate is set high as a number so that model predictions could be studied properly.

recall rate also mean the probability of model predicting the correct value or desired result in percentage or otherwise

Recall rate is True Positive/(True Positives+False Negatives).

This whole process is called Model Monitoring. there are other principles like r squared, thresholds which are also important in a model. monitoring process.

Context: Determining the tax mapping rules for tax calculation in Canadian provinces ESS (non-resident) entities that has multiple tax rates for sellers and buyers who are having registrations\exceptions to be followed based on country specific rules in the below three regions.

This is one of the most difficult phases in Indirect ESS taxation and the following is an interesting case study.

PST Province - British Columbia: Non-CA resident (‘Foreign’) service providers supplying 15,000 CAD (11K USD) of in-scope digital services will be required to register, collect and remit BC PST at the standard rate of 7% to BC-based customers. BC PST is a sales and use-type tax. The rules in BC do not distinguish between B2B and B2C transactions i.e. both B2B and B2C supplies are in-scope and subject to BC PST at 7%. Certain customer-based exemptions may apply.

GST\HST Province - Canada Federal: Non-Canadian resident (‘Foreign’) service providers supplying in-scope ESS to CA based customers must register, collect and remit GST/HST at the standard rate for 5% for customers based in Quebec and British Columbia provinces. There is a GST/HST registration threshold for foreign service providers of ESS and marketplace operators amounting to CAD 30,000 (approx. USD 22K) over a 12-month period (both prospective and retrospectively). Where a business or digital platform operates with a simplified GST/HST registration, the rules distinguish between B2B and B2C transactions, where only in-scope B2C transactions will be subject to GST/HST of 5%. However, where businesses or digital platforms are registered or proceeds to register under the regular GST/HST regime, all goods/services sold will be subject to GST/HST regardless of the B2B/B2C distinction.

QST Province - Quebec: The legislation provides that supplies of digital goods and services made to consumers in Quebec by companies not located in Quebec and which are not already registered to collect QST be taxed at the standard QST rate of 9.975%. Further, GST/HST registered non-resident businesses which are not already registered to collect QST under the existing QST regime will be required to collect QST at 9.975% on sales of physical goods to consumers in Quebec. The registration threshold is $30,000 CAD (approx. $22K) of in-scope supplies made to consumers in Quebec in the preceding rolling 12-month period. Consumers are defined as people (individuals or businesses) who ordinarily reside in Quebec and are NOT registered for QST (i.e., generally B2C transactions). Residency can be determined by using two non-contradictory pieces of information QST registration and address. Whether a customer is considered a business consumer can be determined by requesting the customer provide a valid QST number. If a valid QST number is provided, then QST is not required to be charged.

Post updating the tax business rules, threshold amounts, tax rules, tax rates, along with exceptions, customer & seller locations (address), tax registration numbers, AI perfectly created the logic's. However, the preparers ignored the tax calculations provided based on AI’s logic. In fact, they have overridden the calculations and applied incorrect rates and remitted tax to the countries as below

For British Columbia PST registration, they applied 5% GST rate instead of 7% PST - resulted in underpayment of approx. CAD $4.9M

For Quebec QST registration, they applied for 5% GST instead of QST 9.975% - resulted in underpayment of approx. $24.7M

For British Columbia and Quebec provinces under Canada Federal, they applied PST 7% and QST 9.975 instead of GST 5% - resulting over-payment of approx. $48.9M

It was identified that the tax calculations provided through AI’s logic were accurate and later ended up in amending the filings by paying penalties and interests.

In the above example, AI provided accurate tax calculations, however for the second example, AI was at flaw while designing the balance sheet liability accounts where my team blindly followed AI’s recommendation and ended up posting balance sheet reclass journals.

Region

Tax Type and Rate

Correct Balance Sheet account for liability

Incorrectly recorded in multiple accounts

Amount

(CAD)

British Columbia (BC)

PST – 7%

26113

26435

$4.8M

26111

$1.1M

26427

$3.4M

Canada Federal (BC and QC)

GST\HST – 5%

26435

26113

$9.7M

26436

$19.3M

26111

$22.5M

26427

$34.3M

Quebec (QC)

QST – 9.975%

26436

26435

$54.M

26427

$41.6M

26111

$2.3M

 AI is supposed to design the balance sheet liability mapping based on region and tax rates; however, it also referred to additional "logic" or "terms" such as sales (26111), non-resident (26427), virtual location (26435), resulted in the incorrect balance sheet mapping. This had a high impact on balance sheet account reconciliations and ended up in rework and posting correction journals for almost 7 months. Post investigating the root cause, I fixed the balance sheet tax mapping rules that resulted in right liability account allocation.

At any phase, it is recommended that the teams should thoroughly understand and study the AI’s logic proposed before entering production phase which avoids rework and financial impact, This saves time and avoid unwanted attention from tax departments.

Any overridden calculations should be thoroughly reviewed and approved and accordingly the AI’s logic's should be amended with proper approvals in place. The same must be documented in SOP's to avoid error repetition. It is suggested to build flags in the AI tools to highlight the overridden tax calculations changes (by referring to country specific examples) to both preparers and reviewers so that there will be controls before approving the final tax amounts.

  • Solution

Practical BPO project: AI-based Dispute and Chargeback Triage in a FinTech / E-commerce Customer.

The reason why the process fits well.

The BPO dispute/chargeback work is so high volume and time sensitive that an agent has to make decisions that are prone to the scarcity of information. The most popular recommendation areas of

AI include:

Triage resolution (accept dispute, contest, seek more evidence)

Priority level (SLA risk)

The list of evidence to check (what documents to retrieve)

Win probability (probability of succeeding in contest)

It is an ideal place to learn due to the fact that consequences of decisions are obvious; win/loss, financial cost, SLA violation, customer hot-temper outburst.

These two learning events (within the same initiative)

Moment A: Team does not pay attention to AI - AI proved to be correct later.

Example:

AI suggests high confidence recommendation of Contest dispute + include delivery proof + customer IP match.

The agent accepts the dispute because he holds that they will save time.

Two days on, the internal audit of the client indicates that it is winnable and the company is losing funds.

Moment B: Team is informed by AI - AI was mistaken later.

Example:

The recommendation is Contest and such evidence list is auto-generated with AI.

Agent follows it.

The dispute is subsequently missed due to the AI not factoring in one crucial rule, which was that the type of transaction would need a different form of evidence, which the submission did not pass.

These are the ways in which teams ought to learn these cases in an orderly manner.

Step 1: Take all the divergences as Decision Incidents.

Teams record a structured "incident" whenever it happens to them, rather than ascribing it to human beings or AI.

AI advice is overridden, or

The advice is heeded by AI and it proves to be counterproductive.

This is not an exception as an learning pipeline.

Minimum fields to capture

Type of dispute and code of dispute reason.

Artificial intelligence recommendation + confidence + explanation.

Human decision + rationale (forced drop down + any optional note)

Outcome (win/loss, cost, SLA)

Submitted evidence and reason (where applicable) of rejection.

Step 2: Unequivocally, pose the appropriate questions (varied depending on situation)

When they disregarded AI and AI was correct.

The objective is: enhance human belief and acceptance.

Key questions

What was the signal that the agent was not aware of, which the AI perceived?

(e.g. IP match, device fingerprint, delivery signature)

Was the explanation of the AI comprehensible at decision time?

When the AI was correct but the explanation was not in sight then that is a UX failure.

Was the agent overridden because of workflow pressure?

Scenario: The faster one is the acceptance when the volume of queues shoots up.

Was the outdated tribal knowledge used to override?

Typical of BPO: individuals trust in we tend to lose these, despite the alterations of the policy.

Were incentives misaligned?

In case they use agents that are rewarded on speed, they will ignore proper AI counsel.

Practical outcomes

Replacing contest patterns with AI-high-confidence patterns.

Include micro-training: 10-minute per week of reviews of 3 cases of AI was right.

Modify KPIs: compensate not only AHT (average handle time) also net recovery.

B)cases where AI was right and human was wrong.

The objective is: enhance the AI reliability and human verification behaviour.

Key questions

Did the AI make an error because of the omission of some data or incorrect thinking?

Missing data: data was not available in the system.

False logic: model misinterpreted regulations.

Did this represent a rule-change situation?

The policies on chargebacks are dynamic. An AI is able to become silent post-updates.

Calibration of confidence was adequate?

When AI is high confidence, and false, that is potentially dangerous.

In case it was low confidence and agents were out to treat the AI as fact, it is a training failure.

What did the agent fail to confirm since the AI sounded convincing?

This marks automation bias.

Is it feasible to establish a so-called must-check checklist on high-risk cases?

Sample: the type of the transaction, the reason code, evidence type, due date.

Practical outcomes

Guards Rails: 2 validations are necessary in order to recommend X as a reason code.

Introduce policy conscious preparedness (policy check layers) or policy checks.

Failure cases are used as labelled examples that are used to retrain.

Show a better show of confidence ("High confidence" only when rules also pass)

Step 3: Transform knowledge into changes (both human and AI).

1. Enhance human decision making.

Provide: 1) a dashboard with the top 10 override reasons on it.

This can be used to identify trends such as:

"Too busy"

"Didn't trust AI"

Did not know what explanation was.

"Customer is VIP"

There is evidence retrieval that is too long.

2) Have a concept of calibrated autonomy.

Low-risk cases in AI auto-routes are accepted.

Human contest cases that are high value/high ambiguity are reviewed.

Review of AI-human disagreement cases is done by team lead.

3) Hold a 30 minutes calibration talk.

Not a meeting for blame--just:

2 cases where AI beat humans

2 cases where humans beat AI

1 instance of both failures (process problem)

B. Enhance the system of AI.

Prepare a disagreement training kit.

The most valuable data is:

AI challenged, man assented (or at the same time)

It is the result that justifies who is right.

These become "gold" labels.

2) Add a layer of validation of policies/rules.

A lot of AI failures in disputes are not problems of model intelligence, but rule compliance problems.

So implement:

reason code - must be provided in a specified format of evidence.

type of transaction - contesting eligibility.

due date- submissions feasibility.

3) Fix explanation quality

When AI is correct and disregarded, then it must have failed to connect to agent reality.

Improve explanation to show:

2-3 strongest signals

what evidence to attach

why this is winnable

The appearance of what good learning looks like, 60-90 days on.

Under this one dispute-triage project, the teams ought to demonstrate:

Loss was less than that of avoidable accepts (AI-right overrides drop).

Cutting down failed contests because of error in evidence (AI-wrong impact changes down)

Increased confidence in the agent + quicker judgment (improved judgment)

More effective AI calibration (Few false high-confidence)

An apparent change in culture: AI is no longer seen as a shortcut and occurs as a threat.

Final takeaway

It is not about the win, which is that AI is more right.

The victory is developing a chain in which any deviation is an organized process of learning, bringing the human judgment and AI together within the same working process, without decelerating the BPO machine.

What Should Teams Learn When AI Advice Is Ignored or Proven Wrong?

The following case study in Credit Underwriting shall bring out the frontline Risk Operations.

Importance of the process

In order to approve, decline, or verify applications, the Credit underwriting process is vital where AI risk models help underwriters in decision making. Although these AI models can assess data like credit bureau reports, bank statements, income patterns, and device signals, however  humans are needed to make the final decision. Approving wrongly  can result in  financial issues and regulatory issues.

Underwrite

Teams should have good knowledge on identifying when the AI is wrong and when humans are wrong . AI always gives option to Approve ,Decline or Verify .AI gives reasons for the same.

In our Scenario 1, AI is wrong but human followed it.With high confidence AI approved the loan.However,it is later defaulted. The reason was AI model missed the indications like address change and income stability.The explanation given by model was not clear and the score of confidence was not reliable .Incase of high risk cases,the process should push to Verify and understand the reasons provided by the AI .In this case the responsibility is on both human and model owners.

In the Scenario 2, The human has made mistake by declining the loan based on intuition thouah AI asked to Verify .Though the application was genuine,the loan is rejected. AI should have explained in better and clearer way .Also,human decision should have been more based on thinking and facts than bias or intuition . AI should have a track record on when the human decisions went wrong.There should be proper justification for each rejection by human.

Below important approaches to improve the process.

Team-->There should be a regular review on the AI/human mistakes, and the AI should be fed with the information regularly to improve further.Should keep in mind that AI/Human mistakes are common and act accordingly

Human-->Judge and record decisions. Should consider AI as a support and not an authority. Decision should be based on the evidence and not intuition or bias.

AI-->Should show the pros and cons ,evidence and the past cases.

Model Owners-->Should manage the model quality. Tools should include tracking override patterns, calibration, and decision time.

Process Owners--> Set rules and Reviews on regular basis.

Records-->There should be record on all all decisions which should  be logged with AI output, human action, and evidence.

Policy-->Ther should be clear policy on who can override the data.

Conclusion

When we follow this approach,it reduces errors ,speed up decisions,improve fairness and build AI trust. It helps to have a clear documentation and steady improvements. When AI or humans fail, teams must understand why. Model gasps should be fixed , the explanation should be improved ,Process should be strengthen and guide human judgment. Responsibility is shared across the model,people and process.

  • Author
🏆 Best Answer

Taby Sheikh
Exceptional response with a complete operational learning loop. Clear domain anchoring (BPO dispute/chargeback triage), structured “Decision Incident” framework, explicit capture fields (AI confidence + human rationale + outcome), and strong separation of learning when AI is right vs when humans are right. Also provides practical governance upgrades (calibrated autonomy, KPI redesign, disagreement training kit). This directly answers what teams should learn and how to institutionalize it.


Approved

Vijay Yivaturi
Very strong real-world tax compliance case with quantified impact (multi-million CAD under/overpayments). Clearly shows both failure modes — AI correct but ignored, and AI flawed but blindly followed. Strong governance takeaways: validation before production, override flags, documentation in SOPs, and structured review controls. High operational credibility.

vijay gonsalves
Good BPO/claims processing examples covering both scenarios clearly. Practical mitigation steps like override logging, hierarchical approval, confidence thresholds, dashboard tracking, and QA audits. Strong control mindset and clearly linked to regulatory and financial impact.

Preethi Bijesh
Clear credit underwriting example with balanced human-AI learning framing. Identifies model gap (missed signals), human bias/intuition risk, and governance improvements (override logging, calibration, model owner tracking, documentation). Well connected to a real risk-sensitive domain.


🟡 Conditionally Approved

(Good intent but lacks sufficient process depth or structured learning mechanism.)

Dhruva Kapur
Good conceptual linkage to confusion matrix, recall rate, SHAP, and model monitoring. However, the response is largely model-metrics focused and not anchored deeply in a defined operational process or structured team-learning mechanism. Needs clearer translation from metrics → actionable team learning structure.


Not Evaluated (High AI Content)

Manish_Gupta_Tpgl
Aloke Biswas

Create an account or sign in to comment

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.