RANSINGH SATYAJIT RAY

Members

Joined
March 14, 20188 yr
Last visited
February 21, 20224 yr

View Profile Find content

RANSINGH SATYAJIT RAY's post in Class Imbalance was marked as the answer   October 16, 20205 yr

Overview:
Class Imbalance is one of the challenges faced in Classification Model prediction wherein the output from a dominating class is reflected in the output of the model there by leading to result which is just like a generalization
i.e. say if 990 customer are not churning but 10 people are churning then, when we develop the classification model to determine whether some one is churning or not model will be running at higher accuracy when we are considering not churning as determining level of the classifier even when our model predicted churning people as not churning. This kind of problem often decieve the model developer into false hope that model is great based on accuracy and error rate as performance determining criteria.

Before understaning the problem of class imbalance lets understand what Classfication Model is, What is Confusion Matrix, What is Class Imbalance?

Classification Model:
Classification Modelling refers to developing predictive model wherein we want to do prediction on what would be the output in terms of categorical/discrete feature depending on independent features (categorical or/and continuous) which is determined based on historical data of records of independent feature and their output in terms of categorical feature.
eg.
1. Product pass or fail based on various independent feature like product dimention, material property, shift information, workforce profile.
2. Employee churn or not based on employee education profile, manager profile, employee personal profile.
3. Loan approval accepted or rejected based on applicant professional profile etc.

Performance measurement with Confusion Matrix:
We can think Confusion Matrix for a classification model is synonymous to hypothesis testing result.
Hypothesis Testing ->
  As per Hypothesis
Ho Ha
Not Guilty Guilty
Actual Not Guilty Right Decision Type 1 Error
Guilty Type 2 Error Right Decision

Confusion Matrix ->
As per Model
Not Guilty P Guilty N
Actual Result Not Guilty T    TP FN
(Truely Determined as Not Guilty) ( Falsely Determined as Guilty)
Guilty F     FP TN
(Falsely Determined as Not Guity) (Truely Determined as Guilty)

The confusion matrix is called so because wrong interpretation of it woud lead to confusion in model output wherein the actual output doesnt match with predicted output.

This confusion matrix leads us to have multiple number of model performance determining indicators like
1. Sensitivity or Recall
2. Specificity
3. F-1 Score
4. Accuracy
5. Error Rate
6. AUC

Classification model performance are determined using any of these performance measuring indicators depending on the use case we are working on.

Class Imbalance:
As the name suggest it is the disproportion or variance between the classes in a classification model. This is often a challenge when the majority class number of records are significantly higher that the minority class, thus leading to false hope that model performance with respect to accuracy and error rate is high even when the model is predicting the minority class wrongly ( say majority class is people not churning and minority class is people churning, here the accuracy is righly detecting the people not churning is high even when or classification model wrongly predict minority class that is people churning as not churning). This we can think of as sort of generalization wherein model memorizes the outputs from majority class.

This is othen a challenge in
1. Fraud Detection
2.Anomaly detection
3.Security threat detection
4.Medical Diagnosis

How to know when we are falling in the trap of class imbalance?
Firstly there are certain kinds of complex models(algorithm) that are prone to overfitting or underfitting.
*Overfitting: Model performing well on training data(the data on which we build the model) and performing terribly on
testing data (the data on which we test our model built on training data)
*Underfitting: Model performing poorly both on training and test data

We usually do experiment with model to check that like
1. experimenting with various model perforance measure
2.conducting SME interviews

How to overcome this?
Any one of the following or combination of them can be followed for mitigating class imbalance problem
1. Collecting more data
2.Choosing accuracy measure apart from Accuracy and error rate like Presision , Sensitivity/Recall, Specificity, ROC Curves, AUC, F-1
3. Creating samples synthetically (artificially creating data based on data distribution pattern of the minority class)
4. Applying resampling techniques where in we take many sub samples from all the records and try to develop a sort of concensus between them.
5. Choosing differerent mathematical model altogather.

RANSINGH SATYAJIT RAY

Joined

Last visited

Solutions

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)