Skip to content
View in the app

A better way to browse. Learn more.

Benchmark Six Sigma Forum

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

RANSINGH SATYAJIT RAY

Members
  • Joined

  • Last visited

Solutions

  1. RANSINGH SATYAJIT RAY's post in Class Imbalance was marked as the answer   
    Overview:
    Class Imbalance is one of the challenges faced in Classification Model prediction wherein the output from a dominating class is reflected in the output of the model there by leading to result which is just like a generalization
    i.e. say if 990 customer are not churning but 10 people are churning then, when we develop the classification model to determine whether some one is churning or not model will be running at higher accuracy when we are considering not churning as determining level of the classifier even when our model predicted churning people as not churning. This kind of problem often decieve the model developer into false hope that model is great based on accuracy and error rate as performance determining criteria.
     
    Before understaning the problem of class imbalance lets understand what Classfication Model is, What is Confusion Matrix, What is Class Imbalance?
     
    Classification Model:
    Classification Modelling refers to developing predictive model wherein we want to do prediction on what would be the output in terms of categorical/discrete feature depending on independent features (categorical or/and continuous) which is determined based on historical data of records of independent feature and their output in terms of categorical feature.
    eg. 
    1. Product pass or fail based on various independent feature like product dimention, material property, shift information, workforce profile.
    2. Employee churn or not based on employee education profile, manager profile, employee personal profile.
    3. Loan approval accepted or rejected based on applicant professional profile etc.
     
    Performance measurement with Confusion Matrix:
    We can think Confusion Matrix for a classification model is synonymous to hypothesis testing result.
    Hypothesis Testing -> 
                                                                        As per Hypothesis
                                                                                          Ho                         Ha
                                                                                     Not Guilty             Guilty
    Actual                                   Not Guilty    Right Decision    Type 1 Error
                                                         Guilty          Type 2 Error     Right Decision
     
    Confusion Matrix ->
                                                                                             As per Model
                                                     Not Guilty      P                                               Guilty  N
    Actual  Result  Not Guilty T         TP                                                          FN
                                           (Truely Determined as Not Guilty) ( Falsely Determined as Guilty)
                                  Guilty     F        FP                                                          TN                
                                        (Falsely Determined as Not Guity)   (Truely Determined as Guilty)
     
    The confusion matrix is called so because wrong interpretation of it woud lead to confusion in model output wherein the actual output doesnt match with predicted output. 
     
    This confusion matrix leads us to have multiple number of model performance determining indicators like
    1. Sensitivity or Recall 
    2. Specificity
    3. F-1 Score
    4. Accuracy
    5. Error Rate
    6. AUC 
     
    Classification model performance are determined using any of these performance measuring indicators depending on the use case we are working on.
     
    Class Imbalance:
    As the name suggest it is the disproportion or variance between the classes in a classification model. This is often a challenge when the majority class number of records are significantly higher that the minority class, thus leading to false hope that model performance with respect to accuracy and error rate is high even when the model is predicting the minority class wrongly ( say majority class is people not churning and minority class is people churning, here the accuracy is righly detecting the people not churning is high even when or classification model wrongly predict minority class that is people churning as not churning). This we can think of as sort of generalization wherein model memorizes the outputs from majority class.
     
    This is othen a challenge in 
    1. Fraud Detection
    2.Anomaly detection
    3.Security threat detection
    4.Medical Diagnosis
     
    How to know when we are falling in the trap of class imbalance?
    Firstly there are certain kinds of complex models(algorithm) that are prone to overfitting or underfitting.
    *Overfitting: Model performing well on training data(the data on which we build the model) and performing terribly on
    testing data (the data on which we test our model built on training data)
    *Underfitting: Model performing poorly both on training and test data
     
    We usually do experiment with model to check that like 
    1. experimenting with various model perforance measure
    2.conducting SME interviews
     
     
    How to overcome this?
    Any one of the following or combination of them can be followed for mitigating class imbalance problem
    1. Collecting more data
    2.Choosing accuracy measure apart from Accuracy and error rate like Presision , Sensitivity/Recall, Specificity, ROC Curves, AUC, F-1
    3. Creating samples synthetically (artificially creating data based on data distribution pattern of the minority class)
    4. Applying resampling techniques where in we take many sub samples from all the records and try to develop a sort of concensus between them.
    5. Choosing differerent mathematical model altogather.
     
     

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.