• 0

# Class Imbalance

Go to solution Solved by RANSINGH SATYAJIT RAY,

Class Imbalance is the situation where the distribution of observations across known classes are skewed or biased. One of the classes has either an unusually high or low number of observations.

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Ransingh Satyajit Ray and Sourabh Nandi.

Applause for all the respondents - Ransingh Satyajit Ray, Glory Gerald, Sourabh Nandi, Sanat Kumar.

## Question

Q 305. What is a class imbalance and how does it affect the outcome of a classification predictive model? Provide suitable examples where class imbalance exists.

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

## Recommended Posts

• 0
• Solution

Overview:

Class Imbalance is one of the challenges faced in Classification Model prediction wherein the output from a dominating class is reflected in the output of the model there by leading to result which is just like a generalization

i.e. say if 990 customer are not churning but 10 people are churning then, when we develop the classification model to determine whether some one is churning or not model will be running at higher accuracy when we are considering not churning as determining level of the classifier even when our model predicted churning people as not churning. This kind of problem often decieve the model developer into false hope that model is great based on accuracy and error rate as performance determining criteria.

Before understaning the problem of class imbalance lets understand what Classfication Model is, What is Confusion Matrix, What is Class Imbalance?

Classification Model:

Classification Modelling refers to developing predictive model wherein we want to do prediction on what would be the output in terms of categorical/discrete feature depending on independent features (categorical or/and continuous) which is determined based on historical data of records of independent feature and their output in terms of categorical feature.

eg.

1. Product pass or fail based on various independent feature like product dimention, material property, shift information, workforce profile.

2. Employee churn or not based on employee education profile, manager profile, employee personal profile.

3. Loan approval accepted or rejected based on applicant professional profile etc.

Performance measurement with Confusion Matrix:

We can think Confusion Matrix for a classification model is synonymous to hypothesis testing result.

Hypothesis Testing ->

As per Hypothesis

Ho                         Ha

Not Guilty             Guilty

Actual                                   Not Guilty    Right Decision    Type 1 Error

Guilty          Type 2 Error     Right Decision

Confusion Matrix ->

As per Model

Not Guilty      P                                               Guilty  N

Actual  Result  Not Guilty T         TP                                                          FN

(Truely Determined as Not Guilty) ( Falsely Determined as Guilty)

Guilty     F        FP                                                          TN

(Falsely Determined as Not Guity)   (Truely Determined as Guilty)

The confusion matrix is called so because wrong interpretation of it woud lead to confusion in model output wherein the actual output doesnt match with predicted output.

This confusion matrix leads us to have multiple number of model performance determining indicators like

1. Sensitivity or Recall

2. Specificity

3. F-1 Score

4. Accuracy

5. Error Rate

6. AUC

Classification model performance are determined using any of these performance measuring indicators depending on the use case we are working on.

Class Imbalance:

As the name suggest it is the disproportion or variance between the classes in a classification model. This is often a challenge when the majority class number of records are significantly higher that the minority class, thus leading to false hope that model performance with respect to accuracy and error rate is high even when the model is predicting the minority class wrongly ( say majority class is people not churning and minority class is people churning, here the accuracy is righly detecting the people not churning is high even when or classification model wrongly predict minority class that is people churning as not churning). This we can think of as sort of generalization wherein model memorizes the outputs from majority class.

This is othen a challenge in

1. Fraud Detection

2.Anomaly detection

3.Security threat detection

4.Medical Diagnosis

How to know when we are falling in the trap of class imbalance?

Firstly there are certain kinds of complex models(algorithm) that are prone to overfitting or underfitting.

*Overfitting: Model performing well on training data(the data on which we build the model) and performing terribly on

testing data (the data on which we test our model built on training data)

*Underfitting: Model performing poorly both on training and test data

We usually do experiment with model to check that like

1. experimenting with various model perforance measure

2.conducting SME interviews

How to overcome this?

Any one of the following or combination of them can be followed for mitigating class imbalance problem

1. Collecting more data

2.Choosing accuracy measure apart from Accuracy and error rate like Presision , Sensitivity/Recall, Specificity, ROC Curves, AUC, F-1

3. Creating samples synthetically (artificially creating data based on data distribution pattern of the minority class)

4. Applying resampling techniques where in we take many sub samples from all the records and try to develop a sort of concensus between them.

5. Choosing differerent mathematical model altogather.

##### Share on other sites
• 0

What is Class imbalance?

Data are assumed to suffer the Class Imbalance Problem when the class distributions are incredibly imbalanced. In this connection, many classification learning algorithms have moderate predictive accuracy for the uncommon class. Cost-sensitive learning is a general approach to solve this problem.

Class imbalanced data-sets occur in much real-world applicability where the class distributions of data are highly imbalanced. However, in the two-class case, we can assume that the minority or rare class is a positive class without losing generality. The majority class falls under the negative type. Often the minority class is very uncommon, such as 1% of the data-set. If someone applies various traditional (cost insensitive) classifiers on the data-set, they will likely predict everything negative (the majority class). This was often perceived as a problem in learning from highly imbalanced data-sets.

However, there are two fundamental assumptions made by traditional cost insensitive classifiers. The first is that the classifiers’ goal is to maximize the accuracy (or minimize the error rate); the second is that the training and test data-sets’data-sets’ class distribution is identical. Following these 2 assumptions, predicting everything as negative for a highly imbalanced data-set is frequently the right thing to do.

Thus, the imbalanced class problem becomes significant provided that one or both of the 2 assumptions above aren’t true; i.e., if the value of various kinds of error (false positive and false negative within the binary classification) isn’t identical, or if the category distribution within the test data is unusual from that of the training data. The primary case is often managed effectively, applying methods in cost-sensitive meta-learning.

In the case when the misclassification cost isn’t equal, it’s usually costlier to misclassify a minority (positive) example into the bulk (negative) class than a majority example into the minority class (otherwise, it’s more plausible to predict everything as unfavorable). i.e., FNcost > FPcost. Thus, given the values of FNcost and FPcost, a range of cost-sensitive meta-learning methods will be and are accustomed to solving the category imbalance problem. If the prices of FNcost and FPcost don’t seem to be unknown explicitly, FNcost and FPcost will be assigned to be proportional to the quantity of positive and negative training cases. If the category distributions of coaching and test data-sets are different (e.g., if the training data is very imbalanced but the test data is more balanced), a transparent approach is to sample the training data specified its class distribution is that the same because the test data.This can be accomplished by oversampling (creating multiple copies of examples of) the minority class and/or undersampling (selecting a subset of) the bulk type. Note that sometimes the number of minority class instances is too small for classifiers to be told adequately. This can be insufficient (small) training data and different from that of imbalanced data-sets.

Methods for addressing class imbalance can be divided into two main categories. The first category is data-level methods that operate on the training set. The other type covers classifier (algorithmic) level methods, which keeps the training data-set unchanged and adjust training or inference algorithms.

1.Data level methods
a. Oversampling
b. Undersampling

2. Classifier level methods
a. Thresholding
b. Cost-sensitive learning
c. One-class classification
d. Hybrid of methods

How does Class imbalance affect the outcome of a predictive classification model?

Class imbalance poses a hurdle for predictive classification modeling. Most of these machine learning algorithms used for classification were designed to assume an equal number of examples for every class, leading to models with poor predictive performance, specifically for the minority class. This can be an issue because, typically, the minority class is more important. So the matter is more sensitive to classification errors for the minority class than the bulk class.

There are perhaps two leading causes for the imbalance; they are data sampling and the domain properties.

It is also possible that the imbalance in the examples beyond the classes was caused by the way the specimens were collected or sampled from the problem domain, which might include biases acquainted during data collection and errors made during data collection.

Examples where class imbalance exists?
This problem is widespread in practice and can be observed in various disciplines, including

1. Fraud Detection.
2. Medical diagnosis
3. Spam Detection
4. Claim Prediction
5. Default Prediction.
6. Oil spillage detection.
7. Facial recognition
8. Churn Prediction.
9. Spam Detection.
10. Anomaly Detection.
11. Outlier Detection.
12. Intrusion Detection
13. Conversion Prediction.
14. Binary Classification
15. Software Defect Prediction
16. Building Decision Trees for the Multi-class Imbalance Problem
17. Non-Linear Gradient Boosting for Class-Imbalance Learning
18. Hybrid Sampling with Bagging for Class Imbalance Learning
##### Share on other sites
• 0

Class Imbalance is a problem that usually occurs in machine learning algorithms where the occurrence of one of the classes of data is very high compared to other classes present. Here the algorithm will be more biased towards predicting the majority class as there is no enough data to learn the patterns present in the minority class.

For better understanding, I have explained the concept through a simple example below, that will also give a brief on how Class Imbalance impacts the outcome of a classification predictive model.

Lets consider you have shifted from your hometown to a new city and you have been living here for the past one month.

Class 1 (Hometown) - You will be familiar with all the locations like your home, routes, essential shops, tourist spots, etc. because you had spent your entire childhood there.

Class 2 (New Town) - You will not have many ideas about where each location exactly is, and the chances of taking the wrong routes and getting lost will be very high.

Hometown is the Majority Class and New town is the Minority Class.

Likewise happens in Class Imbalance. The model has sufficient information about the majority class but insufficient information about your minority class. Thus leading to high misclassification errors for the minority class.

Examples:

Below are various disciplines where Class Imbalance is generally observed:

• Medical Diagnosis
• Oil Spillage Detection
• Facial Recognition
• Anomaly Detection
• Fraud Detection
• Intrusion Detection
• Spam Detection
• Conversion Prediction
• Churn Prediction
• Claim Prediction
• Default Prediction
• Outlier Detection

##### Share on other sites
• 0

Class Imbalance refers when the class (es) is/are skewed or baised (or not in proportion).

For eg.- If we want to compare the performance of boys against girls in a Higher Secondary School, but if the boys: girls ratio is not 1:1 then its imbalance (eg- 5:1, or 1:5 so on)

Class imbalance could be minimal or maximum (higher the class imbalance less is the accuracy of prediction)

Class Imbalance impacting predictive learning using simple example

Eg- Let’s take example a school want to create a predictive model to understand student performance based on their mother tongue. Assume there are 600 students mother tongue language is “A”, 200 mother tongue language is “B”, 150 speaks language “C” and 50 mother tongue language is “D”

In the above scenario there are 4 classes but the frequencies are different leading to class imbalance (since the above example has more than 2 classes hence it could be referred as multi-class imbalance).

When any predictive model is created one of the basic assumptions is all the classes have equal frequencies. But in the above scenario since there is class imbalance it leads to poor predictive model (especially for language D as it has the least frequency)

Most Common place where we find class imbalance are:

1)      Banking sector – Fraud detection (majority of the transactions are genuine and only a few are fraudulent)

2)      Spams- Only a few emails are spam out of the total email transactions

3)      Brand Loyalty – very less persons (in current scenario) are brand loyal on account of high options

4)      Manufacturing- Only a few cars out of a lot has some performance issue

##### Share on other sites
• 0

Sourabh Nandi and Ransingh have provided the best answer to the question for highlighting the method to evaluate a classification problem, ways to address class imbalance issues and providing multiple examples.

##### Share on other sites
This topic is now closed to further replies.

• ### Forum Statistics

• Total Topics
2,899
• Total Posts
14,643
• ### Member Statistics

• Total Members
55,297
• Most Online
888