Suppose you are tasked with developing a simple machine learning algorithm, whether supervised (with a clear target variable) or unsupervised (with no predefined outcome variable).
But looking at your data, it seems that one class dominates the other. In such case, your model will have a hard time learning from your data to predict future classes.
Still unsure what’s the issue here? Let’s look at a proper example.
Visualizing Unbalanced Datasets
Let’s look at what imbalanced data look like in practice. For the purpose of this article, and an upcoming series of articles focused on detecting fraudsters, I picked the Credit Card Fraud dataset from Kaggle.
This dataset is a collection of about ~28.5k credit card transactions, where dimensions are a set of theoretical features from V1 to V28 excluding the amount of the transaction, in addition to whether or not the latter was a fraudulent one or not, expressed in the form of a categorical variable (Class == 0 when there is no fraud, Class == 1 when otherwise).
#Import libraries & Read Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("creditcard.csv")
Let’s now count the number of times a fraudulent case arises versus a non fraudulent one.
occ = df['Class'].value_counts()
So it seems we have a total of 492 fraudulent cases over a total of 284807 credit card transaction.
print(occ / df.shape)
That’s just about 0,17% of all the cases. This is arguably quite an unsignificant amount of fraudulent data to train our model on, and this is what we refer to as a class imbalance problem.
Let’s look at it visually, before we make any changes.
First, let’s define a function to create scatterplot of our data and labels.
def plot_data(X, y):
plt.scatter(X[y == 0, 0], X[y == 0, 1], label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1], label="Class #1", alpha=0.5, linewidth=0.15, c='r')
We want to avoid as much noise and dimensionality as possible on our graph. So we’ll create a vector X that puts all the features together, and a vector y that contains the target variable (Class variable).
We’ll define an additional function for that.
X = df.iloc[:, 1:29].values
X = np.array(X).astype(np.float)
y = df.iloc[:, 30].values
We’ll now define our variables X and y:
X, y = prep_data(df)
Ready? Let’s go:
You can easily see tha the fraud cases in red are clearly outnumbered by the non-fraud cases in blue. What a machine learning algorithm interprets in this case is mainly non fraud occurences. This will prove challenging when it comes to training our model successfully, since it will most likely generate false negatives when the dataset is left as is.
So what can we do about this? This is where resampling comes in handy as a way to improve our fraud to non-fraud ratio.
What is Resampling & When to use it?
Resampling is the process of drawing repeated samples from the original dataset. The intuition behind resampling methods is that it creates “similar” cases for our data classes in order to render the data representative of the population we wish to investigate, and therefore feed the algorithm enough data to output more accurate results (when the data is not enough).
What are some types of Resampling Methods?
- Undersampling the majority class
- Oversampling the minority class
- Synthetic Minority Oversampling Technique (SMOTE)
There are obviously more resampling methods (e.g. bootstrapping, cross validation etc.) but we’ll focus on defining the above three for this article.
Undersampling the majority class (in this case, non-fraudulent cases) is taking random draws of the dominating class out of the dataset to match the amount of non-dominating class. As a general rule, this is usually the least desirable approach as it causes us to lose some valuable data by throwing it away, but when you have a large dataset, it might prove to be computationally better to undersample (unless you brew coffee in the meantime).
Oversampling the minority class is the opposite. Instead of the previous approach, we take random draws of the non-dominating class and create “fake” copies to match the amount of cases in the dominating class. In this case, we are in essence creating duplicates of the data and training our model on such. This may not be an ideal approach when our non-dominating class is not scattered across the dataset. Duplication will effectively only recreate similar instances without a “synthetic” variety.
Which bring us to the third and last method!
Synthetic Minority Oversampling Technique (SMOTE) is another type of minority oversampling technique, except that this one takes into account characteristics of existing cases of non dominating class, and creates synthetic duplicates in a “nearest neighbors” fashion (couldn’t hold myself from the South East Asian saying: same same, but different!).
For all three methods, a rule of thumb is to only resample on your training data!
Resampling Unbalanced Class — The SMOTE Way
(see what I did there? never miss an opportunity to put a pun in edgewise)
!pip install imblearn
from imblearn.over_sampling import SMOTE
method = SMOTE(kind='regular')
Let’s apply the method on our features & target variable.
X_resampled, y_resampled = method.fit_sample(X, y)
Let’s look at our new & balanced dataset now:
As you can see, our non-dominating class (fraudulent cases in red) are much more visible on the data. If we run a quick numerical check, we’ll see the count of y and y_resampled is now similar.
SMOTE has effectively synthesized new fraudulent cases in a feature space that is pre-defined in the imblearn package (for the more mathematically inclined, an understanding of the K-nearest neighbors and Eucledian space might come in handy at this point).
With this being said however, there are some limitations to keep in mind, in that the closest neighbors may not always be fraudulent cases. It would be a wise decision to combine the SMOTE method with some other rule-based systems for more accuracy, but it depends on the task and data at hand.
Ready to model?
For now, it’s fair to say you’re ready to move forward with your predictive model (but don’t forget to only apply SMOTE on the training set)!