Avoid Data Leakage – Arif Works

One Potential Reason for Model Failure in Production

Rule of thumb: Split your dataset into train and test sets before you do any processing to the data. Otherwise there might be data leakage that makes model evaluation overly optimistic.

Some of you might have these experiences:

Your model has a good score on your local test set but performs badly on Kaggle leaderboard
Your model works normally during POC but degrades drastically at production environment

These might happen due to Data Leakage.

What is Data Leakage?

Data leakage refers to the accidental sharing of information between training and testing datasets. This sharing of information will give the model a ‘heads-up’ about the testing dataset and generate seemingly optimal evaluation scores. However, since the model overfits the testing data, it cannot predict accurately on future unseen datasets, i.e. Kaggle test data or production live data.

One of the common causes of data leakage is due to train-test split after data processing.

Example: Data Leakage due to Split after Processing

Assume we have the following dataset and want to use columns 1 and 2 to predict the target.

The following steps are carried out in order:

Example: train test split after processing

Replace Y and N in column 1 with their respective average values of target.
Label encode column 2 based on lexical order.
Split the dataset. Use first 7 rows for training and last 3 rows for testing.
Train model using the training data and validate on testing data.

Data leakage occurred!

For step 1, value for Y is computed by 4 / 6, which indicates that there are 6 Ys in the dataset and 4 of them have positive targets. This information is accidentally shared between the training and testing datasets because the data is split after processing. At step 4, the model is trained from 4 Ys and will somehow ‘expect’ two Ys in the test data with one of it having a positive target. This will lead to an overly optimistic expectation of model performance.

Similarly, for step 2, letter ‘e’ at row 5 is encoded as 5 because it understands the full dataset has ‘a’, ‘b’, ‘c’ and ‘d’. This information is also leaked between the training and testing data.

Simple Solution: Split before Processing

The above mentioned issue has a tremendously negative impact, but the solution is simple — split before processing. Let use the same example but different steps:

Example: train test split before processing

Example: transform train and test data with same encoders

Split the dataset. Use first 7 rows for training and last 3 rows for testing.
For training data: Replace Y and N in column 1 with their respective average values of target. Label encode column 2 based on lexical order. Save these encoders.
For testing data: transform the data with the same encoder.
Train model using the training data and validate on testing data.

Note that Y is replaced with 0.75 because there are 4 Ys in training data and 3 of them have positive targets. Letter ‘e’ is encoded as 4 because there is no ‘c’ in training data. Testing data is assumed to be unseen before and follow similar patterns. Thus they are transformed with the same encoders fit from the training data.

By splitting the data first, there are two major benefits:

Reduce risk of data leakage.
Future unseen data will be processed in exact same way as the testing data, thus ensures consistency in model performance.