Semi-Supervised Fraud Detection


dataset: Credit Card Fraud Detection

kaggle notebook: Semi-Supervised Fraud Detection

keywords: fraud detection, novelty detection, ensembling, learning representation, semi-supervised learning

Usually, datasets related to fraud detection are highly unbalanced due to the fact that, in the of transactions, only few of then are fraudulent. Instead of trying to augment the dataset using a resample method, we are going to approach this problem as a Novelty Detection problem.

On this kernel, we are goind describe the dataset briefly and then, we will compare multiple machine learning algorithms using some evaluation metrics.

Most common machine learning algorithms for this kind of tasks are the following:

  1. Isolation forest
  2. One class SVM
  3. Autoencoders

Isolation Forest



The Isolation Forest algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The logic argument goes: isolating anomaly observations is easier because only a few conditions are needed to separate those cases from the normal observations. On the other hand, isolating normal observations require more conditions. Therefore, an anomaly score can be calculated as the number of conditions required to separate a given observation.

One Class SVM (OCSVM)


A One-Class Support Vector Machine is an unsupervised learning algorithm that is trained only on the ‘normal’ data, in our case the negative examples. It learns the boundaries of these points and is therefore able to classify any points that lie outside the boundary as, you guessed it, outliers.



An autoencoder is a neural network that is trained to attempt to copy its input to its output. Internally, it has a hidden layer h that describes a code used to represent the input. The network may be viewed as consisting of two parts: an encoder function h = f (x) and a decoder that produces a reconstruction r = g(h).


In our case, we will train the autoencoder using samples of normal transactions only. After that, we will provide a fraudulent sample as input to the model. As a result, the representation should have larger loss error. Therefore, by defining a loss threashold, this ANN model will work as novelty detection model.



import numpy as np
import pandas as pd

from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score, f1_score, accuracy_score, precision_score
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.decomposition import PCA, IncrementalPCA, LatentDirichletAllocation
from sklearn.manifold import TSNE

from tqdm.notebook import tqdm, trange
from typing import NoReturn, Union, List
from mlxtend.classifier import EnsembleVoteClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from umap import UMAP
df = pd.read_csv('../input/creditcardfraud/creditcard.csv')
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns


  • V{1-28}: PCA decompossition outcome
  • Class label (0: normal, 1: fraudulent)
  • Time: Number of seconds elapsed between this transaction and the first transaction in the dataset
  • Amount: transaction amount
# use a dataset sample for development purposes
dev = False
df.groupby(['Class']).Class.count().plot(kind='pie', title='Fraudulent VS Genuine Transactions')
<matplotlib.axes._subplots.AxesSubplot at 0x7f5dca0a79d0>


df_s = df.sample(2000)
X = df_s[[_ for _ in df.columns if _ != 'Class']]

pca = PCA(n_components=2)
tsne = TSNE(n_components=2)
ump = UMAP(n_components=2)
ipca = IncrementalPCA(n_components=2)

x_pca = pca.fit_transform(X)
x_tsne = tsne.fit_transform(X)
x_umap = ump.fit_transform(X)

PCA, UMAP and t-SNE will be used for further dimensionality reduction in order to visualize a sample of the dataset

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

sizes = pd.Series(df_s['Class']+1).pow(5) # represent fraud with bigger point

axes[0].scatter(x_pca[:, 0], x_pca[:, 1], s=sizes, c=df_s['Class'].values)
axes[1].scatter(x_tsne[:, 0], x_tsne[:, 1], s=sizes, c=df_s['Class'].values)
axes[2].scatter(x_umap[:, 0], x_umap[:, 1], s=sizes, c=df_s['Class'].values)





Time and Amount fields should be scaled

During the step of pre-processing, the dataset will be splited in two parts:

  1. ~283K samples of genuine transactions (Training set)
  2. All fraudulent samples and equal number of genuine samples (Test set)

Novelty detection is concered as a semi-supervised task due to the fact that only the normal samples are used during the phase of training. During the phase of evaluation, a balanced subset of genuine and fraudulent samples will be used.

# time and amount scaling
df['Time'] = StandardScaler().fit_transform(df['Time'].values.reshape(-1, 1))
df['Amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))

df_anom = df[df['Class'] == 1]
df_norm = df[df['Class'] == 0]

if dev:
    df_norm = df_norm.sample(5000, random_state=42)
df_test_norm = df_norm.sample(df_anom.shape[0])
df_test = pd.concat([
df_train = df_norm.drop(df_test_norm.index)

feature_cols = [_ for _ in df.columns if _ != 'Class']
X_train = df_train[feature_cols]
y_train = df_train['Class'] # will not be used
X_test = df_test[feature_cols]
y_test = df_test['Class'] # for evaluation
train: [{:>8} x {:<5}]
 test: [{:>8} x {:<5}]
'''.format(*X_train.shape, *X_test.shape))
train: [  283823 x 30   ]
 test: [     984 x 30   ]
def sensitivity_keras(y_true, y_pred):
    y_pred - Predicted labels
    y_true - True labels 
    Specificity score
    neg_y_true = 1 - y_true
    neg_y_pred = 1 - y_pred
    fp = tf.keras.backend.sum(neg_y_true * y_pred)
    tn = tf.keras.backend.sum(neg_y_true * neg_y_pred)
    specificity = tn / (tn + fp + tf.keras.backend.epsilon())
    return specificity


We are going to define some wrappers, these classes will work as adapters in order to have an abstract implementation.

class Scaled_IsolationForest(IsolationForest):
    """The purpose of this sub-class is to transform prediction values from {-1, 1} to {1,0}
    def predict(self, X):
        pred = super().predict(X)
        scale_func = np.vectorize(lambda x: 1 if x == -1 else 0)
        return scale_func(pred)

class Scaled_OneClass_SVM(OneClassSVM):
    """The purpose of this sub-class is to transform prediction values from {-1, 1} to {1,0}
    def predict(self, X):
        return np.array([y==-1 for y in super().predict(X)])
class NoveltyDetection_Sequential(tf.keras.models.Sequential):
    """This custom `tf.keras.models.Sequential` sub-class transforms autoencoder's output into {1,0}.
    Output value is determined based on reproduction (decode) loss. If reproduction loss is more than a threashold then, the input sample is considered as anomaly (outlier).
    Based on few experiments, 1.5 is a dissent threashold (don't as why :P). Future work: determine the threashold using a more sophisticated method.
    def predict(self, x, *args, **kwargs):
        pred = super().predict(x, *args, **kwargs)
        mse = np.mean(np.power(x - pred, 2), axis=1)
        scale_func = np.vectorize(lambda x: 1 if x > 1.5 else 0)
        return scale_func(mse)
# define early stop in order to prevent overfitting and useless training
early_stop = tf.keras.callbacks.EarlyStopping(

# it's a common practice to store the best model
checkpoint = tf.keras.callbacks.ModelCheckpoint(

def get_autoencoder() -> tf.keras.models.Sequential:
    """Build an autoencoder
    model = NoveltyDetection_Sequential([
        tf.keras.layers.Dense(X_train.shape[1], activation='relu', input_shape=(X_train.shape[1], )),
        # add some noise to prevent overfitting
        tf.keras.layers.Dense(2, activation='relu'),
        tf.keras.layers.Dense(X_train.shape[1], activation='relu')
                        metrics=['acc', sensitivity_keras])
    return model

The following dictionary containes the models and the parameters that we are going to use during the phase of training

clfs = {
    'isolation_forest': {
        'label': 'Isolation Forest',
        'clb': Scaled_IsolationForest,
        'params': {
            'contamination': 'auto',
            'n_estimators': 300
        'predictions': None,
        'model': None
    'ocsvm': {
        'label': 'OneClass SVM',
        'clb': Scaled_OneClass_SVM,
        'params': {
            'kernel': 'rbf',
            'gamma': 0.3,
            'nu': 0.01,
        'prediction': None,
        'model': None
    'auto-encoder': {
        'label': 'Autoncoder',
        'clb': get_autoencoder,
        'params': {},
        'fit_params': {
            'x': X_train, 'y': X_train,
            'validation_split': 0.2,
            'callbacks': [early_stop, checkpoint],
            'epochs': 64,
            'batch_size': 256,
            'verbose': 0
        'predictions': None,
        'model': None

t = trange(len(clfs))
for name in clfs:
    clfs[name]['model'] = clfs[name]['clb'](**clfs[name]['params'])
    if 'fit_params' in clfs[name]:
        clfs[name]['model'].fit(**clfs[name].get('fit_params', {}))
    clfs[name]['predictions'] = clfs[name]['model'].predict(X_test)
HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

CPU times: user 3h 9min 26s, sys: 38.6 s, total: 3h 10min 4s
Wall time: 3h 8min 34s
def print_eval_metrics(y_true, y_pred, name='', header=True):
    """Function for printing purposes
    if header:
        print('{:>20}\t{:>10}\t{:>10}\t{:>8}\t{:>5}'.format('Algorith', 'Accuracy', 'Recall', 'Precision', 'f1'))
    acc = accuracy_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
        name, acc, recall, prec, f1


y_preds = np.column_stack([clfs[_]['predictions'] for _ in clfs])
enseble_preds = []

Hard Voting

This is one of the simplest way to combine multiple models in order to generalize better and achive better performance.

hard_vot = EnsembleVoteClassifier([clfs[_]['model'] for _ in clfs], fit_base_estimators=False), y_test)
enseble_preds.append((hard_vot.predict(X_test), 'Hard Voting'))
/opt/conda/lib/python3.7/site-packages/mlxtend/classifier/ UserWarning: fit_base_estimators=False enforces use_clones to be `False`
  warnings.warn("fit_base_estimators=False "

Weighted Hard Voting

Using weighted hard voting you can take advantage of most high-performed models

wei_hard_vot = EnsembleVoteClassifier([clfs[_]['model'] for _ in clfs], weights=[
    ], fit_base_estimators=False), y_test)
enseble_preds.append((wei_hard_vot.predict(X_test), 'Weighted Hard Voting'))
/opt/conda/lib/python3.7/site-packages/mlxtend/classifier/ UserWarning: fit_base_estimators=False enforces use_clones to be `False`
  warnings.warn("fit_base_estimators=False "


We are going to use the the predicted values as input to another model

rf = RandomForestClassifier()

x_tr_ens, x_ts_ens, y_tr_ens, y_ts_ens = train_test_split(y_preds, y_test, test_size=.5), y_tr_ens)


It’s crusial to detect fraudulent transactions, therefor a significat evaluation metric could the simplicity. For every trained method and ensebling method the following evaluation metrics will be calculated:

  • Accuracy
  • Recall
  • Precision
  • f1 Score
print_header = True
for k, v in clfs.items():
    print_eval_metrics(y_test, v['predictions'], v['label'], print_header)
    print_header = False


for prds, l in enseble_preds:
    print_eval_metrics(y_test, prds, l, print_header)
    print_header = False

    'Bleding using RF', False
            Algorith	  Accuracy	    Recall	Precision	   f1
    Isolation Forest	0.90040650	0.84349593	0.951835	0.894
        OneClass SVM	0.87398374	0.93699187	0.832130	0.881
          Autoncoder	0.89939024	0.87398374	0.920771	0.897

         Hard Voting	0.90548780	0.88008130	0.927195	0.903
Weighted Hard Voting	0.89939024	0.87398374	0.920771	0.897

    Bleding using RF	0.92479675	0.90909091	0.942623	0.926

Future Work

  • Find the auto-encoding loss threashold using a more sophisticated way
  • Test more models and different configurations


  1. Outlier Detection with One-Class SVMs url
  2. Niu, X., Wang, L., & Yang, X. (2019). A comparison study of credit card fraud detection: Supervised versus unsupervised. arXiv preprint arXiv:1904.10604. url pdf
  3. Benefits of Anomaly Detection Using Isolation Forests url
  4. scikit-learn: Voting Classifier url
  5. scikit-learn: Novelty and Outlier Detection url
  6. Building Autoencoders in Keras url
  7. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. url
  8. Porwal, Utkarsh, and Smruthi Mukund. “Credit card fraud detection in e-commerce: An outlier detection approach.” arXiv preprint arXiv:1811.02196 (2018). url
comments powered by Disqus