Handling imbalanced datasets in machine learning is a difficult challenge, and can include topics such as payment fraud, diagnosing cancer or disease, and even cyber security attacks. What all of these have in common are that only a very small percentage of the overall transactions are actually fraud, and those are the ones that we really care about detecting. In this post, we will boost accuracy on a popular Kaggle fraud dataset by training a generative synthetic data model to create additional fraudulent records. Uniquely, this model will incorporate features from both fraudulent records and their nearest neighbors, which are labeled as non-fraudulent but are close enough to the fraudulent records to be a little “shady”.
Search