Predictive models can be divided into two categories based on the task at hand, i.e. classifiers and regressors. The regression models aim at predicting continuous outcomes, and the classifiers look to predict discrete outcomes.
In the regression model, if the distribution of outcomes is skewed, it can throw off some models. For instance, using linear regression to predict skewed distributions can reduce the performance of the model. In the context of linear regression, this happens because it violates our assumption about Normally distributed noise (you can also think in terms of the underlying loss function which is influenced by some “outlier” outcomes and impedes the learning process; to counter this Huber regression can be helpful as the underlying loss is not influenced as much by skewness.). One way to deal with such a phenomenon can be to use a different set of models such as decision trees, generalized linear models, etc. In essence, the outcome distribution plays an important role in data modeling.
Similar to the continuous outcome case, discrete outcomes (classification) have a similar kind of problem in the form of imbalanced outcome counts. For example, let’s suppose we trying to model a dataset where the outcomes are in the form of 1/0s. For instance,1 could indicate credit card fraud. Most of the people don’t commit fraud, so in such a case the counts would be skewed in the direction of 0s. In this paper, they introduce ROC curves, which as a tool, are used to compare the relative performance of different classifiers. This tool is exactly motivated by the problem of dealing with imbalanced datasets. The paper is an interesting read to understand the concept of ROC curves in-depth.
In this article, I will try to further deconstruct the idea of the ROC curves, and hopefully give you an intuition around how these things work. I will be using the PIMA Diabetes dataset to study the ROC curve. I will start by describing the problem, show why using a simple accuracy measure is a bad choice, and finally, introduce a dashboard to look at a ROC curve in action.