Machine learning commonly requires the use of highly unbalanced data. When detecting fraud or isolating manufacturing defects, for example, the target event is extremely rare – often way below 1 percent. So, even if you’re using a model that’s 99 percent accurate, it might not correctly classify these rare events.
What can you do to find the needles in the haystack?
A lot of data scientists frown when they hear the word sampling. I like to use the term focused data selection, where you construct a biased training data set by oversampling or undersampling. As a result, my training data may end up slightly more balanced, often with a 10 percent event level or more (See Figure 1). This higher ratio of events can help the machine learning algorithm learn to better isolate the event signal.
For reference, undersampling removes observations at random to downsize the majority class. Oversampling up-sizes the minority class at random to decrease the level of class disparity.
Figure 1: Develop biased samples through under and oversampling. The plus sign represents duplicated examples.
Another rare event modeling strategy is to use decision processing to place greater weight on correctly classifying the event.
TABLE 1: The cost of inaccurately identifying fraud
The table above shows the cost associated with each decision outcome. In this scenario, classifying a fraudulent case as not fraudulent has an expected cost of $500. There’s also a $100 cost associated with falsely classifying a non-fraudulent case as fraudulent.
Rather than developing a model based on some statistical assessment criterion, here the goal is to select the best model that minimizes total cost. Total cost = False negative X 500 + False Positive X 100. In this strategy, accurately specifying the cost of the two types of misclassification is the key to the success of the algorithm.