Fraud Analytics Like finding a needle in a haystack
Abstract
We are drowning in the barrage of data that is being collected world-wide, while starving for knowledge at the same time. Despite the enormous amount of data, particular events of interest are still quite rare. Modeling rare events is not uncommon in practice. Network intrusion detection, Credit card fraud detection, Medical diagnostics are some of the common areas where study of rare events becomes important. We often build models to detect the occurrence of these events or detect the riskiest of these events. One common approach is to try oversampling to improve the proportion of rare events. However, more often than not, one is interested in classification of events based on the probability of occurrence than the actual probabilities themselves. In this paper, we’ll compare different modeling techniques on U.S. data and then show that segmentation is not affected by the sampling scheme or the regression techniques like Logistic or Probit and that inferences can be made using the entire population directly.