Back to questions
Assume a classifier that produces a score between 0 and 1 representing the probability of a particular loan application being fraudulent.
In this scenario, what do false positives and false negatives represent? What are the tradeoffs between them in terms of dollars, and how should the model be weighted accordingly?
This is the same question as problem #16 in the Machine Learning Chapter of Ace the Data Science Interview!
Assume that a positive label is identifying a loan application as fraudulent, and so a false positive occurs when the model labels a loan application as a fraud when in reality it is not. In this case, there is immediate loss of revenue. If the loan was for some amount X, and the interest rate was 10% for instance, then the immediate loss is 10% of X. A false negative, on the other hand, occurs when the model deems an application to not be a fraud when in reality it is. In this case, the loss is the immediate loan amount X.
In combination, the two points mean that weighting by revenue would yield a false negative worth about 10 false positives, i.e., one bad loan would be the equivalent of missing 10 good loans. Therefore, for the model, false negatives should be weighted 10:1 with respect to false positives in terms of cost.
Generalizing, we see that not making a loan to what would be a good client leads to the loss of the interest this client would then pay (usually just a low percentage of the total amount of money borrowed). On the other hand, giving the loan to someone who will not pay it back means that all the money will be lost, not just a percentage. In terms of dollars, given a set interest rate, we can therefore calculate the loss ratio between the two types of losses (not making the loan versus making the loan and not being paid back) to quantify the tradeoff between false positives and false negatives.