Applying ML to flag risky payments
Background
From cash to cards
In a transaction, a buyer makes a payment to a seller in exchange for goods or services. As a starting point, let’s assume the payment is made in cash.

Cash keeps things simple. However, this requires the buyer carry cash around, and more crucially, the buyer may not yet have the amount in full. Credit cards address this issue, offering convenience through portability and access to credit.
Credit card transactions involve additional parties. There’s the credit card issuer, often referred to as the issuing bank. The funds must now be transferred to the seller's bank, referred to as the acquiring bank. Finally, an intermediary is needed to connect the two banks. For simplicity, assume a single entity, the payment processor, serves as the intermediary.

More parties means more fees. A transaction fee of ~3% is deducted from the payment, leaving the seller with 97 cents for every dollar. The majority of this fee (roughly two-thirds) goes to the issuing bank.
Fraud
This convenience also introduces the risk of potential fraud. A fraudster using stolen credit cards might 'cash out' by impersonating a seller and processing fraudulent payments as a legitimate sale; this is known as seller fraud. The fraudster might masquerade as a buyer and acquire goods or services from a legitimate seller, known as buyer fraud. Another possibility involves compromising the seller's account and transferring funds to the fraudster's bank account instead, known as account takeover.
The payment processor may also bear responsibility for a well-intentioned seller's poorly-managed business. Imagine the seller selling concert tickets but unable to organize the concert. The buyers may be able to reverse the transaction but the seller may be out of funds. In this situation, the payment processor 'fronts' the payments to the seller, assuming the credit risk.
A buyer's request to reverse a transaction is known as a chargeback. It’s actually more akin to a ‘mini court case’ to determine which party should be on the hook for the loss. The issuing bank will ask the buyer for more details, then the acquiring bank will ask the seller for more details, and finally a determination will be made (who decides often depends on the type of dispute).
Applying ML
Models
Consider another hypothetical scenario of starting a payment processing startup. Initially everything proceeds smoothly, but you soon notice fraudsters processing transactions with stolen credit cards. You hire someone to lead the operations team, going through every payment to make sure they’re legitimate. Over time, your payment volume increases. To manage the load, you hire a software engineer to create a case queue for flagging payments above $1,000 from IP addresses outside the US.
Though this decision tree is straightforward to understand and implement, it can easily be bypassed. Fraudsters may uncover the $1,000 threshold and respond by processing $999 payments. In addition, what was thought of as a fraud signal may not actually be one. For example, instead of the IP address being the red flag, it could actually be the language setting of the web browser.
This is where machine learning (ML) models can help. First, instead of a ‘review’ or ‘no review’ determination, the model returns a score between 0 (not fraud) and 1 (fraud). We can set up the case queue such that payments with a score above a certain threshold are reviewed; this is often based on review capacity and risk tolerance. Second, the model would use all possible features as input, and use historical data to assign higher weights to more predictive features.

Concerning the trade-offs between various types of models, logistic regression models are often more 'explainable' due to the direct relationship between the model weights and the likelihood of fraud. There are also ‘black box’ models like random forests, that employ a ‘wisdom of crowds’ approach by training a ‘forest’ of decision trees. These are harder to understand but tend to have much better predictive performance.
Metrics
The ‘precision’ of each model can be thought of as the percentage of cases reviewed by the ops team that end up being blocked, over the total number of cases created. The ‘recall’ is the total dollar amount of blocked payments for a single model over total dollar amount across all blocked payments. Tracking precision alone may end up solving for risky but low-dollar payments, so we want to optimize for both.
Another consideration that makes the optimization process tricky is ‘leading’ vs ‘lagging’ indicators. Chargebacks are our ‘ground truth’ but can take up to 3 months from the payment time to arrive; this is our lagging indicator. Having a “human in the loop” workflow helps provide a leading indicator, especially against fraud rings. The complication lies in the fact that the transactions blocked by the operations team may not always result in chargebacks.
Once there’s an effective baseline workflow, we can potentially layer on additional processes to automatically block payments that are extremely likely to be fraudulent. This helps free up ops team capacity. In addition, if we think of high-scoring payments as ‘bad’, then low-scoring payments can highlight ‘good’ customers we may want to do more business with. In other words, we get a credit-like model for free.
Further reading
For a more detailed breakdown of the different parties and transaction fees in a credit card transaction, this Bloomberg article has an excellent graphic. Patrick McKenzie’s Bits about Money newsletter is a great read for fraud (and payments more generally), as is Modern Treasury’s Learn site. For chargebacks, Square has a Chargebacks 101 post.
The follow-up post based on responses to questions on this post can be found here.