Editor’s note: Alex & Matt are speakers for ODSC East 2022. Be sure to check out their talk, “Introducing Model Validation Toolkit,” there!
There are a number of tools and concepts that go into assuring a machine learning pipeline. We’ve built solutions for some of the most common problems we’ve encountered into themodel validation toolkit. In this tutorial, we’ll share our solutions to two common problems we encountered: measuring concept drift and assessing credibility from the sample size.
After deploying a model, there is no guarantee your training or validation set will still be representative of production data years down the road. Statistical differences in subsequent batches of data are known as concept drift. As a toy example, let’s say we have built a model that ingests stock prices. The stock market tanks months later, and all the prices average considerably lower than before. The model might be fine, but the training data is definitely distinguishable from what’s flowing through the production pipeline, so there’s a possibility that the model may no longer be valid.
This kind of warning can be useful before ground truth has been established. A simple approach to detecting this kind of drift might be to compare means of prices, but that wouldn’t necessarily catch a change in volatility. A more general approach is to try to find a summary statistic that reflects the greatest difference between the training (or validation) set and the production data. By constraining your model in different ways, you can explore different classes of summary statistics.
The greatest measurable difference after training is our measure of concept drift. This type of measurement is known as an integral probability measure, and amounts to training a model that represents the learned summary statistic on a loss function given by the difference between the average (perhaps over a minibatch) output of the model applied to each of the datasets you wish to compare (i.e. training and validation). Unfortunately, this kind of analysis is not coordinate invariant. That is, if we just changed the units of price from dollars to euros, we would have an accordingly different measure of concept drift. If we’re assessing multiple features simultaneously (like price and volume), some features can have a disproportionate impact on our measure of drift just because they have larger values.
There is also a class of coordinate invariant measures of statistical divergence known asf-divergences that can also be applied to this kind of problem. Those effectively train a model to learn the likelihood ratio that a given sample would be drawn from one or the other of your two datasets. A particularly interpretable example is the area (or hypervolume for multiple features) between the pdfs of the distributions of the two datasets you wish to compare.
This is typically divided by two so it varies between 0 (perfect overlap) and 1 (no common samples will ever be drawn). This is known as total variation. We have a simple but fast implementation of total variation in the model validation toolkit built on a dense fully connected neural network.
We have more information on integral probability measures,f-divergences, and more in theModel Validation Toolkit’s Supervisor User Guide!
Based on a true story: A data scientist claims recall of 100% but only has 7 positive examples in their dataset. You say the sample size is too small but a director asks how big the sample needs to be to say anything conclusive. You know larger datasets are better, but you can never be 100% confident in a statistical measure. Likewise, a good performance measure on 7 samples isn’t completely equivalent to not taking any measurements at all, but it clearly doesn’t carry much weight. In contrast to the frequentist school of thought, we will focus on making statements about the true (infinite sample limit) performance of the model given the validation set we do have–however small it may be.
Most performance measures are averages. And in classification problems, you’re almost certainly dealing with some ratio of counts. For example, accuracy measures the number of times the model was correct divided by the size of the data and recall measures the number of times the model was right about an example being positive divided by total number of positive examples. We can therefore treat these performance measures as the probability of some binary outcome happening (like a coin coming up as heads). The Model Validation Toolkit’s credibility submodule comes with tools for making statements about these types of performance measures. We could, for example, compute the probability that a model that correctly classified 7 of 7 positive examples in the validation set would in fact have a recall below 90% if we had an infinitely large number of positive examples to test it on as follows:
The result shows there is a 43% chance the true recall would in fact be below 90%. Quite high! This does make some assumptions under the hood (which you can adjust) about a hypothetical distribution of true recalls across all models floating around your company. By default, it assumes maximal uncertainty about such a distribution by modeling it as uniform. Check out the model validation toolkit’sCredibility User Guidefor more!
Alex Eftimiades is a data scientist at FINRA. He applies machine learning and statistics to identify anomalous and suspicious trading and has helped to develop model validation procedures and tools. Alex originally studied physics and is passionate about applying math to solve real-world problems. He previously worked as a data engineer and as a software engineer.
Matthew Gillett is an Associate Director at FINRA who manages a team of Software Development Engineers in Test (SDET) across multiple projects. In addition to his primary focus in software development and assurance engineering, he also has an interest in various other technology topics such as big data processing, machine learning, and blockchain.