Just this past month,an article was shared that showed that over 30% of the data used by Google for one of their shared machine learning models was mislabeled with the wrong data. Not only was the model itself full of errors, but the actual training data used by that model itself was full of mistakes. How could anyone using Google’s model ever hope to trust the results if it’s full of human-induced errors that computers can’t fix. And Google isn’t alone with major data mislabeling, an MIT study in 2021 found that almost 6% of the images in the industry-standard ImageNet database are mislabeled, and furthermore, found “label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets”. How can we hope to trust or use these models if the data used to train those models is so bad?
The answer is you can’t trust that data or those models. As AI goes, garbage in is most definitely garbage out, and AI projects are suffering from significant bad data garbage. If Google, ImageNet, and others are making this mistake, for sure you are making this mistake as well. Research from Cognilytica shows that over 80% of AI project time is spent managing data, from collecting and aggregating that data to cleaning and labeling it. Even with all that time spent, mistakes are bound to happen, and that’s if the data is good quality to begin with. Bad data equals bad results. That's been the case for all sorts of data oriented projects for decades, and now it’s a significant problem for AI projects as well, which are basically just big data projects.
Data quality is more than just “bad data”
Data is at the heart of AI. What drives AI and ML projects is not programmatic code, but rather the data from which learning must be derived. Far too often, organizations move too quickly with their AI projects to realize only later the poor quality of their data is causing their AI systems to fail. If you do not have your data in a good quality state then don’t be surprised when your AI projects are plagued.
There is more to data quality than just “bad data” such as incorrect data labels, missing or erroneous data points, noisy data or low quality images. Major data quality issues also emerge when you’re acquiring or merging data sets. They also arise when capturing the data and enhancing the data with third-party data sets. Each of these actions, and more, introduce many potential sources of data quality issues.
Of course, how do you realize the quality of your data before you even start your AI project? It’s important to evaluate the state of your data up front and that you don’t move forward with your AI project only to realize too late that you don’t have good quality data needed for your project. Teams need to figure out their data sources such as streaming data, customer data, or third party data and then how to successfully merge and combine the data from these different sources. Unfortunately, most data doesn’t come in nice, good usable states. You need to remove extraneous data, incomplete data, duplicate data, or otherwise unusable data. You're also going to need to filter this data to help minimize bias.
But we’re not done yet. You’ll also need to think about how data must be transformed to meet the specific requirements that you have. What are you going to do for implementation of data cleansing, data transformation, and data manipulation? Not all data is created equal and, over time, you will have data decay and data drift.
Have you thought about how you are going to be monitoring this data and evaluating this data to make sure that the quality stays at the level that you need? If you need labeled data, how are you getting that data? There's also data augmentation steps to possibly consider. If you need to do additional data augmentation, how are you going to monitor that? Yes, there are a lot of steps involved with quality data and these are all aspects you need to be thinking about in order for your project to be successful.
Data labeling specifically is a common area where a lot of teams get stuck. For supervised learning approaches to work, they need to be fed good, clean well labeled data so that it can learn from example. If you're trying to identify images of boats in the ocean, then you need to be feeding the system good, clean well labeled images of boats to train your model. That way, when you feed it an image that it's never seen before, it can give you some high degree of certainty whether or not the image has a boat in it. If you’re only training your system with boats in the ocean on sunny days with no cloud coverage, then how is the AI system expected to react when it sees a boat at night or a boat with 50% cloud coverage? If your test data does not match real world data or real world scenarios then you’re going to have a problem.
Even when teams spend a lot of time to make sure their test data is perfect, often the training data quality doesn’t mirror real world data. In a publicly documentedexample, AI industry leader Andrew Ng discussed how in his project with Stanford Health the quality of the data in his test environment didn’t match the quality of medical images in the real world deeming his AI models useless outside of the test environment. This caused the entire project to basically stall and fail, jeopardizing millions of dollars and years of investment.
All of this data quality-centric activity may seem overwhelming, which is why these steps are often skipped. But of course, as stated above, bad data is what’s killing AI projects. So not paying attention to these steps is a major cause of overall AI project failure. This is why organizations are increasingly adopting best practices approaches such as CRISP-DM, Agile, and CPMAI to ensure that they are not missing or skipping the crucial data quality steps that will help avoid AI project failure.
The issue of teams often moving forward without planning for project success is all too common. Indeed, the second and third phases of both CRISP-DM methodology and CPMAI are “Data Understanding” and “Data Preparation”. These steps precede even the very first step of building models, and therefore are considered a best practice for those AI organizations looking to succeed.
Indeed, if the Stanford medical project had adopted CPMAI or similar approaches, they would have realized well before the million dollar and multiple year mark that the data quality issues would sink their project. While it might be comforting to realize that even luminaries like Andrew Ng and companies like Google are making significant data quality mistakes, you still don’t want to needlessly be part of that club and let data quality issues plague your AI projects.