The future is fake: The rise of synthetic data in training AI models

The future is fake: The rise of synthetic data in training AI models

Synthetic data provides a way to have unlimited amounts of perfectly labeled data at a fraction of the cost of manually labeled data. Collecting high-quality data from the real world can be complicated, costly, and time-consuming. Data can be particularly expensive to clean and label—and may be biased depending on the methods used to collect and process it. Real-world data is not always 100% accurate or of sufficiently high enough quality to train AI models.

For example, real-world image data must be labeled by a human before it can be used to train an AI model. Not only can the labeling depend on the person doing it (and how tired or hungover they are when they carry out said task), but it may be inaccurate or incomplete, which will affect the AI model’s performance, and can be costly. In May 2022, gaming software company Unity lost $100 million citing ‘bad data’ as corrupting an ML model and ultimately impacting its ad business.

Generating synthetic data can be particularly useful for training ‘edge cases’ in models. Such edge cases are abundant in autonomous vehicle (AV) computer vision training, and AI model training for these seldom-encountered scenarios is paramount for safety.

For example, to train an AV model to properly react to a kangaroo crossing the road would require hours of high-quality, well-labeled video footage. Collecting and labeling such quantities of high-quality data might not be feasible and so instead can be simulated, for example using generative adversarial networks (GANs). Video creation and data simulation can be repeated for koalas too, for example, instead of collecting hours of new video footage.

Though a silly example, vast quantities of data in the ‘long tail’ is often needed to train AI models, and this data may not be readily available. Data engineers may spend valuable time extracting, organizing, and cleaning data from siloed platforms even though they cannot feasibly capture data on all real-world scenarios. Instead, synthetic data can be created algorithmically, using classical statistical models or traditional machine learning (ML). It does not require data labeling, which is often an expensive, time-consuming, and error-prone process.

However, synthetic data may be biased depending on the algorithms used to create it. It is most likely that deployments using synthetic data will use a mix of synthetic and real-world data to train and test models.

New formats of synthetic data will emerge in the coming years. There is a notable gap in the audio market, although research papers do exist on the matter. High-dimensional data will likely also be an area of active research, such as that collected on myriad Internet of Things (IoT) devices.

Questions in the coming years to be answered will include how high-fidelity (or how similar a synthetic dataset is to a real dataset) does synthetic data have to be useful? Before that can be answered, ways of measuring and quantifying fidelity must be designed.

Users of synthetic data will then look to maximize fidelity while preserving privacy. If synthetic data is too similar to real-world data, sensitive information may be given away, such as medical records or financial transactions. Health insurer Anthem has partnered with Google Cloud to generate synthetic patient medical data to address data privacy concerns, which have hitherto held back AI and data analytics deployment in healthcare.

As synthetic data makes quality training data more accessible and affordable, the value of synthetic data may outstrip that of messy real-world data, which often requires significant processing. Though it is unlikely that synthetic data will pose an existential threat to the likes of Google, Meta, and Amazon—which derive much of their profit from the huge datasets they collect, own, and monetize—the desire for synthetic data is real.

Images Powered by Shutterstock