What is Synthetic data?

AI Glossary

Synthetic data

synthetic data, artificial data, generated data

Synthetic data is artificially generated examples, used to train or evaluate models when real data is scarce or sensitive. It needs quality control, because it can reproduce and amplify the flaws of its source.

It's data produced artificially rather than collected from reality — from rules and simulations to examples generated by other models.
It's used when there is too little real data or it is too sensitive to use directly.
Its quality has to be verified: it can reproduce and amplify the errors and bias of the source data.

Synthetic data is examples produced artificially rather than collected from real events. It comes about in various ways — through rules and simulations, by transforming existing records, or by generating new examples with another model. What they have in common is that they don't come directly from real users or from real measurements.

It's most often used for two reasons. First, when training data is simply lacking — for instance, rare cases that appear too few times in a real dataset for a model to learn them. Second, when the real data is sensitive: synthetic data lets you build and test solutions without exposing personal data, which bears directly on the matter of data privacy in AI. It's also sometimes used for fine-tuning a model for a specific, narrow task.

The main risk is that synthetic data inherits the flaws of its source. If it's generated by a model that itself has gaps or bias, the synthetic dataset can reproduce those flaws and even amplify them. That's why, in a deployment, it's treated as a supplement rather than a substitute for real data, and it's checked to confirm it matches the actual distribution of the cases the model is meant to handle.

Related terms