Synthetic Data
Definition
Artificially generated data that reproduces statistical or structural properties of real data — produced by language models, generative image models, simulations or rule-based methods. Used for training, augmentation, testing or privacy-compliant analytics.
Noise — Signal
Synthetic data is sold as a way out of data scarcity and privacy problems. Both only partially true. Synthetically generated data inherits the bias of the generators and rarely captures edge cases that occur in real operation — it is a feedback amplifier for known patterns, not a generator for unknown ones. Under data-protection law it is treated as non-personal only if re-identification is demonstrably ruled out; many implementations don't deliver that proof.
The right question
Not: "Can we train the model on synthetic data?" But: "Which gap in our real data set should the synthetic augmentation close, how do we validate that it does, and which privacy and audit evidence is needed for synthetic data to hold up regulatorily?"