Researchers in the University of York's Department of Sociology will lead one of the first large-scale, systematic social science studies of synthetic data.
 A new project will investigate the societal effects of synthetic data.
A new project will investigate the societal effects of synthetic data.
Synthetic data are information generated by machine learning algorithms and AI models such as Gemini and GPT-4. They are increasingly used to fill the gaps in real-world data and are often seen as a solution to some of the challenges presented by AI development, such as lack of diversity and representation and issues of privacy and confidentiality.
Although already in use within sectors such as healthcare, finance, biometrics and surveillance, their societal effects are currently not well understood and important ethical and political questions have yet to be addressed.
Consequences
To shed new light on the subject, the European Research Council-funded SYNDATA project, led by Dr Benjamin Jacobsen, will look at the practical and political consequences of using synthetic data. As one of the first large-scale social science studies of this kind, the project aims to generate new knowledge on how synthetic data are transforming AI and society.
Dr Jacobsen said "A crucial part of the allure of synthetic data is how they promise to address some of the ethical issues associated with the extraction of real-world data and challenges associated with large training datasets, such as class imbalance and lack of racial and gendered representation. If something cannot be found or collected in the real world, it can be generated via algorithms.
"However, this has significant and disruptive ethical implications, because synthetic data intervene in our understanding of long-standing issues such as bias, fairness and algorithmic injustice."
Under-researched
While there is a substantive literature about the effects of algorithms on society, the area of synthetic data remains both under-researched and under-theorised. For example, what happens when you can generate synthetic data of minority populations to make your algorithm less biased? And, what happens to data privacy when algorithms can be trained on data of people that are not real?
By examining how they are produced, what kinds of people or groups they depict and how they challenge or reinforce existing power structures, the SYNDATA project will investigate how synthetic data, algorithms and AI models shape society.
With recent developments in generative AI models it has never been easier to generate realistic synthetic data at scale. Synthetic data are likely to become an issue not only for regulators but also for how we think about the ethics of data and algorithms on a global scale.
Pressing questions
To answer these pressing questions, the project will conduct both archival research, fieldwork and case studies in the form of both historical predecessors of synthetic data as well as defining studies of the different areas where it is currently being generated.
Dr Jacobsen said "By developing new ways of thinking about the ethics of data in an age where the line between the 'real' and the 'synthetic' is increasingly blurred, the SYNDATA-project will shed light on both the contemporary and future use of AI data."
Further information:
The Ethics of Synthetic Data in the Age of Machine Learning and AI is funded by the European Research Council and will start in January 2026.
 
									
								




 
										 
								 
										 
								 
										 
								 
										 
								 
										 
								