Real promise of synthetic data

Massachusetts Institute of Technology
MIT researchers release the Synthetic Data Vault, a set of open-source tools meant to expand data access without compromising privacy.

Illustration of four people near a vault that has data (in the form of blank operating system windows) floating in and around it

After years of work, MIT’s Kalyan Veeramachaneni and his collaborators recently unveiled a set of open-source data generation tools – a one-stop shop where users can get as much data as they need for their projects, in formats from tables to time series. They call it the Synthetic Data Vault.

Image: Arash Akhgari

Each year, the world generates more data than the previous year. In 2020 alone, an estimated 59 zettabytes of data will be “created, captured, copied, and consumed,” according to the International Data Corporation – enough to fill about a trillion 64-gigabyte hard drives.

But just because data are proliferating doesn’t mean everyone can actually use them. Companies and institutions, rightfully concerned with their users’ privacy, often restrict access to datasets – sometimes within their own teams. And now that the Covid-19 pandemic has shut down labs and offices, preventing people from visiting centralized data stores, sharing information safely is even more difficult.

Without access to data, it’s hard to make tools that actually work. Enter synthetic data: artificial information developers and engineers can use as a stand-in for real data.

/University Release. The material in this public release comes from the originating organization and may be of a point-in-time nature, edited for clarity, style and length. View in full here.