Database Input Rosetta Stone Uncovers Major Security Flaw

The data inputs that enable modern search and recommendation systems were thought to be secure, but an algorithm developed by Cornell Tech researchers successfully teased out names, medical diagnoses and financial information from encoded datasets.

People are able to search large databases because an encoder has transformed each piece of data into an "embedding" - a series of numbers representing the meaning of the text, image, sound recording or any other type of information. The new algorithm, called vec2vec, can translate databases of text embeddings back into English - with no knowledge of the original data or how it was encoded. Until recently, companies had assumed these embeddings were as good as encrypted.

"Everybody should think of these embeddings as being as sensitive as the underlying text," said senior author Vitaly Shmatikov, professor of computer science in the Cornell Ann S. Bowers College of Computing and Information Science and at Cornell Tech. "Trusting anyone with your embeddings is the same as trusting them with your data."

The researchers presented the study, "Harnessing the Universal Geometry of Embeddings," at the Annual Conference on Neural Information Processing Systems on Dec. 5 in San Diego.

All together, a database of embeddings forms a coded map of the input data used for search. Each model encodes embeddings differently, however, so embeddings couldn't previously be translated across models.

With vec2vec, a person can translate embeddings from each model into a universal system - essentially creating a Rosetta stone of the different models. Similar to the original Rosetta stone, which carried the same message in Ancient Greek, Egyptian hieroglyphs and another Egyptian script, this universal system allows anyone to convert embeddings between models.

Additionally, translating embeddings to a known model makes it possible to approximate their original meanings. This creates a new security risk: If a database of embeddings is compromised, the original inputs can be partially recovered.

The team demonstrated that vec2vec successfully extracted information from several encoded databases of embeddings. They recovered the topics of tweets, medical conditions from a set of anonymized hospital records and the content of emails - including names, dates and financial information - from the now defunct Enron Corporation.

The algorithm couldn't translate the embeddings word-for-word. "There is some distortion," Shmatikov said. "We don't recover everything, but at least you get the gist of it." However, it successfully produced lunch orders from emails and medical symptoms as specific as "alveolar periostitis."

Co-author John Morris, Ph.D. '25, had previously shown that if you know how an encoder creates the embeddings, you can use that knowledge to extract the original meaning.

"This work shows that, even if you don't know the encoder, by just having a bunch of these embeddings, you can translate them," said co-author Collin Zhang, a doctoral student in the field of computer science.

The discovery may explain a common observation: When you ask the same question to multiple AI chatbots - which use embeddings to provide answers - they often give highly similar responses, even though they were created by different companies with different training data. Many AI researchers have long suspected that all of the large language models that power these chatbots share a universal underlying structure because they are each encoding the same concepts from human language - essentially creating their own version of the same collection of embeddings.

"All these different models are kind of reinventing the same thing," said Morris. "I think our work validated a lot of people's beliefs about that."

This advance also has interesting technological applications, said co-author Rishi Jha, a doctoral student in the field of computer science. For example, if someone has an encoder that functions in only one language, vec2vec can create a translation to allow it to function in multiple languages - or even different data formats. "A text encoder can now interface with images or can interface with audio in an elegant way," Jha said.

Theoretically, vec2vec may even be able to help with more extreme types of translation. The research team is looking into the possibility of translating whale noises into human text. Currently this idea is purely theoretical, Morris said, but "if we kept doing this for real, that would be the best possible outcome of our work."

This research received support from the Google Cyber NYC Institutional Research Program and the National Science Foundation.

Patricia Waldron is a writer for the Cornell Ann S. Bowers College of Computing and Information Science.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.