Privacy-Preserving Data Collection and Analysis with Missing Values

The University of Electro-Communications

To control pandemics like the novel coronavirus infection (COVID-19), data such as the age, gender, family composition, and medical history of infected individuals are required. While patients themselves may provide this information to medical institutions, these details are highly confidential. If the data is properly handled for privacy protection, it can be shared with researchers worldwide without identifying the infected individual, which can help clarify the state of the pandemic and more accurately predict its progression.

There may be missing values in the information provided by patients, and existing methods do not take these missing values into account when collecting personal data while ensuring privacy. This is leading to a significant reduction in the accuracy of data analysis.

Differential Privacy, the privacy protection metric addressed in this paper, is adopted by many organizations, including Apple, Google, Microsoft, and LINE. Numerous methods have been proposed to collect and analyze personal data based on Differential Privacy. However, none of the existing methods take into account the presence of missing values. When considering medical data, as during the COVID-19 pandemic, it is conceivable that different hospitals can obtain different information, and many patients may feel comfortable providing only some data after privacy protection processing. Under the current methodology, the accuracy of analysis is greatly reduced in such scenarios, which has prevented sufficient data analysis for pandemic mitigation.

Professor Sei has demonstrated that using the Copula model, primarily used in the finance field, can restore the true statistical model from data processed by Differential Privacy technology even in situations with many missing values, enabling highly accurate data analysis. Of course, he mathematically proves that each individual's privacy is strictly protected at the exact same level as existing methods. In real society, data typically has various missing elements. By using the proposed method, not only medical information but also various societal and personal information with missing values can be safely analyzed with high accuracy. Therefore, this research is expected to have a significant impact on society.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.