Privacy-Safe Data Sharing

Members and Collaborators

Overview

Networking and cybersecurity research critically need publicly available, fresh and diverse user data, for data mining and for validation. There are very few publicly available sources of user data, because of privacy concerns. Institutions, which collect user data, are concerned that releasing such data for research would jeopardize user privacy. Anonymization and differential privacy have emerged as two approaches to address privacy-safe data sharing. Yet, they each have drawbacks, which prevent their wide adoption. Anonymization can be successfully attacked by leveraging auxilliary data, and differential privacy may lose too much research utility by adding large noise to heavy-tailed data. Our current research lies in two directions. First, we investigate new ways to share data safely, by introducing Commoner Privacy. Second, we have developed a framework called Critter@home, which empowers volunteer users to share their network traffic data with researchers in an anonymous, privacy-safe, aggregated manner.

Commoner Privacy

Commoner Privacy is a data-processing approach, which fuzzes (omits, aggregates or adds noise to) outputs of queries ran over private data. It fuzzes only those output points where an individual’s contribution to this point is an outlier. By hiding outliers, our mechanism hides the presence or absence of an individual in a dataset. We propose one mechanism that achieves commoner privacy— interactive 𝑘-anonymity. We also show that commoner privacy holds for query composition either via presampling or via query introspection.

Critter@home

Critter@home is a continuously updated archive of content-rich network data, contributed by volunteer users. Data contributors join the Critter overlay whenever online, offering their data to interested researchers. Privacy of data contributors is protected in multiple ways:

Contributors have the option of hosting their own data locally, thus retaining full control over it.
Before data is stored, it is modified via a PPI-sanitization process to replace all personal and private information (PPI).
Data is always stored and transmitted in an encrypted format.
No human apart from the contributor will ever access the raw, PPI-sanitized, data. Instead, researchers access data via a query system which only returns aggregate statistics.
All contact with a contributor is at her discretion and is done via an anonymizing network where contributor identities are hidden both from researchers and the Internet at large.
Contributors (if they so desire) can have full, fine-grained control over their data at all times via policy settings.

Our work relies on the secure query framework, which uses Commoner Privacy. This framework allows only for queries about aggregate features of the data, such as counts, distributions, etc. and preserves user privacy by applying k-anonymity and l-diversity principles.

Software and Datasets

Publications

Commoner Privacy And A Study On Network Traces, Xiyue Deng and Jelena Mirkovic, In Proceedings of the Annual Computer Security Applications Conference (ACSAC), 2017PDF BIB
Critter: Content-Rich Traffic Trace Repository, V. Sharma, G. Bartlett and J. Mirkovic, In Proceedings of Workshop on Information Sharing and Collaborative Security (WISCS), 2014PDF BIB

This material is based upon work supported by the National Science Foundation under grant #1224035 and #0914780. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.