Privacy-Safe Data Sharing

Privacy-Safe Data Sharing


Members and Collaborators

Overview

Networking and cybersecurity research critically need publicly available, fresh and diverse user data, for data mining and for validation. There are very few publicly available sources of user data, because of privacy concerns. Institutions, which collect user data, are concerned that releasing such data for research would jeopardize user privacy. Anonymization and differential privacy have emerged as two approaches to address privacy-safe data sharing. Yet, they each have drawbacks, which prevent their wide adoption. Anonymization can be successfully attacked by leveraging auxilliary data, and differential privacy may lose too much research utility by adding large noise to heavy-tailed data. Our current research lies in two directions. First, we investigate new ways to share data safely, by introducing Commoner Privacy. Second, we have developed a framework called Critter@home, which empowers volunteer users to share their network traffic data with researchers in an anonymous, privacy-safe, aggregated manner.

Commoner Privacy

Commoner Privacy is a data-processing approach, which fuzzes (omits, aggregates or adds noise to) outputs of queries ran over private data. It fuzzes only those output points where an individual’s contribution to this point is an outlier. By hiding outliers, our mechanism hides the presence or absence of an individual in a dataset. We propose one mechanism that achieves commoner privacy— interactive 𝑘-anonymity. We also show that commoner privacy holds for query composition either via presampling or via query introspection.

Critter@home

Critter@home is a continuously updated archive of content-rich network data, contributed by volunteer users. Data contributors join the Critter overlay whenever online, offering their data to interested researchers. Privacy of data contributors is protected in multiple ways:

Our work relies on the secure query framework, which uses Commoner Privacy. This framework allows only for queries about aggregate features of the data, such as counts, distributions, etc. and preserves user privacy by applying k-anonymity and l-diversity principles.

Software and Datasets

Publications


This material is based upon work supported by the National Science Foundation under grant #1224035 and #0914780. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.