Privacy: The state or condition of being free from observation. (Oxford Dictionary). While this is a nice and complete definition, this definition really doesn't hold for activities on the Internet and other connected systems.
A better definition for the privacy we talk about in this class is:
The right of people to choose freely under what circumstances and to what extent they will reveal themselves, their attitude, and their behavior to others.
Privacy is a fuzzy concept because what I may consider private information/data may not be considered private by you.There are many types of data generated and collected on the Internet today. Some examples of private information are:
Privacy is achieved through policy enforced via technology, law and ethics. Privacy requires security, in that information is only private if the system holding the data is secure enough to enforce policy. On the other hand, security mechanisms often rely on authentication and accountability which can often be at odds with privacy.
Pressures on businesses and government to increase security while streamlining interactions, plus the increasing prevalence of new technologies leads to many privacy concerns. For example, while low-dose X-ray machines streamlined security checks for the Transportation Security Administration (TSA) at airports, the machines lead to protests and were eventually recalled due to privacy concerns (other scanners are still in use since these scanners produce a more generic outline of a body, and therefore raise fewer privacy concerns).
"If you're not paying for something, you're not the customer; you're the product being sold." --Andrew Lewis
Tracking Internet users and performing trend monitoring, targeted advertising and predictive behavior is big business. Companies like Google and Facebook monetize data collected on users.
Third-party cookies---cookies which a user receives from a party who's domain they did not visit directly---are a big privacy concern and also a big money maker. Gary Kovacs (CEO of Mozilla Corporation) discusses how prevalent tracking browsing behavior is in the Internet today in his TED talk "Tracking the Trackers". In his talk, he introduces "Collusion", a tool to visualize who uses third-party cookies and when you receive third-party cookies while browsing.
While there is nothing inherently wrong in exchanging your data for a service (eg. telling Google what you're interested in finding out about in exchange for using their search engine), the right to privacy requires an understanding and explicit consent on the user's part. Sometimes, as is the case with third-party cookies, it's unclear how much consent is given and how well the average user understands what personal information he/she is revealing.
Any one of these techniques may limit the usefulness of the data collected. Often there is tension between the versatility of the data collected and privacy.
Using one or more of these techniques to limit privacy concerns is not a guarantee that the data collected is truly safe from privacy concerns. There are many examples of where anonymized and sanitized data was considered safe enough to share, but in the end the releasing of such information lead to privacy breaches.
Part of the problem is that data does not exist in a vacuum. One dataset can be correlated with another dataset and new information can be inferred.
For example, The Massachusetts Group Insurance Commission released an anonymized and sanitized data set on state employees which showed every hospital visit for each employee. Employee identities were anonymized and PII data such as street addresses were removed. The ZIP code, birth date and gender for each employee were not removed however, since the purpose of releasing the data was for researchers to study health and demographics. Latanya Sweeney, a CS grad student at the time, was able to de-anonymize this dataset by correlating this dataset with public voting records. With this correlation she was able to identify the state governor (among others) and prove a point that anonymized data is not safe to release. Sweeney went on to show that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birth date and gender.
Another example is the Netflix Prize dataset: a data of movie reviews available for study by anyone. Netflix offered a prize to whomever made the most useful and interesting use of the data set. User identities were removed from the dataset, which appeared to make the data set safe to release publicly. (Note that while movie reviews may seem innocuous, take into account that users may reveal personal things like their sexual orientation and/or political views in a review and such information can have real consequences if revealed.) A pair of researchers from the University of Texas combined the Netflix Prize data set with the Internet Movie Database and showed that people could quite easily by picked out and identified from the Netflix data set.
As an aside, check out NNDB and NNDB mapper to play with some social network data.
Observation of internet traffic can give a great deal of information about associations. For example, if a user connects to unicornsarewaywayawesome.com, the content exchanged isn't needed to infer that the user probably likes unicorns.
During transit, we can protect privacy of application-level content by encrypting the application payload, but this does not protect the privacy of the association between the two communicating computers. Every observer along the path that a packet travels knows the source and destination of the communication.
One way to protect the privacy of such associations is onion routing. Onion routing uses layered encryption and a series of application-level routers which peel through the layers of encryption and route the packets according to contents. Along a path, each hop only knows the last and next hop---not the source and destination. For those of you who missed class, we discussed Tor: The Onion Router. Check out the class slides and/or MIT's How Tor Works video.
There are a number of attacks Tor is susceptible to, including DoS by blocking the publicly known Tor routers and timing attacks which correlate the first hop's traffic with the last hop's traffic to de-anonymize users.