I heard Lotus Cafe is gone. And, Rififi may not be happy either. Ah, old timers.
-
Say you have data set you wish to release for research purposes but you don’t want the individual people identified (e.g., a medical data base) and thus tied to particular sensitive attributes (e.g., medical conditions). So, you have this data set consisting of what you consider to be sensitive attributes (e.g., medical conditions) and non-sensitive attributes (e.g., social security number, date of birth, zip code). In order to anonymize the data, you strip out the attributes that serve as blatantly unique identifiers (e.g., names, social security numbers).
Now, there is an immediate issue here. The obvious identifiers have been removed, but lets say this data set also contains attributes (e.g., date of birth, zip code) that when linked to external information lead to the identification of particular individuals (e.g., there is only one person with a given birthday in a given zip code, and this information is readily available someone trying to identify that individual in the data set).
This is where k-anonymity comes in. Now, you have already identified the sensitive and non-sensitive attributes, and you have removed the outright identifiers. From the remaining set of non-sensitive attributes, you can then identify those attributes, which are called quasi-identifiers, that could be linked to individuals through correlation against external data sources.
Here we come to one of the key assumptions made in the k-anonymizing process - the sanitizer can identify the attributes in the data set that can be tied to other, external sources. Not only do they have to identify these attributes, but they also have to assess the level of resources required to use those attributes to penetrate the anonymity of the data set, and make judgment calls on modifying the data for anonymity and/or privacy versus the usefulness of the resulting data set. This is hard stuff.
You can see many issues with this assumption. Does the sanitizer really know what will be quasi-identifiers in a data set? Even if they do, are their judgments about the risk posed by those identifiers realistic? For example, the sanitizer might not know about various public records or even google. Another example, one may assume that only large governments would have access to name, date of birth, and zip code information for individuals. However, most of us have access to this information for at least out friends and family.
Anyway, once you have picked out the quasi-identifiers, you then ensure that at least k records in the data set possess the same set of values for quasi-identifiers (e.g., generalize the date of birth to be year of birth, such that at least 10 records turn up for all years of birth and zip code combinations). In other words, instead of being able to uniquely identify a unique record using particular values for the quasi-identifiers, one always ends up identifying a set of k records with any particular values for the quasi-identifiers (e.g., you pull up 10 records at least for every year of birth and zip code combination).
Important note for the paper that follows later in this post: You may be wondering, what happens when dealing with all these sparse data sets with long-tail distributions? For example, a particular attribute may have a large number of values that are unique or at least very minimally spread across records, which could mean huge impacts on the data set if these values were generalized or removed for k-anonymity purposes. Which means there may be a big trade-off between anonymity and/or privacy, and the usefulness of the data here, which will minimize k if not render k-anonymity infeasible in order to keep the data at all useful. And, it is notable that most transactions databases fall into this category (e.g., credit card records, amazon purchases).
But, say k-anonymity is reasonable for the data without too much loss of usefulness of the data. Great, that covers establishing the exact identity of records, but there is an issue here - we may still be able to tie sensitive attributes to individuals. For example, if the sensitive attributes you are trying to unlink from an individual are applicable to all of the k records pulled up by the quasi-identifiers (e.g., all 10 records have the same medical condition), then one can assume that the individual possesses this attribute even if they can’t uniquely identify which of the k records is that individual. Or, if one knows particular sensitive values do or do not apply to the individual, they can rule out those records, perhaps pinpointing the applicable value of the sensitive attribute for the individual concerned (e.g., the medical condition in the group is either a broken arm or severe depression, and one knows the individual does not have a broken arm).
Which leads to the concept of l-diversity, wherein a set of k records should have l number of values for sensitive attributes contained within it. Now, when one pulls up that set of k records, all of those records do not have the same sensitive values and, even if I know certain sensitive values to do/do not apply to the individual in question, there is potentially still a range of records with other values for sensitive attributes that are applicable, making it hard for me to establish exactly which values for the sensitive attributes apply to a particular individual.
l-diversity can result in the data set undergoing significant modifications. Like with k-anonymity, judgment calls must be made on privacy versus the usefulness of the data set.
But, say l-diversity is applicable to the data set without too much loss of usefulness of the data. Nice, but we may still have some semantic ties or non-uniformity that can be exploited. For example, if all the k records have some related values for the sensitive values (e.g., all the medical conditions were mental problems) while the overall data set covered a larger variety of values, then information is learned about the members of k. Or, if the sensitive attributes in k records have one of set of odds of having particular values (e.g., 75% of members have cancer), while in the overall data set the sensitive attributes have some set of odds (e.g., 10% of members have cancer), this can be used to reveal information about the set of k records distinguishable for the overall data set.
Which leads to t-closeness, wherein the distribution of values of sensitive attributes in a set of k records will mimic the distribution of those values across the whole data set within a range t.
t-closeness can result in the data set undergoing even major changes. Like with k-anonymity and l-diversity, judgment calls must be made on privacy versus the usefulness of the data set.
And so forth.
…
I typed out that summary because I just noted this paper interesting. [via what is left of the cypherpunks list]
We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge.
We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
A few paragraphs of note…
Micro-data are characterized by high dimensionality and sparsity. Informally, micro-data records contain many attributes, each of which can be viewed as a dimension (an attribute can be thought of as a column in a database schema). Sparsity means that a pair of random records are located far apart in the multi-dimensional space defined by the attributes. This sparsity is empirically well-established [6, 4, 16] and related to the “fat tail” phenomenon: individual transaction and preference records tend to include statistically rare attributes.
This applies to most of those real-world databases out there containing information about us, which is something to note when policies say that information that could identify you has been removed from a data set before that data set is distributed.
Our de-anonymization algorithms are designed to work against databases that have been anonymized and “sanitized” by their publishers. The three main sanitization methods are perturbation, generalization, and suppression [23, 8]. Furthermore, the data publisher may only release a (possibly non-uniform) sample of the database. For example, he may attempt to k-anonymize the records, and then release only one record out of each cluster of k or more records.
If the database is released for collaborative filtering or similar data mining purposes (as in the case of the Netflix Prize dataset), the “error” introduced by sanitization cannot be large, otherwise its utility will be lost. We make this precise in our analysis. Our definition of privacy breach (see below) allows the adversary to identify not just his target’s record, but any record as long as it is sufficiently similar (via Sim) to the target and can thus be used to determine its attributes with high probability.
The tradeoff between privacy and/or anonymity, and usefulness comes into play, and the authors make sure to take advantage of it. The real-world is a fun place.
Moreover, the linkage between an individual and her movie viewing history has implications for her future privacy. In network security, “forward secrecy” is important: even if the attacker manages to compromise a session key, this should not help him much in compromising the keys of future sessions. Similarly, one may state the “forward privacy” property: if someone’s privacy is breached (e.g., her anonymous online records have been linked to her real identity), future privacy breaches should not become easier. Now consider a Netflix subscriber Alice whose entire movie viewing history has been revealed. Even if in the future Alice creates a brand-new virtual identity (call her Ecila), Ecila will never be able to disclose any non-trivial information about the movies that she had rated within Netflix because any such information can be traced back to her real identity via the Netflix Prize dataset. In general, once any piece of data has been linked to a person’s real identity, any association between this data and a virtual identity breaks anonymity of the latter.
Anonymity tends to be a one way street. This can be particularly dangerous when it comes to persistent pseudonyms.
We have presented a de-anonymization methodology for multi-dimensional micro-data, and demonstrated its practical applicability by showing how to de-anonymize movie viewing records released in the Netflix Prize dataset. Our de-anonymization algorithm works under very general assumptions about the distribution from which the data are drawn, and is robust to perturbation and sanitization. Therefore, we expect that it can be successfully used against any large dataset containing anonymous multi-dimensional records such as individual transactions, preferences, and so on.
An interesting topic for future research is extracting social relationships, networks and clusters from the anonymous records. This knowledge can be a source of information for further de-anonymization [13]. In the case of the Netflix Prize dataset, de-anonymization of individual records may also have interesting implications for winning the Netflix Prize. We discuss this briefly in appendix B.
Data is recorded, filtered, and mined. You, your actions are not random. There will be non-uniformity, patterns. Identities and attributes will always appear.
I tend to think pseudonymity is the best you get in practice, and even that is quite difficult.
-
So, a while back I wrote this.
Side note, this is even more interesting in combination with things like the reality it generally takes couples trying to have a baby on the average of a few months to achieve pregnancy. Such drawn out periods seems to protect against a number of attacks, such as misrepresentation and quick judgement - for example, it exposes potential mates that seem good at first glance but aren’t quite so good once a closer look is provided.
Fitting right in here, I noted this.
However, Provost and her colleagues say there is in fact no contradiction between this research and other studies, as they are investigating two different kinds of signal. The previous research investigating men’s response to fertile women focused on signals such as smells and facial expressions, which can only be detected at close range. That makes evolutionary sense, as it would benefit a woman to advertise her fertility to a man that she has decided is worth having children with and has therefore allowed to get close to her.
In contrast, men can pick up on the attractiveness of a woman’s walk from long distance, and it can therefore act as an unwitting signal to less appealing males who she might not want to choose. So the advantage of having a less sexy walk around the time of ovulation becomes clear: it allows a woman to hide her fertile period from undesirable men who might take advantage of her at that time.
-
I noticed this request.
As many of you know, NIST has specified two standard KDFs for use with key agreement algorithms (e.g., Diffie-Hellman or MQV) in NIST SP 800-56A. NIST is considering supplementing the 800-56A KDFs with a more broadly applicable KDF. In particular, NIST is considering a proposal for an HMAC-based KDF. Before committing resources to this effort, we would like to get a better handle on the requirements seen by protocol developers and evaluate the level of support for such a standard. We would also like to identify alternative designs that should be considered.
PBKDF2 (something you may already use possibly without knowing it) and S2V were pointed to in replies.
-
Look at that, Banksy in New York.
VANINA HOLASEK GALLERY·502 West 27TH Street New York, NY 10001 T: 212-367-9093
FOR IMMEDIATE RELEASE:
BANKSY DOES NEW YORK
DECEMBER 2ND – DECEMBER 29TH, 2007
OPENING RECEPTION: SUNDAY, DECEMBER 2ND, 1 PM -5 PM
BANKROBBER GALLERY, London, in collaboration with VANINA HOLASEK GALLERY, are pleased to present for the first time in New York, an exhibition of works by Banksy, on view from December 2nd through December 29th, 2007.
There may even be a comment on security in here.
He’s the maniac who got on the news for managing to smuggle one of his pieces of art into Tate Britain and embarrassed everyone because nobody seemed to notice…He’s the wit behind the stencilled “Mind the Crap” writing that appeared overnight on the steps to Tate Modern. He is the prankster who smuggled 500 alternative copies of the Paris Hilton CD into record stores. He is the subversive who placed a life-size replica of a Guantanamo Bay detainee in Disneyland. He’s the jester who gave LA a painted elephant. He is the trickster whose hoax cave painting of a man pushing a supermarket trolley sat in the British Museum unnoticed for three days. He is the infiltrator who disguised as a pensioner hung his perfectly framed pieces in the Metropolitan, MOMA, Brooklyn Museum and his “dead beetle with glued on sidewinder missiles and satellite dish” had pride of place in the Museum of Natural History NYC. Get the picture, get this. Banksy images are even being used to sell 900k condos in Williamsburg.
Anyway, there is a party Saturday (2007-12-01) night from 6-9pm at the gallery celebrating the opening. [via thisheartsonfire]
-
Finally, here are some generic odds.
The National Safety Council has been compiling and reporting on injury data every year since the 1920s. The table below was prepared in response to frequent inquiries to the Council concerning the odds of dying from or being killed by a specific incident or occurrence such as a lightning strike or a plane crash.
The odds given below are statistical averages over the whole U.S. population and do not necessarily reflect the chances of death for a particular person from a particular external cause. Any individual’s odds of dying from various external causes are affected by the activities in which they participate, where they live and drive, what kind of work they do, and other factors.