Archive for November, 2007

data sets, walks, kdfs, banksy, odds

Wednesday, November 28th, 2007

I heard Lotus Cafe is gone. And, Rififi may not be happy either. Ah, old timers.

-

Say you have data set you wish to release for research purposes but you don’t want the individual people identified (e.g., a medical data base) and thus tied to particular sensitive attributes (e.g., medical conditions). So, you have this data set consisting of what you consider to be sensitive attributes (e.g., medical conditions) and non-sensitive attributes (e.g., social security number, date of birth, zip code). In order to anonymize the data, you strip out the attributes that serve as blatantly unique identifiers (e.g., names, social security numbers).

Now, there is an immediate issue here. The obvious identifiers have been removed, but lets say this data set also contains attributes (e.g., date of birth, zip code) that when linked to external information lead to the identification of particular individuals (e.g., there is only one person with a given birthday in a given zip code, and this information is readily available someone trying to identify that individual in the data set).

This is where k-anonymity comes in. Now, you have already identified the sensitive and non-sensitive attributes, and you have removed the outright identifiers. From the remaining set of non-sensitive attributes, you can then identify those attributes, which are called quasi-identifiers, that could be linked to individuals through correlation against external data sources.

Here we come to one of the key assumptions made in the k-anonymizing process – the sanitizer can identify the attributes in the data set that can be tied to other, external sources. Not only do they have to identify these attributes, but they also have to assess the level of resources required to use those attributes to penetrate the anonymity of the data set, and make judgment calls on modifying the data for anonymity and/or privacy versus the usefulness of the resulting data set. This is hard stuff.

You can see many issues with this assumption. Does the sanitizer really know what will be quasi-identifiers in a data set? Even if they do, are their judgments about the risk posed by those identifiers realistic? For example, the sanitizer might not know about various public records or even google. Another example, one may assume that only large governments would have access to name, date of birth, and zip code information for individuals. However, most of us have access to this information for at least out friends and family.

Anyway, once you have picked out the quasi-identifiers, you then ensure that at least k records in the data set possess the same set of values for quasi-identifiers (e.g., generalize the date of birth to be year of birth, such that at least 10 records turn up for all years of birth and zip code combinations). In other words, instead of being able to uniquely identify a unique record using particular values for the quasi-identifiers, one always ends up identifying a set of k records with any particular values for the quasi-identifiers (e.g., you pull up 10 records at least for every year of birth and zip code combination).

Important note for the paper that follows later in this post: You may be wondering, what happens when dealing with all these sparse data sets with long-tail distributions? For example, a particular attribute may have a large number of values that are unique or at least very minimally spread across records, which could mean huge impacts on the data set if these values were generalized or removed for k-anonymity purposes. Which means there may be a big trade-off between anonymity and/or privacy, and the usefulness of the data here, which will minimize k if not render k-anonymity infeasible in order to keep the data at all useful. And, it is notable that most transactions databases fall into this category (e.g., credit card records, amazon purchases).

But, say k-anonymity is reasonable for the data without too much loss of usefulness of the data. Great, that covers establishing the exact identity of records, but there is an issue here – we may still be able to tie sensitive attributes to individuals. For example, if the sensitive attributes you are trying to unlink from an individual are applicable to all of the k records pulled up by the quasi-identifiers (e.g., all 10 records have the same medical condition), then one can assume that the individual possesses this attribute even if they can’t uniquely identify which of the k records is that individual. Or, if one knows particular sensitive values do or do not apply to the individual, they can rule out those records, perhaps pinpointing the applicable value of the sensitive attribute for the individual concerned (e.g., the medical condition in the group is either a broken arm or severe depression, and one knows the individual does not have a broken arm).

Which leads to the concept of l-diversity, wherein a set of k records should have l number of values for sensitive attributes contained within it. Now, when one pulls up that set of k records, all of those records do not have the same sensitive values and, even if I know certain sensitive values to do/do not apply to the individual in question, there is potentially still a range of records with other values for sensitive attributes that are applicable, making it hard for me to establish exactly which values for the sensitive attributes apply to a particular individual.

l-diversity can result in the data set undergoing significant modifications. Like with k-anonymity, judgment calls must be made on privacy versus the usefulness of the data set.

But, say l-diversity is applicable to the data set without too much loss of usefulness of the data. Nice, but we may still have some semantic ties or non-uniformity that can be exploited. For example, if all the k records have some related values for the sensitive values (e.g., all the medical conditions were mental problems) while the overall data set covered a larger variety of values, then information is learned about the members of k. Or, if the sensitive attributes in k records have one of set of odds of having particular values (e.g., 75% of members have cancer), while in the overall data set the sensitive attributes have some set of odds (e.g., 10% of members have cancer), this can be used to reveal information about the set of k records distinguishable for the overall data set.

Which leads to t-closeness, wherein the distribution of values of sensitive attributes in a set of k records will mimic the distribution of those values across the whole data set within a range t.

t-closeness can result in the data set undergoing even major changes. Like with k-anonymity and l-diversity, judgment calls must be made on privacy versus the usefulness of the data set.

And so forth.

I typed out that summary because I just noted this paper interesting. [via what is left of the cypherpunks list]

We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge.
We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.

A few paragraphs of note…

Micro-data are characterized by high dimensionality and sparsity. Informally, micro-data records contain many attributes, each of which can be viewed as a dimension (an attribute can be thought of as a column in a database schema). Sparsity means that a pair of random records are located far apart in the multi-dimensional space defined by the attributes. This sparsity is empirically well-established [6, 4, 16] and related to the “fat tail” phenomenon: individual transaction and preference records tend to include statistically rare attributes.

This applies to most of those real-world databases out there containing information about us, which is something to note when policies say that information that could identify you has been removed from a data set before that data set is distributed.

Our de-anonymization algorithms are designed to work against databases that have been anonymized and “sanitized” by their publishers. The three main sanitization methods are perturbation, generalization, and suppression [23, 8]. Furthermore, the data publisher may only release a (possibly non-uniform) sample of the database. For example, he may attempt to k-anonymize the records, and then release only one record out of each cluster of k or more records.
If the database is released for collaborative filtering or similar data mining purposes (as in the case of the Netflix Prize dataset), the “error” introduced by sanitization cannot be large, otherwise its utility will be lost. We make this precise in our analysis. Our definition of privacy breach (see below) allows the adversary to identify not just his target’s record, but any record as long as it is sufficiently similar (via Sim) to the target and can thus be used to determine its attributes with high probability.

The tradeoff between privacy and/or anonymity, and usefulness comes into play, and the authors make sure to take advantage of it. The real-world is a fun place.

Moreover, the linkage between an individual and her movie viewing history has implications for her future privacy. In network security, “forward secrecy” is important: even if the attacker manages to compromise a session key, this should not help him much in compromising the keys of future sessions. Similarly, one may state the “forward privacy” property: if someone’s privacy is breached (e.g., her anonymous online records have been linked to her real identity), future privacy breaches should not become easier. Now consider a Netflix subscriber Alice whose entire movie viewing history has been revealed. Even if in the future Alice creates a brand-new virtual identity (call her Ecila), Ecila will never be able to disclose any non-trivial information about the movies that she had rated within Netflix because any such information can be traced back to her real identity via the Netflix Prize dataset. In general, once any piece of data has been linked to a person’s real identity, any association between this data and a virtual identity breaks anonymity of the latter.

Anonymity tends to be a one way street. This can be particularly dangerous when it comes to persistent pseudonyms.

We have presented a de-anonymization methodology for multi-dimensional micro-data, and demonstrated its practical applicability by showing how to de-anonymize movie viewing records released in the Netflix Prize dataset. Our de-anonymization algorithm works under very general assumptions about the distribution from which the data are drawn, and is robust to perturbation and sanitization. Therefore, we expect that it can be successfully used against any large dataset containing anonymous multi-dimensional records such as individual transactions, preferences, and so on.
An interesting topic for future research is extracting social relationships, networks and clusters from the anonymous records. This knowledge can be a source of information for further de-anonymization [13]. In the case of the Netflix Prize dataset, de-anonymization of individual records may also have interesting implications for winning the Netflix Prize. We discuss this briefly in appendix B.

Data is recorded, filtered, and mined. You, your actions are not random. There will be non-uniformity, patterns. Identities and attributes will always appear.

I tend to think pseudonymity is the best you get in practice, and even that is quite difficult.

-

So, a while back I wrote this.

Side note, this is even more interesting in combination with things like the reality it generally takes couples trying to have a baby on the average of a few months to achieve pregnancy. Such drawn out periods seems to protect against a number of attacks, such as misrepresentation and quick judgement – for example, it exposes potential mates that seem good at first glance but aren’t quite so good once a closer look is provided.

Fitting right in here, I noted this.

However, Provost and her colleagues say there is in fact no contradiction between this research and other studies, as they are investigating two different kinds of signal. The previous research investigating men’s response to fertile women focused on signals such as smells and facial expressions, which can only be detected at close range. That makes evolutionary sense, as it would benefit a woman to advertise her fertility to a man that she has decided is worth having children with and has therefore allowed to get close to her.

In contrast, men can pick up on the attractiveness of a woman’s walk from long distance, and it can therefore act as an unwitting signal to less appealing males who she might not want to choose. So the advantage of having a less sexy walk around the time of ovulation becomes clear: it allows a woman to hide her fertile period from undesirable men who might take advantage of her at that time.

-

I noticed this request.

As many of you know, NIST has specified two standard KDFs for use with key agreement algorithms (e.g., Diffie-Hellman or MQV) in NIST SP 800-56A. NIST is considering supplementing the 800-56A KDFs with a more broadly applicable KDF. In particular, NIST is considering a proposal for an HMAC-based KDF. Before committing resources to this effort, we would like to get a better handle on the requirements seen by protocol developers and evaluate the level of support for such a standard. We would also like to identify alternative designs that should be considered.

PBKDF2 (something you may already use possibly without knowing it) and S2V were pointed to in replies.

-

Look at that, Banksy in New York.

VANINA HOLASEK GALLERY·502 West 27TH Street New York, NY 10001 T: 212-367-9093

FOR IMMEDIATE RELEASE:
BANKSY DOES NEW YORK
DECEMBER 2ND – DECEMBER 29TH, 2007
OPENING RECEPTION: SUNDAY, DECEMBER 2ND, 1 PM -5 PM

BANKROBBER GALLERY, London, in collaboration with VANINA HOLASEK GALLERY, are pleased to present for the first time in New York, an exhibition of works by Banksy, on view from December 2nd through December 29th, 2007.

There may even be a comment on security in here.

He’s the maniac who got on the news for managing to smuggle one of his pieces of art into Tate Britain and embarrassed everyone because nobody seemed to notice…He’s the wit behind the stencilled “Mind the Crap” writing that appeared overnight on the steps to Tate Modern. He is the prankster who smuggled 500 alternative copies of the Paris Hilton CD into record stores. He is the subversive who placed a life-size replica of a Guantanamo Bay detainee in Disneyland. He’s the jester who gave LA a painted elephant. He is the trickster whose hoax cave painting of a man pushing a supermarket trolley sat in the British Museum unnoticed for three days. He is the infiltrator who disguised as a pensioner hung his perfectly framed pieces in the Metropolitan, MOMA, Brooklyn Museum and his “dead beetle with glued on sidewinder missiles and satellite dish” had pride of place in the Museum of Natural History NYC. Get the picture, get this. Banksy images are even being used to sell 900k condos in Williamsburg.

Anyway, there is a party Saturday (2007-12-01) night from 6-9pm at the gallery celebrating the opening. [via thisheartsonfire]

-

Finally, here are some generic odds.

The National Safety Council has been compiling and reporting on injury data every year since the 1920s. The table below was prepared in response to frequent inquiries to the Council concerning the odds of dying from or being killed by a specific incident or occurrence such as a lightning strike or a plane crash.

The odds given below are statistical averages over the whole U.S. population and do not necessarily reflect the chances of death for a particular person from a particular external cause. Any individual’s odds of dying from various external causes are affected by the activities in which they participate, where they live and drive, what kind of work they do, and other factors.

Quick notes on Ubuntu Gutsy with Xen and djbdns, phishing lesson, misc

Thursday, November 8th, 2007

Just when you started thinking I was beginning to clean up my messy posting habits, I went and did this…

-

So, I decided to migrate my Tor server, etc. and thought it would be nice to upgrade to Ubuntu Gutsy in the process. (I also took this as an opportunity to setup disk encryption, which was quick and easy.)

As part of the effort, I rebuilt the Xen guest domains I was using on this server. This part turned out to have some quirkiness, as my Xen dom-0 and dom-Us running Ubuntu Gutsy (7.10) would not place nice under the Xen 3.1 that was installed from the latest Ubuntu Gutsy Xen packages (2.6.22-14-xen kernel). By not place nice, I mean the guests would hang during the boot process and/or not provide a usable console.

So, for my reference and yours, I figured it good to point out where I found fixes/workarounds for these issues with Xen 3.1 (2.6.22-14-xen kernel) and Ubuntu Gutsy (7.10) (used for both the host and guests) – this link is where I found some guidance to help fix the issue.

In particular, I found this, which led me to copy “etc/event.d/tty1″ to “etc/event.d/xvc0″ and then replace all occurences of “tty1″ with “xvc0″ within “etc/event.d/xvc0″, useful and it worked across a couple of running dom-U’s. Alternatively, this seems like it might be a workaround, although I did not use it myself.

I found this, which led me to remove offending “hwclock” entries, took care of some hang time.

Also, I did this, which led me to replace “sda” with “xvda” in my guest’s Xen configuration and its “etc/fstab”, just to continue following the general direction Xen is going, although it did not fix any issues.

-

I decided to use djbdns on a dns cache server. As I was setting this up in one of my newly create Xen virtual machines, I found these instructions useful (minus the small part about tinydns, as I wanted a dns cache service – dnscache was my concern, e.g., dnscache); however, I did note one issue on my Ubuntu Gutsy install with regards to the contents of “etc/event.d/svscan” conveyed in those instructions – the use of “runlevel-” was incorrect.

In other words, under Ubuntu Gutsy (7.10), the “etc/event.d/svscan” contents should be something like what follows.

# svscan – daemontools — http://www.froyn.net/blosxom/blosxom.cgi/2007/1/12
#

# This service maintains an svscan process from the point the system is
# started until it is shut down again.

start on runlevel 2
start on runlevel 3
start on runlevel 4
start on runlevel 5

respawn
exec /command/svscanboot

-

I briefly noted this paper on training user’s about phishing.

Our embedded training system works roughly as follows.
People are periodically sent training emails, perhaps from
their system administrator or from a training company.
These training emails look just like phishing emails, urging
people to go to some website and log in. If people fall for
the training email and click on a link in that email, we
provide an intervention that explains that they are at risk for
phishing attacks and gives some tips for protecting
themselves.

Any manager that has ever had to train (or discipline) an employee can probably relate to some of these lessons learned.

· Embed the training into users’ regular activities so they do not have to go to a separate website to learn about phishing attacks.
· Make it clear why users are being warned—for example, what the risks are and what caused the warning.
· Do not delay the warnings; present them immediately after the user clicks on the link.
· Use training messages with the same content that users have just seen, as this helps them concretely relate to what is being discussed in the training message.
· Supplement training text with story-based graphics and annotations.
· Keep the training messages simple and short. One reason the security notices did not work well was too much text.
· Give clear actionable items that participants can easily do to protect themselves.

The “Embed the training into users’ regular activities” reminded me of what I discussed here in quite some length.

Now, lets look at a much better example. In preparation for security training, you have someone sit outside your organization auditing whether people were taking off their ID badges before leaving work, as was mandatory. As part of this audit, the auditor photographs the ID badge of someone leaving the offices still still wearing their ID badge. The next day, that person comes to work to find you sitting at their desk wearing an ID badge with their name but your face. Now, that would make an impact.

However, the actual implementation used in the paper seemed to lack the impact of my example, in that, the people in the study were asked to play someone else, which separated those people from the actions being taken and the consequences of those actions. There was no real world experience here to drive the lessons home.

Oh, and here is our old phishing lesson.

And, what techniques did we employ?

  • Wariness – Email should not be trusted by default. Examine email messages closely, especially if they request sensitive information, contain attachments, or provide links, and be sure to establish trust before performing any actions requested by the messages. This is just how I think, and it helps in avoiding scams.
  • Research – When in doubt, do some research. (An easy way to do this for messages claiming to be from a company or person you deal with is just call the company or person.) By looking at the raw message, I could see weird characteristics in the headers and the message body that indicated this was a fraud. I then used information taken from the headers and message body to identify proper abuse contacts at the relevant ISPs.
  • And the easiest one…

  • Multiple email accounts with dedicated purposes – By having specialized email accounts that are used only for certain purposes, many scam messages can be quickly identified just by being out of place. (You can also track the dissemination of your email addresses quite well by doing this.) That is how I knew this message was a scam before I even read it its contents.

-

Finally, UAVs have been getting a little attention in my circles for quite a while. So, this article might interest some of you.

Having evolved from military use, drones, or unmanned aerial vehicles (UAVs), are taking to the air in increasing numbers for public-service and civilian roles. They are being operated by groups as diverse as police, surveyors and archaeologists. A UAV helped firemen track the blaze that recently ravaged southern California. [...]

Reminded me of some of the things we did with model airplanes back in the day.

Road to next SHA continues – Contest announcement

Monday, November 5th, 2007

So, the search for SHA-3 continues with the publishing of the official rules and requests for submissions for the SHA-3 competition.

NIST has opened a public competition to develop a new cryptographic hash algorithm, which converts a variable length message into a short “message digest” that can be used for digital signatures, message authentication and other applications.  The competition is NIST’s response to recent advances in the cryptanalysis of hash functions. The new hash algorithm will be called “SHA-3” and will augment the hash algorithms currently specified in FIPS 180-2, Secure Hash Standard. Entries for the competition must be received by October 31, 2008. The competition is announced in the Federal Register Notice published on November 2, 2007; further details of the competition will be available at the specific sites indicated in the menu on the left.

The timeline can now be found here.

Tentative Timeline (subject to change) for the Hash Algorithm Competition: [pdf version of timeline]

FIPS 180-2 (the Secure Hash Standard) is scheduled for a review in 2007 and again in 2012. A reasonable goal is to complete the hash function competition process, and publish the augmented and revised Hash Function Standard by 2012.

As I said before,

And so begins the long and winding road to the next SHA. This will be fun.

(Politics aside, I thought the AES process was quite effective and necessarily open, a big change from the government’s view of crypto in the past. I am glad to see them following a similar path for the next hashing algorithms.)