Over the past couple of weeks there has been a little website making the rounds. On the site the user is asked just 25 questions — each question about word choice. It’s a multiple choice quiz for U.S. residents that seeks to identify where your language most likely comes from — and the questions are innocuous enough: “What do you call that strip of grass between the street and the sidewalk?”
Nothing about the quiz makes you think they are actually going to be accurate. And yet, the site is stunningly accurate. It reminds me of yet another quiz — a little electronic toy that plays the 20 questions game with you. All you need to do is to think about something — hell tell the crowd in the room what you are thinking of, the toy can’t hear — and then truthfully answer its questions. I’ve only seen it fail to get the answer once, and that’s because the word was an esoteric Japanese tradition.
Both of these tools are a bit of engineering amazement, but they also both foretell how powerful computational power can be. If you have a large enough database to query, you only need so many search parameters before you get the answer you were looking for — it seems logical, but in practice feels a bit magical.
So, if a website can narrow down where you likely live, or grew up, by only asking 25 questions about your word choices — then I think you have to seriously wonder how close someone can get to actually identifying you if they are given the “anonymized” data that Google holds on users.
The question: can we truly anonymize data?
It seems like it would be a trivial task for Google/NSA to go from an anonymous user ID to ‘Ben Brooks’ if they properly mined my data — and if we can accept that as a given (I think it is hard not to believe that is possible), then the question really becomes: is assigning a user ID truly a means of making something anonymous?
I’m not sure it is.
Let’s just take what Google knows about you and strip it down to the bare minimum data that I (not having any targeted advertising knowledge) would guess a marketer might want to know to better target their ads at me:
- Keywords from email
- Keywords from Social Posting
- Keywords from Searches
- Sexual Orientation (gleaned from correspondence and searches)
- Marital Status
That’s a fairly intrusive list, but I highly doubt exhaustive of all the data points Google is tracking for every user they have — and seemingly innocuous when looked at point by point. I’d wager that given that data set you could match my data with my name and I don’t think it would take long, or be hard to do — even if the data is only shown to belong to user #110923849108234098.
Again a $20 toy can “read your mind” asking only 20 questions. A website can effectively know where you developed your language patterns from asking just 25 questions.
How hard do you really think it would be for the computing power of any large company to reverse all the thousands (millions?) of data points they have and find you?
That’s part of the problem with data collection: that no matter what it’s not a truly anonymous data set — it’s just a slightly less identifiable data set. You are effectively throwing a blanket over and object you want to hide instead of actually hiding the object. You can still see the size and shape, so educated guesses are fairly easy.