Throwaway account. My company created an analytics product around the ability to track which sites your visitors have visited. It used a different and (at the time) more reliable technique.
About a year after the product launch we were contacted by a powerful washington based lobby group and they wanted to chat. They felt it violated a site visitor's "reasonable expectation of privacy". I agreed. So we pulled the feature and dodged a bullet as this "browser bug" hit the mainstream press a few months later. The feature wasn't a major part of our product's value prop, few of our customers used it and none missed it.
So if you're thinking about basing a startup on this, don't. You will get a call very quickly from organizations much larger than you are asking awkward questions.
I'm sure it's the trick where you make a bunch of links to various sites, set a :visited color, and then interrogate all of the links with javascript to see what color they are. It's no longer possible on most browsers.
I get what you're saying, but if the technique really worked, I'd expect it to have told me the sites I frequently visit on the first run and a 100% score on the second run, but it was only like 80% of that.
same here. funny thing is that all correct matches were for sites actually opened in another tab and it even missen facebook which was also open in some tab.
Same here, although I visited multiple sites listed in the very same session on the browser I have open, the tool only reported one as visited, and Google as a Whoops.
It should have reported at least 6 more!
Note that it doesn't need to be 100% accurate to be effective. If it guesses better than 50% (i.e. coin flip), then it could be used to give guesses with at least some confidence. No different than analyzing any other noisy dataset. Because this all works client-side, it can also be done quite invisibly.
Right, but those applications will be for things like online advertising, which is already tracking your visits and/or just assuming you use the popular sites anyway. Can this be used to violate privacy in a meaningful manner if it misses 50% or more of the time? You'd have complete plausible deniability if you were pinned to have accessed some site you don't want people to know you accessed. It can't be used as evidence in anything. What would the threat look like?
I oppose user tracking not necessarily because there's something to hide, but because a sense privacy is a fundamental component of a sense of independence, and for me, independence is one of the critical components of happiness.
If the question is "Has this user visited site X?" then I'd hope that any kind of modelling is better than 50%, as simulating a coin toss would be at least as good.
I was under the impression that attempts had been made to hide any effects of :visited styles on the accessible DOM to stop this from working. There's a particularly good article on Mozilla's attempts [1], and the relevant bug on Bugzilla. [2]
Apparently the key to people not knowing where you visited is to use IE. It missed sites like Twitter and Facebook that are open for me all the time. It did get one site correct, HN ;<).
What would be a possible use of this attack? I can't think of anything useful you'd do with knowing that you've visited Facebook. And so many people use sites like Facebook you might get a better success rate just always returning "visited" rather than measuring this way!
Maybe you could use it to only show those social sharing widgets that are for services the visitor actually uses. Though I guess WebIntents will eventually be a better way to handle that.
Ad retargeting without going through one of the high-reach ad networks. Or any product site instantly knowing which competitors you've researched, and tailoring their pricing to that.
Again, this seems to be inaccurate for a large number of people. Can we take these two attempts as evidence that is hard for malicious websites to discern our browser history?
Said I hadn't visited any of the sites, except HN (which is easy to guess since news.ycombinator.com will be in the HTTP_REFERER field...).
If I re-run the test it still gets some sites wrong (says I haven't visited them when in fact I have). It even claims I haven't visited Amazon both times when in fact it's open in another tab.
I just tried it twice, once on a public wi-fi network. And then again when I got home. It worked very well on the public wifi, and had many false positives at home.
It seems to work better on slower internet connections. The script returns calls a site "visited" if the response time of the potentially cached image is less than 1/20 the time of the certainly uncached image.
On slow connections the cache is much faster than the uncached. On fast connections it's only slightly faster. However, the known uncached images sometimes have "10x increase in latency" so it seems that based on my (and other's experience) that this is a major problem.
One could attempt to normalize this for the sites where appending random query string causes higher latency. Simply precalculate the added latency from images with the random query string on a per site basis. Then subtract it from "uncachedTime."
Doesn't appear to guess correctly in Chrome 15 on OS X (10.7.2). I'm not sure exactly what the 'whoops' means for google - but I've obviously visited HN and have visited a few of the others as well.
Really interesting concept. This one wasn't as accurate for me as the original Firefox-specifc proof of concept, though. It only picked up on YouTube and Wikipedia. What's with the "whoops" on Google?
I do use NoScript and Ghostery, though, and I could see how that might cause some false negatives.
When the script is run the second time, it will show that every site was visited. After the first visit, guess its cached and cannot figure out if its a hit or a fail :)
Running in Chrome's incognitive mode is a bit different though. only 7 show up cached the first time its run.
Same results here, and I'm not running any of the things you mentioned; also been to many of those sites, within the past few days in many cases. I'm basically on stock Chrome (though I do have Adblock).
In my case, the ones it got wrong were the images that returned a 304 (not changed) header since they returned significantly faster than fetching the full image.
It said no to sites I had visited. Unless that is what is was programmed to do I'd say it did not work. You can message me for any other info about the test.
From the 5 sites I visited, it correctly flagged HN, WP and YT as visited, and gave a "whoops" for FB and Google (what does that mean?), which I both visited.
Yet another reason this test is useless. If site A uses it, it may get partially correct data, but when you browse to site B, it will return 100% positive, most of these being false positives.
I just don't see any practical application for this method with such high error rates. The methods mentioned above are only valuable if you can guarantee at least relative reliability. By and large the results have been seemingly random, with only one or two persons reporting 100% correctness. So what's the difference between running a test with wildly unreliable results and just doing something randomly?
Even so, even without doing any work to ameliorate these flaws, it could still be (ab)used. Don't assume that it's only useful if everyone can scan which of the top 100 websites you've visited.
Any site could use this to check which competitors' sites have been visited. It's unlikely anyone else has an interest in checking that information, so the cache is not going to be poisoned by anyone else. With knowledge of which competitors a potential customer has checked out, you could do some effective price discrimination -- the guy looking at the $10 solutions sees your lowest price, while the guy looking at some competing Microsoft Dynamics package enters a more enterprisey sales funnel.
It's also useful for retargeting. Throw the code up on an ad network and you only test for cache hits against domains of current advertisers. If there's a hit, store it in a cookie so you don't need to check the (now filled) cache again. You can now show ads for companies a person has already had an interaction with, without having to cookie every visitor to the advertisers' sites first.
It doesn't take much to come up with (mostly nefarious) uses for this, even without perfect accuracy and even without the ability to have multiple parties check the same URLs.
It also doesn't take much to come up with ways to improve the process. You can ameliorate the problem of overlapping testers by having a large pool of URLs from each site to check. The average top 1000 site probably has dozens and dozens of images and other resources per page, each of which can be used for a cache test.
First of all, I was explaining my experience.
Second,I know it is a proof of concept. Still, I don't think it really does prove any concept, since the results are not reliable at all. A proof of concept, requires to give you a reliable results, hence a proof.
Not even close, too. Instead of measuring load time, you can create "<a>" elements verify their rendered color is the color you defined for visited links. It's a trick of old times...
as mentioned elsewhere in this thread, that loophole has been closed by all modern browsers. not to say there aren't other ways to get at that information, but it's not as simple as checking the color of a link anymore.
About a year after the product launch we were contacted by a powerful washington based lobby group and they wanted to chat. They felt it violated a site visitor's "reasonable expectation of privacy". I agreed. So we pulled the feature and dodged a bullet as this "browser bug" hit the mainstream press a few months later. The feature wasn't a major part of our product's value prop, few of our customers used it and none missed it.
So if you're thinking about basing a startup on this, don't. You will get a call very quickly from organizations much larger than you are asking awkward questions.