A website that deletes itself once indexed by Google

tonyarkles · on March 8, 2015

I had a client once who had something similar, although unintentionally. She approached me because her website "kept getting hacked" and she didn't trust the original developers to solve the security problems... And rightly so!

There were two factors that, together, made this happen: first, the admin login form was implemented in JS, and if you went to log in with it with JS disabled, it wouldn't verify your credentials. And it submitted via a GET request. Second, once you were in the admin interface, you could delete content from the site by clicking on an X in the CMS. Which, as was the pattern, presented you with a JS alert() prompt before deleting the content... via a GET request.

Looking at the server logs around the time it got "hacked", you could see GoogleBot happily following all the delete links in the admin interface.

ars · on March 8, 2015

> I had a client once who had something similar, although unintentionally.

I did that too. I was aware of the problem, but at the time (1996) I did not know how to fix it.

So I just documented it and warned that they should keep the site away from altavista.

This was back before cookies had wide support, so login state was in the URL. If you allowed a search spider to know that URL it would have deleted the entire site by spidering it.

I did eventually fix it by switching to forms, and strengthening the URL token to expire if unused for a while. And then eventually switching to cookies (at one point it supported both url tokens and cookies).

I have not thought about those days in such a long time.

shiggerino · on March 8, 2015

Why not POST requests for anything that changed server-side state?

ars · on March 8, 2015

Obviously that is the solution. I know that now, I didn't then. (As I wrote: "I did eventually fix it by switching to forms.")

The whole thing about POST vs GET that everyone knows today for read only vs write was not that well known back then.

Back then you used GET for things with a small number of variables, and POST when you expected enough data that it wouldn't fit in the URL. It was all about the URL, not about the effect of the request.

shiggerino · on March 8, 2015

Ah, I see. Should have picked that up.

I guess there was no Wikipedia to have an article for HTTP back then, which has been an invaluable resource for me to understand some of the intricacies in my work.

girishso · on March 8, 2015

I remember those days. Those days only two methods existed, GET and POST! ;)

sukilot · on March 8, 2015

htaccess would have been your friend. How did you prevent any visitor from deleting the site?

ars · on March 8, 2015

> htaccess would have been your friend

htaccess didn't exist in 1996.

This site ran on IIS 1.0 on Windows NT 3.51. For scripting we used a prerelease Coldfusion version. (i.e. the version before 1.0, which was released as we were developing the site, partially based on feedback we provided as we tested it.)

> How did you prevent any visitor from deleting the site?

A security token in the url which was secret. The worry was that some admin would try to submit the site to altavista for indexing without removing the token from the url first.

zerocrates · on March 9, 2015

> htaccess didn't exist in 1996.

Obviously not for IIS, but .htaccess files go back at least as far as NCSA httpd, and so definitely existed before 1996.

airtonix · on March 9, 2015

> This site ran on IIS (anything)

There's your first problem.

nathan_long · on March 10, 2015

Unnecessary condescending snark.

ikeboy · on March 8, 2015

Are you this guy http://thedailywtf.com/articles/The_Spider_of_Doom? Or http://craigandera.blogspot.com/2004/04/beware-googlebot_12....? (They seem like the same story but have different names.)

robflynn · on March 8, 2015

Probably not. This happens more than you might think. I got called in to consult on a project where something similar was happening. Client would add products to their web store and the next day the products were missing.

Unsecured access and 'GET' based deletes were everywhere.

kragen · on March 8, 2015

I accidentally deleted about half of the database at a startup where I’d recently started working by approximately the same method. I was running a copy of the web interface on my laptop, connecting over the internet to our MySQL server, and also running ht://dig’s spider on localhost from cron. It started spidering the delete links. Fortunately, I’d also started running daily MySQL backups from cron (there were no backups before I started working there), so we only lost a few hours of everyone’s work. As you can imagine, though, they weren’t super happy with me that day.

frandroid · on March 8, 2015

If they weren't making backups before, and you instituted them, they should have been super happy with you.

kevin_thibedeau · on March 8, 2015

Cowboy coders don't like to see the holes in their development process.

kragen · on March 9, 2015

They were unhappy to lose several hours of the rest of the company’s work.

firtoz · on March 8, 2015

Someone should make a website for indexing bots to play with!

brador · on March 8, 2015

What's the idea solution for this? I would drop a cookie and use that to verify admin privileges on each page. Is that right?

jessaustin · on March 8, 2015

First, use the correct HTTP verb: POST (or possibly DELETE). Googlebot only GETs.

Also as you note, destructive changes should be authenticated, whether by Basic Auth over TLS or the more common cookie tokens.

ben0x539 · on March 8, 2015

Fundamentally, authentication when someone tries to delete a thing needs to happen in server-side logic, not on the client side. The rest is flavouring.

vkjv · on March 8, 2015

Authentication should happen server side, but authentication need not happen at the time of delete. When deleting, you should be authorizing and validating, which can safely be done client side... but, if you are doing something server-side (e.g., a delete) you should also be doing it server side.

TillE · on March 8, 2015

HTTP is stateless. You need to generate some kind of token that can be checked for each admin action a user takes.

Mahn · on March 8, 2015

Authentication, POST requests and CSRF tokens, at the bare minimum.

tlrobinson · on March 8, 2015

Basically do the exact opposite of everything they did. Doing authentication in client-side JavaScript is an absolute no-no. Using GET requests for things that have side effects (like deleting content) is another.

Apofis · on March 8, 2015

Don't authenticate on the client, it's that simple. Authenticate only server side.

airtonix · on March 9, 2015

Solution is to use a mature web framework that already solves this. ie. Django.

CodeWriter23 · on March 9, 2015

You win "Most Hilarious Bug" for the day.

tonyarkles · on March 10, 2015

Clearly! I just came back to HN and wondered what the hell had happened to my karma.

tlrobinson · on March 8, 2015

I'm surprised there are so many people on Hacker News asking "why?".

Hackers don't need a reason, other than it being clever, novel, fun, etc. But if you want a reason there are plenty:

* art: there are numerous interpretations of this

* fun: this is sort of the digital equivalent of a "useless box" http://www.thinkgeek.com/product/ef0b/

* science: experiment to see how widespread a URL can be shared without Google becoming aware of it

* security: embed unique tokens in your content to detect if it has leaked to the public

barbs · on March 9, 2015

I agree that there are lots of reasons that someone would make a site like this, but I think people are curious as to the maker's specific reason. From the github:

Why would you do such a thing? My full explanation was in the content of the site. (edit: ...which is now gone)

I'm curious as to what the website said originally.

tlrobinson · on March 9, 2015

In that case, I'd guess "art".

soheil · on March 8, 2015

I think more and more the word hacker has lost its original meaning at least in this community. If I were reading a similar story on a tor hidden service, let's say, I would not be asking why, but here I do.

TeMPOraL · on March 8, 2015

It's totally twisted, because "business types" got involved and everyone started confusing "hacking" with "working". This story is a cool but simple hack, in the original meaning of the word. I find that a good definition of a hack is "a project or a trick that you can tell to your tech friends over a beer and have some good laugh from it".

RexRollman · on March 8, 2015

My first thought was filesharing.

TeMPOraL · on March 8, 2015

Auto-DMCA-ing yourself?

RexRollman · on March 9, 2015

Exactly.

raimondious · on March 9, 2015

My first question was "_why? Is that you?"

dsjoerg · on March 8, 2015

It's a digital embodiment of coolness; once the masses can find out about it, it isn't cool anymore and the coolness is gone. Literally.

LukeB_UK · on March 8, 2015

I think Hipsterism is what you're actually referring to.

xorcist · on March 8, 2015

I was cool long before it was hip.

rjempson · on March 8, 2015

Much like tattoos.

frik · on March 8, 2015

An alternative would be to check for the browser user agent and delete the website right at that point and return a 404 page to the Google crawler bot. Then Google won't have a static copy of the website.

desdiv · on March 8, 2015

Your approach is "a website that irrevocably deletes itself once indexed by Google".

What OP has done is "a website that irrevocably deletes itself once Google decided to publicly reveal the fact that it indexed said website".

OP's approach has no way of knowing when the site was indexed. It's conceivable that Google indexed it on the very first day and decided not to share it publicly until 21 days later.

nostrademons · on March 8, 2015

Technically, the former is when it is "crawled" and the latter is when it is "indexed".

In practice, since 2010 these two events have generally been separated by minutes.

tyrust · on March 8, 2015

If you really want to get "technical", then the first one is when the site is "crawled" and the latter is when it's "served". "Indexing" happens in-between the two.

lukeschlather · on March 8, 2015

Even if the request that claims to be from Googlebot is actually the Googlebot (which it might not be,) that doesn't guarantee the site is indexed. It's impossible to know when the site is indexed without direct access to Google's index.

BorisMelnik · on March 8, 2015

hah, good point for all intents and purposes (for most) indexed is when it appears in the search results

LukeB_UK · on March 8, 2015

The problem with that is that you could spoof the user agent.

haneefmubarak · on March 8, 2015

Actually, you could do a reverse IP lookup against any user agent claiming to be googlebot followed by a forward IP lookup against the domain name you were returned. Legitimate googlebots will be in the *.googlebot.com space.

Source: https://support.google.com/webmasters/answer/80553?hl=en

jonlucc · on March 8, 2015

But Google probably doesn't bother, right?

LukeB_UK · on March 8, 2015

What I meant is that a human could spoof the user agent and pretend to be Googlebot.

ForHackernews · on March 8, 2015

They do tell google not to save a static copy:

> the NOARCHIVE meta tag is specified which prevents the Googles from caching their own copy of the content.

ikeboy · on March 8, 2015

Wouldn't that only prevent the user from seeing the cache? I mean, if it's indexed, then google must have it cached, right?

adanto6840 · on March 8, 2015

That meta tag prevents Google from publicly showing their cached version of the page. In practice this means the "Cached" link, within the results, doesn't appear when a given page asks Google to NOARCHIVE -- which I believe can be 'asked for' via either the meta tag or via a special response header.

Edit:

Yeah, 'noarchive' can be specified via the meta tag or via header. Also available to you are a handful of other directives such as NoIndex, NoFollow, NoArchive, NoSnippet, NoTranslate, etc...

See these links for more in-depth info about the directives & which search engines support what:

Directives & Usage in Meta tags - http://noarchive.net/meta/

Usage in Response Headers - http://noarchive.net/xrobots/

whoopdedo · on March 8, 2015

What about the opposite? A website that created when it is indexed? Start with nothing and content is added each time the site is visited by Googlebot, or shared on Facebook, tweeted, posted on Reddit, etc. The website exists only so that it can be shared, and the act of sharing it defines what the website is.

TeMPOraL · on March 8, 2015

This is an uber cool idea. Especially if, when this website is shared by someone, it would attempt to scan the sharer's public feed, last submissions, last comments, last tweets, etc. (depending on where it got shared), and generate additional content based on what it found.

Sounds like an awesome weekend project.

yk · on March 8, 2015

Cool, but why? ( And shoulden't we invent digital Baroque art before inventing digital postmodernism?)

cheatsheet · on March 8, 2015

Both exist.

Postmodernism is a lot more relevant to the digital age than anything, imo. It emphasizes pointing out ways of thinking and doing, which I think is especially relevant when we are actually automating most of our ways of thinking and doing.

I know it gets a bad rap because of the ridiculous examples, but the real point of it engages the viewer into a serious kind of contemplation concerning the massive infrastructure that exists and how that shapes our culture, thoughts, understanding, action..

We have the expectation that the generations to come will accept this infrastructure and what it says about how the human mind functions. But much of it is founded on belief systems of how thought and action operate in the real world. Most of these systems are baseless, the idea of a base obfuscated only by the sheer complexity involved in understanding each layer.

matt4077 · on March 8, 2015

Please don't tell me Geocities was our Renaissance.

cheatsheet · on March 8, 2015

I really look forward to when we, as academics, historically document and seriously examine the various phases of the internet, from a variety of alternative perspectives.

It's interesting while it's being built, but it's also interesting to look back and reflect on the bigger picture, outside of the buzzwords and technical terminology used to pull the creation through, and make it actualized.

I look forward when critics and theorists start thinking about the goal of the internet from a social perspective, a collective cultural subconscious directive. I look forward to all the kinds of art history theoretical methodology used to explain the significance of Picasso or Manet in their respective time periods, to use the same kinds of methodology to reason about the relation between the internet and everything that is not the internet.

It's interesting when some information gets washed away and other information is retained through time, and it isn't always the stuff that is indexed that is retained. The idea that art critics can even agree to call the same collection of works "cubism" or "impressionism" fascinates me, and I look forward to the same kinds of invented vocabularies being used to describe various processes, movements, and patterns throughout internet culture (way beyond studying memes and tropes - there are so many layers to the collective psyche of the internet, it is dumbfounding).

I don't know what geocities represents. I'd have to define it's 'kind' and compare and contrast it to other 'kinds' throughout time. I know this was meant to be a humorous comment, but I love to weave theories, and some of them even turn out to be descriptive of the nature of things.

joepie91_ · on March 9, 2015

And if you want to help out to archive the data that is needed for that kind of work, ArchiveTeam needs your help: http://archiveteam.org/index.php?title=Main_Page :)

egypturnash · on March 8, 2015

AAA games: where someone is paid to do nothing but design the details on imaginary Dwarven armor for an entire year.

If that ain't baroque I don't know what is.

psykovsky · on March 8, 2015

Because.

javajosh · on March 8, 2015

Normally this kind of comment gets downvoted, and rightly so. But in this context, it's perfect. Well done.

threatofrain · on March 8, 2015

The reason why "Laconian wit" is normally frowned upon is because it's actually almost disruptively lazy. In the event that almost everyone agrees with you, then that's okay.

But should anyone disagree with you, now they're going to have to do the heavy lifting for YOUR side. That disrupts the willingness someone has to even converse with you, and if someone retorts with similarly Laconian wit, you can see the conversation breaks down really fast, because nobody is willing to put in the extra effort to flesh out someone else's opinion when there's no reciprocity or show of effort.

psykovsky · on March 8, 2015

I wouldn't mind the downvotes at all. But I really thought it was the only acceptable answer for the why. :)

TeMPOraL · on March 8, 2015

Yup. If aiming for less laconism, I like to quote Cave Johnson - "Science isn't about WHY. It's about WHY NOT.".

psykovsky · on March 9, 2015

I'm laconic like that. Great quote, btw.

ssalazar · on March 8, 2015

Just check out any of the MIDI music forums for some sweet digital baroque art.

bogdan · on March 8, 2015

Thank you.

http://i.imgur.com/cjDeLEb.png

EDIT: What's with the downvote hate? Somebody actually posted a valid key...

PhasmaFelis · on March 8, 2015

As far as I can tell, you just posted part of a random screengrab from your web browser for no obvious reason. Striking's response suggests that this is actually a reference to a site which, per the OP, is gone forever, along with any chance of getting your joke. So...I'm not really sure what you were expecting.

striking · on March 8, 2015

People likely didn't understand that someone posted a key for a game on that website and thought that you just posted an unrelated images.

hackhat · on March 8, 2015

>Why would you do such a thing? My full explanation was in the content of the site. (edit: ...which is now gone)

So anyone really understood why he did this?

TeMPOraL · on March 8, 2015

My guess - because he could, and likely had some good laugh when discussing it with friends.

ikeboy · on March 8, 2015

Anyone know the origin or have an archive?

TimWolla · on March 8, 2015

The origin is this: http://eep40h.herokuapp.com/

ikeboy · on March 8, 2015

https://web.archive.org/web/20150213152238/http://eep40h.her...

Yay!

Edit: and now https://archive.today/3QpC9

aqwas · on March 8, 2015

IF anything, that's a much deeper comment than the website itself. No matter how hard you try, it's impossible to really destroy something once it's been on the web. Resistance is futile.

scribu · on March 8, 2015

That's not quite the point being made, since the site didn't even attempt to block indexing via robots.txt or a meta tag.

eatonphil · on March 8, 2015

Is this indexed by google? Doesn't this make the attempt a failure (this time)?

ikeboy · on March 8, 2015

I've submitted it to Google, hopefully it makes it.

imjustsaying · on March 8, 2015

I see what you did there, I think.

WA · on March 8, 2015

Not sure if I see this as "art" or something. I mean, irrevocably deletes itself could be attached to a thousand arbitrary things.

- deleted after 100 visitors

- deleted if visited with IE 6.0 for the first time

- deleted if referrer is Facebook

- ...

comboy · on March 8, 2015

Also, irrecovability seems a bit questionable (google cache, archive.org etc.)

getsat · on March 8, 2015

    <meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate" />
    <meta http-equiv="Pragma" content="no-cache" />
    <meta http-equiv="Expires" content="0" />

+

    Cache-Control: no-cache, no-store, must-revalidate
    Pragma: no-cache
    Expires: 0

+

    User-agent: ia_archiver
    Disallow: /

Of course, this won't prevent crawlers which do not honor these headers/meta tags from caching your site, but if you're not in Google's index you're likely not getting traffic from said crawlers.

comboy · on March 8, 2015

Good point. I wonder if meta tags were updated later or did archive.org ignore them - https://web.archive.org/web/20150213152238/http://eep40h.her...

cubano · on March 8, 2015

Snapchat for websites...hmmmm perhaps.

thewizardofmys · on March 8, 2015

I see some potential use of this, for example as soon as Google crawlers reach the site I know that it is accessible from outside and I destroy the site.

placeybordeaux · on March 8, 2015

That seems to be the exact use case. Did you want to elaborate on why you find that useful?

aqme28 · on March 8, 2015

What is the purpose of a website that is inaccessible "from outside"?

jessaustin · on March 8, 2015

Maybe it's a resource that should only be used by people in a particular organization?

kordless · on March 8, 2015

It's not a purpose. It is a detection of state, however.

arash_milani · on March 8, 2015

"Death is reason for the beauty of butterfly"

ars · on March 8, 2015

Who said that? I could not agree less. Butterflies are beautiful for their color, not their death.

arash_milani · on March 8, 2015

It's from "Sohrap Sepehri" [1] an Iranian poet and painter. and I think the replies to your comments answer you question.

[1] https://en.wikipedia.org/wiki/Sohrab_Sepehri

TeMPOraL · on March 8, 2015

Whoever said that, probably meant selection pressure.

mod · on March 8, 2015

Potentially also that if every butterfly that ever existed were still alive, we wouldn't be very fond of them.

neilellis · on March 8, 2015

I have to say I'm not usually a fan of conceptual art, but kudos - the concept is great. Keep experimenting!

scottcanoni · on March 8, 2015

I would be interested in similar experiments but with a couple of minor variations to see the effects of each:

1. Sending the NOINDEX meta tag

2. Combining meta tags

3. Monitoring for a referrer URL that matches a Google search page to catch the 1st non-sneaky user coming from the index.

4. Monitoring other search engines and their behaviors.

angelortega · on March 9, 2015

grep Googlebot /var/www/log/* && rm -rf /var/www/site

shubhamjain · on March 8, 2015

How about detecting GoogleBot traffic and deleting when it has crawled your website?

tjgq · on March 8, 2015

Then anyone would be able to trigger the autodestruct by spoofing their UA.

desdiv · on March 8, 2015

Googlebot's identity can the authenticated to prevent spoofing:

https://support.google.com/webmasters/answer/80553?hl=en

tjgq · on March 8, 2015

I actually wasn't aware of that! Thanks for the link.

hartator · on March 8, 2015

https://github.com/mroth/unindexed/blob/master/views/alive.e...

bernardlunn · on March 8, 2015

Like a snow angel? Art that auto destructs? Stay in the moment.

lukasm · on March 8, 2015

What problem does it solve? EDIT: that was an honest question.

317070 · on March 8, 2015

The problem of creating something interesting, a.k.a. creating art.

pravj · on March 8, 2015

It leaves a gender-bending aftertaste on society.

chippy · on March 8, 2015

It helps provide interesting questions.

tls · on March 8, 2015

solves a broken system.

hellbanner · on March 8, 2015

Guess I better clone the source before its deleted..

switchb4 · on March 8, 2015

Oh I was thinking something very similar few minutes ago and when I opened hacker news and saw this post I was amazed

tzury · on March 9, 2015

a) One can also use referer to check whether a visitor has come from Google to trigger the deletion (in addition to "seek itself in Google").

b) robots.txt shall get the same results, plus, no cached content at Google, unlike "deleting itself", which the cache content remains at Google.

psykovsky · on March 8, 2015

You mean a website which can't be used with Chrome or even with Android itself, on any browser.

colund · on March 13, 2015

This makes me think about the immensely cool self destructing sunglasses in Mission Impossible

facepalm · on March 8, 2015

I have a thought that I will forget immediately once somebody asks me what it is.

Now I am an artist, yay :-)

qu4z-2 · on March 9, 2015

What is it?

FaisalRashid · on March 8, 2015

@Cjlm, what type of problem does it solve?

sorokod · on March 8, 2015

A Google worshiping sand mandala ( http://en.wikipedia.org/wiki/Sand_mandala ) ?