Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yeah, robots.txt is a horrible standard. trust me, i wrote https://www.npmjs.org/package/robotstxt just so that i can really understand what is going on. it's based on https://developers.google.com/webmasters/control-crawl-index...

the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.

still, you should not use comments in the robots.txt, why?

you can group user agents i.e.:

    User-agent: Googlebot
    User-agent: bingbot
    User-Agent: Yandex
    Disallow: /
Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)

ok, now:

    User-agent: Googlebot
    #User-agent: bingbot
    User-Agent: Yandex
    Disallow: /
so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.

they way i read the spec https://developers.google.com/webmasters/control-crawl-index..., this behaviour is undefined. (pleae correct me if i'm wrong)

simple solution: don't use comments in the robots.txt file.

also, please somebody fork and take over https://www.npmjs.org/package/robotstxt it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).

by the way, my recommendation is to have a robots.txt file like this

    User-agent: *
    Dissalow: 

    Sitemap: http://www.example.com/your-sitemap-index.xml
and return HTTP 200

why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt

also read the spec https://developers.google.com/webmasters/control-crawl-index...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: