Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.

Here are some tricks which may or may not work today:

- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.

- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.



> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Nice tip!!

> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).


>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_pr...


You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?


Tbh I'm still selling that data.

For a while I've had reoccurring nightmares that my DB had been stolen and published together with an article on how stupid and incompetent I am.


If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?


>You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

I'll cut straight to the chase and post it on hn. This intermediate step of waiting for someone to discover it takes too long


The use case is in a local datacenter, with a NAT-ed IP not exposed to the main web


A firewalled IP would be much more appropriate, and NAT is not a firewall or a security mechanism.


Same thing, more-or-less. And NAT is effectively a firewall for inbound traffic, even if a lot of people say it isn't.


> Have an app where user logs in through said website, then scrape their friends using this user's token.

That's some extremely shady thing to do.


Welcome to the internet!


> Don't be too kind on the big websites.

I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.


1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.


Don't you consider this unethical -- if not against the site itself, than against the other users of the site whose data you're scraping?


Wow these are some hot tips!

YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.

Pretending to be Googlebot also helps.


>- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Clever. VMs with IPV6 are cheap as a bonus :)

Same for non-js mobile. Thanks for the tips


- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

How would someone do that using node.js? Asking for a friend.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: