I've crawled a popular social network on a large scale, currently doing the same...

hitpointdrew · on Nov 22, 2019

> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Nice tip!!

> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

kaivi · on Nov 22, 2019

>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_pr...

davidhyde · on Nov 22, 2019

You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?

kaivi · on Nov 22, 2019

Tbh I'm still selling that data.

For a while I've had reoccurring nightmares that my DB had been stolen and published together with an article on how stupid and incompetent I am.

prawnsalad · on Nov 23, 2019

If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?

Havoc · on Nov 22, 2019

>You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

I'll cut straight to the chase and post it on hn. This intermediate step of waiting for someone to discover it takes too long

tomc1985 · on Nov 22, 2019

The use case is in a local datacenter, with a NAT-ed IP not exposed to the main web

kchamplewski · on Nov 22, 2019

A firewalled IP would be much more appropriate, and NAT is not a firewall or a security mechanism.

tomc1985 · on Nov 23, 2019

Same thing, more-or-less. And NAT is effectively a firewall for inbound traffic, even if a lot of people say it isn't.

xfer · on Nov 22, 2019

> Have an app where user logs in through said website, then scrape their friends using this user's token.

That's some extremely shady thing to do.

edm0nd · on Nov 22, 2019

Welcome to the internet!

isoos · on Nov 22, 2019

> Don't be too kind on the big websites.

I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.

arcticfox · on Nov 23, 2019

1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.

nfoz · on Nov 22, 2019

Don't you consider this unethical -- if not against the site itself, than against the other users of the site whose data you're scraping?

Ayesh · on Nov 22, 2019

Wow these are some hot tips!

YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.

Pretending to be Googlebot also helps.

Havoc · on Nov 22, 2019

>- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Clever. VMs with IPV6 are cheap as a bonus :)

Same for non-js mobile. Thanks for the tips

adatavizguy · on Nov 22, 2019

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

How would someone do that using node.js? Asking for a friend.