I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.
Here are some tricks which may or may not work today:
- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.
- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.
> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
Nice tip!!
> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.
You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)
In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?
If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?
I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.
1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.
YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.
Here are some tricks which may or may not work today:
- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.
- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.