Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's a separate project:

- https://github.com/fake-name/ExHentai-Archival

- https://github.com/fake-name/PatreonArchiver

- https://github.com/fake-name/xA-Scraper

- https://github.com/fake-name/DanbooruScraper

Or... well, 4 separate projects. Whoops?

At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.

Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.

Next up, automate the consumption too!



At least Ex supports torrents and also has some custom p2p software which you can run (serves content) from which data can be siphoned off.

And what is served through their website is resized. So web-scraping is an inferior approach.


You seem to be assuming

1. I'm scraping the resized galleries.

2. I don't have the Hath perk that makes the galleries full sized.

3. I don't have a phash-based fuzzy image deduplication system on top of all this (see https://github.com/fake-name/IntraArchiveDeduplicator). It's main purpose is to deduplicate manga (https://github.com/fake-name/MangaCMS).


Jesus, your projects are massive. Does your job involve working on these or are these just side things?


It's all entirely hobby things.


Oh my god. Can you share any results?


The project never went anywhere, unfortunately, and I haven't had time to look at it recently.

I have huge, uh, "datasets" around still, though.


You're doing god's work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: