Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I know it's possible to download the entire Wikipedia database, but does anyone know of a way to download every article in this list? Preferably as a torrent.


I can offer one option that's in the ballpark of what you're looking for:

https://wiki.kiwix.org/wiki/Content_in_all_languages

Kiwix is a standalone viewer (Available as desktop, phone or web server [including a Sandstorm.io package which I put together]) for archived websites of various sorts, Wikipedia being the flagship. This link lists all available things, and it's countless. However if you ctrl-f for "physics" (and keep searching until you hit the language you want) you'll see that they have subsets of Wikipedia available that cater to many interests. Physics, basketball, "for schools", history, etc.

All content packages are indeed available as torrents.


Please use the API. DO not scrape wikipedia via the website.

What you're looking for is:

https://en.wikipedia.org/wiki/Special:Export

You can start with the index page and collect all the page titles you're interested in, and then use the special:export API to download XML (probably other formats too) of all those pages.


I was about to say that a torrent is hardly necessary. How big could a 1000 mostly-text files get? Pretty big as it turns out. Downloading a dozen random entries from that list, the sizes seem average around 2 MB, and that's including only the small images on each page (not the big picture you get when you click on an image). So 1000 entries at 2 MB each would be 2 GB.

Picking apart just one page (the Jane Austen entry), the plain ASCII text with no markup is only 88 KB. The 19 small images, plus some tiny buttons and logos, are 536 KB, and the markup (HTML, CSS, and whatnot) is 497 KB. I was surprised that Wikipedia, in terms of page weight, is mostly images and markup. (Not complaining, of course. Wikipedia is one of the few big sites on the web that doesn't throw in gratuitous and irrelevant images and videos.)


Are you including page history? Are you allowed to re-distribute without page history?

Also, you probably know this but there is a prize for compressing bits of Wikipedia. Here's the wikipedia page about it: https://en.wikipedia.org/wiki/Hutter_Prize

http://prize.hutter1.net/

About the test data: http://mattmahoney.net/dc/textdata.html

And discussion on HN: https://news.ycombinator.com/item?id=7405129


Not a torrent or a full solution but applying the regex /wiki/(?!.\:)[aA-zZ0-9%()_] on the source should select all the articles (along with some generic wikipedia links matched at the bottom), then batch adding "https://en.wikipedia.org" to the beginning of each line gives full urls.

Here's one such list: https://hastebin.com/terugezeda

wget has an option (-i) to download links line-by-line from a text file but is sadly making a mess of the images, using

  wget --span-hosts --convert-links --adjust-extension --page-requisites --no-host-directories --no-parent --wait=1 --reject="robots.txt" -i wget.txt 
or

  wget -H -k -E -p -nH -np -w 1 -R "robots.txt" -i wget.txt
for short.

Maybe someone has a better idea for the last step

edit: shorthand version


I'd recommend inliner for the last step: https://www.npmjs.com/package/inliner




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: