Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Visualizing Tolkien (5013.es)
70 points by josem on Oct 4, 2012 | hide | past | favorite | 54 comments


>One of the reviewers argued that it was the hardest book to read because 'and' was the most used word in the book.

I think the author took this way to literally. Granted, it inspired some fun (albeit odd and basic) data analysis. But the point of the reviewer isn't about word counts but about style and pace. That, narratively, the Silmarillion just felt like it continued on and on without clear sections of rising and falling action or significant breaks. Sentences were long and atmospheric, rather than short, quick and active.


I haven't gotten through the whole book, but it's also written in a style which makes it hard to immerse yourself in it. Characters aren't described in detail, major events are described in a few words, and entire lifetimes are glossed over... Certainly epic and at times quite beautiful, but completely different to LoTR.


Sillmarillion is just meant to be read in pieces, I think. It's very similiar to the Old Testament, and I don't know a single person that read the Old Testament in one sitting.


> Sentences were long and atmospheric, rather than short, quick and active.

Interestingly, I think the larger frequency of 'of' actually suggests this. I suspect the Silmarillion (I haven't read it in ... a decade maybe?) has a far more complicated web of relationships–not just between characters, but between places; at the very least, it concerns itself with genealogies more.


Where's the sentence length statistics? Average word length? Commas per sentence? All are surely much more indicative of reading difficulty than that "originality index" he came up with.

Why didn't the author try the most obvious test of reading difficulty of all: Flesch-Kincaid? [1]

[1] http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...


Hi, author here.

First, it's she, not he.

Second, because I hadn't heard about that obvious test at all back then. I never pretended to do a superserious scientific analysis but rather answer to the questions that came to my mind, by using a computer to validate hypothesis.

But many thanks for the pointers & suggestions, though! I was thinking about rebuilding this to make it realtime+interactive so that more than one text could be analysed/visualised, so I'll make sure to introduce the new tests.


Hi author, I really liked the visualisations, especially the black hole thing.

I think I may have misunderstood the purpose of your article: I first read it as a serious attempt to use textual analysis to do a comparison of the comprehensibility of Tolkein's best-known works, in which case you fell way short of the mark by going no further than word counting. I think it started off like that, but on re-reading it I see that you say you hit a brick wall (paraphrasing) and decided to have some fun visualising your results so far; something you did very nicely. So perhaps I got the wrong end of the stick.

I thought about what you've taken on here. Doing things at the word level is pretty easy. Taking into account grammatical structure to get things like sentence lengths and clauses per sentence, or breaking words up into syllables (needed for F-K, for example) is considerably harder. I'm interested to hear what you come up with. Flesch & Flesch-Kincaid are US inventions so perhaps not obvious to everyone.

On he/she: I agonised over that for a good ten minutes over my breakfast. I read around your blog and your twitter page this morning for clues as to the appropriate pronoun. In the end I couldn't tell, so I went with 'he' because I stereotyped you. I nearly changed my wording to use constructions like "the author," "they," and various other mealy-mouthed alternatives but they were too ugly. So I did try, but I got it wrong. I apologise.


Isn't FK determined almost completely by sentence length? That is what I recall from messing with MS Word docs in high school.


FK combines average sentence length (total words/total sentences), average syllables per word and some fixed coefficients to come up with an equivalent school grade.

Flesch Reading Ease score, which is what I actually meant, does the same but with different coefficients to come up with a more granular difficulty score, usually in the range of 30-100.

They're both pretty arbitrary. The more I read up on this subject the more respect I have for the author's own attempts at an originality score. It's all subjective ultimately.


> On he/she

While I consider defaulting to 'he' to be legitimate and acceptable, I actually prefer the zie/zir gender-neutral pronouns when I think about it. http://santiago.mapache.org/nonfiction/essays/zie.html If I ever begin to agonize about the gender of the person I'm talking about, that's enough to kick me over into using GNPs.


Have fun with it! This seems like a great example to encourage further analysis because you still don't have a testable quantitative hypothesis for why it was harder to read. Makes me want to start coming up with metrics, too! Of course there's a big list of readability metrics, but it's way more fun to discover them on (y)our own. That said, I too was a little surprised when you had a heading titled "The classic graphs" without actually having perspective in the field for what would be classic.


The "originality" index bothered me, because as the work of a length grows, you'd generically expect less words to be introduced -- exactly what the results show.

The idea makes sense, but I'm unclear on how to actually measure it in a way that's normalized by page count.


There's the concept of a vocabulary growth curve. It shows the number of words occurring once as a function of the amount of text, e.g., how much new words in the first 1000 words? how much in first 2000 words? etc.


But is there a good way to boil that down to a single metric? (i.e., can we parametrize these curves with a single value?)


I wouldn't know off the top of my head, but the concept was introduced by Harold Baayen. There are free PDFs online of his with lots of interesting quantitative techniques.


Binomial distribution mumble mumble logarithm of page count mumble.


Would it be possible for you to generate a HD version of the "Graphical representation of words frequency" image? It would look great as a poster.


Nice work! So my question is where did you get the text to analyze?


You can find the text for pretty much any popular work if you look in the proper places :-)


The painstaking lengths people go though in honor of Tolkien is amazing. Along those lines I think http://3rin.gs/ deserves a mention, especially as maps are a "visualization" of sorts. I met the author this summer. The attention to detail is staggering.


Great post, I love the thinking out loud thought process portions.

Some suggestions: try removing stopwords (the, and, etc.), it'll bring out more variation on the analysis. Particularly in the circular graphs.

Try unique n-gram analysis, I suspect that will show something interesting. At the very least 2-gram (bigram/digram) analysis might show something cool.

Some other interesting comparative measures, since the author already has all of the tokens, try the Jaccard index between book pairs to look for similarity. (may even want to break the books down into major sections: e.g. LOTR can be viewed as both 3 and 6 books.

Some others to try, sentence length, word length, distribution of tf-idf scores, etc. etc.

fun fun!


Yes--maybe I am thinking out 'too loudly' sometimes, but I think it's interesting since that explains how one can come to a conclusion, and maybe if it is wrong, it can be corrected as you know how the conclusion was reached. It's like you can "debug" a thought process, in a way, as you have a "trace" :-)

And thanks for all the suggestions! I've made a note of them all. Good to hear from people who know more about the topic than me.


I think the author should have filtered out stopwords. The big "The" in the middle of the visualization seems kind of obvious. It would be much more interesting to see the words that make this text different from other english texts.


I did some work while still in college doing analysis of texts from Project Gutenberg. Removing the stop words made the analysis far more interesting.


Why does everyone find the Silmarillion hard to read? Am I the only one, who had no problems with that style?


I've heard many people say that it "reads like the Bible", or like a history book: there aren't any "viewpoint characters", or really any recurring characters at all (especially if inscrutable angelic beings don't count). Certainly if you're expecting another story in the style of The Lord of the Rings, you'll be disappointed.

Personally, I loved the book, but I was hungry for all the "lore" about Middle-earth and its history that I could find. I think most people would really prefer a self-contained story.

Oh, and for the record, an underlying reason for many people disliking The Silmarillion may be that Tolkien never really finished it. I have a little bit of a writeup of the story behind that on my Tolkien FAQ site: http://tolkien.slimy.com/faq/External.html#SilmChanges


I suspect that the author may have had more luck explaining the difficulty of the Silmarillion if he had compared the number of unique proper nouns. I enjoyed it, but ISTR when I read it (over 30 years ago now) that it did have the feel of a telephone directory.


Apologies for the gender snafu in that post.


I was thinking the same. I never had problems with Silmarillion. Couldn't read Messages from Middle Earth however, that was just not enough connection. But Silmarillion, man, that's an epic collection of stories. I think it might be a cool material for a TV series. Say, one season for every major story with the same characters, plus some creative freedom for the screenwriters of course.

Just think about it - battles with hordes of orks, hero elves AND 30 or so Balrogs. And loosing / winning a battle doesn't just blow up a tower, it creates spasms in damn middle earth. The ring wars got nothing on that..


Agreed. Compared to the Silmarillion the Lord of the Ring war seems kind of cute. Imagine a thousand balrogs riding on dragons against the elven city Gondolin.


Later parts of the book might be good (or even great) in movie format, but there's no way one can turn Ainulindalë into a movie... Or can they? For years I was passionately against the idea, but now that I think about it again, maybe I'm just being over-pessimistic? I mean, LOTR turned out to be quite good...


IMO quite good is an understatement, it has become the best any LotR fan could have ever hoped for. And about Silmarillion: Think about how good some TV screenwriters have become in recent years - IMO they turn out more high quality TV scripts than movies nowadays, mostly because the big budget movie industry needs to always take the safe bets (and so you get stupid flicks like battleship et al.)


It reads like a history book to me, or a collection of legends, rather than a traditional story arc. I enjoyed the Silmarillion, but more like one would enjoy reading about the rise and fall of the Roman empire than reading fiction. It was more of a learning experience. It also made LOTR a lot richer when I went back and read it again.


From what I remember the first couple of chapters contain very close to 0 dialogue, most people simply aren't used to reading books like that.


That depends. Those who like reading folklore and legends actually appreciate such style.


I tried reading it a few times and never made it more than 1/3 of the way through. I kept losing my place because it all sounded the same.


Then try the audiobook: http://www.randomhouse.com/book/179221/the-silmarillion-boxe...

It's great. And I don't say that about any audiobook (and I listen to audiobooks quite often). The narrator does a great, great job.


I like it too. So not everyone finds it hard to read. But it's not some casual fantasy for sure.


I do research in this topic. While word level features are readily available and provide some interesting insights, I'm more interested in syntactic phenomena (e.g., recurring phrases). But in the case of The Simarillion the "reading difficulty" could be beyond even that; I suspect it might be because it doesn't build up suspense the way the traditional narrative of a novel does. Unfortunately, that might be hard to capture in a programmable test.


"Visualizing English Text" seems more accurate a title. Cute process, but the result is entirely generic.


Agreed. It's just applying some standard (albeit amusing) tools to a particular text (amounting to 1 book in 5 parts) and getting results no different from any other text. Word clouds etc have been done, but all still result in pretty much the same sort of textual haze, with nothing inspiring a unique view of a unique tome.

Mapping semantic associations would be more interesting. Something akin to http://xkcd.com/657/ generated by the in-text juxtaposition of names & related verbs.


"I wrote a simple program who counted how many times did each word appear in the [The Silmarillion]."

Out of curiosity, where did the author get a copy of the text to analyze?

"Turin" is the only name in The Silmarillion graph? Interesting.


Author here.

I got them in txt files, online. I own the original books too. I would have typed them in if I had all the time of the world, of course.


While I find this somewhat fascinating, I think this kind of analysis would be similar to trying to figure out how to make great food by examining the molecular contents of the final product.


I almost closed the tab right away because the text only started below my fold. The 4 graphics fit the screen perfectly so I assumed that was all. Make sure you scroll!


You need to remove the stop words before visualizing them. Words such as 'the' 'and' etc don't need to be in the analysis


They are included because they were the reason people gave when asked why "The Silmarillion" was so unreadable.

(I'm the author)


The problem with stop words is that they tend to be the most common words in every piece of text[1], regardless genre.

So, in order to test the hypothesis "Silmarillion is harder to read because it has lots of stop words", you need to calculate the relative frequencies of lots of other texts and see if there's something special about Silmarillion's top 10 versus all other's.

Surely, you have already done that using LOTR and The_Hobbit, but a much bigger sample is needed. At the very least, you may want to use 10-15 other works of fantasy from different authors, and that will be just like a back-of-the-envelop test to see if it is worth to pursue this experiment with a statistically significant sample.

[edit] 1. Provided it is sufficiently large.


ok great. but will you read the damn book already?!?!

;)


rather: Wasting time and attention on useless statistics and pointless visualizations.


It quickly becomes apparent that the author has no experience with NLP.

Still, I have mixed feelings regarding your comment. Firstly, the author shows a true hacker spirit by trying to use the computer to solve problems, on the other hand, the author shows great courage to publish his/her thoughts on the Internet.

I do not want to live in a world where people are afraid of publishing their research (as insignificant as it might be) because of other people calling it "a waste of time" or "useless and pointless".


Says the one with a nickname from mythology/fantasy (Darkover-Cycle)... ;-)


I'm not sure what you mean. I was criticizing the viz, not the literature.

Incidentally, I never read anything by Lovecraft, but am a big fan of Tolkien. I came up with this nickname on the spot.


Err---that would be Marion Zimmer Bradley (see: http://en.wikipedia.org/wiki/Marion_Zimmer_Bradley), not Lovecraft. It's possible she got it from "The King in Yellow" usually referred to as 'The Yellow Book', but not know for sure. See: http://en.wikipedia.org/wiki/The_King_in_Yellow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: