Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You need to remove the stop words before visualizing them. Words such as 'the' 'and' etc don't need to be in the analysis


They are included because they were the reason people gave when asked why "The Silmarillion" was so unreadable.

(I'm the author)


The problem with stop words is that they tend to be the most common words in every piece of text[1], regardless genre.

So, in order to test the hypothesis "Silmarillion is harder to read because it has lots of stop words", you need to calculate the relative frequencies of lots of other texts and see if there's something special about Silmarillion's top 10 versus all other's.

Surely, you have already done that using LOTR and The_Hobbit, but a much bigger sample is needed. At the very least, you may want to use 10-15 other works of fantasy from different authors, and that will be just like a back-of-the-envelop test to see if it is worth to pursue this experiment with a statistically significant sample.

[edit] 1. Provided it is sufficiently large.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: