Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hacker News' Reading Level (google.com)
57 points by aeurielesn on Dec 13, 2010 | hide | past | favorite | 55 comments


Can someone briefly explain how google determines reading level? I'm assuming it's using something like the Flesch–Kincaid test:

http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...

A brief search on wikipedia reveals a few readability tests, but they all seem to be based on sentence/syllable ratios, not content complexity.

http://en.wikipedia.org/wiki/Category:Readability_tests

And in general they all rank multi-syllable (longer) words higher. Which would mean a conversation between two Java API writers would be ranked higher than a ruby conversation :)

Java vs Ruby vs Lisp: http://i.imgur.com/tq3pA.png


I figured that, since Google must read every word of every page to spider it, they must thereby have, as a byproduct, the world's most accurate database of word usage frequencies. "Reading level" would then just be a measure of the average frequency of all the words on a page (thus making words learned in a first year ESL class simple, and technical jargon advanced—quite the same as the measure of difficulty used by language proficiency exams.) The fact that the average of Simple English Wikipedia articles seems to be more intermediate than basic, though (29/52/17), would argue against that—unless the calculations are being biased by all the very infrequent proper nouns.


I'd be lying if I said I don't doubt you are not incorrect ;)

What you are proposing is a statistically generated version of the Gunning Fog Index (http://en.wikipedia.org/wiki/Gunning_fog_index) or the Flesch–Kincaid test (http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...).

If I were Google, I'd try that, but I'd also try something like working out percentage deviation from a Markov chain generated from their crawl. A method like that would show that my first sentence is pretty unreadable, while an algorithm based on word complexity would see it as pretty simple.


> I'd be lying if I said I don't doubt you are not incorrect

Holy Shnikes, that's a tough one to parse! I wasn't sure what you were saying here, so I'm gonna break it down, working from the end of the sentence:

1. I'd be lying if I said I don't doubt you are not incorrect

2. I'd be lying if I said I don't doubt you are CORRECT

3. I'd be lying if I said I don't think you are INCORRECT

4. I'd be lying if I said I think you are CORRECT

5. I think you are INCORRECT

The idea is that each of the previous statements are saying basically the same thing; I'm just cancelling negatives each time. Anyway, am I correct to assume that you think that the GP is incorrect?


Your transition 2 -> 3 is not justifiable, "doubt" is not the opposite of "think". A better parsing leaves "doubt" alone and would end with: "I doubt you are correct."


My dear, you just lost the "don't".


I'd be lying if I said I don't doubt you are not incorrect

I'd be lying if I said I don't doubt you are correct

I'd be lying if I said I think you are correct


I had to triple check to make sure I got it right, and I wrote it.

Yes, I think the GP is incorrect.


> If I were Google, I'd try that, but I'd also try something like working out percentage deviation from a Markov chain generated from their crawl.

Indeed, that was my second thought, but I wonder if the gains are really all that large over a raw statistical analysis of the word bag, and whether they're worth the extra analysis space/time. It really depends on what Google is planning on doing with this metadata, internally; if an order-of-ten precision is fine (to pick out decisive categorizations), the raw analysis may be all that's needed.


Well we could always try it out. Here's 24GB (compressed) of ngram data from Google: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-ar...


> I'm assuming it's using something like the Flesch–Kincaid test

Yes, It seems to be since it is only available for content in english.

Although, as far as I know they have not said anything regarding the algorithm used yet.

You can see (what it seems to be) the official announcement here: http://www.google.com/support/forum/p/Web%20Search/thread?ti...


Pitty us old C programmers. Every for loop was indexed by i, j for inner loops. Every string was indexed by sp, or cp if you were a purist and considered strings a figment of the imagination. You never used names longer than 8 characters, because even if the compiler allowed it , the linker surely wouldnt.


Well, it's official: 4chan is smarter than us.

http://www.google.com/search?q=site:news.ycombinator.com&...


Well 4chan.org is simply the homepage, if you link to the actual boards (specifically /b/): https://encrypted.google.com/search?hl=en&tbs=rl:1&q...


After dealing with my sick wife, I very much needed that laugh. Thank you.


You, sir, have made my day.


There's hope for us yet!


You guys are not really helping.


LOL XD


That's ok, arxiv.com is smarter than everyone.

https://encrypted.google.com/search?q=site:arxiv.com&hl=...


To get that, they have to be discarding nearly every word that is common to most web pages and grading only on uncommon words. There is no other way to get that low a basic and intermediate score.


Wow, 99% advanced. That actually gives me a lot more confidence in google's algorithms.


Maybe Google's algorithm categorizes made up words as intermediate? :D


I know it's comparing apples to oranges, but I contribute a lot on Stack Overflow so I took a look at the results.

https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

I was expecting a higher proportion in the "advanced" category.

For comparison:

Math Overflow: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

OnStartups: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

English Language & Usage: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

CS Theory: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

Seasoned Advice (Cooking): https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

Physics: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

It's no surprise really, but it seems like the more technical jargon used on a site, the higher the reading level Google assigns.



https://encrypted.google.com/search?hl=en&safe=off&t...;

So much for Simple English wikipedia!


Interesting, perhaps in their effort to avoid short scientific terms they have substituted in longer more common words. For example describing a tumor as "an abnormal new mass of tissue that serves no purpose". http://wordnetweb.princeton.edu/perl/webwn?s=tumor The seemingly paradoxical result in an increase in understandability but a decrease in usability. Think of it as reading a passage where almost every noun has been substituted by a dictionary definition.


^ is comparing Reddit with Wikipedia. Results are not shocking.


At the risk of stating the obvious, you can get the results for the reading level of your own HN comments by adding your user name. It's kind of interesting going through the leader board and looking up different people's scores.


More accurately, it gives the reading level of all discussions in which you commented.


Unfortunately by drawing attention to this information I am now obligated to substitute concise words with superfluousness ones.



I suspect this is because the algorithm measures word use and not grammar. HN tends to have people who follow the "simple is better" mantra and who write in a tone similar to either an executive summary or an online dialog, thus an accessible reading level.


According to Google a HN thread about vomiting is "advanced".

http://news.ycombinator.com/item?id=315490


I think it's worth noting that most of the articles submitted to HN have no comments on them, meaning if those pages without comments were indexed by google and included in their measure of our reading level, there would be many pages consisting of only the words: "flag, 1 point by xxxx, no comments, Hacker News, new, threads..." etc etc. So maybe that has some influence on google's calculation.


I got to thinking, how would America's universities compare? It's an interesting thought: The best universities should have the most advanced reading level.

I've published my findings, a complete ranking, and source code: http://log.largevoid.com/2010/12/ranking-colleges-by-reading...


It would be cool to be able to compare two sites on one page, as with Google Trends.

A few examples that I thought were neat:

- msnbc (44/55/1) vs bbc.co.uk (15/82/2)

- facebook (40/37/22) vs linkedin (2/90/6) (I wondering why facebook has so much "advanced" content according to Google)

- wordpress (35/47/16) vs xanga (76/23/1)

- boston college (5/41/53) vs harvard (2/6/91)


I like how the result for "noobcomments" is marked as a Basic Reading Level.


The first thing I noticed was "noobstories" being marked Advanced. It is probably because noobs are trying their best to get karma by submitting some advanced stories.


yeah, but "best" is 100% intermediate


As simple as possible, but not simpler :-)


It would be equally interesting to only see the reading level of the articles, rather than the comments.


An elevated reading level is not necessarily a sign of more thoughtful or insightful comments. It could just be contrived banality disguised as conceptual depth.


> An elevated reading level is not necessarily a sign of more thoughtful or insightful comments. It could just be contrived banality disguised as conceptual depth.

Put simply, high reading level doesn't mean good content. (trying to prove your point :D)


"contrived banality disguised as conceptual depth"

Is it possible this phrase is an example of itself? :-)


Words that describe themselves: short, convoluted, awkward, multilingual.


Is the word (assuming it to be one) "non-self-describing" self-describing, or non-self-describing?

http://en.wikipedia.org/wiki/Grelling-Nelson%20paradox


"Pentasyllabic"


      I am sure it was indented as such.


Irony: (n) - incongruity between what might be expected and what actually occurs


I like words, so I know a lot of words. But I try to take the wordiness way down. Why lose people who don't know the same words?

I mean, even if you don't care about losing people who don't have time to learn a more precise-but-obscure synonym, there's always people who learned English as a third language. Why alienate them just to look a little clever?

Even this comment has too many big words. If I had more time, I'd edit it down more. "alienate" would become "turn off" and "precise-but-obscure synonym" would become... I dunno, something simpler. Big obscure words that lock people out is really missing the forest from the trees.


You're only considering the negative aspect of "big words". Sure, there are many instances in which peope unnecessarily use lesser-known synonyms, but it is helpful when people utilize a vocabulary in order to be concise. I think that while the current mantra of "don't use big words so everyone can understand" is called for because of the amount of people abusing vocabulary so others will consider them intelligent, we should remain cognizant of the benefits succinct communication provides.


And it is even more important for precision, when a less common word is closer to what you want to say than a more popular one. This is one point Robert Heinlein made in Expanded Universe when he was defending Doc Smith's "space operas" and the vocabulary that was used in them. (Note that since "the map is not the territory", nothing, not your thoughts or even a physical description, can be exactly captured in words, but that doesn't mean it is not sometimes important or at least useful to try to be as precise as possible.)


This is a great comment. I've always thought 'reading level' as a bit bogus and looking at the wrong thing. When determining the quality of content I look at it from two ways. Does content convey the idea the writer intended and does the reader comprehend the idea the writer intended.

If I'm reading something I would prefer it be written as simply as possible. From a writers standpoint though this can be much harder than just explaining an idea in complex terms.

To paraphrase Pascal, "I apologize for the length of this letter, but I did not have time to make it shorter." I like to think length in this case meant complexity.


go to ltu for advanced reading, here for startups.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: