Hacker News' Reading Level

achille · on Dec 13, 2010

Can someone briefly explain how google determines reading level? I'm assuming it's using something like the Flesch–Kincaid test:

http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...

A brief search on wikipedia reveals a few readability tests, but they all seem to be based on sentence/syllable ratios, not content complexity.

http://en.wikipedia.org/wiki/Category:Readability_tests

And in general they all rank multi-syllable (longer) words higher. Which would mean a conversation between two Java API writers would be ranked higher than a ruby conversation :)

Java vs Ruby vs Lisp: http://i.imgur.com/tq3pA.png

derefr · on Dec 13, 2010

I figured that, since Google must read every word of every page to spider it, they must thereby have, as a byproduct, the world's most accurate database of word usage frequencies. "Reading level" would then just be a measure of the average frequency of all the words on a page (thus making words learned in a first year ESL class simple, and technical jargon advanced—quite the same as the measure of difficulty used by language proficiency exams.) The fact that the average of Simple English Wikipedia articles seems to be more intermediate than basic, though (29/52/17), would argue against that—unless the calculations are being biased by all the very infrequent proper nouns.

nl · on Dec 13, 2010

I'd be lying if I said I don't doubt you are not incorrect ;)

What you are proposing is a statistically generated version of the Gunning Fog Index (http://en.wikipedia.org/wiki/Gunning_fog_index) or the Flesch–Kincaid test (http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...).

If I were Google, I'd try that, but I'd also try something like working out percentage deviation from a Markov chain generated from their crawl. A method like that would show that my first sentence is pretty unreadable, while an algorithm based on word complexity would see it as pretty simple.

biotech · on Dec 13, 2010

> I'd be lying if I said I don't doubt you are not incorrect

Holy Shnikes, that's a tough one to parse! I wasn't sure what you were saying here, so I'm gonna break it down, working from the end of the sentence:

1. I'd be lying if I said I don't doubt you are not incorrect

2. I'd be lying if I said I don't doubt you are CORRECT

3. I'd be lying if I said I don't think you are INCORRECT

4. I'd be lying if I said I think you are CORRECT

5. I think you are INCORRECT

The idea is that each of the previous statements are saying basically the same thing; I'm just cancelling negatives each time. Anyway, am I correct to assume that you think that the GP is incorrect?

billswift · on Dec 13, 2010

Your transition 2 -> 3 is not justifiable, "doubt" is not the opposite of "think". A better parsing leaves "doubt" alone and would end with: "I doubt you are correct."

gimpf · on Dec 13, 2010

My dear, you just lost the "don't".

oh_ryan · on Dec 13, 2010

I'd be lying if I said I don't doubt you are not incorrect

I'd be lying if I said I don't doubt you are correct

I'd be lying if I said I think you are correct

nl · on Dec 13, 2010

I had to triple check to make sure I got it right, and I wrote it.

Yes, I think the GP is incorrect.

derefr · on Dec 13, 2010

> If I were Google, I'd try that, but I'd also try something like working out percentage deviation from a Markov chain generated from their crawl.

Indeed, that was my second thought, but I wonder if the gains are really all that large over a raw statistical analysis of the word bag, and whether they're worth the extra analysis space/time. It really depends on what Google is planning on doing with this metadata, internally; if an order-of-ten precision is fine (to pick out decisive categorizations), the raw analysis may be all that's needed.

nl · on Dec 13, 2010

Well we could always try it out. Here's 24GB (compressed) of ngram data from Google: http://googleresearch.blogspot.com/2006/08/all-our-n-gram-ar...

aeurielesn · on Dec 13, 2010

> I'm assuming it's using something like the Flesch–Kincaid test

Yes, It seems to be since it is only available for content in english.

Although, as far as I know they have not said anything regarding the algorithm used yet.

You can see (what it seems to be) the official announcement here: http://www.google.com/support/forum/p/Web%20Search/thread?ti...

russell · on Dec 13, 2010

Pitty us old C programmers. Every for loop was indexed by i, j for inner loops. Every string was indexed by sp, or cp if you were a purist and considered strings a figment of the imagination. You never used names longer than 8 characters, because even if the compiler allowed it , the linker surely wouldnt.

redthrowaway · on Dec 13, 2010

Well, it's official: 4chan is smarter than us.

http://www.google.com/search?q=site:news.ycombinator.com&...

citricsquid · on Dec 13, 2010

Well 4chan.org is simply the homepage, if you link to the actual boards (specifically /b/): https://encrypted.google.com/search?hl=en&tbs=rl:1&q...

sleight42 · on Dec 13, 2010

After dealing with my sick wife, I very much needed that laugh. Thank you.

redthrowaway · on Dec 13, 2010

You, sir, have made my day.

citricsquid · on Dec 13, 2010

There's hope for us yet!

olalonde · on Dec 13, 2010

You guys are not really helping.

redthrowaway · on Dec 13, 2010

LOL XD

triffidhunter · on Dec 13, 2010

That's ok, arxiv.com is smarter than everyone.

https://encrypted.google.com/search?q=site:arxiv.com&hl=...

billswift · on Dec 13, 2010

To get that, they have to be discarding nearly every word that is common to most web pages and grading only on uncommon words. There is no other way to get that low a basic and intermediate score.

hugh3 · on Dec 13, 2010

Wow, 99% advanced. That actually gives me a lot more confidence in google's algorithms.

olalonde · on Dec 13, 2010

Maybe Google's algorithm categorizes made up words as intermediate? :D

cruise02 · on Dec 13, 2010

I know it's comparing apples to oranges, but I contribute a lot on Stack Overflow so I took a look at the results.

https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

I was expecting a higher proportion in the "advanced" category.

For comparison:

Math Overflow: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

OnStartups: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

English Language & Usage: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

CS Theory: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

Seasoned Advice (Cooking): https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

Physics: https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

It's no surprise really, but it seems like the more technical jargon used on a site, the higher the reading level Google assigns.

Mithrandir · on Dec 13, 2010

https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

and

https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...

dantle · on Dec 13, 2010

https://encrypted.google.com/search?hl=en&safe=off&t...;

So much for Simple English wikipedia!

G_Wen · on Dec 13, 2010

Interesting, perhaps in their effort to avoid short scientific terms they have substituted in longer more common words. For example describing a tumor as "an abnormal new mass of tissue that serves no purpose". http://wordnetweb.princeton.edu/perl/webwn?s=tumor The seemingly paradoxical result in an increase in understandability but a decrease in usability. Think of it as reading a passage where almost every noun has been substituted by a dictionary definition.

quizbiz · on Dec 13, 2010

^ is comparing Reddit with Wikipedia. Results are not shocking.

Alex3917 · on Dec 13, 2010

At the risk of stating the obvious, you can get the results for the reading level of your own HN comments by adding your user name. It's kind of interesting going through the leader board and looking up different people's scores.

olalonde · on Dec 13, 2010

More accurately, it gives the reading level of all discussions in which you commented.

G_Wen · on Dec 13, 2010

Unfortunately by drawing attention to this information I am now obligated to substitute concise words with superfluousness ones.

fluidcruft · on Dec 13, 2010

Redditors will flip their buckets when they see:

http://www.google.com/search?q=site:reddit.com&hl=en&...

http://www.google.com/search?q=site:digg.com&hl=en&n...

rchowe · on Dec 13, 2010

I suspect this is because the algorithm measures word use and not grammar. HN tends to have people who follow the "simple is better" mantra and who write in a tone similar to either an executive summary or an online dialog, thus an accessible reading level.

duck · on Dec 13, 2010

According to Google a HN thread about vomiting is "advanced".

http://news.ycombinator.com/item?id=315490

harshpotatoes · on Dec 13, 2010

I think it's worth noting that most of the articles submitted to HN have no comments on them, meaning if those pages without comments were indexed by google and included in their measure of our reading level, there would be many pages consisting of only the words: "flag, 1 point by xxxx, no comments, Hacker News, new, threads..." etc etc. So maybe that has some influence on google's calculation.

carucez · on Dec 13, 2010

I got to thinking, how would America's universities compare? It's an interesting thought: The best universities should have the most advanced reading level.

I've published my findings, a complete ranking, and source code: http://log.largevoid.com/2010/12/ranking-colleges-by-reading...

lpolovets · on Dec 13, 2010

It would be cool to be able to compare two sites on one page, as with Google Trends.

A few examples that I thought were neat:

- msnbc (44/55/1) vs bbc.co.uk (15/82/2)

- facebook (40/37/22) vs linkedin (2/90/6) (I wondering why facebook has so much "advanced" content according to Google)

- wordpress (35/47/16) vs xanga (76/23/1)

- boston college (5/41/53) vs harvard (2/6/91)

cmelbye · on Dec 13, 2010

I like how the result for "noobcomments" is marked as a Basic Reading Level.

dheerosaur · on Dec 13, 2010

The first thing I noticed was "noobstories" being marked Advanced. It is probably because noobs are trying their best to get karma by submitting some advanced stories.

jpwagner · on Dec 13, 2010

yeah, but "best" is 100% intermediate

sp332 · on Dec 13, 2010

As simple as possible, but not simpler :-)

steveklabnik · on Dec 13, 2010

It would be equally interesting to only see the reading level of the articles, rather than the comments.

iwwr · on Dec 13, 2010

An elevated reading level is not necessarily a sign of more thoughtful or insightful comments. It could just be contrived banality disguised as conceptual depth.

olalonde · on Dec 13, 2010

> An elevated reading level is not necessarily a sign of more thoughtful or insightful comments. It could just be contrived banality disguised as conceptual depth.

Put simply, high reading level doesn't mean good content. (trying to prove your point :D)

j_baker · on Dec 13, 2010

"contrived banality disguised as conceptual depth"

Is it possible this phrase is an example of itself? :-)

astrofinch · on Dec 13, 2010

Words that describe themselves: short, convoluted, awkward, multilingual.

gjm11 · on Dec 13, 2010

Is the word (assuming it to be one) "non-self-describing" self-describing, or non-self-describing?

http://en.wikipedia.org/wiki/Grelling-Nelson%20paradox

IsaacL · on Dec 13, 2010

"Pentasyllabic"

mahmud · on Dec 13, 2010

      I am sure it was indented as such.

mbm · on Dec 13, 2010

Irony: (n) - incongruity between what might be expected and what actually occurs

lionhearted · on Dec 13, 2010

I like words, so I know a lot of words. But I try to take the wordiness way down. Why lose people who don't know the same words?

I mean, even if you don't care about losing people who don't have time to learn a more precise-but-obscure synonym, there's always people who learned English as a third language. Why alienate them just to look a little clever?

Even this comment has too many big words. If I had more time, I'd edit it down more. "alienate" would become "turn off" and "precise-but-obscure synonym" would become... I dunno, something simpler. Big obscure words that lock people out is really missing the forest from the trees.

jiganti · on Dec 13, 2010

You're only considering the negative aspect of "big words". Sure, there are many instances in which peope unnecessarily use lesser-known synonyms, but it is helpful when people utilize a vocabulary in order to be concise. I think that while the current mantra of "don't use big words so everyone can understand" is called for because of the amount of people abusing vocabulary so others will consider them intelligent, we should remain cognizant of the benefits succinct communication provides.

billswift · on Dec 13, 2010

And it is even more important for precision, when a less common word is closer to what you want to say than a more popular one. This is one point Robert Heinlein made in Expanded Universe when he was defending Doc Smith's "space operas" and the vocabulary that was used in them. (Note that since "the map is not the territory", nothing, not your thoughts or even a physical description, can be exactly captured in words, but that doesn't mean it is not sometimes important or at least useful to try to be as precise as possible.)

matwood · on Dec 13, 2010

This is a great comment. I've always thought 'reading level' as a bit bogus and looking at the wrong thing. When determining the quality of content I look at it from two ways. Does content convey the idea the writer intended and does the reader comprehend the idea the writer intended.

If I'm reading something I would prefer it be written as simply as possible. From a writers standpoint though this can be much harder than just explaining an idea in complex terms.

To paraphrase Pascal, "I apologize for the length of this letter, but I did not have time to make it shorter." I like to think length in this case meant complexity.

aaronblohowiak · on Dec 13, 2010

go to ltu for advanced reading, here for startups.