And in general they all rank multi-syllable (longer) words higher. Which would mean a conversation between two Java API writers would be ranked higher than a ruby conversation :)
I figured that, since Google must read every word of every page to spider it, they must thereby have, as a byproduct, the world's most accurate database of word usage frequencies. "Reading level" would then just be a measure of the average frequency of all the words on a page (thus making words learned in a first year ESL class simple, and technical jargon advanced—quite the same as the measure of difficulty used by language proficiency exams.) The fact that the average of Simple English Wikipedia articles seems to be more intermediate than basic, though (29/52/17), would argue against that—unless the calculations are being biased by all the very infrequent proper nouns.
If I were Google, I'd try that, but I'd also try something like working out percentage deviation from a Markov chain generated from their crawl. A method like that would show that my first sentence is pretty unreadable, while an algorithm based on word complexity would see it as pretty simple.
> I'd be lying if I said I don't doubt you are not incorrect
Holy Shnikes, that's a tough one to parse! I wasn't sure what you were saying here, so I'm gonna break it down, working from the end of the sentence:
1. I'd be lying if I said I don't doubt you are not incorrect
2. I'd be lying if I said I don't doubt you are CORRECT
3. I'd be lying if I said I don't think you are INCORRECT
4. I'd be lying if I said I think you are CORRECT
5. I think you are INCORRECT
The idea is that each of the previous statements are saying basically the same thing; I'm just cancelling negatives each time. Anyway, am I correct to assume that you think that the GP is incorrect?
Your transition 2 -> 3 is not justifiable, "doubt" is not the opposite of "think". A better parsing leaves "doubt" alone and would end with: "I doubt you are correct."
> If I were Google, I'd try that, but I'd also try something like working out percentage deviation from a Markov chain generated from their crawl.
Indeed, that was my second thought, but I wonder if the gains are really all that large over a raw statistical analysis of the word bag, and whether they're worth the extra analysis space/time. It really depends on what Google is planning on doing with this metadata, internally; if an order-of-ten precision is fine (to pick out decisive categorizations), the raw analysis may be all that's needed.
Pitty us old C programmers. Every for loop was indexed by i, j for inner loops. Every string was indexed by sp, or cp if you were a purist and considered strings a figment of the imagination. You never used names longer than 8 characters, because even if the compiler allowed it , the linker surely wouldnt.
To get that, they have to be discarding nearly every word that is common to most web pages and grading only on uncommon words. There is no other way to get that low a basic and intermediate score.
Interesting, perhaps in their effort to avoid short scientific terms they have substituted in longer more common words. For example describing a tumor as "an abnormal new mass of tissue that serves no purpose".
http://wordnetweb.princeton.edu/perl/webwn?s=tumor
The seemingly paradoxical result in an increase in understandability but a decrease in usability. Think of it as reading a passage where almost every noun has been substituted by a dictionary definition.
At the risk of stating the obvious, you can get the results for the reading level of your own HN comments by adding your user name. It's kind of interesting going through the leader board and looking up different people's scores.
I suspect this is because the algorithm measures word use and not grammar. HN tends to have people who follow the "simple is better" mantra and who write in a tone similar to either an executive summary or an online dialog, thus an accessible reading level.
I think it's worth noting that most of the articles submitted to HN have no comments on them, meaning if those pages without comments were indexed by google and included in their measure of our reading level, there would be many pages consisting of only the words: "flag, 1 point by xxxx, no comments, Hacker News, new, threads..." etc etc. So maybe that has some influence on google's calculation.
I got to thinking, how would America's universities compare? It's an interesting thought: The best universities should have the most advanced reading level.
The first thing I noticed was "noobstories" being marked Advanced. It is probably because noobs are trying their best to get karma by submitting some advanced stories.
An elevated reading level is not necessarily a sign of more thoughtful or insightful comments. It could just be contrived banality disguised as conceptual depth.
> An elevated reading level is not necessarily a sign of more thoughtful or insightful comments. It could just be contrived banality disguised as conceptual depth.
Put simply, high reading level doesn't mean good content. (trying to prove your point :D)
I like words, so I know a lot of words. But I try to take the wordiness way down. Why lose people who don't know the same words?
I mean, even if you don't care about losing people who don't have time to learn a more precise-but-obscure synonym, there's always people who learned English as a third language. Why alienate them just to look a little clever?
Even this comment has too many big words. If I had more time, I'd edit it down more. "alienate" would become "turn off" and "precise-but-obscure synonym" would become... I dunno, something simpler. Big obscure words that lock people out is really missing the forest from the trees.
You're only considering the negative aspect of "big words". Sure, there are many instances in which peope unnecessarily use lesser-known synonyms, but it is helpful when people utilize a vocabulary in order to be concise. I think that while the current mantra of "don't use big words so everyone can understand" is called for because of the amount of people abusing vocabulary so others will consider them intelligent, we should remain cognizant of the benefits succinct communication provides.
And it is even more important for precision, when a less common word is closer to what you want to say than a more popular one. This is one point Robert Heinlein made in Expanded Universe when he was defending Doc Smith's "space operas" and the vocabulary that was used in them. (Note that since "the map is not the territory", nothing, not your thoughts or even a physical description, can be exactly captured in words, but that doesn't mean it is not sometimes important or at least useful to try to be as precise as possible.)
This is a great comment. I've always thought 'reading level' as a bit bogus and looking at the wrong thing. When determining the quality of content I look at it from two ways. Does content convey the idea the writer intended and does the reader comprehend the idea the writer intended.
If I'm reading something I would prefer it be written as simply as possible. From a writers standpoint though this can be much harder than just explaining an idea in complex terms.
To paraphrase Pascal, "I apologize for the length of this letter, but I did not have time to make it shorter." I like to think length in this case meant complexity.
http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...
A brief search on wikipedia reveals a few readability tests, but they all seem to be based on sentence/syllable ratios, not content complexity.
http://en.wikipedia.org/wiki/Category:Readability_tests
And in general they all rank multi-syllable (longer) words higher. Which would mean a conversation between two Java API writers would be ranked higher than a ruby conversation :)
Java vs Ruby vs Lisp: http://i.imgur.com/tq3pA.png