1. Someone named "mistercow" is obviously biased on this question.
2. Another item to factor in, beyond lookup table size, is that questions are written in English, and the medium (spelled words) should not be snuck in as part of the content, except where sometimes it is done intentionally :)
Also, in classification problems like these, one should also consider not just how efficiently a solution chooses an answer from the set, but also how cleanly it isolates the cluster of items in question from the unnamed items. That is, since [cow,hen,pig,sheep] are all animals, more so than a random word is, animality should be part of the rule used to choose among them.
3. As blauwbilgorgel notes, Google's very successful solution to model picking is to slurp of everything published publicly online (and with Books, also many things published offline) and sample over the combined output of humanity. This is still biased towards written text, published text, and loquaciousness, but it's pretty good.
>Another item to factor in, beyond lookup table size, is that questions are written in English, and the medium (spelled words) should not be snuck in as part of the content, except where sometimes it is done intentionally :)
That just strengthens my point. The word-length hypothesis doesn't require any information about English. If we change our assumptions to say that the program is being fed the raw visual stimuli (as a human is), then the word-length hypothesis gets even stronger, since it merely involves comparing the widths of the stimuli.
But most of the information about the medium can be ignored when comparing hypotheses because it is constant across them.
2. Another item to factor in, beyond lookup table size, is that questions are written in English, and the medium (spelled words) should not be snuck in as part of the content, except where sometimes it is done intentionally :) Also, in classification problems like these, one should also consider not just how efficiently a solution chooses an answer from the set, but also how cleanly it isolates the cluster of items in question from the unnamed items. That is, since [cow,hen,pig,sheep] are all animals, more so than a random word is, animality should be part of the rule used to choose among them.
3. As blauwbilgorgel notes, Google's very successful solution to model picking is to slurp of everything published publicly online (and with Books, also many things published offline) and sample over the combined output of humanity. This is still biased towards written text, published text, and loquaciousness, but it's pretty good.