The article makes some strong claims about the statistical validity of their results. However, 1,000,000 observations is not always better than 3,000 observations. If the data are not representative, then a Gallup poll (which is) of a order of magnitude smaller sample is much more powerful.
Counterintuitively, if the sample is truly randomly distributed, you gain very little additional information as you go beyond 300 samples. This is why every political poll has an error margin of + or - 3%.
Right, but that doesn't mean that 300 (or 3000) samples total is enough. You can't make the detailed map about burning the national flag with 3000 samples. More data is helpful until you have 300 samples per pixel.
The real problem is most samples are not random. So, you are bound by the bias of your methods and you can't really get all that accurate. In theory when you double your sample size you do reduce your margin of error by a reasonable degree, but reality does not mesh until you start taking a large percentage of the population.
Think of it like a coin, that has a 1% bias you want the percentage to some accuracy (say 4 digits) how many flips do you need?. Now what if the problem is not the coin but the person doing the flipping. At some point more testers help more than more flips.
And a word about statistical validity: the best questions on OkCupid have been answered over a million times. Therefore we have unique insights into the American mindset
Yeah, so OKCupid users aren't representative of the average American, but somehow I don't think a post titled "Rape Fantasies and Hygiene By State" is meant to be a serious exercise in statistics.
Whether or not the data is statistically valid across the general populace may or may not be relevant to people who are concerned with the sample that is represented.
Arguing that OKC data is better than Gallup's (as the article implies) isn't a strong claim of statistical validity, it's ignorance of the basic principles of statistics.