I'm going to get flamed for this, but I think, even for a data startup, worrying about scaling before you have users or traction is premature, especially with how cheap hardware is becoming.
Case in point: I'm currently hacking together an inefficient, unoptimized prototype analyzing pretty large datasets on probably the worst architecture for this kind of thing known to man, and the whole thing still runs pretty well on a single $50 VPS.
Do you have full control over the amount of data your system is taking in?
The startup I founded had analytics code in a ton of iPhone applications and was handling the load just fine right up until the day it suddenly wasn't. By that point we had customers who relied on us, and we had to deal with it very quickly. Not fun. And there's certainly more to scaling than just cheap architecture. We thought EC2 would handle the overflow until we unexpectedly became completely I/O bound. Firing up a few more instances can't fix that.
If you're just running some scraper and can control what you're taking in, that's a completely different story.
You're absolutely right, I hadn't considered analytics as an example.
Some data startups I've seen as well as my own project take in existing data sets and simply generate reports from it for customers. Makes it a lot easier to scale.
I think there's a happy middle ground between premature optimization and naive development.
While the former should not be allowed to impede one's progress toward a MVP, real customer feedback, and the potential need to adapt or pivot, neither should one ignore early optimization decisions where they are inexpensive and may only minimally impede (if at all) that progress.
Being able to recognize the difference is a talent that comes with experience.
Case in point: I'm currently hacking together an inefficient, unoptimized prototype analyzing pretty large datasets on probably the worst architecture for this kind of thing known to man, and the whole thing still runs pretty well on a single $50 VPS.