I see that the data sets which interests me the most aren't as useful as I would like, because they are out of date. The Genbank release is "Last Updated: December 9, 2009 2:49 AM GMT", and PubChem is "Last Modified: Jun 4, 2009 20:21 PM GMT". Both of these datasets are modified continuously.
This is great.. hope some novel new apps, or at the very least the germs of some cross-field research, spring from this. I wonder how you can go about encouraging data providers to update the data, however? Obviously they were convinced once..
You are correct, in order to use a public data set you will have to create an EBS volume with the corresponding snapshot and attach it to an instance. There is currently no option for read-only access.
The data is stored (free of charge) via ebs (look at the EC2 instance) which persists to S3 but is not visible in or directly usable from your S3 directory. If you decide to transfer the data or run computations (e.g. via emr), you'll then pay for the resources used.
I didn't find the documentation all that clear to efficiently use the public data sets, which had financial consequences.
If anyone is adept with using the public data sets, I'd love to speak with you.
WTF? I had assumed that it was a simple sort of file access which allowed anyone in EC2 to read the data without having to import all of the storage. Then again, PubChem is only about 25 GB and inbound data transfer is free, so this is only about US$4/month.
"AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications."
Oh. I searched for "open street map" and didnt get any results. My bad. Though as others have pointed out, the data is quite old. OSM has probably doubled in size at least since then.
On the plus side, Ensembl is up-to-date.