Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Public datasets on AWS (amazon.com)
99 points by fs111 on Jan 14, 2012 | hide | past | favorite | 17 comments


I see that the data sets which interests me the most aren't as useful as I would like, because they are out of date. The Genbank release is "Last Updated: December 9, 2009 2:49 AM GMT", and PubChem is "Last Modified: Jun 4, 2009 20:21 PM GMT". Both of these datasets are modified continuously.

On the plus side, Ensembl is up-to-date.


This is great! Public datasets can do such amazing things for people.

Do not want to derail this, but for more datasets (and a very easy way to use them) check out Yahoo's YQL - http://developer.yahoo.com/yql/console/


Some of these are just notional and don't exist; e.g. "petroleum dataset" https://aws.amazon.com/datasets/2900


This is great.. hope some novel new apps, or at the very least the germs of some cross-field research, spring from this. I wonder how you can go about encouraging data providers to update the data, however? Obviously they were convinced once..


Geocoding without a restrictive API, thanks to Twilio/Wigle.net street vector data set: https://aws.amazon.com/datasets/2408


Definitely useful. In the past, I have fetched DBPedia manually and put it on an EBS volume to process - now I can save a little money.


If you use the data how will you billed? postGIS data can be huge.


Quoting from a forum thread:

https://forums.aws.amazon.com/thread.jspa?threadID=59896&...

So as I understand it in order to be able to access a public dataset I create a new EBS volume and attach it to a given instance for use.

So if a dataset were 200GB in size I'd be charged against my storage usage for 200GB / month.

Is this correct?

Is there a way to do a read-only access to the data from within an instance that doesn't count against my storage usage?

------------------------------------------------------------

Here is an answer from Amazon support, April 2011:

------------------------------------------------------------

Hello,

You are correct, in order to use a public data set you will have to create an EBS volume with the corresponding snapshot and attach it to an instance. There is currently no option for read-only access.

Sincerely,

------------------------------------------------------------

I've played around with some of the datasets and it ended up being fairly costly....


I did some work with the public data sets.

The data is stored (free of charge) via ebs (look at the EC2 instance) which persists to S3 but is not visible in or directly usable from your S3 directory. If you decide to transfer the data or run computations (e.g. via emr), you'll then pay for the resources used.

I didn't find the documentation all that clear to efficiently use the public data sets, which had financial consequences.

If anyone is adept with using the public data sets, I'd love to speak with you.


WTF? I had assumed that it was a simple sort of file access which allowed anyone in EC2 to read the data without having to import all of the storage. Then again, PubChem is only about 25 GB and inbound data transfer is free, so this is only about US$4/month.


From the link:

"AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications."


Shame they haven't added OpenStreetMap data dumps.



Oh. I searched for "open street map" and didnt get any results. My bad. Though as others have pointed out, the data is quite old. OSM has probably doubled in size at least since then.


Last Updated in october 2009 ...


Maybe you can contact the person that submitted it: https://forums.aws.amazon.com/profile.jspa?userID=89792


why is this advertising spamlink upvoted in the news list? can't you use a search engine? any admins here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: