I've often wondered how a database such as this can be used in other fields of p...

stingraycharles · on Aug 5, 2014

I actually worked on this exact problem as an intern job at our university. We used a huge corpus of communication (for example, we had access to all the emails every sent internally at Enron).

We used this as the basis to train a speech-to-text engine by automatically correcting likely-wrong interpretations. "I go loo school" would be corrected to "I go to school", for example. It worked remarkably well.

The basis of all these subtitles can be used, but there are far bigger (and better?) collections of data to be used to train these machine learning engines.

haraball · on Aug 5, 2014

Could you recommend any of these data collections if they are open to the public?

law · on Aug 5, 2014

This is very likely the Enron corpus that was used: https://www.cs.cmu.edu/~./enron/

stingraycharles · on Aug 6, 2014

I can confirm that this is the corpus. I can also confirm that, even though the emails are all from mid-to-senior management, the writing style is very sloppy.