Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've often wondered how a database such as this can be used in other fields of programming like say, a text-to-speech engine[1] where using subtitles the algorithm can guess the context of the conversation to produce better results.

[1] http://www.slate.com/articles/technology/technology/2009/03/...



I actually worked on this exact problem as an intern job at our university. We used a huge corpus of communication (for example, we had access to all the emails every sent internally at Enron).

We used this as the basis to train a speech-to-text engine by automatically correcting likely-wrong interpretations. "I go loo school" would be corrected to "I go to school", for example. It worked remarkably well.

The basis of all these subtitles can be used, but there are far bigger (and better?) collections of data to be used to train these machine learning engines.


Could you recommend any of these data collections if they are open to the public?


This is very likely the Enron corpus that was used: https://www.cs.cmu.edu/~./enron/


I can confirm that this is the corpus. I can also confirm that, even though the emails are all from mid-to-senior management, the writing style is very sloppy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: