Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A TensorFlow implementation of Baidu's DeepSpeech architecture (github.com/mozilla)
156 points by rhakmi on Dec 3, 2017 | hide | past | favorite | 38 comments


By the way, there's currently a competition on Kaggle about speech recognition for TensorFlow, [1].

[1] https://www.kaggle.com/c/tensorflow-speech-recognition-chall...


That challenge seems to be more about speech command recognition (isolated words). They supply 1 second long recordings of 30 short words.

> There are only 12 possible labels for the Test set: yes, no, up, down, left, right, on, off, stop, go, silence, unknown.

This won’t work for large vocabulary continuous speech recognition, which is what you want if you want to transcribe podcasts, phone calls, or generally human-to-human, spoken interactions.


It would be awesome to see open source speech-to-text becoming a viable option.

Has anyone tested this out? Impressions of its usability?


What about Kaldi? Do you not consider that a viable option?


I made an embarrassing amount of money back when I was a consultant, just helping people get Kaldi working in their production environments or small tweaks to the models. I really don't consider it viable unless you're in the field, and it's especially not viable in a production environment.


That’s a shame to hear, I’ve always has high hopes for it.

What makes it less suitable for production compared to TF?

Is it just the Kafkaesque build process (admittedly I last tried a year ago), or is model-fitting or prediction especially buggy?


Totally agree on that. For example, imho it is getting a bit ridiculous that Linux has a driver for just about any text-input device out there, except speech.


Speech input is not text input. I don't understand what you mean.


He means speaking commands (to a shell I presume) rather than type them.


I haven't used it, but I think I read that they have 6-10% error rate?


Thanks so much for this! The effort to package this model must have been significant. As awesome as it is to see source code implementations, being able to pip install a pretrained model is even better. I hope others emulate this!


I wrote a VAMP plugin that can be used in Audacity to run DeepSpeech on selected ranges of audio: https://github.com/Mortal/vampdeepspeech


1997: Nobody believes the NSA is listening to phone calls. Netscape is rich. China barely has the internet.

2017: Kids can run real time E1+ voice transcription systems made exclusively of free software on commodity gaming hardware. Bizarrely, the dominant implementation is based upon the "free" browser community Mozilla, based upon work released by a "don't be evil" global megacorporation, but they are reduced to imitating China to get there.


I’m tempted to just ignore this troll but this is highly uninformed. There’s nothing “dominant” about this implementation or the DeepSpeech architecture in general. And just to address the Sinophobia at the end of your post: the Deep Speech papers were published by Baidu’s Silicon Valley lab, not “China.”


Haha, what the hell? How did you interpret the comment like that? It was a jovial reference to how times change. For the record, I like China, I live here, I've spent most of my adult life here! Chill out :)

PS. With respect to "There’s nothing “dominant” about this implementation or the DeepSpeech architecture in general." the use of dominant was really poetic license in support of the line of amusement, but in fact I'm not aware of a more popular open source transcription system by Github stars ... are you? As for China != Baidu SV, I actually find it even more bizarre that they would develop such algorithms in such a foreign environment.


I don't understand this "China == Evil" attitude on this website.


It might have to do with their one party political system, heavy censorship, thousands of executions each year, and persecution of reporters and political opposition.


This happens on a smaller scale in the US as well, not to mention the flagrant wars waged on other countries as well as brutalizing their own black and Hispanic population through the justice system and underpaid labor, yet people here dont generally paint a broad brush to claim the US is evil. They understand its a complex place with nuance. The same naunce should be afforded to China.


I don't live in the US and I don't pretend that it's a perfect place either, but there's still a big difference between the two countries.


> there's still a big difference between the two countries

yes indeed, like a few thousand kilometers.


on the weekend, HN is overrun with nationalist/conservative trolls.


This doesn't help. Please follow the guidelines and flag comments you think break them, and don't comment that you did.

https://news.ycombinator.com/newsguidelines.html


Was I supposed to flag the parent comment? Perhaps instead of attacking well-meaning people for observing problems, you should focus on solving the problem.


If you go to a comment's page by clicking on the timestamp (e.g. https://news.ycombinator.com/item?id=15837946) you should see a “flag” link.


They are more and more all over. Look at Reddit.


1997: Apple was nearly bankrupt:

https://www.wired.com/2009/08/dayintech_0806/

Not everything changes. 1997 - 2017: Microsoft has 90% desktop market share, but the Year of Linux on the desktop was fast approaching.

Linux not gaining a least some traction was a huge disappointment.


I do not really see how a system kernel has any bearing on consumer desktop environments. Meanwhile, open-source web-browsers surpassed IE, Android is nominally open-source and so far runs on a heavily modified linux ...


I ran into some minor glitches trying to install and use DeepSpeech couple of days ago. I’m sure they’ll be fixed soon enough but meanwhile hope this helps: https://www.phpied.com/taking-mozillas-deepspeech-for-a-spin...


It only works on "short", about 5 seconds or so, audio clips. (We should have documented this better, but I just put in a PR adding this to the documentation.)

However, you can use voice activity detection (VAD), for example webrtcvad from PyPI, to chop long audio into smaller bits that are able to be digested.

Maybe we should just put VAD in the client and have this occur automatically?


Personally, I'd love to see that as part of the client.

Just this week I started looking into how I could generate transcripts for a bunch of videos. Even if the transcripts aren't perfect, it helps with tagging and searching through large video collections that include certain keywords.

Sadly, I didn't have any luck with local solutions. I managed to generate a few transcripts using GCP's Cloud Speech API with minimal hassle, but I'd much prefer to do it locally.

I was planning on trying this out later today, and had already downloaded the Common Voice corpus. Having to add another step to break up the input into smaller chunks probably isn't a huge deal, but I wouldn't have known what tool to use in order to achieve that.

Do you know of any comparisons between various speech-to-text tools? I've avoided commercial tools so far because I'm hesitant to drop $250+ just for playing around, but I'd be interested in seeing if they're truly superior to existing open alternatives.


Added an enhancement request, issue 1064[1], to github, asking for the clients to support longer audio clips.

I can't promise when we'll get to it, as from now until new year is a bit of a wash.

I don't know of any detailed comparisons of commercial solutions. However, with respect to pure word error rate, the article[2] does a comparison of several engines as of circa 2015.

[1] https://github.com/mozilla/DeepSpeech/issues/1064

[2] https://arxiv.org/abs/1412.5567


Thanks, didn’t know about the 5 seconds. If this chopping tool can generate a map to the original audio, that means subtitles TED-style are going to be possible.

And thanks for you Mozilla peep’s hard work!


out of interest, do you also work on a reverse solution, text-to-speech? Most open source engines sadly still can't compete with commercial alternatives.


Maybe Tacotron will interest you? It's an end-to-end model, that's reasonably close to the state of the art:

https://google.github.io/tacotron/publications/tacotron/inde...

They are some open source implementations.


Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.

Edit: Another interesting one: http://research.baidu.com/deep-voice-3-2000-speaker-neural-t...


that does look pretty good, thanks!


Any plans on making it work with longer audio?


We're working on it! :-)

I don't have an ETA, but it's in the works.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: