That challenge seems to be more about speech command recognition (isolated words). They supply 1 second long recordings of 30 short words.
> There are only 12 possible labels for the Test set: yes, no, up, down, left, right, on, off, stop, go, silence, unknown.
This won’t work for large vocabulary continuous speech recognition, which is what you want if you want to transcribe podcasts, phone calls, or generally human-to-human, spoken interactions.
I made an embarrassing amount of money back when I was a consultant, just helping people get Kaldi working in their production environments or small tweaks to the models. I really don't consider it viable unless you're in the field, and it's especially not viable in a production environment.
Totally agree on that. For example, imho it is getting a bit ridiculous that Linux has a driver for just about any text-input device out there, except speech.
Thanks so much for this! The effort to package this model must have been significant. As awesome as it is to see source code implementations, being able to pip install a pretrained model is even better. I hope others emulate this!
1997: Nobody believes the NSA is listening to phone calls. Netscape is rich. China barely has the internet.
2017: Kids can run real time E1+ voice transcription systems made exclusively of free software on commodity gaming hardware. Bizarrely, the dominant implementation is based upon the "free" browser community Mozilla, based upon work released by a "don't be evil" global megacorporation, but they are reduced to imitating China to get there.
I’m tempted to just ignore this troll but this is highly uninformed. There’s nothing “dominant” about this implementation or the DeepSpeech architecture in general. And just to address the Sinophobia at the end of your post: the Deep Speech papers were published by Baidu’s Silicon Valley lab, not “China.”
Haha, what the hell? How did you interpret the comment like that? It was a jovial reference to how times change. For the record, I like China, I live here, I've spent most of my adult life here! Chill out :)
PS. With respect to "There’s nothing “dominant” about this implementation or the DeepSpeech architecture in general." the use of dominant was really poetic license in support of the line of amusement, but in fact I'm not aware of a more popular open source transcription system by Github stars ... are you? As for China != Baidu SV, I actually find it even more bizarre that they would develop such algorithms in such a foreign environment.
It might have to do with their one party political system, heavy censorship, thousands of executions each year, and persecution of reporters and political opposition.
This happens on a smaller scale in the US as well, not to mention the flagrant wars waged on other countries as well as brutalizing their own black and Hispanic population through the justice system and underpaid labor, yet people here dont generally paint a broad brush to claim the US is evil. They understand its a complex place with nuance. The same naunce should be afforded to China.
Was I supposed to flag the parent comment? Perhaps instead of attacking well-meaning people for observing problems, you should focus on solving the problem.
I do not really see how a system kernel has any bearing on consumer desktop environments. Meanwhile, open-source web-browsers surpassed IE, Android is nominally open-source and so far runs on a heavily modified linux ...
It only works on "short", about 5 seconds or so, audio clips. (We should have documented this better, but I just put in a PR adding this to the documentation.)
However, you can use voice activity detection (VAD), for example webrtcvad from PyPI, to chop long audio into smaller bits that are able to be digested.
Maybe we should just put VAD in the client and have this occur automatically?
Personally, I'd love to see that as part of the client.
Just this week I started looking into how I could generate transcripts for a bunch of videos. Even if the transcripts aren't perfect, it helps with tagging and searching through large video collections that include certain keywords.
Sadly, I didn't have any luck with local solutions. I managed to generate a few transcripts using GCP's Cloud Speech API with minimal hassle, but I'd much prefer to do it locally.
I was planning on trying this out later today, and had already downloaded the Common Voice corpus. Having to add another step to break up the input into smaller chunks probably isn't a huge deal, but I wouldn't have known what tool to use in order to achieve that.
Do you know of any comparisons between various speech-to-text tools? I've avoided commercial tools so far because I'm hesitant to drop $250+ just for playing around, but I'd be interested in seeing if they're truly superior to existing open alternatives.
Added an enhancement request, issue 1064[1], to github, asking for the clients to support longer audio clips.
I can't promise when we'll get to it, as from now until new year is a bit of a wash.
I don't know of any detailed comparisons of commercial solutions. However, with respect to pure word error rate, the article[2] does a comparison of several engines as of circa 2015.
Thanks, didn’t know about the 5 seconds. If this chopping tool can generate a map to the original audio, that means subtitles TED-style are going to be possible.
out of interest, do you also work on a reverse solution, text-to-speech? Most open source engines sadly still can't compete with commercial alternatives.
Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.
[1] https://www.kaggle.com/c/tensorflow-speech-recognition-chall...