Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal

Use your voice to play a song, artist, or genre through Amazon Music, Apple Music, Spotify, Pandora, and others. With compatible Echo devices in different rooms, you can fill your whole home with music.

Buy Now

Wireless Rechargeable Battery Powered WiFi Camera.

Wireless Rechargeable Battery Powered WiFi Camera is home security camera system lets you listen in and talk back through the built in speaker and microphone that work directly through your iPhone or Android Mic.

Buy Now

Full-Blown Open Source Speech Processing Server Available on Github

0
91


Florian Treml

Botium Speech Processing combines the best Open Source speech processing tools in a single service and makes them accessible with a HTTP/JSON API.

First release includes:

Botium Speech Processing Architecture
> git clone https://github.com/codeforequity-at/botium-speech-processing.git
> cd botium-speech-processing
> docker-compose up -d

A simple HTTP/JSON API is available in Swagger UI.

Swagger UI for Botium Speech Processing

Some time ago we got the task to design a test automation solution for a voice app published as Alexa Skill. The first approach was to make audio recordings of typical user queries and use it to trigger our voice app. Very soon it was clear that this is not feasible for a non-trivial app to make audio recordings and it will be impossible to reach a satisfying test coverage. On the other hand, we had to find a way to transcribe the voice app response to text for doing our automated assertions.

There were cloud services like Amazon Polly and Google Cloud Speech API available which could help us to auto-generate the user utterances, but we had the strict order to not use any cloud-based solutions.

Totally new to voice processing, we were looking for an Open Source stack helping us to accomplish the task of converting speech to text and text to speech. We expected some tools to exist which we can easily integrate into our test scripts — some command line tools, or some backend services with standards-based APIs.

For speech synthesis we quickly found Open Source software MaryTTS would do the job, and it took us several days to pack it into a docker image ready for deployment in our systems.

For speech recognition we have been directed to Kaldi, as some benchmarks see it as the best freely available tool for this purpose. It was a real pain to compile and set it up, but when it finally worked we realized that we first had to train our own model with it — Kaldi is not a ready-to-use system, it contains the algorithms, not the data — this concept was new to us this time. Eventually, we found a german recipe for Kaldi (kaldi-tuda-de), but before we had to clarify with legal department if we are allowed to use the data for our purposes. Then we realized that it will take ages to train the model on inhouse servers without GPU support. When we finally managed to complete training our own model with Kaldi after serveral weeks, the next step was: how can we actually use it from our test scripts ? Kaldi itself doesn’t include an API, just an SDK. So we found another project (Kaldi GStreamer Server) making Kaldi available as HTTP/JSON service, and it again took us several days to set it up. Then the question was, how can we just talk to Kaldi for testing purposes ? We had no user interface for doing this, so we had to install an additional tool …

You see where this is heading: it took us several months to have all of this stuff up and running, and a project like Botium Speech Processing would have saved tons of work. After all, we just wanted to convert text to speech and speech to text with medium quality… no need to get experts in speech synthesis and speech recognition software.

All the knowledge we gathered in this journey is now available on Github. It gives a quick start if you need Speech-To-Text or Text-To-Speech in your applications, available with a unified, clear and simple HTTP/JSON API:

  • HTTP POST to /api/stt/{language} for Speech-To-Text
> curl -X POST "http://127.0.0.1/api/stt/en" -H "Content-Type: audio/wav" -T sample.wav
  • HTTP GET to /api/tts/{language}?text=… for Text-To-Speech
> curl -X GET "http://127.0.0.1/api/tts/en?text=hello%20world" -o tts.wav
  • HTTP POST to /api/convert/{profile} for audio file conversion
> curl -X POST "http://127.0.0.1/api/convert/mp3tomonowav" -H "Content-Type: audio/mp3" -T sample.mp3 -o sample.wav

Botium Speech Processing is a pre-configured Speech-To-Text and Text-To-Speech service with a simple, clean and beautiful API.

Botium Speech Processing is backing the Botium connector for testing Alexa Skills with Botium, the Selenium for Chatbots.

It is part of Botium Box as well — find instructions in the Botium Wiki.

Please give it a try and tell us what you think. We are always open for suggestions and contributions.



Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here