Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal

Use your voice to play a song, artist, or genre through Amazon Music, Apple Music, Spotify, Pandora, and others. With compatible Echo devices in different rooms, you can fill your whole home with music.

Buy Now

Wireless Rechargeable Battery Powered WiFi Camera.

Wireless Rechargeable Battery Powered WiFi Camera is home security camera system lets you listen in and talk back through the built in speaker and microphone that work directly through your iPhone or Android Mic.

Buy Now

Continuous Speech to Text with Microsoft Azure Cognitive Services Speech SDK and Angular 8



Just recently, there was a requirement that popped up for the ability to have speech to text conversion capability in our Angular application. We have analysts that visit the client side and as such, they wanted the ease to just dictate the review(s) about client meeting directly in to an input field, rather than having to log in to the app, then upload a .doc/.pdf file etc. The requirement sounds pretty simple on the surface of it! Easier said than done!

1. Knowledge graphs and Chatbots — An analytical approach.

2. 🤖 How to talk to Computers: A Framework for building Conversational Agents — Part 1

3. Sentiment Analysis Voice Bot

4. Chatbot, Natural Language Processing (NLP) and Search Services and how to mash them up for a better user experience

1. Have a rich text box input field with a ‘mic’ icon to let the user click the mic icon and start the dictation

2. Use the Microsoft Speech SDK to translate the speech and output the text content in to the rich text box as the use speaks (dictates) his review in to the microphone

The application architecture that we have is roughly as follows:

1. Angular 8 UI

2. Microservices API layer with microservices for purposes like Cognitive services, Elastic search services etc.

3. So, logically the Speech to text functionality was to go in to the Cognitive Microservice API, if implemented at the Server side.

Microsoft offers different flavors for the Speech to text Conversion. Please go through the GITHUB project for details. Coming back to our original problem at hand, in order to baseline the implementation, a POC was in order. As we were still very new to the Speech Services, it made sense to go along with the ‘Quick start’ samples provided on the official Microsoft website.

Please note that to start using the Speech Cognitive services, you need to have an Azure account. We need to set up a speech resource using the Azure subscription. Please see How to Create Speech service resource in Azure

So we implemented REST API call as follows:

[HttpGet]        public async Task<string> RecognizeSpeechAsync()        {            var message = string.Empty;            var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");            // Creates a speech recognizer.            using (var recognizer = new SpeechRecognizer(config))            {                var result = await recognizer.RecognizeOnceAsync();// Checks result.                if (result.Reason == ResultReason.RecognizedSpeech)                {                    message = result.Text;                }                else if (result.Reason == ResultReason.NoMatch)                {                    message = "No Match";                }                else if (result.Reason == ResultReason.Canceled)                {                    var cancellation = CancellationDetails.FromResult(result);                    if (cancellation.Reason == CancellationReason.Error)                    {                        message = $"CANCELED: ErrorCode={cancellation.ErrorCode}";                        message += $"CANCELED: ErrorDetails={cancellation.ErrorDetails}";                        message += $"CANCELED: Did you update the subscription info?";                    }                }            }            return message;}

And then the call to the API would be from the Angular 8 UI. So far so good.

The only caveat was that, the above works wonderfully in a scenario where it is a one shot recognition; meaning that if the speaker speaks a sentence using a microphone, the API starts speech recognition, and returns after a single utterance is recognized. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed. The task returns the recognition text as result.

Since ‘RecognizeOnceAsync()’ returns only a single utterance, it is suitable only for single shot recognition like command or query. For long-running multi-utterance recognition, we need to use StartContinuousRecognitionAsync() instead. The server side code with continuous recognition presents problems with returning result back from the API to the client side since it is continuously listening for speech and giving out intermediate translation results.

As our requirement was to have continuous speech translation, we decided to move out of server side API calls and instead use the Microsoft Cognitive Services Speech SDK for JavaScript.

Whilst, the usage is straight forward, making the latest version of the above speech npm package work with Angular 8 (which has typescript version 3.5.2 supported as the maximum) requires a little bit of config files acrobatics!

As the latest speech sdk requires Typescript version 3.7+, I had to install the typescript version 3.7.2 and make the following changes to the tsconfig.json file:

{"compileOnSave": false,"compilerOptions": {…],"lib": ["es2018","dom"],"skipLibCheck": true,"paths": {…}},"angularCompilerOptions": {"fullTemplateTypeCheck": true,"strictInjectionParameters": true,"disableTypeScriptVersionCheck": true}}

And in the index.html file, in the head section, add the following script tag:

<script>var __importDefault = (this && this.__importDefault) || function (mod) { return (mod && mod.__esModule) ? mod : { "default": mod }; }</script>

(Don’t ask me why the script tag is necessary!! Comment out the script tag and find out for yourself the errors thrown at runtime!)

After doing all this, our Angular 8 project starts compiling without a whimper.

And now we come to the crux of the problem: Continuous Speech to text translation. This is achieved by the following code snippet in the Angular project (assume we have a button called as “start” which the user clicks to start speaking into the microphone) :

startButton(event) {if (this.recognizing) {this.stop();this.recognizing = false;}else {this.recognizing = true;console.log("record");const audioConfig = AudioConfig.fromDefaultMicrophoneInput();const speechConfig = SpeechConfig.fromSubscription("Yourkey", "your region");speechConfig.speechRecognitionLanguage = 'en-US';speechConfig.enableDictation();this._recognizer = new SpeechRecognizer(speechConfig, audioConfig)this._recognizer.recognizing = this._recognizer.recognized = this.recognizerCallback.bind(this)this._recognizer.startContinuousRecognitionAsync();}}recognizerCallback(s, e) {console.log(e.result.text);const reason = ResultReason[e.result.reason];console.log(reason);if (reason == "RecognizingSpeech") {this.innerHtml = this.lastRecognized + e.result.text;}if (reason == "RecognizedSpeech") {this.lastRecognized += e.result.text + "rn";this.innerHtml = this.lastRecognized;}}stop() {this._recognizer.stopContinuousRecognitionAsync(stopRecognizer.bind(this),function (err) {stopRecognizer.bind(this)console.error(err)}.bind(this))function stopRecognizer() {this._recognizer.close()this._recognizer = undefinedconsole.log('stopped')}}

And viola! We are done! The end result is a good accuracy and decent real time continuous speech to text translation. Here is a screen recording for the final output:

Speech to text demo: Continuous Speech Recognition

Well, to be honest, there are few areas where more accuracy is needed. For example, specific abbreviations like the word “UAT” (User acceptance testing) is rendered as ‘U 80’ and sometimes words like “before”, depending on the accent and intonation, are rendered as ‘b 4’ etc.

To improve on the accuracy for our Speech to text services, we can leverage upon the “Custom Speech” tools.

Speech To Text with Azure Cognitive services

According to Microsoft official docs, Custom Speech is a set of online tools that allow you to evaluate and improve Microsoft’s speech-to-text accuracy for your applications, tools, and products.Before you can do anything with Custom Speech, you’ll need an Azure account and a Speech service subscription (Remember the speech resource we created at the start of our speech to text recognition Odyssey?) .

Once you’ve got an account, you can prep your data, train and test your models, inspect recognition quality, evaluate accuracy, and ultimately deploy and use the custom speech-to-text model. The Custom Speech portal is roughly outlined as follows:

Custom Speech Studio

Follow the steps below (these steps are sourced directly from the Custom Speech to text Microsoft website)

1. Subscribe and create a project — Create an Azure account and subscribe to the Speech service. This unified subscription gives you access to speech-to-text, text-to-speech, speech translation, and the Custom Speech portal. Then, using your Speech service subscription, create your first Custom Speech project.

Custom Speech : Creating a new project

2. Upload test data — Upload test data (audio files) to evaluate Microsoft’s speech-to-text offering for your applications, tools, and products. The data I have uploaded consists of audio files with written transcripts for the audio recording. We can also provide a pronunciation file that highlights pronunciation of user/ domain specific words like: 3CPO three c p o
CNTK c n t k
IEEE i triple e UAT you a t etc.

Upload Data for Training

3. Inspect recognition quality — Use the Custom Speech portal to play back uploaded audio and inspect the speech recognition quality of your test data. For quantitative measurements, see Inspect data.

4. Evaluate accuracy — Evaluate the accuracy of the speech-to-text model. The Custom Speech portal will provide a Word Error Rate, which can be used to determine if additional training is required. If you’re satisfied with the accuracy, you can use the Speech service APIs directly. If you’d like to improve accuracy by a relative average of 5% — 20%, use the Training tab in the portal to upload additional training data, such as human-labeled transcripts and related text.

Test Uploaded Data

5. Train the model — Improve the accuracy of your speech-to-text model by providing written transcripts (10–1,000 hours) and related text (<200 MB) along with your audio test data. This data helps to train the speech-to-text model. After training, retest, and if you’re satisfied with the result, you can deploy your model.

Train the Custom Speech Model

6. Deploy the model — Create a custom endpoint for your speech-to-text model and use it in your applications, tools, or products.

Deploy the Custom Speech model and consume it using the endpoints.

Deploy the custom model

Once the model is deployed, we can consume it using the code:

var config = SpeechConfig.FromSubscription("YourSubscriptionKey", "YourServiceRegion");config.EndpointId = "YourEndpointId";var reco = new SpeechRecognizer(config);

The details of the endpoint, Keys etc. will be available once the model deployment is successful. Please see ways to use different endpoints in the application.

REST API, Short audio, Long audiowss:// how to use different endpoints in your applications

In our case, we just need to replace the speech config subscription in our Angular code with the new keys and endpoints and we are ready to start using our Custom Speech to text model 🙂

It is really easy to integrate custom speech models using the Custom Speech Portal and we can use various languages like German, French etc. for training our models.

Please find the full code here on GitHub.

Happy coding and do let me know how did it work out for you 🙂

Read More


Please enter your comment!
Please enter your name here