Top Speech Procesing APIs
A VoiceBot is nothing without it’s Speech, It’s capability to turn responses coming from an AI engine into a Voice response, That is what gives a way for it to communicate with a user in our languages. And It would require a two way recognition to understand first the language in which user is interacting and then understand the content of the interaction which user did and then responding back to the user in the preferred language.
There are many public cloud based service providers which support for both type of interactions, Speech to Text and Text to Speech conversion. Even if you are not an AI expert or Natural language procesing expert, you can still easily integrate your existing services by implementing simple web services calls. You can implement such services for various Applications in which you are looking to provide support for Voice based interaction or you are looking to have a voice bot deployed in your contact center.
Here is a list of some popular APIs for speech processing:
- Google Cloud Speech API
- IBM Watson Speech to Text
- Amazon Polly
- IBM Watson Text to Speech
- Amazon Transcribe
We will describe the general aspects of each API.
Google Cloud Speech API
Google Cloud Speech API is a part of Google Cloud. It would support converting a human speech to text.Google is so far has the best and vast natural langugage processing engine. It has support for processing more than 100 languages. It would also allow you to identify the sentiments or different entities available in the speech input.API can work both in batch and real-time modes. The price is flexible. Up to 60 minutes of the processed audio is free for each user. If you want to process more than 60 minutes, you should pay 0.006 USD per 15 seconds. It is interesting that the total monthly capacity is limited to 1 million minutes of audio.
IBM Watson Speech to Text
IBM Watson Speech to Text is a service provided by IBM Watson that can convert human speech into text. IBM Watson has very limited language capabilties. Good thing about IBM Watson is that other than supporting customization for specific words it would also allow you to customize it for the particular acoustic condition.
There are three levels of access to the service. The standard level provides free access for the first 1000 minutes of processed audio per month. Then, the flexible per minute prices are used. They depend on the number of minutes you want to process. If you’re going to use customization models, you will have to pay 0.03 USD in addition to the Standard level prices. To use Premium level, you would have to reach out to IBM.
Amazon Polly is part of Amazon Web Services offering to allow their customers to convert Speech into Text.
Amazon Polly has a good support of SSML which would enable it’s users to add various touches into the interactions like adding pauses, adding weight to some of the words etc.
Pricing is flexible. The Free Tier is available during the first 12 months, but you will be able to process not more than 5 million characters per month. The Pay-As-You-Go model is an alternative. You will have to pay 4 USD per 1 million characters processed.
IBM Watson Text to Speech
IBM Watson Text to Speech also provides a service for performing text-to-speech task.
The system produces high-quality audio files from the input texts. It can recognize some abbreviations and numbers. For example, it can pronounce “United States Dollars” when it meets “USD” abbreviation in the text. The API can detect the tone of the sentence (question, for example). You can choose the expressiveness of the voice (GoodNews, Apology, Uncertainty). Also, there are available such voices as Young, Soft, Male, Female. However, expressiveness and different types of voices are currently available only for English language. Word timing feature allows synchronizing the text streaming and the voice accompanying. The service can produce audio files with different formats. You can read more about supported formats in the documentation.
Pricing depends on the level of usage. If you want Premium level, you should contact IBM to agree on the details of the price and usage. If using Standard level is sufficient, the conditions are as follows. First 1 million characters of processed text per month are free. If you need to process more characters, you will need to pay 0.02 USD per 1000 characters. All languages and voices are available in the Standard level.
Amazon Transcribe is another service provdied by Amazon Web Services into Speech recognition. As it name suggests this service enables users to generate a transcript of audio files.
The main benefit we could think of using Amazon Transcribe is to convert Contact center conversations into text transcripts allowing contact center to get better insights to their calls. Amazon transcribe has this feature which supports telephony audio which usually has lower audio quality. Other that this features includes like adding timestamps. Aside to that Amazon Transcribe has fairly good roadmap to bring in many more features to this product.
This service is also part of their Free Tier, A user can use this service for upto 60 minutes per month for 12 months during the Free Tier. Post that, it would be charging 0.0004 USD per second of the audio which is being processed.