Azure AI
Speech Services
Driven by my immense enthusiasm and fascination for AI technology, including cloud-based services from Microsoft, AWS, and Google, I have developed a small proof of concept on both AWS and Azure. This project highlights the possibilities AI has created for real-time communication between people who do not speak the same language. Through the use of AI for speech recognition and transcription, text translation, and ultimately the production of synthetic speech in the desired language, the distance between people suddenly seems much shorter.
Cutting straight to the chase in a context where I compare AWS with Azure, from my perspective, there’s no reason to spend time on anything other than Azure services. This encompasses supported languages for both speech recognition and speech synthesis, as well as natural language understanding (NLU) for phone numbers, dates, time expressions, and other specific language uses, necessary for maintaining clear and accurate communication.
The page provide various examples along with some illustrations of the technical setup, as well as links to the respective code.
In both Azure solutions setup, we direct all traffic through an Apache Reverse Proxy. This proxy acts as a gateway for clients to connect to the WebSockets, where they send a PCM Wav 16 kHz mono audio stream. RabbitMQ then takes over, managing the flow of audio events, the transcription and translation of speech to text, and the delivery of synthetic audio in the translated language. Experience this workflow firsthand by trying out the echo service
Echo solution for transcription, translation and playback-service
It’s also possible to engage in real-time communication with another individual using this platform. However, this demo requires two devices to be connected in order to initiate a dialogue. Through the following link, you can start this proof of concept. Once both machines are connected, the conversation can begin. You’ll find a brief description and a shareable link for the other participant on the site.
This setup showcases real-time transcription, translation, speech generation, and dialogue capabilities, allowing for seamless interaction between speakers of different languages.
The proof of concept is currently unavailable as it is undergoing refinements.