How do I make a custom smart speaker?

The vendors at Development Kits for AVS (Alexa Voice Service) supply reference designs for Far-Field microphone technology. These reference designs have two modules.

  1. Digital Signal Processors (DSPs) that handle the array of microphones and noise reduction, Echo Cancellation, Beamforming, and the custom wake word.

  2. The microcontroller - such as an Android device, a Raspberry Pi, or some other microcontroller - handles the communication to an endpoint. The microcontroller controls if the analog speech is streamed to Speech-to-Text Client Libraries or to the AVS.

If using Alexa Voice Service (AVS)

If you are using Alexa Voice Service (AVS) (which is not HIPAA-compliant today) for Orbita, Alexa notifies Orbita of intents and utterances, just as if it was an Echo Dot.

Log into developer.amazon.com and see the documentation: https://developer.amazon.com/avs/home.html#/avs/welcome

When you use Alexa Voice Service (AVS)

  • You can control your wake word.

  • You can launch directly into your skill (that is, you don't need to say Launch or open).

If using Google Speech-to-Text

To be HIPAA-compliant today, you need to stream to Google Speech-to-Text. If you do not need to be HIPAA-compliant, then you can save money if by connecting to the Alexa Voice Service (AVS).

When you stream the speech audio (MP3) to Google Speech-to-Text, this service detects when there is a pause in speech, and it returns the text to the microcontroller, which would be sent to Orbita to process for intents. The microcontroller turns the microphones on and off. For example, when engaging the speaker the microphone is off; the microphone also turns off after 8 seconds if it does not detect a response.

In Orbita, you configure the Bot Channel listener for a URL that the device can send the utterance text to. In the following diagram and procedure show the flow of the configuration.

  1. The Bot Channel Listener pushes the output to the Bot-In-Parser.

    • You can control your wake word.

    • It can launch directly into your application (i.e. you don't need to say Launch or open)

  2. The Bot-In-Parser formats the utterance text to an Orbita JSON.

  3. Orbita de-identifies the user and sends the text to a natural language processor (NLP), Google Dialog.

  4. The result is returned to the NLP Connector node that called it with the Intent and Slot information.

  5. Then, internally, the NLP Connector calls the Intent node on the experience manager that matches to start that flow. The particular intent flow is concluded with a response node.

  6. The NLP Connector node passes the response to the Bot-Out-formatter node, which now includes the SSML, the Text, Display, and button info.

  7. The SSML is processed by Google Cloud Text-to-Speech (TTS) and transforms the text to speech.

  8. Then it is sent back to the device with the other information.

See the attached PDF, Orbita Security Device Flow.

The following is a partial list of companies that create the far-field voice:

  • QualComm

  • Amlogic

  • Allwinner

  • NXP

  • Intel

  • XMOS

  • Synaptics

  • Cirrus Logic

  • Microsemi AcuEdge

  • Synaptics

Related Articles