V-Blaze and V-Cloud Online Help (May 2020)

Real-Time Streaming Transcription

Real-time streaming transcription enables use cases such as in-call monitoring and alerting of a supervisor to intervene in an active call. When real-time mode is activated, a transcript of each utterance is returned as soon as that utterance has been transcribed.

V‑Blaze can be configured to transcribe live streaming audio at a rate between 1X (real time) and 5X (five times faster than real time). In most cases 1X is sufficient. Higher speeds are offered for demanding use cases where milliseconds count. Delivering 5X real time requires five times more hardware resources than does 1X, all other factors being equal.

How does real-time transcription work?

Utterance transcripts are HTTP POSTed to a client-side callback server. Real-time transcription resembles the standard callback mechanism with one major difference. Instead of POSTing the entire transcript to the callback server, the transcript of each utterance is POSTed as soon as it is ready.

  1. Utterances are transcribed based on two events:

    • Break(s) in speech

    • Max utterance length

  2. The max utterance length setting can be as high as 80 seconds (15 seconds is typical), but this is a variable that will require tweaking per solution and use case. Note that setting max utterance too low will most likely degrade transcription accuracy; a lower setting reduces the the amount of context available to support recognition, which the ASR engine relies on.

  3. Latency is measured from the time an utterance to be transcribed ends to the time that a transcription result is posted. Load impacts this latency:

    • Light load: 0.2x latency should be expected

    • Medium load: 1x latency should be expected

    • Heavy load: > 1x latency should be expected

The three phases of the transcription of an utterance are described below to illustrate the precise timing of real-time ASR engine transcription:

  1. The ASR engine receives audio data packets as fast as the sender can provide them. For example, during a live two-channel telephone call being sampled at 8 KHz and encoded as PCM with a two-byte sample size, each ASR engine stream will receive (8000 Hz * 2 Bytes * 2 Channels), which equals 32,000 bytes per second.

    The ASR engine will buffer this audio data until it detects a sufficiently long silence or until the maximum utterance duration has been exceeded. For example, for a 15-second utterance, the ASR engine will spend 15 seconds buffering audio.

  2. Once the ASR engine has buffered a complete utterance, it will transcribe the utterance. If the ASR engine has been configured to transcribe at 1X, it can take up to 15 seconds to complete the transcription process of a 15-second utterance. If it has been configured to transcribe at 5X, it can take up to three seconds (15/5 = 3).

  3. As soon as the utterance transcription process has completed, the result is POSTed to the utterance callback server.

For example, suppose a server (the "sender") is configured to broadcast a telephone call on port 5555, using the WAV container format and a supported audio encoding method such as PCM. Likewise, a server (the "receiver") is configured to receive utterance transcript data on port 5556. Note that sender and receiver can be running on the same machine, and can even be different threads of the same program, or they can be two entirely different, geographically distributed systems. The following request will initiate real-time transcription:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F datahdr=WAVE \
     -F socket=sender:5555 \