V-Blaze and V-Cloud Online Help (May 2020)

Real-Time Streaming Transcription

Real-time streaming transcription supports use cases such as in-call monitoring and alerting to notify a supervisor to intervene in an active call. When operating in real-time mode, V‑Blaze returns a transcript of each utterance as soon as the utterance has been transcribed.

Utterance transcripts are HTTP POSTed to a client-side callback server. This works the same way as discussed in Receiving Results via Callback, except in this case rather than the entire transcript being POSTed to the callback server, the transcript of each utterance is POSTed as soon as it is ready.

The three phases of the transcription of an utterance are provided below to illustrate the precise timing of real-time V‑Blaze transcription:

  1. V‑Blaze receives audio data packets as fast as the sender can provide them. For a live 2-channel telephone call being sampled at 8 KHz and encoded as PCM with a 2-byte sample size, each V‑Blaze stream will receive (8000 Hz * 2 Bytes * 2 Channels) = 32,000 bytes per second. V‑Blaze will buffer this audio data until it detects a sufficiently long silence or until the maximum utterance duration has been exceeded. For example, for an utterance of duration 15 seconds, V‑Blaze will spend 15 seconds buffering audio.

  2. Once V‑Blaze has buffered a complete utterance, it will transcribe the utterance. If V‑Blaze has been configured to transcribe at 1x, it can take up to 15 seconds to complete the transcription process of a 15-second utterance. If it has been configured to transcribe at 5X, it can take up to 15/5 = 3 seconds.

  3. As soon as the utterance transcription process has completed, it is POSTed to the utterance callback server.

For example, suppose a server (the "sender") is configured to broadcast a telephone call on port 5555, using the WAV container format and a supported audio encoding method such as PCM. Likewise, a server (the "receiver") is configured to receive utterance transcript data on port 5556. Note that sender and receiver can be running on the same machine, and can even be different threads of the same program, or they can be two entirely different, geographically distributed systems. The following request will initiate real-time transcription:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F datahdr=WAVE \
     -F socket=sender:5555 \

It is often the case that real-time streaming audio will not include a WAV header. When transcribing raw or headerless audio, the datahdr field is not used to define the file header; raw encoded audio is supported by explicitly providing the information normally provided by the header. This includes at a minimum the sample rate, sample width, and encoding. The byte endianness can also be specified, however the default value of LITTLE is usually correct. The following is an example:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F socket=sender:5555 \
     -F samprate=8000 \
     -F sampwidth=2 \
     -F encoding=spcm \