V-Blaze and V-Cloud Online Help

Real-Time Streaming Transcription with V‑Blaze

The three phases of the transcription of an utterance are provided below to illustrate the precise timing of real-time V‑Blaze transcription:

  1. V‑Blaze receives audio data packets as fast as the sender can provide them. For a live 2-channel telephone call being sampled at 8 KHz and encoded as PCM with a 2-byte sample size, each V‑Blaze stream will receive (8000 Hz * 2 Bytes * 2 Channels) = 32,000 bytes per second. V‑Blaze will buffer this audio data until it detects a sufficiently long silence or until the maximum utterance duration has been exceeded. For example, for an utterance of duration 15 seconds, V‑Blaze will spend 15 seconds buffering audio.

  2. Once V‑Blaze has buffered a complete utterance, it will transcribe the utterance. If V‑Blaze has been configured to transcribe at 1x, it can take up to 15 seconds to complete the transcription process of a 15-second utterance. If it has been configured to transcribe at 5X, it can take up to 15/5 = 3 seconds.

  3. As soon as the utterance transcription process has completed, it is POSTed to the utterance callback server.

For example, suppose a server (the "sender") is configured to broadcast a telephone call on port 5555, using the WAV container format and a supported audio encoding method such as PCM. Likewise, a server (the "receiver") is configured to receive utterance transcript data on port 5556. Note that sender and receiver can be running on the same machine, and can even be different threads of the same program, or they can be two entirely different, geographically distributed systems. The following request will initiate real-time transcription:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F datahdr=WAVE \
     -F socket=sender:5555 \
     http://vblaze_name:17171/transcribe

It is often the case that real-time streaming audio will not include a WAV header. When transcribing raw or headerless audio, the datahdr field is not used to define the file header; raw encoded audio is supported by explicitly providing the information normally provided by the header. This includes at a minimum the sample rate, sample width, and encoding. The byte endianness can also be specified, however the default value of LITTLE is usually correct. The following is an example:

curl -F utterance_callback=http://receiver:5556/utterance \
     -F socket=sender:5555 \
     -F samprate=8000 \
     -F sampwidth=2 \
     -F encoding=spcm \
      http://vblaze_name:17171/transcribe

Refer to Voice Activity Detection Controls for more information on parameters used with real-time transcription.