V-Blaze and V-Cloud Online Help

Voice Activity Detection and Utterance Controls

The following parameters are most often used in real-time transcription scenarios using V‑Blaze.

Table 1. Voice Activity Detection controls






default is 175

Specifies the volume threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the average magnitude of a signed 16-bit LPCM frame.



Specifies the maximum gap in seconds that can occur between utterances before they are combined. During text processing, each utterance is buffered for a maximum of uttmaxgap seconds, which controls whether subsequent utterances are considered for possible combination during text processing modifications such as numtrans and substitutions.


During real-time speech processing, uttmaxgap must be set to 0. Otherwise, utterances may be delayed for modification, which would result in higher utterance latency and utterances not combining during text processing.



default is 800 ms

Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds uttmaxsilence milliseconds, an utterance “cut” is made within the detected silent region.

Refer to uttmaxsilence for more information on this parameter.



default is 150 seconds

Specifies the maximum amount of time in seconds that is allotted for a spoken utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching uttmaxtime, the utterance is terminated forcibly.



default is 500 ms

Specifies how much activity is needed (without uttpadding) to classify as an utterance. This is usually lower if activitylevel or uttpadding are high and vice-versa.



default is 300 ms

Specifies how much padding around the active area to treat as active. Typically the higher the activitylevel, the more padding is needed. Lower activity levels require less padding.


energy (default), level

The two types of Voice Activity Detection (VAD) available during transcription are energy and level. The energy setting instructs the engine to use the amount of energy in the audio signal to determine if speech might be present. This is the best setting to use when transcribing audio files (for post-call or batch transcription).

The level setting instructs the engine to use the simple amplitude level of the audio signal for VAD. This is the best setting to use when transcribing live audio streams (for in-call or real-time transcription) because it operates instantaneously, without the need for buffering.