V-Blaze and V-Cloud Online Help

Voice Activity Detection Controls

The following parameters are most often used in real-time transcription scenarios using V‑Blaze.

Table 1. Voice Activity Detection controls

Name

Values

Description

activitylevel

integer

default is 175

Specifies the volume threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the average magnitude of a signed 16-bit LPCM frame.

uttmaxgap

integer

Specifies the maximum gap in seconds that can occur between utterances before they are combined. During text processing, each utterance is buffered for a maximum of uttmaxgap seconds for possible combination with a subsequent utterance before being released for subsequent processing.

Tip

During real-time speech processing, uttmaxgap must be set to 0.

uttmaxsilence

integer

default is 800 ms

Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds uttmaxsilence milliseconds, an utterance “cut” is made within the detected silent region.

Refer to uttmaxsilence for more information on this parameter.

uttminactivity

integer

default is 500 ms

Specifies how much activity is needed (without uttpadding) to classify as an utterance. This is usually lower if activitylevel or uttpadding are high and vice-versa.

uttpadding

integer

default is 300 ms

Specifies how much padding around the active area to treat as active. Typically the higher the activitylevel, the more padding is needed. Lower activity levels require less padding.

vadtype

energy (default), level

The two types of Voice Activity Detection (VAD) available during transcription are energy and level. The energy setting instructs the engine to use the amount of energy in the audio signal to determine if speech might be present. This is the best setting to use when transcribing audio files (for post-call or batch transcription).

The level setting instructs the engine to use the simple amplitude level of the audio signal for VAD. This is the best setting to use when transcribing live audio streams (for in-call or real-time transcription) because it operates instantaneously, without the need for buffering.