V-Blaze and V-Cloud Online Help (May 2020)

Voice Activity Detection Controls

The following parameters are most often used in real-time transcription scenarios using V‑Blaze.

Table 1. Voice Activity Detection controls

Name

Values

Description

activitylevel

integer

default is 175

Specifies the volume threshold for active versus inactive audio. This value should be high enough to screen out noise, but low enough to clearly trigger on speech. Range is 0-32768, correlating to the average magnitude of a signed 16-bit LPCM frame.

insecure

false (default), true

This option explicitly allows curl to perform "insecure" SSL connections and transfers. All SSL connections are attempted to be made secure by using the CA certificate bundle installed by default. This makes all connections considered "insecure" fail unless -k/--insecure is used.

This option is only relevant when HTTPS URLs are provided for callback or utterance_callback.

Refer to http://curl.haxx.se/docs/sslcerts.html for more details on this parameter.

realtime

false (default), true

Controls whether or not the ASR engine is processing incoming audio in real-time mode or not. Real-time mode is enabled based on a license setting and cannot be enabled using this setting if it is not enabled in the license. This tag is only useful to specify that the ASR engine not process incoming audio in real-time even though real-time is enabled in the license.

utterance_callback

URL

Enables you to specify the URL of a callback server to which each utterance in a transcription result will be POSTed as it is transcribed. Using this option is mandatory for real-time speech processing. As used in the ASR engine, a callback is the address and (optionally) method name and parameters of a web application that can receive data via HTTP or HTTPS. In the ASR engine, callbacks are usually used to enable another application to receive and directly interact with the transcripts produced by the ASR engine.

Refer to utterance_callback for more information on this parameter.

uttmaxgap

integer

Specifies the maximum gap in seconds that can occur between utterances before they are combined. During text processing, each utterance is buffered for a maximum of uttmaxgap seconds for possible combination with a subsequent utterance before being released for subsequent processing.

Tip

During real-time speech processing, uttmaxgap must be set to 0.

uttmaxsilence

integer

default is 800 ms

Specifies the maximum amount of silence in milliseconds that can occur between speech sounds without terminating the current utterance. Once a silence occurs that exceeds uttmaxsilence milliseconds, an utterance “cut” is made within the detected silent region.

Refer to uttmaxsilence for more information on this parameter.

uttmaxtime

integer

default is 80 seconds

Specifies the maximum amount of time in seconds that is allotted for a spoken utterance. Normally an utterance is terminated by a sufficient duration of silence, but if no such period of silence is encountered prior to reaching uttmaxtime, the utterance is terminated forcibly.

uttminactivity

integer

default is 500 ms

Specifies how much activity is needed (without uttpadding) to classify as an utterance. This is usually lower if activitylevel/uttpadding are high and vice-versa.

uttpadding

integer

default is 300 ms

Specifies how much padding around the active area to treat as active. Typically the higher the activitylevel, the more padding is needed. Lower activity levels require less padding.