V-Blaze and V-Cloud Online Help (May 2020)

Receiving Language Identification Information

V‑Blaze’s Language Identification is a licensed feature that enables V‑Blaze to identify the languages used by the speakers in an audio file. Language Identification is activated by setting the lid tag to one of the following values:

  • lid=true - automatically selects the language identification model based on the LID and language models that are available in the V‑Blaze installation.

  • lid=language_model - the alternate language model (to that specified by the model tag). You will only want to specify this option if you need to select between multiple alternate language models.

The following parameters provide additional options when using the lid tag:

  • lidmaxtime (default 20.0 seconds) - maximum audio duration (seconds) to analyze. For example, if lidmaxtime=20, the ASR engine will analyze 20 seconds of audio at most.

  • lidthreshold (default 0.7) - specifies the required confidence level before lid will stop analyzing audio.

Note

Language identification will stop analyzing audio once confidence exceeds the value specified in lidthreshold or goes over the audio duration limit set in lidmaxtime.

When Language Identification is activated, the transcript will contain a section like the following stereo audio example:

"lidinfo": {
    "0": {
        "conf": 1.0,
        "lang": "spa",
        "speech": 8.4499999999999993
    },
    "1": {
        "conf": 0.92000000000000004,
        "lang": spa",
        "speech": 0.98999999999999999
    },
},

The lidinfo section is a global, top-level dictionary that contains one dictionary per audio channel if the source audio is multi-channel, or contains no channel subdivisions if the source audio is mono. The dictionary contains three fields:

  • lang - the three-letter language code specifying the language that was identified for the stream

  • speech - the number of seconds of automatically detected speech that were used to determine the language used in the stream

  • conf : the confidence score of the language identification decision

In addition, the JSON transcript also includes the following additional fields:

  • model - as a top-level field that reports the value of the language model that was specified by the model tag. For example:

    "model": "eng1:survey"
  • model - as a field in the metadata dictionary for each element of the top-level utterances array. The model field for each utterances element identifies the language model that was selected by language identification for use in transcribing that utterance.

A sample JSON file that shows examples of all of these fields in context is the following:

[
    {
        "confidence": 0.94999999999999996,
        "donedate": "2017-07-19 17:49:17.395902",
        "lidinfo": {
            "0": {
                "conf": 1.0,
                "lang": "spa",
                "speech": 8.4499999999999993
            },
            "1": {
                "conf": 0.92000000000000004,
                "lang": spa",
                "speech": 0.98999999999999999
            },
        },
        "model": "eng1:survey",
        "recvdate": "2017-07-19 17:49:16.013155",
        "recvtz": [
            "EDT",
            -14400
        ],
        "source": "ispeech_usspanishfemale_survey_example_025.wav",
        "utterances": [
            {
                "confidence": 0.94999999999999996,
                "donedate": "2017-07-19 17:49:17.395902",
                "end": 11.67,
                "events": [
                    {
                        "confidence": 0.94999999999999996,
                        "end": 0.25,
                        "start": 0.0,
                        "word": "Yo"
                    },
                    {
                        "confidence": 0.94999999999999996,
                        "end": 0.66999999999999995,
                        "start": 0.25,
                        "word": "habla"
                    },...
                ],
                "metadata": {
                    "channel": 0,
                    "model": "spa1:survey",
                    "source": "vocab7@host1",
                    "uttid": 0
                },
                "recvdate": "2017-07-19 17:49:16.013155",
                "recvtz": [
                    "EDT",
                   -14400
                ],
                "start": 0
            }
        ]
    }
]

As an example of calling this API using the cURL command, you would use a command like the following to submit the file sample1.wav for transcription with language identification activated:

curl -F lid=true \
     -F file=@sample1.wav \
      http://vblaze_name:17171/transcribe

The response to this POST command will be a JSON file that contains a transcript, sample1.json, that includes a lidinfo section.