V-Blaze and V-Cloud Online Help

Receiving Language Identification Information

Language identification (LID) information is written in a lidinfo section of the JSON transcript of an audio file. The JSON transcript also contains information about the language model specified during transcription and the model selected by the language identification module to transcribe each utterance.

The lid parameter enables you to use the ASR engine's Language Identification Module to identify the language spoken in the input audio. When lid identifies the language, the appropriate language model is selected based on the spoken language detected in the audio. For example, if Spanish is detected, then the resulting transcript will be in Spanish. Additionally, if you require an alternate model, specify it using lid=language_model.

Refer to the lid parameter reference for more detail on additional lid options and how to use them.

The following cURL example would submit the file sample1.wav for transcription with language identification activated:

curl -F lid=true \
     -F file=@sample1.wav \
      http://vblaze_name:17171/transcribe

The response to this POST command is a JSON file that contains a transcript, sample1.json, that includes a lidinfo section, as in the following stereo audio example:

"lidinfo": {
    "0": {
        "conf": 1.0,
        "lang": "spa",
        "speech": 8.4499999999999993
    },
    "1": {
        "conf": 0.92000000000000004,
        "lang": spa",
        "speech": 0.98999999999999999
    },
},

The lidinfo section is a global, top-level dictionary that contains one dictionary per audio channel if the source audio is multi-channel, or contains no channel subdivisions if the source audio is mono. The dictionary contains three fields:

  • conf - the confidence score of the language identification decision

  • lang - the three-letter language code specifying the language that was identified for the stream

  • speech - the number of seconds of automatically detected speech that were used to determine the language used in the stream

In addition, the JSON transcript also includes the following additional fields:

  • model - as a top-level field that reports the value of the language model that was specified by the model tag. For example:

    "model": "eng1:survey"
  • model - as an additional field in the metadata dictionary for each element of the utterances array. The model field for each utterances element identifies the language model that was selected by LID for use in transcribing that utterance. For example:

     "metadata": {
                    "source": "spa-eng-sample.wav",
                    "model": "eng1:callcenter",
                    "uttid": 0,
                    "channel": 0
                }
  • langinfo - Breakdown of language information that is added when there was more than one language detected. For example:

     "langinfo": {
                "spa": {
                    "utts": 1,
                    "speech": 17.46,
                    "conf": 1.0,
                    "time": 21.56
                },
                "eng": {
                    "utts": 1,
                    "speech": 1.35,
                    "conf": 0.81,
                    "time": 0.93
                }
  • langfinal - Added to the lidinfo object when the LID language has been detected in the audio channel, but with a confidence that is less than the lidthreshold value. In these cases, the default model (or the model specified by the model parameter) is used to transcribe the audio instead. langfinal indicates the language of the model that was actually used to transcribe the audio. For example:

      "lidinfo": {
                    "lang": "spa",
                    "speech": 1.35,
                    "langfinal": "eng",
                    "conf": 0.81
                 }
    

Refer to Language Support for information on supported languages.