V-Spark Online Help

The utterances Array

The top-level utterances element contains an array of speech segment information. Each utterances array consists of the following elements.

Table 1. Data contained in the utterances Array

Object

Availability

Type

Definition

emotion

All

value

Emotional intelligence consists of both acoustic and linguistic information. Events can be given the following values:

  • Positive

  • Mostly Positive

  • Neutral

  • Mostly Negative

  • Negative

confidence

All

value

A measure of how confident the speech recognition system is in its utterance transcription results.

  • Range between 0 and 1

  • 1 is most confident

end

All

value

End time of the utterance in seconds

recvtz

All

array

An array containing two values:

  • time zone abbreviation of the timezone in which the ASR engine is running

  • offset in seconds from UTC for the time on the ASR engine

sentiment

All

value

Utterance-level linguistic sentiment value:

  • Positive

  • Mostly Positive

  • Neutral

  • Mostly Negative

  • Negative

  • Mixed (contains both Positive and Negative in the file)

gender

All

value

Gender prediction of the speaker

rawemotion

All

value

Acoustic emotion values (version 7.1+):

  • ANGRY

  • NEUTRAL

  • HAPPY

Acoustic emotion values (prior to version 7.1):

  • NONANGRY

  • ANGRY

lidinfo

V‑Blaze version 7.1+

array

The lidinfo section is a global, top-level dictionary that contains one dictionary per audio channel if the source audio is multi-channel, or contains no channel subdivisions if the source audio is mono. The dictionary contains three fields:

  • lang - the three-letter language code specifying the language that was identified for the stream

  • speech - the number of seconds of automatically detected speech that were used to determine the language used in the stream

  • conf - the confidence score of the language identification decision

  • langfinal - added when the language specified in LID is below threshold and not the default language

For example:

"lidinfo": {
                "lang": "spa",
                "speech": 17.46,
                "conf": 1.0
            }

sentimentex

All

array

Contains sentiment information for each utterance

  • [0][0] = Positive phrase counts

  • [0][1] = Negative phrase counts in utterance

  • [1][*] consist of an array of sentiment segments where [0] = ‘+’ or ‘-‘ for Positive and Negative, and [1] is the position range of the phrase

  • [0] is beginning and [1] is end position

start

All

value

Start time of the utterance in seconds

donedate

All

value

Date and time the utterance transcription was completed by the speech-to-text engine

recvdate

All

value

Date and time the utterance was received by the speech-to-text engine

events

All

array

Contains information about individual words. Each element is a word object that contains the following values:

  • confidence: word level transcription confidence value between 0 and 1

  • end: end time of the word in seconds

  • start: start time in seconds

  • word: normalized word.

  • wordex: raw dictionary word. This value is often used to disambiguate different pronunciations that have the same spelling.

For example:

"events": [
                {
                    "confidence": 0.69,
                    "end": 2.32,
                    "start": 1.81,
                    "word": "Stephanie"
                },
                {
                    "confidence": 0.76,
                    "end": 2.74,
                    "start": 2.32,
                    "word": "so"
                }

metadata

All

object

Speaker information of the utterance. Each object contains the following values:

  • channel: channel number

  • model: model that decoded the utterance

  • source: audio file name

  • nsubs: (V‑Blaze 7.1+) count of substitutions applied for the utterance, not including numtrans counts.

  • uttid: utterance segment number

  • substinfo: (V‑Blaze 7.1+) substitutions detail included when substinfo=true

    • nsubs: (V‑Blaze 7.1+) count of substitutions applied for the utterance, including numtrans counts.

For example:

   "metadata": {
                "uttid": 1,
                "substinfo": {
                    "subs": [
                        [
                            19.56,
                            20.14,
                            [
                                {
                                    "source": "subst_rules",
                                    "end": 20.14,
                                    "sub": "persona => /Persona/",
                                    "rule": "0",
                                    "start": 19.56
                                }
                            ]
                        ]
                    ],
                    "nsubs": 1
                },
                "source": "spa-eng-sample.wav",
                "channel": 0,
                "model": "spa3:travel",
                "nsubs": 1
            } "metadata": {
                "source": "spa-eng-sample.wav",
                "model": "eng1:callcenter",
                "uttid": 0,
                "channel": 0
            }