V-Spark Online Help


Automatic Speech Recognition Glossary

Accuracy (ACC)

Metric to determine the level of correctness in a given transcript. Accuracy can be calculated with the following formula:

Equation 1.
ACC=1-S+IN-D+I ACC=1-\frac{S+I}{N-D+I}

S = number of substitution errors

I = number of insertion errors

D = number of deletion errors

N = number of words in reference transcript

Acoustic Model

A component of the model used by the ASR engine for transcription that converts audio into a stream of sound symbols specific to a language, such as English or Spanish.

After Call Work (ACW)

A period of time immediately after contact with the customer is completed and any supplementary work is undertaken by the Agent, in relation to that interaction.

Agent/Client Voice Clarity

A measure of the speech recognizer’s confidence in its transcription accuracy. Higher clarity scores indicate that the call was transcribed with higher accuracy.


Audio compressed with the A-law companding algorithm, primarily used in digital telecom systems in Europe.

American Standard Code for Information Interchange (ASCII)

Computers can only understand numbers, so an ASCII code is the numerical representation of a character.


Amazon Machine Image


A feature of V‑Spark that allows users to create custom analysis of calls by entering sets of search queries and filters to group calls into upper-level categories, lower-level categories, and leaf-level categories. Leaf-level categories have no lower-level categories of their own.


A commonly used file format for language models

Audio Source

Refers to the number of channels in the original audio file. Mono indicates that the audio file has only one channel, while Stereo indicates that it has two channels.

Automatic Call Distributor (ACD)

Recognize, answer and route incoming calls.

Automatic Number Identification (ANI)

A feature of the telephony network to capture a caller’s identifying telephony number.

Automatic Speech Recognition (ASR)

Technologies that leverage the independent, computer-driven transcription of spoken language into readable text (often known as Speech to Text or STT), enabling users to use speech rather than a keyboard as the source of input data for a computer system.

Average Call Time (ACT)

The total amount of time that a customer is engaged with an agent, including call transfers, hold time, talk time, and so on.

Average Handle Time (AHT)

A call center metric that represents the total amount of work related to customer/agent transactions, including the Average Talk Time (ATT), hold time, and any administrative tasks that must be done for per-call record-keeping (often referred to as After Call Work, ACW), divided by the number of calls that were handled. This metric is used to summarize and analyze agent performance.

Benchmark WER

Benchmark word error rate (internal only.)

Calling Line Identity (CLI)

A feature of the telephony network to capture a caller’s identifying telephony number in the UK.

Call Silence Time

The percentage of the call that was silent, meaning no speakers were talking.


Credit Card Validation


Refers to the act of cancelling a subscription or service, or the overall rate at which customers cancel their accounts or otherwise stop doing business with an entity; sometimes short for "churn rate"


Quantifies how clear the audio sounds with respect to signal strength, background noise, or a strong accent.

Compatible Versions

ASR engine versions compatible with the language model.


An estimate of the probability that the correct words were selected during decoding.

Customer Service Representative (CSR)

A person employed in a call center to answer the phone. Another name for this is Agent or Adviser.


The process by which the ASR engine converts speech to text.


The process of splitting mono audio into 2 channels, according to speaker. Higher diarization scores indicate that the speech recognizer separated speaker phrases more accurately.


The industry the language model was specifically tuned to transcribe.


Element Management System consists of systems for managing network elements.

Features Supported

Features that are supported with the language model.

First Call (Contact) Resolution (FCR)

A measure of relative success for an individual interaction. Usually defined in terms of a single customer or account, a single issue or order, and a predefined time range for a response to have taken place.


an audio coding format for lossless compression of digital audio

Integrated Services Digital Network (ISDN)

A digital network providing 64kbit and 2 Mbit bandwidth voice and data circuits.

Interactive Voice Response (IVR)

A telephone system that lets callers interact with your company through either touch tone or speech recognition.


Short for JavaScript Object Notation. Allows for information to be stored in a hierarchical fashion within a single document.

JSON Transcript

A transcript and associated metadata produced from an audio file, in JSON format. The Voci JSON format is documented in the Voci JSON Output Format Guide.

Language Model

A component of the model used by the ASR engine for transcription that converts the stream of sound symbols from an acoustic model into text. The term "language model" may also be used to refer to the combination of an acoustic and language model.

Last Updated

The month and year that the language model was last updated.


Period of time that passes once input is submitted into a system until output emerges.

Lightweight Directory Access Protocol (LDAP)

An open industry standard set of protocols for accessing information directories.


The level of robustness for the language model. This refers to the amount of data that has been used to train the model. Refer to Model Maturity for more information.Language Model Maturity

Model ID

The identifier recognized by the ASR engine and passed in when requesting the model directly in transcription requests.

MP3 ()

A standard format for compressing audio files. This format is lossy and optimized for music, so using audio in this format for speech transcription may result in lower accuracy from the ASR engine.

Out-of-Vocabulary (OOV)

A word is considered out-of-vocabulary if it is not included in the language model


Overtalk occurs when speakers talk over one another. A recording's overtalk percentage is the count of Agent-initiated overtalk turns as a percentage of the total number of Agent-speaking turns. In other words, out of all of the Agent’s turns, it measures how many turns interrupted a Client’s turn.

Package Name

The install package filename for the language model.


The Pulse-Code Modulation audio format

Percent Words Correct (PWC)

Substitution Error Count

Private Branch Exchange (PBX)

A commonly-used type of multiline telephone system used within businesses and call centers.


Refers to the ability to automatically redact sensitive content from audio files and their transcriptions. This information can be redacted from the audio file, the transcription, or both.


Short for regular expression. Allows users to perform more complex searches using variables to stand in for characters, words, and character patterns.

Region or Dialect

Region or dialect that the language was developed for.

Resource Interchange File Format (RIFF)

A generic file container format for storing data in tagged chunks. It is primarily used to store multimedia such as sound and video.

Side Classification

A process available in V‑Spark that examines text generated from two different speakers on the same call and designates each speaker as either agent or client.

Side classification is automatically applied to transcribed text coming into a folder that is configured for single-channel, two-speaker operation.

Speech-to-Text (STT)

The process of transcribing audio files to text using Automated Speech Recognition.


Open source speech recognition software toolkit


A sequence of computer data that is made available over time. This term is commonly used in the context of streaming audio or streaming video which you can listen to as the data is continually being delivered to the device that you are using.

Stream Tag

A name/value pair that when specified modifies the way in which audio is transcribed by V‑Blaze, V‑Cloud, or the Voci ASR module. These parameters can also be used to tag output with user-level metadata.

Talk Time

The amount of time an agent spends handling a customer call from start to finish.


Measure of units that can be processed in a given amount of time.

Turn-Around Time (TAT)

The amount of time between submitting an audio file for transcription and receiving the transcribed text.


Audio compressed with the µ-law companding algorithm, primarily used in digital telecom systems in North America and Japan.


An uninterrupted chain of spoken language by a single speaker. An utterance is a region of speech audio that ends with a period of silence that exceeds a threshold duration, or that exceeds the maximum utterance duration threshold.


The version of the language model.


JSON output produced by one of Voci's products that typically contains all or some of a transcript produced from an audio file, plus metadata. Voci JSON format is documented in the JSON Output Format Guide.

Voice Activity Detection (VAD)

An early portion of the Automatic Speech Recognition process, voice activity detection identifies the portions of an audio file that correspond to speech. This improves performance by not spending processing time attempting to decode audio that is not relevant to speech, such as music or background noise.

Voice over Internet Protocol (VoIP)

Enables use of the internet as the transmission medium for phone calls.


A file format for digitized audio with a RIFF header at the beginning. It is not an audio encoding, and the audio after the RIFF header can be highly compressed or not compressed at all.

Word Error Rate (WER)

Calculates the total number of errors (which includes insertions, deletions, and substitutions), divides that by the number of words in the transcript