ASR Engine Capabilities & Features
Voci's automatic speech recognition (ASR) engine powers Voci's accurate and scalable Speech-to-Text (STT) solutions. Whether your call volume is measured in hundreds or millions of hours per month, the VociASR engine enables you to automatically generate high-quality transcripts from 100% of your speech audio assets.
Voci uses deep neural networks and deep belief networks in a proprietary configuration to convert speech to intelligent data. Voci speech recognition uses a combination of assisted and unassisted machine learning and is based on Large Vocabulary Continuous Speech Recognition (LVCSR) technology. LVCSR recognizes phonemes like a phonetic system, then applies a dictionary or language model to produce a full transcript. The accuracy is much higher than just the single word lookup of a phonetic approach, and transcript produced is much easier and faster for contact centers to search and use.
The ASR engine uses language models tuned for telephony-based communications such as customer service call center interactions, voicemail, phone sales, and similar audio. The system caters to continuous, spontaneous, uncooperative speech. Speech of this type typically occurs during a phone call between an agent and a caller, or in a voicemail, where it is typical of callers to leave spontaneous messages.
The following table describes real-time and post-call features of the ASR engine.
Feature | Real-time* | Post-call | Description |
---|---|---|---|
* V‑Cloud implementations do not currently support real-time transcription. | |||
✓ | ✓ | Transcribes digitized audio to text. | |
Punctuation | ✓ | ✓ | Adds punctuation and capitalization. Fully punctuated transcripts significantly improve speech analysis by increasing the understanding of the caller’s intended meaning. |
✓ | ✓ | The total number of words is provided for each call depending on parameters. Other counts included are:
| |
✓ | ✓ | Rules-Based approach to substituting common miss-identified words in the transcription. This can be run through Voci's AutoSubs process to automatically identify words that are typically transcribed incorrectly. | |
✓ | ✓ | Classifies emotion based on combined acoustic features and word sentiment scores. Values include strongly positive, positive, neutral, negative, and strongly negative. Scoring is available at the call and individual utterance level. Raw emotion scoring is also available. | |
✓ | ✓ | Classifies sentiment based on word usage at the call and utterance level. Values include negative, mostly negative, neutral, mostly positive, and positive. | |
✓ | ✓ | Scores words, utterances, and calls for the system's confidence in the transcription results. | |
✓ | If a LID-supported language is detected, the ASR engine will switch to the same model of the detected language. | ||
✓ | ✓ | Redacts numbers from a transcript. Automated numeric redaction reduces PCI/PII risk by automatically finding and eliminating credit card and other sensitive numbers from audio and text. | |
✓ | Replaces sensitive segments of an audio file with silence. Automated redaction reduces PCI/PII risk by automatically finding and eliminating credit card and other sensitive numbers from audio and text. | ||
✓ | ✓ | Identifies speakers as male or female. | |
✓ | Automatic speaker separation of customer and agent voices when both are recorded on one channel, enabling their utterances to be analyzed independently. This is referred to as diarization. | ||
✓ | ✓ | Acoustic-based classification model that identified when music occurs. Each utterance is scored -1 to +1, corresponding to the probability that it is music. Music utterances are not transcribed. | |
✓ | ✓ | Identifies which channel is the agent versus the customer. | |
✓ | ✓ | Identifies which numbers are likely credit cards (n-16 digits) by adding a tag to the transcript metadata file (even if number was redacted). Luhn numbers are not redacted when detected, and there is no "scrub only Luhn numbers" functionality. | |
Overtalk | ✓ | ✓ | Overtalk occurs when speakers talk over one another. A recording's overtalk percentage is the count of Agent-initiated overtalk turns as a percentage of the total number of Agent-speaking turns. In other words, out of all of the Agent’s turns, it measures how many turns interrupted a Client’s turn. |