V-Blaze and V-Cloud Online Help

Language Model Maturity

Language models are used to capture knowledge that a machine-learning algorithm has extracted from examples. This is similar to how a person learns from observation. A model has a finite capacity for knowledge. Because of this, it can either learn a large amount about a single subject, or it can learn small amounts about many subjects. This is a spectrum: the more it knows about any given subject, the less capacity remains to learn about other subjects. This spectrum is referred to as the "maturity" of the language model.

Our language models are classified under four different levels of maturity:

Level 1 - Language model has been trained across multiple data sets, using a significant amount of data. Level 1 offers very high accuracy out of the box within the applicable vertical.

Level 2 - Language model has been trained with a moderate amount of data. Level 2 models can generally be used out-of-the-box with good accuracy as long as the language, vertical, and audio are well matched.

Level 3 - Language model has been developed with limited data and can be deployed to production in limited circumstances, but should be closely evaluated to determine the level of success and any tuning that may be required.

Level 4 - Language model has been developed with minimal data and will likely need to be tuned.

For models that convert speech to text, the variety of topics people speak about and the variety of languages they speak are the primary domains of interest. Within a single language there are many dialects. Within a dialect there can be many sub-dialects. Within a sub-dialect, each individual speaker has a unique vocal tract that produces unique sounds, which themselves change over time and under different conditions and stresses.

An acoustic model interprets sounds produced by a person, trying to identify units of sound that convey meaning within the target language. How well it does this depends upon how well the speech matches the speech used to train the acoustic model. The accuracy of the acoustic model's sound interpretation depends upon how well the input speech matches the speech used to train the acoustic model. Average sound interpretation accuracy across a population depends on how well the statistical properties of that population match the training data.

Vocabulary and phrasing choices are collectively well correlated with the "topic" of conversation. For example, word and phrase choices when discussing cell phones are distinct from words and phrases used when discussing mattresses. If the variety of topics discussed is small, then a narrow and deep language model will perform best.

Consider, for example, customer care calls discussing issues that can arise from a single brand of mattress. If the variety of topics discussed is large, then a broad and shallow language model will perform best. Another example is voicemail messages, which can address any topic imaginable.

Ideally, we would have one, all-knowing acoustic model that understands all languages at the "subject matter expert" (SME) level.  This would require a knowledge capacity that is orders of magnitude beyond what current technology can accomplish. Likewise with language models. This may one day be achievable, but for now we take the approach of creating multiple models. These models can be thought of as a "team" of SMEs that we apply strategically to provide the best accuracy possible for the largest overall population possible.