V-Blaze and V-Cloud Online Help

Substitution Syntax

Substitution files contain text, with one pattern-replacement pair on each line, separated by a colon. Blank lines and comment lines (beginning with #) are ignored.

In its basic form, a pattern and its replacement are simply words separated by spaces. For example:

TALKING TO : SPEAKING WITH

would change talking to to speaking with. The pattern, TALKING TO, is just the list of words to be matched. Character case is ignored. When these words are found in the transcribed text, they are replaced with the words SPEAKING WITH.

Word substitution is followed by other processing that determines the character case of the translated text. If you want to specify the character case for a replacement word explicitly, just enclose it with slash characters (/). For example,

laugh out loud : /LOL/

Each line in a substitution file must only contain a single substitution rule. The letter case (capitalization) of the original phrase doesn't matter because matches are not case sensitive. Capitalization of the replacement phrase will match the capitalization of the original transcript by default. Letter case can be controlled by enclosing a word between forward slashes. The example below illustrates substitution rules that correct transcription errors and capitalization.

n d a : /NDA/
pc and number : /PCN/ number 
it's vance physical therapy : /Advanced/ /Physical/ /Therapy/

Each word in the replacement text must be delimited individually. For example, the following replacement text is incorrect and will cause the substitution to fail.

it's vance physical therapy : /Advanced Physical Therapy/

Patterns allow additional options, as described in Patterns.

Substitution File Priority

Substitution is performed by locating patterns and replacing the matched text. The patterns are ordered, which affects the way substitutions are performed. For example, suppose the transcribed text is:

Hello, my name is Justin. Who am I talking to?

If the first pattern changes talking to to speaking with, the result would be:

Hello, my name is Justin. Who am I speaking with?

Any later pattern searching for the word talking would not find it, since it has been changed to speaking.

Once the last pattern has acted, no further changes will be made.

Substitution files and rules are applied to transcripts in a certain order. All the rules in the hierarchy apply and do not override the others. The following list is the order of application for substitution files and rules:

  1. Built-in language model substitutions:

    1. Packaged by Voci — DO NOT MODIFY!

    2. Rules contained in /opt/voci/models/[LANG]/[MODEL]/substitutions.

  2. Built-in language substitutions:

    1. Packaged by Voci — DO NOT MODIFY!

    2. Rules contained in /opt/voci/models/[LANG]/substitutions.

  3. Language group substitutions:

    1. Automatically applied substitutions to all transcriptions which use a model from that language group.

    2. Match the name of the substitution file with the name of the language group.

    3. Place substitution rules in /opt/voci/state/substitutions/[GLOBAL_LANG].

  4. Language substitutions:

    1. Automatically applied language substitutions.

    2. Match the name of your substitution file to the name of the language.

    3. Place substitution rules in /opt/voci/state/substitutions/[LANG].

  5. Language model substitutions:

    1. Automatically applied language model substitutions.

    2. Match the name of your substitution file to the name of the language model.

    3. Place substitution rules in /opt/voci/state/substitutions/[LANG]:[MODEL].

  6. subst_list:

    1. V‑Blaze API Request using the subst_list parameter.

    2. Substitution files used with subst_list can have any name.

    3. Place substitution files in /opt/voci/state/substitutions/.

    4. Substitutions files apply in list order if multiple files are specified.

  7. subst_rules:

    1. V‑Blaze API Request using the subst_rules parameter.

    2. Specify the substitution rules as the value of subst_rules.

    3. Use the < operator in curl to automatically populate the value with the contents of a substitution file.

Patterns

A pattern may simply consist of a sequence of words separated by spaces. However, each word could instead specify a set of possible words or a regular expression.

Set of words

If you would like the pattern to accept one of a few words, place the words in parentheses and separate them by vertical bars. For example,

READ (A|AN|ONE) BOOK : READ A BOOK

would change read a book, read an book, or read one book to read a book. There should be no spaces in the list of alternative words.

Regular expression

Any word in the pattern may be a regular expression when enclosed in single or double quotes.

Regular expressions are implemented in V‑Blaze using the python library re. See https://docs.python.org/2/library/re.html for details. Some simple example patterns using regular expressions are:

  • “X.*Y” X followed by any characters followed by Y

  • 'AB+C' A followed by one or more B characters followed by C

Important

A regular expression is expected to match an entire single word in the transcribed text. Therefore, any regular expression containing a space will not match anything.

You may specify groups within a regular expression using parentheses as described in the URL referenced above. For example

‘(.*)-(.*)’ Any characters followed by - followed by any characters. The sequence of characters may be referenced separately as described in Replacements under Copying text from the input.

Pattern suffixes

Any word, set of words, or regular expression may take a suffix as described below. The word and the suffix should not have any spaces between or in them.

Not: !

A word followed by ! will match anything other than that word. It will also match the beginning or end of the transcribed text.

Zero or one: ?

A word followed by ? makes the word optional. That is, it will match zero or one occurrence of the word in the transcribed text.

Zero or more: *

A word followed * will match zero or more occurrences of the word in the transcribed text.

One or more: +

A word followed by + will match one or more occurrences of the word in the transcribed text.

Specified number of occurrences: {number}

If you want to match a specific number of occurrences of a word, place the number in { } after the word.

Range of occurrences: {min,max}

If you want to limit the number of repetitions of a word to a specific range, place the range in { } after the word. The range is inclusive, so X{2,3} would match either X X or X X X.

Replacements

When a pattern is matched, the words that were matched are replaced by the words specified as replacements on the same line of the file.

Controlling character case

Replaced words are subject to character case translation unless they are enclosed in slash characters (/). Each word where you want to specify the character case should individually be enclosed in slashes. For example, to translate H AND R BLOCK to H&R Block, use the following pattern:

H AND R BLOCK : /H&R/ /Block/

Again, note that each word is enclosed in slashes, not the entire replacement.

Copying text from the input

The items in the pattern are numbered automatically, beginning at 1. You can use the matched word in your replacement pattern using \ followed by the number of the item. For example,

READ (A|AN|ONE) (BOOK|MAGAZINE) : READ A \3

would change read one book to read a book or read an magazine to read a magazine.

A more complex example is

(AT|ON|FROM) ".*"{1,5} DOT (COM|GOV) : \1 \2.\4

This rule would change

learn more about x at my health plan dot gov or by calling

to

learn more about x at myhealthplan.gov or by calling

The first item, (AT|ON|FROM), replaces \1 in the replacement. The second item, “.*”{1,5} will match any sequence of one to five words. All of the words that are matched are concatenated without spaces and used in place of \2 in the replacement. Finally, the last item, (COM|GOV), replaces \4 in the replacement.

Note that the same rule would change

traffic information at nys dot dot gov

to

traffic information at nysdot.gov

because it finds that the first occurrence of dot is not followed by com or gov, so it keeps looking for a match.

Finally, consider this example:

ZIPCODE '(.*)-(.*)' : ZIP \2.1 PLUS \2.2

This rule would change

ZIPCODE 12345-6789

to

ZIP 12345 PLUS 6789

The reference \2.1 indicates the first group from the second word. Similarly, the reference \2.2 is replaced by the second group from the second word.

If you wanted the entire second word followed by .1, simply place a backslash in front of the dot, like this: \2\.1