By: Gene Quinn (IPWatchdog.com)
On February 14, 2013, an interesting Apple patent application published. The title of the patent application is “Method for disambiguating multiple readings in language conversion,” and deals with the difficulties associated with proper automatic translation when the word being translated has a different meaning based on the usage context. An interesting invention in its own right perhaps, but the disclosure specifically relates to addressing this problem with respect to translating from Chinese into English.
The Heteronym Problem
Statistical language models are commonly used to convert or translate one language to another by assigning a probability to a sequence of words using a probability distribution. These language models are typically trained from a large body of texts and generally capture the frequencies of the occurrence of each word and/or each sequence of two or more words the defined body of text collected. This works well enough, at least to some basic extent, for many purposes. But a real problem is encountered when identically written words have different meanings. In linguistics, identical words having different meanings are referred to as heteronyms.
An example of a heteronym in the English language is “desert,” which in one context means “to abandon,” and in another context means “a dry, barren area of land.” Thus, by accounting for the frequency of the word “desert” without regard to the context of its use, any distinctions of frequencies of use of the word in the first sense (“to abandon”) and a second sense (“a dry, barren area of land”) are most likely overlooked by conventional statistical language translation models.
Translating from Chinese into English is not an easy task in and of itself. Pinyin is a standard method for transcribing Mandarin Chinese using the Roman alphabet. In a pinyin transliteration, the phonetic pronunciations of Chinese characters are mapped to syllables composed of Roman letters. While this can improve conversion accuracy, certain Chinese characters have multiple pronunciations, or heteronymous Chinese characters. Conventional language models that do not distinguish between different pronunciations of heteronyms can sometimes produce undesirable Chinese conversion that is associated with heteronymous Chinese characters.
The Apple Solution
The Apple innovation is an improvement on the statistical translation models previously described. A corpus of text is obtained and statistically analyzed. One of the cornerstones of the advance is the annotation of the corpus, which facilitates distinguishing between one use of a heteronym to indicate a meaning in the corpus and another use of the same heteronym to indicate a subsequent usage within the corpus.
A language model training engine is configured to receive manual annotations to characters in corpus. A heteronymous character with one meaning in corpus will be associated with a first symbol in corpus and the same heteronymous character with a second meaning will be associated with a second symbol in corpus. So, when the heteronymous character is used in a context associated with the first meaning, that instance of the character will be stored as the first symbol in corpus and when the heteronymous character is used in a context associated with the second meaning, that instance of the character will be stored as the second symbol. As a consequence of this annotation, a heteronymous character will no longer appear throughout corpus as various instances of the same symbol, but rather each different reading of a heteronymous character will be replaced by a distinct symbol in the corpus. Therefore, a heteronymous character that is associated with three possible readings could appear throughout corpus as various instances of three different symbols in the annotated corpus.
To illustrate how the Apple innovation works let’s return to the illustration of “desert” previously discussed. There are two possible meanings for the word “desert”: one, a verb, meaning “to abandon”, which is associated with one meaning, and one, a noun, meaning “a dry, barren area of land”, which is associated with a different meaning. Prior to the annotation of the corpus, an appearance of “desert” in the text of the corpus would be associated with a symbol for “desert.” But after the annotation, the appearance of the “desert” in the corpus would be associated with either the symbol for the verb or the noun, depending on which of the two meanings for “desert” is appropriate for the context in which that instance of “desert” appears in the text.
After annotation, the symbol set of the corpus becomes expanded; whereas prior to the disclosed annotation, a heteronymous character resolves to one value that is used to represent the character in a machine intelligible manner. Subsequent to the disclosed annotation, a heteronymous character maps to more than one machine readable value used in the annotated corpus, where each machine readable value associated with a heteronymous character represents a particular reading of the heteronymous character.
As you might expect, there has been no substantive action take on this application by the USPTO at this point, although copious amounts of non-patent literature has already been disclosed by the applicant by and through its attorneys.
Representing Apple in this matter is the Palo Alto, California, office of Morgan, Lewis & Bockius, LLP.