Finding the correct metaphor for text-to-speech

medspeech

A recent release from the Associated Press concerning the Authors Guild’s concerns with the Kindle 2’s text-to-speech feature left many computer programmers guffawing, but it occurs to me that for those not familiar with text-to-speech technology, the humorous implications may not be self-evident, so I will attempt to parse it:

“NEW YORK (AP) — The guild that represents authors is urging writers to be wary of a text-to-speech feature on Amazon.com Inc.’s updated Kindle electronic reading device.

 

“In a memo sent to members Thursday, the guild says the Kindle 2’s “Read to Me” feature “presents a significant challenge to the publishing industry.”

 

“The Kindle can read text in a somewhat stilted electronic voice. But the Authors Guild says the quality figures to “improve rapidly.” And the guild worries that could undermine the market for audio books.”

The quality of text-to-speech depends on the library of phonemes available on the reading device and the algorithms used to put them all together.  A simple example is when you call the operator and an automated voice reads back a phone number to you with a completely unnatural intonation, and you realize that the pronunciation of each number has been clipped and then taped back together without any sort of context.  That is a case, moreover, where the relationship between vocalization and semantics is one-to-one.  The semantic meaning of the number “1” is always mapped to the sound of someone pronouncing the word “one”.   In the case of speech-to-text, no one has been sitting with the OED and carefully pronouncing every word for a similar one-to-one mapping. Instead, the software program on the reading device must use an algorithm to guess at the set of phonemes that are intended by a collection of letters and generate the sounds it associates with those phonemes. 

 

The problem of intonation is still there, along with the additional issue of the peculiarities of English spelling.  If have a GPS system in your car, then you are familiar with the results.  Bear in mind that your GPS system, in turn, is bungling up what is actually a very particularized vocabulary.  The books that the Kindle’s “Read to Me” feature will be dealing with have more in common with Borges’s labyrinth than Rand McNally’s road atlas.

 

While text-to-speech technology will indeed improve over time, it won’t be improving in the Kindle 2, which comes with one software bundle that reads in just one way.  I worked on a text-to-speech program a while back (if you have Vista, you can download it here) that combines an Eliza engine with the Vista operating system’s text-to-speech functionality.  One of the things I immediately wanted to do was to be able to switch out voices, and what I quickly found out was that I couldn’t get any new voices.  Vista came with a feminine voice with an American accent, and that was about it unless one wanted to use a feminine voice with a Pidgin-English accent that is included with the Chinese speech pack.  The only masculine voice Microsoft provided was available for Windows XP, and it wasn’t forward compatible. 

 

It simply isn’t easy to switch out voices, much less switch out speech engines on a given platform, and seeing that we aren’t paying for a software package when we buy the Kindle but rather only the device (with much less power than a Microsoft operation system), it can be said with some confidence that the Kindle 2 is never going to be able to read like Morgan Freeman.

 

The Kindle 2’s text-to-speech capabilities, or lack of it, is not going to undermine the market for audio books any more than public lectures by Stephen Hawking will undermine sales of his books.  They are simply different things.

“It is telling authors and publishers to consider asking Amazon to disable the audio function on e-books it licenses.”

This is what is commonly referred to as the business requirement from hell.  It assumes that something is easy out of a serious misunderstanding of how a given technology actually works.  Text-to-speech technology is not based on anything inherent to the books Amazon is trying to peddle.  It isn’t, for what this is worth, even associated with metadata about the books Amazon is trying to peddle.  Instead, it is a free-roaming program that will attempt to read any text you feed it.  Rather than a CD that is sold with the book, it has a greater similarity to a homunculus living inside your computer and reading everything out loud to you. 

 

The proposal from the Authors Guild assumes that something must be taken off of the e-books in order to disable the text-to-speech feature.  In fact, instructions not to read those certain e-books must be added to the e-book metadata, and each Kindle 2 homunculus must in turn be taught to look for those instructions and act accordingly, in order to fulfill this requirement.  This is a non-trivial rewrite of the underlying Kindle software as well as of the thousands of e-book images that Amazon will be selling — nor can the files already living on people’s devices be recalled to add the additional metadata.

“Amazon spokesman Drew Herdener said the company has the proper license for the text-to-speech function, which comes from Nuance Communications Inc.”

This is just a legalese on Amazon’s part that intentionally misunderstands the Authors Guild’s concerns as well as the legal issues involved.  The Authors Guild isn’t accusing Amazon of not having rights to the text-to-speech software.  They are asking whether using text-to-speech on their works doesn’t violate pre-existing law. 

 

The answer to that, in turn, concerns metaphors, as many legal matters ultimately do.  What metaphor does text-to-speech fall under?  Is it like a CD of a reading of a book, which generates additional income from an author’s labor?  Or is it like hiring Morgan Freeman to read Dianetics to you?  In which case, beyond the price of the physical book, Mr. Freeman should certainly be paid, but the Church of Scientology should not.