Category Archives: Artificial Intelligence

The Great AI Awakening

bathers_and_whale

This is a crazy long but nicely comprehensive article by the New York Times on the current state of AI: The Great AI Awakening.

While lately I’ve been buried in 3D interfaces, I’m always faintly aware of the way 1D interfaces (Cortana Skills, Speech as a service, etc.) is another fruit of our recent machine learning breakthroughs (or more accurately refocus) and of how the future success of holographic displays ultimately involves making it work with our 1D interfaces to create personal assistants. This article helps connect the dots between these, at first, apparently different technologies.

It also nicely complements Memo Atken’s Medium posts on Deep Learning and Art, which Microsoft resident genius Rick Barraza pointed me to a while back:

Part 1: The Dawn of Deep Learning

Part 2: Algorithmic Decision Making, Machine Bias, Creativity and Diversity

There’s also a nice throw away reference in the Times article about the relationship between VR and Machine Learning which is a little less obscure if you already know Baudrillard’s Simulacra and Simulation which in turn depends on Jorge Luis Borges’s very short story On Exactitude In Science.

If you really haven’t the time though, which I suspect may be the case, here are some quick excerpts starting with Google’s AI efforts:

Google’s decision to reorganize itself around A.I. was the first major manifestation of what has become an industrywide machine-learning delirium. Over the past four years, six companies in particular — Google, Facebook, Apple, Amazon, Microsoft and the Chinese firm Baidu — have touched off an arms race for A.I. talent, particularly within universities. Corporate promises of resources and freedom have thinned out top academic departments. It has become widely known in Silicon Valley that Mark Zuckerberg, chief executive of Facebook, personally oversees, with phone calls and video-chat blandishments, his company’s overtures to the most desirable graduate students. Starting salaries of seven figures are not unheard-of. Attendance at the field’s most important academic conference has nearly quadrupled. What is at stake is not just one more piecemeal innovation but control over what very well could represent an entirely new computational platform: pervasive, ambient artificial intelligence.

 

When he has an opportunity to make careful distinctions, Pichai differentiates between the current applications of A.I. and the ultimate goal of “artificial general intelligence.” Artificial general intelligence will not involve dutiful adherence to explicit instructions, but instead will demonstrate a facility with the implicit, the interpretive. It will be a general tool, designed for general purposes in a general context. Pichai believes his company’s future depends on something like this. Imagine if you could tell Google Maps, “I’d like to go to the airport, but I need to stop off on the way to buy a present for my nephew.” A more generally intelligent version of that service — a ubiquitous assistant, of the sort that Scarlett Johansson memorably disembodied three years ago in the Spike Jonze film “Her”— would know all sorts of things that, say, a close friend or an earnest intern might know: your nephew’s age, and how much you ordinarily like to spend on gifts for children, and where to find an open store. But a truly intelligent Maps could also conceivably know all sorts of things a close friend wouldn’t, like what has only recently come into fashion among preschoolers in your nephew’s school — or more important, what its users actually want. If an intelligent machine were able to discern some intricate if murky regularity in data about what we have done in the past, it might be able to extrapolate about our subsequent desires, even if we don’t entirely know them ourselves.

 

The new wave of A.I.-enhanced assistants — Apple’s Siri, Facebook’s M, Amazon’s Echo — are all creatures of machine learning, built with similar intentions. The corporate dreams for machine learning, however, aren’t exhausted by the goal of consumer clairvoyance. A medical-imaging subsidiary of Samsung announced this year that its new ultrasound devices could detect breast cancer. Management consultants are falling all over themselves to prep executives for the widening industrial applications of computers that program themselves. DeepMind, a 2014 Google acquisition, defeated the reigning human grandmaster of the ancient board game Go, despite predictions that such an achievement would take another 10 years.

Jabberwocky

 

Download SAPISophiaDemo.zip – 2,867.5 KB

 

Following on the tail of the project I have been working on for the past month, a chatterbox (also called a chatbot) with speech recognition and text-to-speech functionality, I came across the following excerpted article in The Economist, available here if you happen to be a subscriber, and here if you are not:

 

Chatbots have already been used by some companies to provide customer support online via typed conversations. Their understanding of natural language is somewhat limited, but they can answer basic queries. Mr Carpenter wants to combine the flexibility of chatbots with the voice-driven “interactive voice-response” systems used in many call centres to create a chatbot that can hold spoken conversations with callers, at least within a limited field of expertise such as car insurance.

This is an ambitious goal, but Mr Carpenter has the right credentials: he is the winner of the two most recent Loebner prizes, awarded in an annual competition in which human judges try to distinguish between other humans and chatbots in a series of typed conversations. His chatbot, called Jabberwacky, has been trained by analysing over 10m typed conversations held online with visitors to its website (see jabberwacky.com). But for a chatbot to pass itself off as a human agent, more than ten times this number of conversations will be needed, says Mr Carpenter. And where better to get a large volume of conversations to analyse than from a call centre?

Mr Carpenter is now working with a large Japanese call-centre company to develop a chatbot operator. Initially he is using transcripts of conversations to train his software, but once it is able to handle queries reliably, he plans to add speech-recognition and speech-synthesis systems to handle the input and output. Since call-centre conversations tend to be about very specific subjects, this is a far less daunting task than creating a system able to hold arbitrary conversations.

 

Jabberwacky is a slightly different beast than the AIML infrastructure I used in my project.  Jabberwacky is a heuristics based technology, whereas AIML is a design-based one that requires somebody to actually anticipate user interactions and try to script them.

All the same, it is a pleasant experience to find that one is serendipidously au courant, when one’s intent was to be merely affably retro.

SophiaBot: What I’ve been working on for the past month…

I have been busy in my basement constructing a robot with which I can have conversations and play games.  Except that the robot is more of a program, and I didn’t build the whole thing up from scratch, but instead cobbled together pieces that other people have created.  I took an Eliza-style interpreter written by Nicholas H.Tollervey (this is the conversation part) along with some scripted dialogs by Dr. Richard S. Wallace and threw it together with a Z-machine program written by Jason Follas, which allows my bot to play old Infocom games like Zork and The Hitchhiker’s Guide to the Galaxy.  I then wrapped these up in a simple workflow and added some new Vista\.NET 3.0 speech recognition and speech synthesis code so the robot can understand me.

I wrote an article about it for CodeProject, a very nice resource that allows developers from around the world to share their code and network.  The site requires registration to download code however, so if you want to play with the demo or look at the source code, you can also download them from this site.

Mr. Tollervey has a succint article about the relationship between chatterboxes and John Searle’s Chinese Box problem, which obviates me from responsibility for discussing the same.

Instead, I’ll just add some quick instructions:

 

The application is made up of a text output screen, a text entry field, and a default enter button. The initial look and feel is that of an IBX XT theme (the first computer I ever played on). This can be changed using voice commands, which I will cover later. There are three menus initially available. The File menu allows the user to save a log of the conversation as a text file. The Select Voice menu allows the user to select from any of the synthetic voices installed on her machine. Vista initially comes with “Anna”. Windows XP comes with “Sam”. Other XP voices are available depending on which versions of Office have been installed over the lifetime of that particular instance of the OS. If the user is running Vista, then the Speech menu will allow him to toggle speech synthesis, dictation, and the context-free grammars. By doing so, the user will have the ability to speak to the application, as well as have the application speak back to him. If the user is running XP, then only speech synthesis is available, since some of the features provided by .NET 3.0 and consumed by this application do not work on XP.

The appearance menu will let you change the look and feel of the text screen.  I’ve also added some pre-made themes at the bottom of the appearnce menu.  If, after chatting with SophiaBot for a while, you want to play a game, just type or say “Play game.”  SophiaBot will present you with a list of the games available (you can add more, actually, simply by dropping additional game files you find on the internet into the Program Files\Imaginative Universal\SophiaBot\Game Data\DATA folder (Jason’s Z-Machine implementation plays games that use version 3 and below of the game engine.  I’m looking (rather lazily) into how to support later versions.  You can go here to download more Zork-type games.  During a game, type or say “Quit” to end your session. “Save” and “Restore” keep track of your current position in the game, so you can come back later and pick up where you left off.

Speech recognition in Vista has two modes: dictation and context-free recognition. Dictation uses context, that is, an analysis of preceding words and words following a given target of speech recognition, in order to determine what word was intended by the speaker. Context-free speech recognition, by way of contrast, uses exact matches and some simple patterns in order to determine if certain words or phrases have been uttered. This makes context-free recognition particularly suited to command and control scenarios, while dictation is particularly suited to situations where we are simply attempting to translate the user’s utterances into text.

You should begin by trying to start up a conversation with Sophia using the textbox, just to see how it works, as well as her limitations as a conversationalist. Sophia uses certain tricks to appear more lifelike. She throws out random typos, for one thing. She also is a bit slower than a computer should really be. This is because one of the things that distinguish computers from people is the way they process information — computers do it quickly, and people do it at a more leisurely pace. By typing slowly, Sophia helps the user maintain his suspension of disbelief. Finally, if a text-to-speech engine is installed on your computer, Sophia reads along as she types out her responses. I’m not certain why this is effective, but it is how computer terminals are shown to communicate in the movies, and it seems to work well here, also. I will go over how this illusion is created below.

In Command\AIML\Game Lexicon mode, the application generates several grammar rules that help direct speech recognition toward certain expected results. Be forewarned: initially loading the AIML grammars takes about two minutes, and occurs in the background. You can continue to touch type conversations with Sophia until the speech recognition engine has finished loading the grammars and speech recognition is available. Using the command grammar, the user can make the computer do the following things: LIST COLORS, LIST GAMES, LIST FONTS, CHANGE FONT TO…, CHANGE FONT COLOR TO…, CHANGE BACKGROUND COLOR TO…. Besides the IBM XT color scheme, a black papyrus font on a linen background also looks very nice. To see a complete list of keywords used by the text-adventure game you have chosen, say “LIST GAME KEYWORDS.” When the game is initially selected, a new set of rules is created based on different two word combinations of the keywords recognized by the game, in order to help speech recognition by narrowing down the total number of phrases it must look for.

In dictation mode, the underlying speech engine simply converts your speech into words and has the core SophiaBot code process it in the same manner that it processes text that is typed in. Dictation mode is sometimes better than context-free mode for non-game speech recognition, depending on how well the speech recognition engine installed on your OS has been trained to understand your speech patterns. Context-free mode is typically better for game mode. Command and control only works in context-free mode.

Do Computers Read Electric Books?

In the comments section of a blog I like to frequent, I have been pointed to an article in the International Herald about Pierre Bayard’s new book,  How to Talk About Books You Haven’t Read.

Bayard recommends strategies such as abstractly praising the book, offering silent empathy regarding someone else’s love for the book, discussing other books related to the book in question, and finally simply talking about oneself.  Additionally, one can usually glean enough information from reviews, book jackets and gossip to sustain the discussion for quite a while.

Students, he noted from experience, are skilled at opining about books they have not read, building on elements he may have provided them in a lecture. This approach can also work in the more exposed arena of social gatherings: the book’s cover, reviews and other public reaction to it, gossip about the author and even the ongoing conversation can all provide food for sounding informed.

I’ve recently been looking through some AI experiments built on language scripts, based on the 1966 software program Eliza, which used a small script of canned questions to maintain a conversation with computer users.  You can play a web version of Eliza here, if you wish.  It should be pointed out that the principles behind Eliza are the same as those that underpin the famous Turing Test.  Turing proposed answering the question can machines think by staging an ongoing experiment to see if machines can imitate thinking.  The proposal was made in his 1950 paper Computing Machinery and Intelligence:

The new form of the problem can be described in terms of a game which we call the ‘imitation game.” It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either “X is A and Y is B” or “X is B and Y is A.” The interrogator is allowed to put questions to A and B thus:

C: Will X please tell me the length of his or her hair?

Now suppose X is actually A, then A must answer. It is A’s object in the game to try and cause C to make the wrong identification. His answer might therefore be:

“My hair is shingled, and the longest strands are about nine inches long.”

In order that tones of voice may not help the interrogator the answers should be written, or better still, typewritten. The ideal arrangement is to have a teleprinter communicating between the two rooms. Alternatively the question and answers can be repeated by an intermediary. The object of the game for the third player (B) is to help the interrogator. The best strategy for her is probably to give truthful answers. She can add such things as “I am the woman, don’t listen to him!” to her answers, but it will avail nothing as the man can make similar remarks.

We now ask the question, “What will happen when a machine takes the part of A in this game?” Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, “Can machines think?”

The standard form of the current Turing experiments is something called a chatterbox application.  Chatterboxes abstract the mechanism for generating dialog from the dialog scripts themselves by utilizing a set of rules written in a common format.  The most popular format happens to be an XML standard called AIML (Artificial Intelligence Markup Language).

What I’m interested in, at the moment, is not so much whether I can write a script that will fool people into thinking they are talking with a real person, but rather whether I can write a script that makes small talk by discussing the latest book.  If I can do this, it should validate Pierre Bayard’s proposal, if not Alan Turing’s.

Speech Recognition And Synthesis Managed APIs in Windows Vista: Part III

Voice command technology, as exemplified in Part II, is probably the most useful and most easy to implement aspect of the Speech Recognition functionality provided by Vista.  In a few days of work, any current application can be enabled to use it, and the potential for streamlining workflow and making it more efficient is truly breathtaking.  The cool factor, of course, is also very high.

Having grown up watching Star Trek reruns, however, I can’t help but feel that the dictation functionality is much more interesting than the voice command functionality.  Computers are meant to be talked to and told what to do, as in that venerable TV series, not cajoled into doing tricks for us based on finger motions over a typewriter.  My long-term goal is to be able to code by talking into my IDE in order to build UML diagrams and then, at a word, turn that into an application.  What a brave new world that will be.  Toward that end, the SR managed API provides the DictationGrammar class.

Whereas the Grammar class works as a gatekeeper, restricting the phrases that get through to the speech recognized handler down to a select set of rules, the DictateGrammar class, by default, kicks out the jams and lets all phrases through to the recognized handler.

In order to make Speechpad a dictation application, we will add the default DicatateGrammar object to the list of grammars used by our speech recognition engine.  We will also add a toggle menu item to turn dictation on and off.  Finally, we will alter the SpeechToAction() method in order to insert any phrases that are not voice commands into the current Speechpad document as text.  Create an local instance of DictateGrammar for our Main form, and then instantiate it in the Main constructor.  Your code should look like this:

	#region Local Members
		
        private SpeechSynthesizer synthesizer = null;
        private string selectedVoice = string.Empty;
        private SpeechRecognitionEngine recognizer = null;
        private DictationGrammar dictationGrammar = null;
        
        #endregion
        
        public Main()
        {
            InitializeComponent();
            synthesizer = new SpeechSynthesizer();
            LoadSelectVoiceMenu();
            recognizer = new SpeechRecognitionEngine();
            InitializeSpeechRecognitionEngine();
            dictationGrammar = new DictationGrammar();
        }
        

Create a new menu item under the Speech menu and label it “Take Dictation“.  Name it takeDictationMenuItem for convenience. Add a handler for the click event of the new menu item, and stub out TurnDictationOn() and TurnDictationOff() methods.  TurnDictationOn() works by loading the local dictationGrammar object into the speech recognition engine. It also needs to turn speech recognition on if it is currently off, since dictation will not work if the speech recognition engine is disabled. TurnDictationOff() simply removes the local dictationGrammar object from the speech recognition engine’s list of grammars.

		
        private void takeDictationMenuItem_Click(object sender, EventArgs e)
        {
            if (this.takeDictationMenuItem.Checked)
            {
                TurnDictationOff();
            }
            else
            {
                TurnDictationOn();
            }
        }

        private void TurnDictationOn()
        {
            if (!speechRecognitionMenuItem.Checked)
            {
                TurnSpeechRecognitionOn();
            }
            recognizer.LoadGrammar(dictationGrammar);
            takeDictationMenuItem.Checked = true;
        }

        private void TurnDictationOff()
        {
            if (dictationGrammar != null)
            {
                recognizer.UnloadGrammar(dictationGrammar);
            }
            takeDictationMenuItem.Checked = false;
        }
        

For an extra touch of elegance, alter the TurnSpeechRecognitionOff() method by adding a line of code to turndictation off when speech recognition is disabled:

        TurnDictationOff();

Finally, we need to update the SpeechToAction() method so it will insert any text that is not a voice command into the current Speechpad document.  Use the default statement of the switch control block to call the InsertText() method of the current document.

        
        private void SpeechToAction(string text)
        {
            TextDocument document = ActiveMdiChild as TextDocument;
            if (document != null)
            {
                DetermineText(text);
                switch (text)
                {
                    case "cut":
                        document.Cut();
                        break;
                    case "copy":
                        document.Copy();
                        break;
                    case "paste":
                        document.Paste();
                        break;
                    case "delete":
                        document.Delete();
                        break;
                    default:
                        document.InsertText(text);
                        break;
                }
            }
        }

        

With that, we complete the speech recognition functionality for Speechpad. Now try it out. Open a new Speechpad document and type “Hello World.”  Turn on speech recognition.  Select “Hello” and say delete.  Turn on dictation.  Say brave new.

This tutorial has demonstrated the essential code required to use speech synthesis, voice commands, and dictation in your .Net 2.0 Vista applications.  It can serve as the basis for building speech recognition tools that take advantage of default as well as custom grammar rules to build adanced application interfaces.  Besides the strange compatibility issues between Vista and Visual Studio, at the moment the greatest hurdle to using the Vista managed speech recognition API is the remarkable dearth of documentation and samples.  This tutorial is intended to help alleviate that problem by providing a hands on introduction to this fascinating technology.

Speech Recognition And Synthesis Managed APIs In Windows Vista: Part II


Playing with the speech synthesizer is a lot of fun for about five minutes (ten if you have both Microsoft Anna and Microsoft Lila to work with)  — but after typing “Hello World” into your Speechpad document for the umpteenth time, you may want to do something a bit more challenging.  If you do, then it is time to plug in your expensive microphone, since speech recognition really works best with a good expensive microphone.  If you don’t have one, however, then go ahead and plug in a cheap microphone.  My cheap microphone seems to work fine.  If you don’t have a cheap microphone, either, I have heard that you can take a speaker and plug it into the mic jack of your computer, and if that doesn’t cause an explosion, you can try talking into it.


While speech synthesis may be useful for certain specialized applications, voice commands, by cantrast, are a feature that can be used to enrich any current WinForms application. With the SR Managed API, it is also easy to implement once you understand certain concepts such as the Grammar class and the SpeechRecognitionEngine.


We will begin by declaring a local instance of the speech engine and initializing it. 

	#region Local Members

private SpeechSynthesizer synthesizer = null;
private string selectedVoice = string.Empty;
private SpeechRecognitionEngine recognizer = null;

#endregion

public Main()
{
InitializeComponent();
synthesizer = new SpeechSynthesizer();
LoadSelectVoiceMenu();
recognizer = new SpeechRecognitionEngine();
InitializeSpeechRecognitionEngine();
}

private void InitializeSpeechRecognitionEngine()
{
recognizer.SetInputToDefaultAudioDevice();
Grammar customGrammar = CreateCustomGrammar();
recognizer.UnloadAllGrammars();
recognizer.LoadGrammar(customGrammar);
recognizer.SpeechRecognized +=
new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
recognizer.SpeechHypothesized +=
new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
}

private Grammar CreateCustomGrammar()
{
GrammarBuilder grammarBuilder = new GrammarBuilder();
grammarBuilder.Append(new Choices(“cut”, “copy”, “paste”, “delete”));
return new Grammar(grammarBuilder);
}


The speech recognition engine is the main workhorse of the speech recognition functionality.  At one end, we configure the input device that the engine will listen on.  In this case, we use the default device (whatever you have plugged in), though we can also select other inputs, such as specific wave files.  At the other end, we capture two events thrown by our speech recognition engine.  As the engine attempts to interpret the incoming sound stream, it will throw various “hypotheses” about what it thinks is the correct rendering of the speech input.  When it finally determines the correct value, and matches it to a value in the associated grammar objects, it throws a speech recognized event, rather than a speech hypothesized event.  If the determined word or phrase does not have a match in any associated grammar, a speech recognition rejected event (which we do not use in the present project) will be thrown instead.


In between, we set up rules to determine which words and phrases will throw a speech recognized event by configuring a Grammar object and associating it with our instance of the speech recognition engine.  In the sample code above, we configure a very simple rule which states that a speech recognized event will be thrown if any of the following words: “cut“, “copy“, “paste“, and “delete“, is uttered.  Note that we use a GrammarBuilder class to construct our custom grammar, and that the syntax of the GrammarBuilder class closely resembles the syntax of the StringBuilder class.


This is the basic code for enabling voice commands for a WinForms application.  We will now enhance the Speechpad application by adding a menu item to turn speech recognition on and off,  a status bar so we can watch as the speech recognition engine interprets our words, and a function that will determine what action to take if one of our key words is captured by the engine.


Add a new menu item labeled “Speech Recognition” under the “Speech” menu item, below “Read Selected Text” and “Read Document”.  For convenience, name it speechRecognitionMenuItem.  Add a handler to the new menu item, and use the following code to turn speech recognition on and off, as well as toggle the speech recognition menu item.  Besides the RecognizeAsync() method that we use here, it is also possible to start the engine synchronously or, by passing it a RecognizeMode.Single parameter, cause the engine to stop after the first phrase it recognizes. The method we use to stop the engine, RecognizeAsyncStop(), is basically a polite way to stop the engine, since it will wait for the engine to finish any phrases it is currently processing before quitting. An impolite method, RecognizeAsyncCancel(), is also available — to be used in emergency situations, perhaps.

        private void speechRecognitionMenuItem_Click(object sender, EventArgs e)
{
if (this.speechRecognitionMenuItem.Checked)
{
TurnSpeechRecognitionOff();
}
else
{
TurnSpeechRecognitionOn();
}
}

private void TurnSpeechRecognitionOn()
{
recognizer.RecognizeAsync(RecognizeMode.Multiple);
this.speechRecognitionMenuItem.Checked = true;
}

private void TurnSpeechRecognitionOff()
{
if (recognizer != null)
{
recognizer.RecognizeAsyncStop();
this.speechRecognitionMenuItem.Checked = false;
}
}


We are actually going to use the RecognizeAsyncCancel() method now, since there is an emergency situation. The speech synthesizer, it turns out, cannot operate if the speech recognizer is still running. To get around this, we will need to disable the speech recognizer at the last possible moment, and then reactivate it once the synthesizer has completed its tasks. We will modify the ReadAloud() method to handle this.


private void ReadAloud(string speakText)
{
try
{
SetVoice();
recognizer.RecognizeAsyncCancel();
synthesizer.Speak(speakText);
recognizer.RecognizeAsync(RecognizeMode.Multiple);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}

}

The user now has the ability to turn speech recognition on and off. We can make the application more interesting by capturing the speech hypothesize event and displaying the results to a status bar on the Main form.  Add a StatusStrip control to the Main form, and a ToolStripStatusLabel to the StatusStrip with its Spring property set to true.  For convenience, call this label toolStripStatusLabel1.  Use the following code to handle the speech hypothesized event and display the results:

private void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
{
GuessText(e.Result.Text);
}

private void GuessText(string guess)
{
toolStripStatusLabel1.Text = guess;
this.toolStripStatusLabel1.ForeColor = Color.DarkSalmon;
}


Now that we can turn speech recognition on and off, as well as capture misinterpretations of the input stream, it is time to capture the speech recognized event and do something with it.  The SpeechToAction() method will evaluate the recognized text and then call the appropriate method in the child form (these methods are accessible because we scoped them internal in the Textpad code above).  In addition, we display the recognized text in the status bar, just as we did with hypothesized text, but in a different color in order to distinguish the two events.


private void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
string text = e.Result.Text;
SpeechToAction(text);
}

private void SpeechToAction(string text)
{
TextDocument document = ActiveMdiChild as TextDocument;
if (document != null)
{
DetermineText(text);

switch (text)
{
case “cut”:
document.Cut();
break;
case “copy”:
document.Copy();
break;
case “paste”:
document.Paste();
break;
case “delete”:
document.Delete();
break;
}
}
}

private void DetermineText(string text)
{
this.toolStripStatusLabel1.Text = text;
this.toolStripStatusLabel1.ForeColor = Color.SteelBlue;
}


Now let’s take Speechpad for a spin.  Fire up the application and, if it compiles, create a new document.  Type “Hello world.”  So far, so good.  Turn on speech recognition by selecting the Speech Recognition item under the Speech menu.  Highlight “Hello” and say the following phrase into your expensive microphone, inexpensive microphone, or speaker: delete.  Now type “Save the cheerleader, save the”.  Not bad at all.

Speech Recognition And Synthesis Managed APIs In Windows Vista: Part I




VistaSpeechAPIDemo.zip – 45.7 Kb


VistaSpeechAPISource.zip – 405 Kb


Introduction


One of the coolest features to be introduced with Windows Vista is the new built in speech recognition facility.  To be fair, it has been there in previous versions of Windows, but not in the useful form in which it is now available.  Best of all, Microsoft provides a managed API with which developers can start digging into this rich technology.  For a fuller explanation of the underlying technology, I highly recommend the Microsoft whitepaper. This tutorial will walk the user through building a common text pad application, which we will then trick out with a speech synthesizer and a speech recognizer using the .Net managed API wrapper for SAPI 5.3. By the end of this tutorial, you will have a working application that reads your text back to you, obeys your voice commands, and takes dictation. But first, a word of caution: this code will only work for Visual Studio 2005 installed on Windows Vista. It does not work on XP, even with .NET 3.0 installed.

Background


Because Windows Vista has only recently been released, there are, as of this writing, several extant problems relating to developing on the platform.  The biggest hurdle is that there are known compatibility problems between Visual Studio and Vista.  Visual Studio.NET 2003 is not supported on Vista, and there are currently no plans to resolve any compatibility issues there.  Visual Studio 2005 is supported,  but in order to get it working well, you will need to make sure you also install service pack 1 for Visual Studio 2005.  After this, you will also need to install a beta update for Vista called, somewhat confusingly, “Visual Studio 2005 Service Pack 1 Update for Windows Vista Beta”.  Even after doing all this, you will find that all the new cool assemblies that come with Vista, such as the System.Speech assembly, still do not show up in your Add References dialog in Visual Studio.  If you want to have them show up, you will finally need to add a registry entry indicating where the Vista dll’s are to be found.  Open the Vista registry UI by running regedit.exe in your Vista search bar.  Add the following registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\.NETFramework\AssemblyFolders\v3.0 Assemblies with this value: C:\\Program Files\\Reference Assemblies\\Microsoft\\Framework\\v3.0. (You can also install it under HKEY_CURRENT_USER, if you prefer.)  Now, we are ready to start programming in Windows Vista.

Before working with the speech recognition and synthesis functionality, we need to prepare the ground with a decent text pad application to which we will add on our cool new toys. Since this does not involve Vista, you do not really have to follow through this step in order to learn the speech recognition API.  If you already have a good base application, you can skip ahead to the next section, Speechpad, and use the code there to trick out your app.  If you do not have a suitable application at hand, but also have no interest in walking through the construction of a text pad application, you can just unzip the source code linked above and pull out the included Textpad project.  The source code contains two Visual Studio 2005 projects, the Textpad project, which is the base application for the SR functionality, and Speechpad, which includes the final code.


All the same, for those with the time to do so, I feel there is much to gain from building an application from the ground up. The best way to learn a new technology is to use it oneself and to get one’s hands dirty, as it were, since knowledge is always more than simply knowing that something is possible; it also involves knowing how to put that knowledge to work. We know by doing, or as Giambattista Vico put it, verum et factum convertuntur.


Textpad


Textpad is an MDI application containing two forms: a container, called Main.cs, and a child form, called TextDocument.csTextDocument.cs, in turn, contains a RichTextBox control.


Create a new project called Textpad.  Add the “Main” and “TextDocument” forms to your project.  Set the IsMdiContainer property of Main to true.  Add a MainMenu control and an OpenFileDialog control (name it “openFileDialog1”) to Main.  Set the Filter property of the OpenFileDialog to “Text Files | *.txt”, since we will only be working with text files in this project.  Add a RichTextBox control to “TextDocument”, name it “richTextBox1”; set its Dock property to “Fill” and its Modifiers property to “Internal”.


Add a MenuItem control to MainMenu called “File” by clicking on the MainMenu control in Designer mode and typing “File” where the control prompts you to “type here”.  Set the File item’s MergeType property to “MergeItems”. Add a second MenuItem called “Window“.  Under the “File” menu item, add three more Items: “New“, “Open“, and “Exit“.  Set the MergeOrder property of the “Exit” control to 2.  When we start building the “TextDocument” form, these merge properties will allow us to insert menu items from child forms between “Open” and “Exit”.


Set the MDIList property of the Window menu item to true.  This automatically allows it to keep track of your various child documents during runtime.


Next, we need some operations that will be triggered off by our menu commands.  The NewMDIChild() function will create a new instance of the Document object that is also a child of the Main container.  OpenFile() uses the OpenFileDialog control to retrieve the path to a text file selected by the user.  OpenFile() uses a StreamReader to extract the text of the file (make sure you add a using declaration for System.IO at the top of your form). It then calls an overloaded version of NewMDIChild() that takes the file name and displays it as the current document name, and then injects the text from the source file into the RichTextBox control in the current Document object.  The Exit() method closes our Main form.  Add handlers for the File menu items (by double clicking on them) and then have each handler call the appropriate operation: NewMDIChild(), OpenFile(), or Exit().  That takes care of your Main form.

        #region Main File Operations

private void NewMDIChild()
{
NewMDIChild(“Untitled”);
}

private void NewMDIChild(string filename)
{
TextDocument newMDIChild = new TextDocument();
newMDIChild.MdiParent = this;
newMDIChild.Text = filename;
newMDIChild.WindowState = FormWindowState.Maximized;
newMDIChild.Show();
}

private void OpenFile()
{
try
{
openFileDialog1.FileName = “”;
DialogResult dr = openFileDialog1.ShowDialog();
if (dr == DialogResult.Cancel)
{
return;
}
string fileName = openFileDialog1.FileName;
using (StreamReader sr = new StreamReader(fileName))
{
string text = sr.ReadToEnd();
NewMDIChild(fileName, text);
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

private void NewMDIChild(string filename, string text)
{
NewMDIChild(filename);
LoadTextToActiveDocument(text);
}

private void LoadTextToActiveDocument(string text)
{
TextDocument doc = (TextDocument)ActiveMdiChild;
doc.richTextBox1.Text = text;
}

private void Exit()
{
Dispose();
}

#endregion


To the TextDocument form, add a SaveFileDialog control, a MainMenu control, and a ContextMenuStrip control (set the ContextMenuStrip property of richTextBox1 to this new ContextMenuStrip).  Set the SaveFileDialog’s defaultExt property to “txt” and its Filter property to “Text File | *.txt”.  Add “Cut”, “Copy”, “Paste”, and “Delete” items to your ContextMenuStrip.  Add a “File” menu item to your MainMenu, and then “Save“, Save As“, and “Close” menu items to the “File” menu item.  Set the MergeType for “File” to “MergeItems”. Set the MergeType properties of “Save”, “Save As” and “Close” to “Add”, and their MergeOrder properties to 1.  This creates a nice effect in which the File menu of the child MDI form merges with the parent File menu.


The following methods will be called by the handlers for each of these menu items: Save(), SaveAs(), CloseDocument(), Cut(), Copy(), Paste(), Delete(), and InsertText(). Please note that the last five methods are scoped as internal, so they can be called by the parent form. This will be particularly important as we move on to the Speechpad project.


#region Document File Operations

private void SaveAs(string fileName)
{
try
{
saveFileDialog1.FileName = fileName;
DialogResult dr = saveFileDialog1.ShowDialog();
if (dr == DialogResult.Cancel)
{
return;
}
string saveFileName = saveFileDialog1.FileName;
Save(saveFileName);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

private void SaveAs()
{
string fileName = this.Text;
SaveAs(fileName);
}

internal void Save()
{
string fileName = this.Text;
Save(fileName);
}

private void Save(string fileName)
{
string text = this.richTextBox1.Text;
Save(fileName, text);
}

private void Save(string fileName, string text)
{
try
{
using (StreamWriter sw = new StreamWriter(fileName, false))
{
sw.Write(text);
sw.Flush();
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

private void CloseDocument()
{
Dispose();
}

internal void Paste()
{
try
{
IDataObject data = Clipboard.GetDataObject();
if (data.GetDataPresent(DataFormats.Text))
{
InsertText(data.GetData(DataFormats.Text).ToString());
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

internal void InsertText(string text)
{
RichTextBox theBox = richTextBox1;
theBox.SelectedText = text;
}

internal void Copy()
{
try
{
RichTextBox theBox = richTextBox1;
Clipboard.Clear();
Clipboard.SetDataObject(theBox.SelectedText);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

internal void Cut()
{
Copy();
Delete();
}

internal void Delete()
{
richTextBox1.SelectedText = string.Empty;
}

#endregion


Once you hook up your menu item event handlers to the methods listed above, you should have a rather nice text pad application. With our base prepared, we are now in a position to start building some SR features.


Speechpad


Add a reference to the System.Speech assembly to your project.  You should be able to find it in C:\Program Files\Reference Assemblies\Microsoft\Framework\v3.0\.  Add using declarations for System.Speech, System.Speech.Recognition, and System.Speech.Synthesis to your Main form. The top of your Main.cs file should now look something like this:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.IO;
using System.Speech;
using System.Speech.Synthesis;
using System.Speech.Recognition;

In design view, add two new menu item to the main menu in your Main form labeled “Select Voice” and “Speech“.  For easy reference, name the first item selectVoiceMenuItem.  We will use the “Select Voice” menu to programmatically list the synthetic voices that are available for reading Speechpad documents.  To programmatically list out all the synthetic voices, use the following three methods found in the code sample below.  LoadSelectVoiceMenu() loops through all voices that are installed on the operating system and creates a new menu item for each.  VoiceMenuItem_Click() is simply a handler that passes the click event on to the SelectVoice() method. SelectVoice() handles the toggling of the voices we have added to the “Select Voice” menu.  Whenever a voice is selected, all others are deselected.  If all voices are deselected, then we default to the first one.


Now that we have gotten this far, I should mention that all this trouble is a little silly if there is only one synthetic voice available, as there is when you first install Vista. Her name is Microsoft Anna, by the way. If you have Vista Ultimate or Vista Enterprise, you can use the Vista Updater to download an additional voice, named Microsoft Lila, which is contained in the Simple Chinese MUI.  She has a bit of an accent, but I am coming to find it rather charming.  If you don’t have one of the high-end flavors of Vista, however, you might consider leaving the voice selection code out of your project.


private void LoadSelectVoiceMenu()
{
foreach (InstalledVoice voice in synthesizer.GetInstalledVoices())
{
MenuItem voiceMenuItem = new MenuItem(voice.VoiceInfo.Name);
voiceMenuItem.RadioCheck = true;
voiceMenuItem.Click += new EventHandler(voiceMenuItem_Click);
this.selectVoiceMenuItem.MenuItems.Add(voiceMenuItem);
}
if (this.selectVoiceMenuItem.MenuItems.Count > 0)
{
this.selectVoiceMenuItem.MenuItems[0].Checked = true;
selectedVoice = this.selectVoiceMenuItem.MenuItems[0].Text;
}
}

private void voiceMenuItem_Click(object sender, EventArgs e)
{
SelectVoice(sender);
}

private void SelectVoice(object sender)
{
MenuItem mi = sender as MenuItem;
if (mi != null)
{
//toggle checked value
mi.Checked = !mi.Checked;

if (mi.Checked)
{
//set selectedVoice variable
selectedVoice = mi.Text;
//clear all other checked items
foreach (MenuItem voiceMi in this.selectVoiceMenuItem.MenuItems)
{
if (!voiceMi.Equals(mi))
{
voiceMi.Checked = false;
}
}
}
else
{
//if deselecting, make first value checked,
//so there is always a default value
this.selectVoiceMenuItem.MenuItems[0].Checked = true;
}
}
}


We have not declared the selectedVoice class level variable yet (your Intellisense may have complained about it), so the next step is to do just that.  While we are at it, we will also declare a private instance of the System.Speech.Synthesis.SpeechSynthesizer class and initialize it, along with a call to the LoadSelectVoiceMenu() method from above, in your constructor:


#region Local Members

private SpeechSynthesizer synthesizer = null;
private string selectedVoice = string.Empty;

#endregion

public Main()
{
InitializeComponent();
synthesizer = new SpeechSynthesizer();
LoadSelectVoiceMenu();
}


To allow the user to utilize the speech synthesizer, we will add two new menu items under the “Speech” menu labeled “Read Selected Text” and “Read Document“.  In truth, there isn’t really much to using the Vista speech synthesizer.  All we do is pass a text string to our local SpeechSynthesizer object and let the operating system do the rest.  Hook up event handlers for the click events of these two menu items to the following methods and you will be up and running with an SR enabled application:


#region Speech Synthesizer Commands

private void ReadSelectedText()
{
TextDocument doc = ActiveMdiChild as TextDocument;
if (doc != null)
{
RichTextBox textBox = doc.richTextBox1;
if (textBox != null)
{
string speakText = textBox.SelectedText;
ReadAloud(speakText);
}
}
}

private void ReadDocument()
{
TextDocument doc = ActiveMdiChild as TextDocument;
if (doc != null)
{
RichTextBox textBox = doc.richTextBox1;
if (textBox != null)
{
string speakText = textBox.Text;
ReadAloud(speakText);
}
}
}

private void ReadAloud(string speakText)
{
try
{
SetVoice();
synthesizer.Speak(speakText);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}

}

private void SetVoice()
{
try
{
synthesizer.SelectVoice(selectedVoice);
}
catch (Exception)
{
MessageBox.Show(selectedVoice + “\” is not available.);
}
}

#endregion