Transferring chat to a human agent using Microsoft Bot Framework

Source Code: Human Handover Bot

One of the frequent questions which I am asked is how to transfer chat to a human from the bot. It is especially necessary if your bot is in space of customer service. Chat bots are not meant to (or atleast, not mature enough currently) to completely replace humans. Many a times chat bot will fail to answer satisfactorily or user would just want to talk to a human from the start. When this happens the chatbot should transfer the chats to a human agent or a customer care representative. But how can we achieve that?

In this article, I will give an overview on how we can integrate a live chat into our bot using Microsoft Bot Framework. Microsoft Bot Framework is highly extensible and lets us do this easily. The source code is available over at my github repo.

High Level Overview

Our bot would be central piece to whole solution. Apart from performing all it’s normal functionality, our bot would also act as a proxy between user and agent. So, what is required to create this feature and who are the actors involved?

Actors Involved

  • Bot: Well, we have our bot (duh!).
  • Users: Users are our customers who would be using our bot. Our users can be on any channel which are supported by Bot Framework.
  • Agent: Agents are humans who would chat with our users. Our agent will also need a chat window. For this we will use Bot Framework Web Chat as a dashboard for our agents.

Running the bot

Let us first run the bot and see how it works. Nothing helps better in understanding than running the code and seeing it first hand.

To run the bot, follow the steps -

  1. Create a Luis app by importing LuisModel\AgentTransfer.json.
    The bot uses Luis to understand if user wants to talk to an agent. If you don’t want to use Luis, I have included an EchoDialog class which would also work. You will need to modify the code to start EchoDialog instead of TransferLuisDialog when message arrives (left as exercise to the reader). If you do this, go to step 3.

  2. Get the ModelId and SubscriptionKey of the Luis App and paste it into LuisModel attribute in TransferLuisDialog.cs.

  3. Run the solution by pressing F5. By default, it should start at port 3979. If not, note the port it runs on and stop debugging.

  4. We will use ngrok to debug the bot locally. Download and run ngrok using command ngrok.exe http --host-header=rewrite <port number>. Copy the Forwarding URL (https) which is genrated by ngrok.

  5. Register a new bot in Bot Framework Portal using URL generated by ngrok. Copy the Microsoft App Id and App Password and paste it in web.config.

  6. Agent Dashboard uses Direct Line as a channel. So, enable direct line and keep its Secret Key handy.

  7. Run the solution once again.

To open Agent Dashboard go to http://localhost:3979/agentdashboard/index.html?s=DIRECTLINE_SECRET_KEY. Change the port number accordingly if it is not 3979. Notice the query string ?s=. Enter the Direct Line secret key as the value of the query string.

You will get the page similar to below.

Agent Dashboard

Click on Connect button to register yourself as an agent. If you get successfully registered, the heading in the page will change to show the “Connected” status. This makes the agent available for chat.

Use any other channel (skype or web chat at bot portal) to simulate a user. Currently there is only one Luis intent AgentTransfer which is triggered by typing “Connect me with customer care”. Enter it to start talking with agent.
Using emulator will not work.

If you are using EchoDialog, agent transfer can be achieved by typing anything starting with letter ‘a’. Typing anything else will just echo back.

Once user has initiated the conversation with an agent, any messages user sends will be delivered to the agent (instead of being handled by bot) and vice versa.

To stop conversation with user, click on Stop Conversation with User button on agent dashboard. Click on Disconnect to remove agent from available pool.

We will see how each of these works shortly, but first let us understand some of the concepts involved in it.

Building Blocks

Let us understand what we did while running the code. We will divide the flow into logical groups -

  • Initiating Transfer: A user can initiate a transfer to an agent anytime. Initiation is successful if an agent is available for chat. A user can only talk to one agent at a time. Once initiated, all the messages from user will be routed to the agent instead of being handled by current Dialog.

  • Agent Availability: An agent is termed available if he is not already in an exisiting conversation with a user. This effectively means that an agent can only talk to one user at a time. In other words, Agent and User have 1:1 mapping.

  • Agent User Mapping: We established that an agent and a user have 1:1 mapping. Since we have to route messages to and fro, we must maintain this mapping somewhere.

  • Message Routing: Message is routed by fetching the Agent User Mapping and sending the current message to it’s counterpart. For example, if a user sends a message, we fetch the agent associated with that user, and send the message text to the agent. Same applies the other way around.

  • Stopping Conversation: Stopping the conversation should prevent bot from routing any further messages to and fro agent and user. This effectively means that we remove the Agent User Mapping. Stopping the conversation will also make the agent available once again.

  • Disconnecting Agent: Disconnecting an agent means we remove the agent from availability pool. No further initiation can happen with this agent.

Solution Structure

This is a pretty lengthy explaination so I suggest you keep the solution open and follow as I explain each classes.

Solution Structure

The most important pieces of code which I want to highlight are

  • Agent folder contains everything related to agent management and routing implementation.
  • AgentDashboard folder contains index.html which has Web Chat control embedded. We will use this page for agent to chat. How it works we will see later.
  • Scorable folder contains two IScorable implementations which serves as middleware to route messages. We will get into its details later.
  • AgentModule class contains Autofac registrations for our project.

There are five key interfaces in our solution, all lying in Agent folder. They are -

  • IAgentProvider: Contains methods for Adding, Removing and getting Next Available agent. When agent connects, we add the agent to availability pool by using AddAgent method. Similarly, RemoveAgent method is used to remove the agent. GetNextAvailableAgent method should get the next available agent from availability pool and remove the agent from the pool in an atomic way, so that same agent is not returned twice.

  • IUserToAgent: As the name suggests, is used to send messages from user to agent. It’s method SendToAgentAsync does exactly that. It contains two other methods - IntitiateConversationWithAgentAsync to initate a transfer for first time and AgentTransferRequiredAsync to check if routing is required.

  • IAgentToUser: Contains a single method SendToUserAsync to send the message from agent to user.

  • IAgentUserMapping: Contains methods for adding, removing and fetching the Agent User Mapping.

  • IAgentService: Acts as a business class mainly for registering agent, unregistering agent and stopping a conversation. In addtion it contains other methods to check if agent is in existing conversation and whether the message is from an actual/valid agent.

Apart from the interfaces, there are two scorables in Scorable folder. Scorable in Bot Framework acts as a middleware to the incoming messages. Using Scorable we can intercept message and take decisions before it is sent to the waiting dialog. We have following scorables in place -

  • AgentToUserScorable: Intercepts messages coming from agent and routes it to user if agent is in conversation with user.

  • UserToAgentScorable: Intercepts messages coming from user and routes it to agent if user is in conversation with agent.

Availability Pool

When an agent connects we add him to availability pool. InMemoryAgentStore which derives from IAgentProvider, maintains this pool using an in-memory ConcurrentDictionary. We are not worried about the implementation detail; however, it mimics a “queue” therefore guaranteeing that an agent is only fetched once.

In an actual production scenario, you would maintain this list using an out-proc data store such as RabbitMQ and implement IAgentProvider to to interface with that.

Agent Registration

Agent registration is done through RegisterAgentAsync(IActivity activity, CancellationToken cancellationToken) of AgentService. This method is called when agent clicks on “Connect” button in the dashboard. AgentService is a concrete implementation of IAgentService.

RegisterAgentAsync method first adds a new instance of Agent in to the availability pool using IAgentProvider AddAgent(Agent agent) method.

Once this is successful, it adds a metadata to the agent’s UserData store. We use this metadata information to identify whether the incoming message is from an agent or a user.

In a production use case this is not importnat as your agent would most likely require to logging in, and therefore you can identify them by passing a valid token (or in any other way depending upon your requirments). I added the metadata to keep things simple for this sample.

Disconnecting Agent

When agent clicks on Disconnect button on dashboard, we just remove the agent from availability pool by calling UnregisterAgentAsync(IActivity activity ...) method in IAgentService. The same method also clears the metadata information which was stored in agent’s store.

In a production scenario, you would not allow agent to disconnect if he is already in a conversation with a user. However, in this sample I have not implemented this.

Agent User Mapping

Agent User Mapping is a crucial piece in our overall design. Methods for Setting and Getting this mapping is present in IAgentUserMapping interface. BotStateMappingStorage implements this interface and provides implementation of proper storage of these mapping. The mapping is not stored in memory. Instead it is stored in Bot State Service.

To give a brief background, Microsoft Bot Framework provides three stores to store user state. These are

We utilize these to store Agent User Mapping in following ways -

  • The Agent which the user is talking to is stored in User’s Private Conversation Store.

  • The User which the agent is talking to is stored in Agent’s Private Conversation Store.

This clever design (😄) makes the Agent User Mapping storage distributed and moves the responsibility of maintaining it to Bot State Service.

But what do we save in the states?
We save the address of the agnet and the user respectively. More specifically we save the ConversationReference of the user and agent. ConversationReference can then be used to route message to the receiver on proper channel.
We have two classes named Agent and User each having a property of type ConversationReference. We store Agent and User class instances into User and Agent store respectively.

Initiating Conversation

When a user wants to connect to the agent, we call IntitiateConversationWithAgentAsync of IUserToAgent. The method first checks if there are any available agent and fetches the next agent from availability pool. Once we get an agent, we create Agent User Mapping and store it into the states as described in previous section.

Message Routing

Message is routed by fetching the Agent User Mapping. When a message arrives, we retrieve the state associated with the sender. Since our Agent User Mapping is maintained in the state, we get that information too.

User to Agent Route

When user sends a message UserToAgentScorable checks if the message needs to be routed to an agent or not. This check is done by calling AgentTransferRequiredAsync method in IUserToAgent.

AgentTransferRequiredAsync method just checks if the user has the agent mapping in its store. If we find Agent class in its store, it means that the user is in the conversation with an agent. The scorable will then route the message to the agent by calling SendToAgentAsync method in IUserToAgent.

SendToAgentAsync will use the ConversationReference of agent to create a new Activity and send it to agent through Bot Connector.

Due to our implemetnation of Scorable, we are not modifying the DialogStack of the user. This means that when the conversation with agent is stopped, the user returns to same state (Dialog) he was with the bot before the transfer.

Agent to User Route

Agent to user flow is also very similar to the above. When a message arrives, AgentToUserScorable first check if the sender is an agent. It does so by checking the metadata information which we store when registering the agent.
Depending upon your requirements, you would have your own logic of checking if the Activity is from an actual agent.

Once we get a message from a valid agent, we then check if the agent is already in an existing conversation. This is done in similar way as described in last section. If we get a valid User in agent’s store, AgentToUserScorable will route the message to the user by calling SendToUserAsync method in IAgentToUser.

Stopping Conversation

Stopping a conversation simply means removing the Agent User Mapping. If this mapping is removed, no further messages would be routed from either user or agent. The implementation of it is done in AgentService class. In this sample only an agent can stop the conversation bbby clicking “Stop Conversation with User” button in dashboard.

Agent Dashboard

As mentioned before, we use Bot Framework Web Chat as a channel for agent. But instead of just using an <iframe>, we directly reference the javascript. I have included botchat.js and botchat.css in the solution by building the Bot Framework Web Chat project. Directly referencing the web chat allows us to use its advanced features which we will see below.

To give a short introduction, Web Chat uses Redux architecture to manage its state. It uses DirectLineJs to connect with the Direct Line API. DirectLineJs uses RxJs to create an Observable which we could subscribe to receive events.

First, we create DirectLine instance. Notice the direct line secret is passed through query string parameter.

var botConnection = new BotChat.DirectLine({
    secret: params['s'],
    token: params['t'],
    domain: params['domain'],
    webSocket: params['webSocket'] && params['webSocket'] === "true"
});

Next, we use call BotChat.App passing the Direct Line instance we created above.

BotChat.App({
    botConnection: botConnection,
    user: user,
    bot: bot
}, document.getElementById("BotChatGoesHere"));

Tip: You can specify an id and a name in user object. These values will be reflected in From field in Activity which is received by out bot.

Now here comes the interesting part. The two buttons in the page do not make ajax calls to any controller explicitly. Instead they use DirectLineJs to send a message to our bot. These messages are different than messages sent when user types something in the chat window. These messages have different type.

If you have noticed, our Activity class has a Type field. A normal chat message Activity has Type = "message". You might be aware that there are messages with different types such as conversationUpdate, typing, ping etc. Messages of some of these types are sent by Bot Framework itself such as conversationUpdate is sent when a memeber is added or removed from conversation.

There is another type called event which represents an external event. As of now, Bot Framework by default does not send any messaages of type event. This is left for us developers to use depending upon our requiremnts. We can create a message of type event and sent it to our bot. The bot would recieve it as a normal Activity which would have all the relevant fields populated such as From, Recipient, Conversation etc.

In this example, on button clicks we send messages of type event. For connect button we send message of type event and name="connect". Similarly, for disconnect we send message with name="disconnect".

const connect = () => {
    var name;
    if(!connected)
        name = "connect"
    else
        name = "disconnect"
    botConnection
        .postActivity({type: "event", value: "", from: user, name: name})
        .subscribe(connectionSuccess);
};

To send messages we use postActivity method of botConnection. We then subscribe to it so we can get back the status whether it was successful or not.

Stop Conversation button works in exactly same way.

const stopConversation = () => {
    botConnection
        .postActivity({type: "event", value: "", from: user, name: "stopConversation"})
        .subscribe(id => console.log("success"));
};

In our bot, we handle these messages in HandleSystemMessage method in MessageController class.

else if (message.Type == ActivityTypes.Event)
{
    using (var scope = DialogModule.BeginLifetimeScope(Conversation.Container, message))
    {
        var cancellationToken = default(CancellationToken);
        var agentService = scope.Resolve<IAgentService>();
        switch (message.AsEventActivity().Name)
        {
            case "connect":
                await agentService.RegisterAgentAsync(message, cancellationToken);
                break;
            case "disconnect":
                await agentService.UnregisterAgentAsync(message, cancellationToken);
                break;
            case "stopConversation":
                await StopConversation(agentService, message, cancellationToken);
                await agentService.RegisterAgentAsync(message, cancellationToken);
                break;
            default:
                break;
        }
    }
}

This is how agent connects and disconnects from out bot. Using direct line to send these “event” messages, we get full context of who has raised the event. It also eliminates the need of creating any “supporting” endpoints just so that we can send some events to the bot.

Almost Done

This completes my tutorial on creating a bot to transfer chats to a human. I have explained all the major concepts involved in achieving this. I have tried to make the code as extensible as possible, nonetheless it could serve as a reference if you want to achieve the same thing.

I cannot close this article without mentioning one of the major shortcoming of this approach. Web Chat channel or Direct Line does not send any event if the client disconnects. If an agent losses the connectivity or closes the dashboard, we do not receive any event regarding this. This means that there is no way of knowing if the agnet is online. It is specially a problem if the agent is in a conversation with the user and suddenly there is a network failure at his end. Ideally in this scenario I would want to re-route the user to next available agent. But since we don’t receive any connectivity information from Direct Line by default, it is left to us to implement a solution.

Let me know your thoughts on this in comments below and ask any questions that you may have.
By the way do share the blog if you liked it.

Skype for Business bot using UCWA

I had recently written a post on how to create a Skype for Business chatbot. In that I used Lync 2013 SDK to intercept messages and pass to bot. However I mentioned in my post that there is a better way to achieve the same by using Unified Communications Web API 2.0(UCWA). Since then I had received a lot of request to write a post on how to do the same. Though I had the code available with me(thanks to Om Shrivastava, my colleague), I did not post it because it was is a very bad shape(you will see). But since there was a lot of demand for it and after discussing with my readers(thank you Dan Williams and Hitesh), I finally got down to do some cleaning up. You can find the source code here. The source code is based on Tam Huynh UCWA Sample, a really well written sample which I then made a mess of.

What is UCWA and why should I care?

From Microsoft’s own words:

Microsoft Unified Communications Web API 2.0 is a REST API that exposes Skype for Business Server 2015 instant messaging (IM) and presence capabilities.

Ok, so it is set of APIs, but why can’t I just keep on using Lync 2013 SDK for my bot as I created previously?.

Well, using Lync 2013 SDK has one major demerit. It requires the bot to run on the system where Skype For Business(SFB)/Lync 2013 is installed and running. That means you are tied down to a machine which would also create problems with scaling. Plus you are dependent upon 4 years old SDK which is no more recommended by Microsoft.

UCWA solves all these problems. Using UCWA, we now no longer need a SFB client running on the system. Bot can be deployed anywhere and scaled independently.

UCWA has a lot of capabilities. However what interests us most is how to send messages and receive messages. Each of these tasks require us to send series of HTTP requests in order to UCWA endpoints. I recommend you read through the above links to understand how it works.

Getting Started

Before even delving into code, you need to set up a lot of things. When developing UCWA applications you need to target either Skype For Business Online or Skype for Business Server(On-Premise). Both have different setup procedure. I recommend you to read through MSDN documentation to understand the differences. In this article and the accompanying code, I would only work with Skype For Business Online. This is primarily because I don’t have an on-premise installation.

The pre-requisite to this is that you must have Office 365 subscription and access to Azure Active Directory to which O365 is linked. Also for setting up you would need to grant permission to our app in Azure Active Directory, which only AD admin can do. Also create two users in Active Directory, one which bot would sign in as and other to test sending message to bot.

Once you have these things ready, the next step is to create an application in Azure Active Directory. I recommend you follow Tam Huynh’s excellent guide.

Once you have the app in azure AD properly setup, keep TenantID, app’s ClientID and app’s RedirectURI handy as we would need them in the code.

Understanding the solution structure

The code itself is just a Console Application. The solution contains 4 folders:

  • The Bot folder contains the Dialog class and an implementation of IBotToUser. All these classes are related to Bot Framework. I have used EchoDialog from the bot framework sample which echoes back the initial message with a count.
  • UcwaSfbo folder mostly contains classes as it is from Tam Huynh’s sample except for UcwaReciveMessage.cs and UcwaSendMessage.cs. As name suggests, these two classes are used to receive and send messages to SFB.
  • Utilities.cs contains some some convenience methods.
  • Data folder contains auto generated classes that represents UCWA JSON responses.

Code Smell

Data folder is a mess. All these classes were auto-generated based on responses from UCWA APIs. Therefore there are lot of duplicate classes. I tried to clean up some of it but couldn’t get time to see it through. UcwaReciveMessage.cs and UcwaSendMessage.cs are also not very well written. It was hastily written as a first attempt to get a PoC on UCWA. Once you get the understanding of what is happening, I would suggest you rewrite them for your own applications.

Setting up the code

Open Program.cs and you would see some static strings. Replace the values of tenantId, clientId and redirectUri to what you copied before. hardcodedUsername and hardcodedPassword are credentials for the user that you would want the bot to sign in as. If you don’t want to hardcode the credentials, that is fine too as we will see later. destinationAddress is not used so you could leave it as it is.

Go to App.config and enter a valid MicrosoftAppId and MicrosoftAppPassword for an existing registration of bot in bot framework portal. It is required as we would be using Bot State Service to store the conversation state.

Once done run the sample and you would be greeted by a console message to choose a login style. If you are running the project for first time after creating the AD app, choose dialog option. This is needed as you would be asked to provide some consent, which requires a web page so console login doesn’t work. This only needs to be done once. Next time you could just use console option and bot would sign in using hardcoded credentials which we defined before. If you don’t want to hardcode, your only choice to proceed is through dialog option.

If the program started successfully you would see json responses in the console. Login to Skype for Business as the other user and send the message to the bot. The bot should echo back what you typed. Great!! Our bot is working as expected.

Under the hood

Once the bot signs in using the credentials provided, it polls the UCWA for incoming messages. As mentioned before, you need to send series of requests to UCWA in specific order for this to work. All this is handled in UcwaReciveMessage.cs class. When you send a message to the bot, the message is actually received in GetIM_Step03_Events method. Once I get the message I create the Activity object with minimum information required.

string SendMessageUrl = item1._embedded.message._links.messaging.href;
var conversationId = item.href.Split('/').Last();
var fromId = item1._embedded.message._links.contact.href.Split('/').Last();

Activity activity = new Activity()
{
    From = new ChannelAccount { Id = fromId, Name = fromId },
    Conversation = new ConversationAccount { Id = conversationId },
    Recipient = new ChannelAccount { Id = "Bot" },
    ServiceUrl = "https://skype.botframework.com",
    ChannelId = "skype",
    ChannelData = SendMessageUrl
};

activity.Text = message;

The conversationId and fromId are extracted from the JSON response. These are required as the conversation state are stored using these key. SendMessageUrl is required to reply to user. We store it in the ChannelData property of Activity.

Once activity object is properly initialized, we jump start the bot and pass the activity object as the incoming message. Instead of starting the bot here, or if you have an existing bot, you could use Direct Line API to send the message to the bot.

using (var scope = Microsoft.Bot.Builder.Dialogs.Conversation
    .Container.BeginLifetimeScope(DialogModule.LifetimeScopeTag, 
    builder => Configure(builder)))
{
    scope.Resolve<IMessageActivity>
        (TypedParameter.From((IMessageActivity)activity));
    DialogModule_MakeRoot.Register
        (scope, () => new EchoDialog());
    var postToBot = scope.Resolve<IPostToBot>();
    await postToBot.PostAsync(activity, CancellationToken.None);
}

In the Configure method I register my custom implementation of IBotToUser.

private static void Configure(ContainerBuilder builder)
{
    builder.RegisterType<BotToUserLync>()
       .As<IBotToUser>()
       .InstancePerLifetimeScope();
}

BotToUserLync reads the ChannelData property of the Activity and calls SendIM_Step05 method of UcwaSendMessage which sends a request to UCWA to reply to the user.

Conclusion

Using UCWA is easy once you understand how it works. However it is tiring process to write code against it. There are lot of steps to follow in particular order and lack of any SDK makes it more difficult. The present sample is not at all production ready. I would recommend you use this sample to understand what happens inside and implement a better(and cleaner) solution if you are developing it for a live environment.

I hope this article was helpful. If you have any questions, please post a comment below.

Integrating CRIS with Microsoft Bot Framework

Couple of months ago I wrote an article on how to skype call a bot. Behind the scene, the bot used Bing Speech API to perform Speech-To-Text(STT) conversion to get plaintext output of what user spoke. It was all cool but I was fairly disappointed with accuracy of Bing Speech. It failed miserably when using domain specific terminology and also did not perform so well with my accent(Indian). Also it did not fair nicely in a noisy environment.

All of these issues goes away with new service called Custom Speech Service(CRIS) which Microsoft made it available as Public Preview earlier this month. You may be wondering what letter R and I stands for in CRIS. Well CRIS was earlier known as Custom Recognition Intelligent Service, but Microsoft renamed it to Custom Speech Service (though I believe the former sounded much cooler).

CRIS lets us create a customized language and acoustic model. What are these models?

  • Language Model: The language model decides what sequence of words are more likely to occur in a sentence. It does it by creating a probability distribution over over sequence of words. You train the language model by providing a plaintext file containing list of sentences which are similar to what user would speak.

  • Acoustic Model: The acoustic model would break down a short fragment of audio and classify it into one of the phonemes. This can help the system to recognize domain specific words in a sentence such as “When did Neanderthal became extinct”. Acoustic models are trained by providing audio files of speech data and a text file of its corresponding transcript. The audio data should be as close to the environment where you expect your app/bot to be used most. For example, if you expect your user to use your bot on road, you should provide audio files of people speaking on road. Acoustic model can then learn the environment and would work much better.

More details about the models are available in the documentation. In this sample, we would only train the Language model and use the base acoustic model.

Getting Started

To use CRIS you would need to get a subscription from Azure. Don’t worry, CRIS is free till 5000 requests/month, so you could try it out. Once you get your subscription key, you need to add it to CRIS portal. Follow the guide to get it done.

We would be using the same source code that I created for skype call a bot post. We would just modify it to support CRIS. I recommend you go through the post first before continuing. Since we would be using the same code base, we would inherit all the bad designs which I described in previous post specially how the response is sent. I absolutely dislike the way I had done it. You are better off using some other way(preferably reactive programming) to achieve the same. In any case, the source code could be found here.

Training Language Model

As mentioned above, training data for Language model is just plaintext file. The file should contain list of utterances with one sentence per line. The sentences may not be complete sentences or grammatically correct as long as it accurately reflects what user would speak. There are some main requirements such as encoding, size limit etc which you can read in the documentation.

I have created a simple file for sample which you could find in CRIS folder in the code. Note that I have just added few sentences for example purpose. Feel free to extend it by adding more sentences. Also you could add part of sentence or words which you think user would most likely speak such as city names.

Once we have training data ready we need to import it in CRIS. Go to Language Data tab by clicking on Menu -> Language Data and click on Import New. Enter the data name and select the text file to upload.

Language Data Upload

Once the training data is uploaded it would be queued for processing. You could check the status of it by going to Language Data tab(it should redirect automatically). Wait till it status is shown as Complete.

Next we need to create a Language Model. Go to Language Model page and click on Create New. Give a name and description to your model. There are two types of base model -

  • Microsoft Conversational Model is optimized for recognizing speech spoken in conversational style.

  • Microsoft Search and Dictation Model is appropriate for short commands, search queries etc.

We would be using the Conversational Model base model, since we expect our user to talk to our bot rather than give commands. Select the Language Data that we uploaded in previous step. Once form is filled click on Submit.

Language Model Create

Similar to previous step, the language model training would take some time. Wait till the status is Complete.

Once the model is successfully created, go to Deployments page and create a new deployment. In the form presented, select the base model as Microsoft Conversational Model and select our trained Language model. For Acoustic Model, select the default base model which is shown.

Deployment Create

Once the deployment is complete, you would be redirected to the Deployment Information page where you would need to copy the Url specified in WebSocket for ShortPhrase Mode. We would require the Url later in the code.

Deployment Complete

Integrating CRIS with Bing Speech SDK

We would continue using the Bing Speech client library for STT. But instead of calling to Bing Speech API we would send the speech to our CRIS deployment. We only need to change how our DataRecognitionClient is created as shown below.

 public string SubscriptionKey { get; } = "CRIS SubscriptionKey";
 public string CrisUri { get; } = "CRIS ShortPhrase Mode URL";
 public string AuthenticationUri { get; } = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken";

 public void CreateDataRecoClient()
 {
     dataClient = SpeechRecognitionServiceFactory.CreateDataClient(
         SpeechRecognitionMode.ShortPhrase,
         DefaultLocale,
         SubscriptionKey,
         SubscriptionKey,
         CrisUri);

     dataClient.AuthenticationUri = AuthenticationUri;

     dataClient.OnResponseReceived += OnDataShortPhraseResponseReceivedHandler;
 }

Paste the deployment URI for Short Phrase which we copied from before in CrisUri and also enter the CRIS subscription key in SubscriptionKey. We would continue to use the ShortPhrase mode as we did before. The only difference now is that the speech is now sent to CRIS instead of Bing Speech API. Leave the AuthenticationUri as it is as it does not needs to be changed.

Apart from this there are no other changes required. You could run the bot as it is and it would work out. Check my previous post on how to run and test this bot. Do not forget to a valid LUIS subscription key in RentLuisDialog and change MicrosoftAppId, MicrosoftAppPassword and CallbackUrl appropriately.

Conclusion

I have been waiting for CRIS for a long time and it is finally available to use. It works so much better than Bing Speech API and looks really promising. However I don’t think that Calling Bot is matured enough yet. It looks a little sketchy and entire flow is not smooth, but it is still in preview so lets wait and watch. Meanwhile try out training Acoustic Model and let me know in the comments how did it work out.

BusyBot - Chat Bot for Skype for Business

We use Skype for Business in our organization which is a fairly common IM applications used in enterprises. The most common distraction while working is popping up of Skype message. And then it takes even more time to reply and finish the conversation, because not replying to colleagues is just rude. So I thought why not create a bot that replies to the messages for me. Unfortunately, Microsoft Bot Framework does not support Skype for Business as one of the channels so I had to find another way to make it works.

Skype for Business has set of APIs called Unified Communications Web API which can enable us to integrate it with a bot, however it is unnecessarily complicated (it requires 5 HTTP calls to just send 1 message). So after searching a bit, I found that Lync 2013 SDK still works with Skype for Business (courtesy to my friend Om) and found an excellent starter code at Taha Amin’s Github Repo BotConnectorSkypeForBusiness.

Lync SDK is fairly straightforward to use. It is event-based and integrates easily with Bot Framework. The only limitation is that Lync 2013/Skype for Business should be already running. Using this I created a simple bot that would let me work in peace. Source code is over here

Dependency

The bot has an optional dependency on Redis server. Since the bot will not be talking to Microsoft Bot Connector in any way, we would need to store bot’s context somewhere ourself. I had earlier used locally running instance of Redis. However now I have commented out RedisStore and used InMemoryStore. To use Redis store uncomment the region in Program.cs and comment InMemory region.

You would also need Skype For Business running and signed in.

Features

So what does the bot do as of now? It accepts the incoming IM and -

  • Responds to greetings - Hi, Hello, Good Morning etc.
  • In case the person wants to call me or asks whether I am free - respond that I am busy and will talk later and set my status to Busy.
  • Ignore any other messages - Pretend I am busy
  • Exception Filter - Bot does not reply anything if sender is present in Exception List. I don’t want to reply to my manager that I am busy if he pings me. :)

How to use

The bot is just a console application. The bot service is not hosted as Web Api, but runs within the console applications. First create a new LUIS application by importing the model json from LuisModel directory. Copy your LUIS model id and subscription key and paste it in LuisModel attribute in LyncLuisDialog.cs.

The exception list is located in App.config in the console project. Values are ; separated.

<add key="ManagerList" value="sip:name1@domain.com;sip:name2@domain.com"/>

Make sure your Skype for Business client is running and you are signed in and just start the console project. Ask your friend to ping you and see what happens.

How it works

Lync 2013 SDK is based on event driven programming. We just subscribe to right event instantMessageModality.InstantMessageReceived += InstantMessageReceived; and any messages will come to our InstantMessageReceived method.

private void InstantMessageReceived(object sender, MessageSentEventArgs e)
{
    var text = e.Text.Replace(Environment.NewLine, string.Empty);
    var conversationService = new ConversationService((InstantMessageModality)sender);
    SendToBot(conversationService, text);
}

Once we get the message text, we bootstrap our bot and pass the text as properly formatted Activity message.

private async void SendToBot(ConversationService conversationService, string text)
{
    Activity activity = new Activity()
    {
        From = new ChannelAccount { Id = conversationService.ParticipantId, Name = conversationService.ParticipantName },
        Conversation = new ConversationAccount { Id = conversationService.ConversationId },
        Recipient = new ChannelAccount { Id = "Bot" },
        ServiceUrl = "https://skype.botframework.com",
        ChannelId = "skype",
    };

    activity.Text = text;

    using (var scope = Microsoft.Bot.Builder.Dialogs.Conversation
        .Container.BeginLifetimeScope(DialogModule.LifetimeScopeTag, builder => Configure(builder, conversationService)))
    {
        scope.Resolve<IMessageActivity>
            (TypedParameter.From((IMessageActivity)activity));
        DialogModule_MakeRoot.Register
            (scope, () => new Dialogs.LyncLuisDialog(scope.Resolve<PresenceService>()));
        var postToBot = scope.Resolve<IPostToBot>();
        await postToBot.PostAsync(activity, CancellationToken.None);
    }
}

The bot then follows usual flow of sending the text to LUIS and determining the intent. Based on the context, it will then send it response back to - BotToUserLync class which implements IBotToUser. This allows us to catch the bot response and instead of sending it to the Bot Connector, we use Lync SDK once again to reply it to our counterpart.

The Exception Filter is managed in ManagerScorable which implements IScorable<IActivity, double>. Scorables are the way to intercept the bot pipeline and branch off with another logic based on requirements. In our case, we check if the incoming message was sent from anyone on the filter list and if it is then we just do nothing. I may write another post on Scorables and discuss about it a little more later.

Conclusion

That’s it. It took me a day to get it all done. The bot is very rudimentary but gets the job done. I now no longer have to reply toe very conversation when I am working. In any case, Skype for Business already saves all the conversation history so I can go over them once I get free. One day of work and lifetime of peace. :)

I hope this article was helpful. If you have any questions, please post a comment below.

Skype Call your bot - Microsoft Bot Framework with Bing Speech

So over this past weekend, I was dead bored when I got this idea of calling a bot from Skype. Skype bot calling feature does exist(preview) but the samples which are available are only for simple IVR bot. So I thought why not integrate it with Bot Builder SDK, so that same bot can text and answer call at same time. So the basic idea is that the bot should follow the same flow irrespective whether the user texts or calls. Great idea to past time, after some initial trouble, I did manage to get it done(not neatly though). So why not write a blog about it.
Source code is available over my github repo.

I developed this sample over weekend just to find out whether it could be done or not. It is not a very cleanly written sample and there is a design flaw due to which the bot cannot be scaled out. I will address this design issue and also explain how we can make it scalable. Nonetheless, I decided to write down this blog because it provides a nice insight into Skype Calling Bot and also how to intercept the responses from Bot by implementing IBotToUser interface.

For this fun project, we will use the Bot Builder Calling SDK which is now part of Bot Framework, to make a call to our bot. Once we get the audio stream of the caller, we will use Bing Speech API for Speech-to-Text conversion. After we receive the text, we will just pass it to our bot and our bot will behave in the same way as it does for text input. The trick here is to not let our bot reply back to user through bot connector, but to intercept the reply messages and pass it onto Skype Calling APIs which manages Text-to-Speech conversion and would speak back to the user. The plan is to utilize feature rich Bot Builder SDK to build our bot and plugin the voice calling functionality on top of it, without having to modify any of the bot logic.

Fair Warning: This is going to be a long post. Before getting into details, I will give a brief overview of Bot Builder Calling SDK and Bing Speech SDK.

Bot Builder Calling SDK

Bot Builder Calling SDK is just a nice client package to call Skype Calling APIs. I recommend you read through the API documentation to understand how calling works through Skype. I will just explain it briefly here.

When a call arrive to Skype Bot Platform, it would notify our bot by calling our bot’s HTTPS endpoint. Once we receive this notification, we(our bot) can reply back by providing a list of actions called workflow to execute. So for the initial call notification, we can either reply back by Answer or Reject action. The Skype Bot Platform will then execute the workflow and return us the result of last action executed. The SDK will raise an appropriate Event which we can subscribe to and handle the action outcome in our code. For example, on initial call notification, sdk will raise OnIncomingCallReceived event. On each event, we have opportunity to send another workflow to the user. In fact it is mandatory to return list of actions otherwise the sdk will throw an error and the call would get disconnected.
The below image, which I shamelessly copied from official documentation, explains how Skype calling works.

Skype Call Flow

Bing Speech API

We will use Bing Speech API for Speech-to-Text(STT) conversion. Microsoft team has released a thin C# client library for STT conversion using Bing Speech API. The samples are available on github and the library is available over nuget. For some reason they have different libraries for x64 and x86. Make sure you use the correct one depending upon your system.

The first thing to do is get a Speech API subscription key (free) by signing up for Cognitive Services. The STT sdk also works on event based model. In simple terms, the audio stream is sent to Speech API, which then returns the recognized text which is available in the argument OnResponseReceived event which is raised. The Speech API also returns partially converted text, but in our case we ignore them.

Putting it all together

But before that, we need to get develop a bot which works text inputs. I have gone ahead and created a very simple bot using FormFlow which allows user to rent a car. LUIS is used for text classification which returns the intent as Rent and entities such as PickLocation and Pickup Date and Time. Following which I just pass the entities to the FormFlow which takes care of asking appropriate questions and filling out rest of the form. Simple.

Skype Call Flow

To integrate Skype call, there are few complication -

  • There are 2 layers of event based models which needs to be wired together - Skype Calling SDK and Bing Speech SDK.
  • The STT output needs to be supplied to the Bot Builder in correct format. Our bot expects an IMessageActivity instance with properly filled in IDs such as Conversation ID, Channel ID etc so that it can fetch correct state from State Service.
  • The response from the bot needs to be intercepted and somehow returned back to event based model of Skype SDK.

We will address each of them one by one.

To start with we create a project from Calling Bot Template. The template creates a very basic IVR bot with a CallingController which receives the request from Skype Bot Platform and a SimpleCallingBot class which derives from ICallingBot which handles all the events which are raised by the Calling Bot SDK. The template also have a MessageController with default bot implementation.

Next create a new class RentCarCallingBot and derive it with ICallingBot. I have used SimpleCallingBot as a reference therefore you will see basic structure and few methods are same. In the constructor we subscribe to the events which will be raised when a workflow is completed.

public RentCarCallingBot(ICallingBotService callingBotService)
{
    if (callingBotService == null)
        throw new ArgumentNullException(nameof(callingBotService));

    this.CallingBotService = callingBotService;

    CallingBotService.OnIncomingCallReceived += OnIncomingCallReceived;
    CallingBotService.OnPlayPromptCompleted += OnPlayPromptCompleted;
    CallingBotService.OnRecordCompleted += OnRecordCompleted;
    CallingBotService.OnHangupCompleted += OnHangupCompleted;
}

We subscribe to only 4 events -

  • OnIncomingCallReceived: Is fired when a call arrives at Skype Bot Platform. This is the same event which I explained earlier. Here we can either accept or reject the call.
  • OnPlayPromptCompleted: Is fired when PlayPrompt action is completed. PlayPrompt action performs Text-to-Speech(TTS) conversion and plays back the supplied text to the caller. Once the playback is complete and if it is the last action in the workflow, then this event is raised.
  • OnRecordCompleted: Similar to above, this event is raised when Record action completes. Record action allows us to record the caller’s voice and gives us an audio stream. This is the primary way to receive the audio of caller.
  • OnHangupCompleted: As name suggests, is raised when we hangup.

OnIncomingCallReceived

private Task OnIncomingCallReceived(IncomingCallEvent incomingCallEvent)
{
    var id = Guid.NewGuid().ToString();
    incomingCallEvent.ResultingWorkflow.Actions = new List<ActionBase>
        {
            new Answer { OperationId = id },
            GetRecordForText("Welcome! How can I help you?")
        };
    return Task.FromResult(true);
}

Upon receiving a call, we get an IncomingCallEvent object as argument. To this, we can add next steps of actions to be executed in the workflow. We add 2 events to the workflow - Answer and Record. We first answer the call and then start a Record action to get the caller’s input. Skype will start recording after speaking welcome message. The recorded stream will be available to us in OnRecordCompleted event.
A thing to note is that we must specify an OperationId to each action. It is used to correlate the outcome of the event.

OnRecordCompleted

private async Task OnRecordCompleted(RecordOutcomeEvent recordOutcomeEvent)
{
    if (recordOutcomeEvent.RecordOutcome.Outcome == Outcome.Success)
    {
        var record = await recordOutcomeEvent.RecordedContent;
        BingSpeech bs = 
            new BingSpeech(recordOutcomeEvent.ConversationResult, t => response.Add(t), s => sttFailed = s);
        bs.CreateDataRecoClient();
        bs.SendAudioHelper(record);
        recordOutcomeEvent.ResultingWorkflow.Actions = 
            new List<ActionBase>
            {
                GetSilencePrompt()
            };
    }
    else
    {
        if (silenceTimes > 1)
        {
            recordOutcomeEvent.ResultingWorkflow.Actions = 
                new List<ActionBase>
                {
                    GetPromptForText("Thank you for calling"),
                    new Hangup() 
                    { 
                        OperationId = Guid.NewGuid().ToString() 
                    }
                };
            recordOutcomeEvent.ResultingWorkflow.Links = null;
            silenceTimes = 0;
        }
        else
        {
            silenceTimes++;
            recordOutcomeEvent.ResultingWorkflow.Actions = 
                new List<ActionBase>
                {
                    GetRecordForText("I didn't catch that, would you kinly repeat?")
                };
        }
    }
}

There are three sections in this method. The first if block is executed when we have successfully recorded the voice of the caller. We get the recorded content and pass it to BingSpeech class. It accepts 3 arguments in constructor, the first being the ConversationResult, the second and third being 2 delegates. The first delegate is used add a string to response property which is a List<string>. The response list maintains the list of messages which bot will send to the caller. The second delegate sets a flag if the STT conversion failed. In short this class calls the Bing Speech API and upon receiving the STT output, it goes ahead and passes it to our bot.
Then we go ahead and add a PlayPrompt action which just keeps silence for specified period of time. This is required as we do not have result from bot immediately as we will see later.

If we do not receive a successful recoding, we give the caller a chance to speak again once more. If the recording fails again, we disconnect the call gracefully. The silenceTimes counter is used for this purpose.

OnPlayPromptCompleted

private Task OnPlayPromptCompleted(PlayPromptOutcomeEvent playPromptOutcomeEvent)
{
    if (response.Count > 0)
    {
        silenceTimes = 0;
        var actionList = new List<ActionBase>();
        actionList.Add(GetPromptForText(response));
        actionList.Add(GetRecordForText(string.Empty));
        playPromptOutcomeEvent.ResultingWorkflow.Actions = actionList;
        response.Clear();
    }
    else
    {
        if (sttFailed)
        {
            playPromptOutcomeEvent.ResultingWorkflow.Actions = 
                new List<ActionBase>
                {
                    GetRecordForText("I didn't catch that, would you kindly repeat?")
                };
            sttFailed = false;
            silenceTimes = 0;
        }
        else if (silenceTimes > 2)
        {
            playPromptOutcomeEvent.ResultingWorkflow.Actions = 
                new List<ActionBase>
                {
                    GetPromptForText("Something went wrong. Call again later."),
                    new Hangup() 
                    { 
                        OperationId = Guid.NewGuid().ToString() 
                    }
                };
            playPromptOutcomeEvent.ResultingWorkflow.Links = null;
            silenceTimes = 0;
        }
        else
        {
            silenceTimes++;
            playPromptOutcomeEvent.ResultingWorkflow.Actions = 
                new List<ActionBase>
                {
                    GetSilencePrompt(2000)
                };
        }
    }
    return Task.CompletedTask;
}

The first time this event is raised is when we have recorded the user’s input and have passed it to the BingSpeech class. At this point of time, we may or may not have any output from the bot itself. If there are any output(reply) from the bot, it will be added to response list. The response field contains the List<string> which are returned by bot to the user. If response is not empty, we get the PlayPrompt Action for the responses and add it to the workflow. We add a Record action after PlayPrompt to capture the next input from caller.
In case the response is empty, it may mean one of the following two things, either the STT conversion failed or the processing of earlier input is yet not completed by the bot. If the STT conversion failed, we play a prompt to user and ask him to repeat and start the recording again. If the bot has not yet processed the previous input, then we start another silence prompt. We maintain a counter for how many times did we end up waiting for bot to complete processing, if it increases a threshold, we gracefully hangup.

OnHangupCompleted

private Task OnHangupCompleted(HangupOutcomeEvent hangupOutcomeEvent)
{
    hangupOutcomeEvent.ResultingWorkflow = null;
    return Task.FromResult(true);
}

Self-explanatory. Just set the workflow to null and return.


Intercepting Bot response

Microsoft Bot Framework does not return the response/reply in-line to the HTTP request. Instead it sends a separate HTTP request to Bot Connector with the reply message. We can intercept this flow by implementing IBotToUser interface. The default implementation which sends the message to Bot Connector is called AlwaysSendDirect_BotToUser. We will create a class BotToUserSpeech and derive this interface.

public BotToUserSpeech(IMessageActivity toBot, Action<string> _callback)
{
    SetField.NotNull(out this.toBot, nameof(toBot), toBot);
    this._callback = _callback;
}

public IMessageActivity MakeMessage()
{
    return this.toBot;
}

public async Task PostAsync(IMessageActivity message, CancellationToken cancellationToken = default(CancellationToken))
{
    _callback(message.Text);
    if (message.Attachments?.Count > 0)
        _callback(ButtonsToText(message.Attachments));
}

The constructor takes two parameters, the first being IMessageActivity and the second being a delegate to return the response to. This is the same delegate which was passed in BingSpeech class constructor in OnRecordCompleted event. The delegate just adds the string to response field. We need to implement just two method to MakeMessage and PostAsync. In MakeMessage we just return back the IMessageActivity object that we received from constructor. In PostAsync, we call the _callback delegate with the message text field. If the message has any attachment, we convert the buttons and cards in the attachment to plain string which is then passed to _callback. This ensures that buttons which are displayed to user normally in chat windows, gets converted to simple text so that the caller has all the options.

Once we have this class ready, we just need to wire it up in the dependency container which we do in BingSpeech class.

BingSpeech

This class performs three tasks (talk about SRP!!!). First it receives the audio stream and sends it in chunks to Bing Speech API. Second it receives the event which is raised once the Bing Speech completes the STT conversion. Third it takes the STT output, and sends it to our RentACar bot. For this step it must setup the required dependencies and get instances through container to pass the message in correct format. Let’s step through each of them one by one. But before that, put the Bing Speech API subscription key in SubscriptionKey property.

string SubscriptionKey { get; } = "Bing Speech subscription key";

Perform Speech-To-Text

public void CreateDataRecoClient()
{
    this.dataClient = SpeechRecognitionServiceFactory.CreateDataClient(
        SpeechRecognitionMode.ShortPhrase,
        this.DefaultLocale,
        this.SubscriptionKey);

    this.dataClient.OnResponseReceived += this.OnDataShortPhraseResponseReceivedHandler;
}

First we ask SpeechRecognitionServiceFactory to give us a DataClient instance. SpeechRecognitionServiceFactory can give us 4 types of clients -

  • MicrophoneClient: Used to get audio stream by using device’s microphone and then perform STT conversion.
  • DataClient: No microphone support. You can use it to pass audio from Stream.
  • MicrophoneClientWithIntent: Same functionality as MicrophoneClient. Additionally it will also send the text to LUIS and return LUIS entities and intents along with the text.
  • DataClientWithIntent: Same as DataClient. Additionally it too will send the STT result to LUIS to perform intent and entity detection.

In our scenario, we already receive voice stream from Skype and the NLP part will be done by our bot, therefore DataClient would work out for us. Next we subscribe to event OnResponseReceived, as this will be raised once STT processing is done by Bing Speech for complete stream.

public void SendAudioHelper(Stream recordedStream)
{
    int bytesRead = 0;
    byte[] buffer = new byte[1024];
    try
    {
        do
        {
            // Get more Audio data to send into byte buffer.
            bytesRead = recordedStream.Read(buffer, 0, buffer.Length);

            // Send of audio data to service. 
            this.dataClient.SendAudio(buffer, bytesRead);
        }
        while (bytesRead > 0);
    }
    catch (Exception ex)
    {
        WriteLine("Exception ------------ " + ex.Message);
    }
    finally
    {
        // We are done sending audio.  Final recognition results will arrive in OnResponseReceived event call.
        this.dataClient.EndAudio();
    }
}

SendAudioHelper will use dataClient to send the audio stream to Bing Speech Service. Once the entire stream is processed, the result will be available in OnResponseReceived event.

private async void OnDataShortPhraseResponseReceivedHandler(object sender, SpeechResponseEventArgs e)
{
    if (e.PhraseResponse.RecognitionStatus == RecognitionStatus.RecognitionSuccess)
    {
        await SendToBot(e.PhraseResponse.Results
                    .OrderBy(k => k.Confidence)
                    .FirstOrDefault());
    }
    else
    {
        _failedCallback(true);
    }
}

If the STT conversion is successful, we order the result by confidence score and send it to our bot. Otherwise we call the callback for failure which sets a flag to true. This flag was then checked back at OnPlayPromptCompleted to either proceed or request caller to speak again.

Send to bot

Next challenge is to take the STT result and construct a valid Activity instance. Why? Because everything in Bot Builder depends upon a proper instance of IMessageActivity. Moreover we need to wire-in our BotToUserSpeech class. We do this by registering it into Autofac ConnectionBuilder while starting a new LifetimeScope.

private async Task SendToBot(RecognizedPhrase recognizedPhrase)
{
    Activity activity = new Activity()
    {
        From = new ChannelAccount { Id = conversationResult.Id },
        Conversation = new ConversationAccount 
        { 
            Id = conversationResult.Id 
        },
        Recipient = new ChannelAccount { Id = "Bot" },
        ServiceUrl = "https://skype.botframework.com",
        ChannelId = "skype",
    };

    activity.Text = recognizedPhrase.DisplayText;

    using (var scope = Microsoft.Bot.Builder.Dialogs.Conversation
      .Container.BeginLifetimeScope(DialogModule.LifetimeScopeTag, Configure))
    {
        scope.Resolve<IMessageActivity>
            (TypedParameter.From((IMessageActivity)activity));
        DialogModule_MakeRoot.Register
            (scope, () => new Dialogs.RentLuisDialog());
        var postToBot = scope.Resolve<IPostToBot>();
        await postToBot.PostAsync(activity, CancellationToken.None);
    }
}

private void Configure(ContainerBuilder builder)
{
    builder.Register(c => 
        new BotToUserSpeech(c.Resolve<IMessageActivity>(), _callback))
        .As<IBotToUser>()
        .InstancePerLifetimeScope();
}

The Activity instance must be valid. At least the 3 IDs needs to be specified to get and set context from state service. For the IDs, we can use ConversationResult.Id which is a unique for each conversation. We also have ConversationResult.AppId which is AppId of the caller but for some reason it was always null for me. Along with them, ServiceUrl and ChannelId should also be correct, otherwise bot will throw exception. Once we have a valid Activity instance, we assign it’s text property to our STT output.

To send this Activity instance to the bot, we need to resolve instance of IPostToBot. Once we get it, we just call it’s PostAsync method and pass the Activity instance. This would kick start our bot, deserialize the Dialog State and resume/start the conversation. This is exact flow which happens when we call Conversation.SendAsync from MessageController.

CallingController

Finally in CallingController pass the instance of RentCarCallingBot when registering the calling bot in the constructor.

public CallingController()
    : base()
{
    CallingConversation.RegisterCallingBot(c => new RentCarCallingBot(c));
}

Scalability Problems

The response from bot will eventually arrive at our BotToUserSpeech class, which would just pass the response text to our delegate which would add it to a list maintained in RentCarCallingBot. The list is then monitored when the workflow is finished and Skype API sends us a callback with result. We have put together everything in such a that once the bot finishes user recording, it then plays a silent prompt and monitors the response list.

This is where the problem lies. Our BotToUserSpeech class will capture the response and add it to the List. However in scenario where we have scaled out and have multiple bot services running behind a load balance, there is no knowing where the next callback from Skype API is going to land. Our current implementation locks us to only single bot service and prevents us from scaling.

We can resolve this issue by changing our implementation of BotToUserSpeech class. For example, instead of passing the response to a delegate, we can push it into a queue such as RabbitMQ. On the OnPlayPromptCompleted method, we can then check if there are any messages in the queue to play to the user. We must also take care of posting message to the queue when STT failed. So in short, both our delegates needs to be replaced by an out of process storing mechanism which can be accessed by multiple running services. Since RabbitMQ or any other queue can be monitored by multiple bot services, it solves our scalability issue.

Testing our bot

We can test our bot locally by using ngrok. Ngrok creates a secure tunnel to localhost and provides us a public URL. Any call to public url will be forwarded to a service running on localhost at our system.
We create a tunnel ngrok http --host-header=localhost 3999 to forward request to localhost:3999 where we will host our bot. Ngrok will then generate a random public URL. Note the https forwarding URL.

ngrok

The first place we need to change is in the web.config. Replace the value of CallbackUrl key by ngrok URL. It should look like https://<ngrokURL>/api/calling/callback.
Once we register our bot at Bot Framework Portal, click on Edit on Skype channel. We must enable 1:1 audio calls feature. In the Calling Webhook text box, enter the ngrok URL in format of https://<ngrokURL>/api/calling/call.

registration

That’s it. Run the bot locally on the same port that was tunneled by ngrok. We can then start making calls to our bot. Try speaking “Rent a car from London”

Conclusion

In the evolution of UI, and the advent of bot, it is only natural that the next logical step is to voice call the bot. Imagine just calling your self-driving car to pick you up from your current location. Can’t wait to live in such a world. The bot ecosystem is pretty new, and the voice call to bot feature itself is just emerging. The current platform is no where production ready. It’s messy and STT is not accurate especially for non-US accent. We may see drastic improvement in Bing Speech service and the new CRIS service looks promising. And with Microsoft achieving human parity in speech recognition, this dream may not be too far.