Sam Charrington: [00:00:00] Welcome to the TWIML AI podcast. I’m your host, Sam Charrington. Hey, what’s up, everyone. Before we jump into today’s interview, I’d like to give a huge thanks to our friends at Microsoft for their continued support of the podcast. Microsoft’s mission is to empower every single person on the planet to achieve more, to inspire customers to reimagine their businesses and the world. Learn more at and And now, onto the show. All right, everyone. I am here with David Carmona. David is the general manager of artificial intelligence and innovation at Microsoft. David, welcome to the TWIML AI podcast.

David Carmona: [00:01:01] Thank you, Sam. Pleasure to be here with you.

Sam Charrington: [00:01:04] It is great to have you on the show. And I’m looking forward to digging into our conversation, which will focus on AI at scale and large scale language models, and a bunch of really interesting things you’re doing there. Before we jump into the topic, though, I’d love to have you share a little bit about your background and how you came to work on all this cool stuff.

David Carmona: [00:01:25] Yeah. Well, I’ve been in Microsoft for almost 20 years, 19 and a half.

Sam Charrington: [00:01:30] Wow.

David Carmona: [00:01:30] So, almost getting to that magical [laughs], magical moment. And it’s funny because my beginning with Microsoft, I was [inaudible 00:01:37] to Microsoft. That was 20 years ago. So, that was the big Windows moment. Right? But actually, I didn’t come to Microsoft because of Windows. I came to Microsoft because of, … At that time, my favorite product, which was Visual Studio. So, I was a developer. I still am a developer. I will always be a developer no matter what I am.

Sam Charrington: [00:01:57] [laughs].

David Carmona: [00:01:58] And for me, working in Visual Studio has been like my entire career. So, [inaudible 00:02:04] I started with AI and, and VR probably way too early [laughs]. That didn’t end well. So, I ended in traditional development. And I had a ton of fun with that. And I, when I move … I’m originally from Spain. When I moved here to the US [inaudible 00:02:17], I worked in, in, in Visual Studio. So, I ended managing the business for Visual Studio and all our tools like .NET and, and all of that. It was a super fun time because it was that big transition in Microsoft to open development. So, I was lucky to do things like launching TypeScript. Right? Or-

Sam Charrington: [00:02:36] Oh, wow.

David Carmona: [00:02:36] … open-sourcing .NET or making it cross-platform, or releasing Visual Studio code. Right? So, super fun stuff. But then like five years ago, this AI thing started to become super real. So, [laughs] I was, I was offered to lead a new team in Microsoft, focused on the business, on creating a new business for AI. And I, I didn’t think about it twice. So, yeah, that’s where I am. So, it’s interesting … So, as you can see, my career is always like, between technology and businesses. I think … I, I mean, knock on wood, but I think I’m in, in that great balance right now [laughs]. So, I have both. I’m super fortunate to have both because I work, connecting with Microsoft research and, and the entire organization of technology and research in, Microsoft. My goal, my team’s goal is really to connect that with the business. So, we work on … We define it as themes, like bigger themes of innovation in Microsoft.

And then we connect those themes to actual real products and technologies that we can take to market. it’s super cool. And one of those things … We have many, but one of them … I think like, probably the start of the themes is, is AI at scale.

Sam Charrington: [00:03:46] Okay. And so is the role primarily focused on taking innovations that are happening in research to existing Microsoft products? Or is it more focused on creating new business opportunities? Or is there some balance between the two?

David Carmona: [00:04:01] Yeah. It’s a balance. So, we have … The way that we work in Microsoft on our framework for innovation is based on Horizon. So, we have … We refer to them as the three [inaudible 00:04:10] Horizon. Right? So, we have Horizon 1, two, and three. Three, Horizon 3 are the like, the moonshots, right? Like, longer-term new business creation, new category creation for Microsoft. A lot of that is, driven by curiosity, in most cases, in research. So, we leave a lot of room for researchers to work on those themes. But then we go all the way to Horizon 2, which are things that are really about opening new opportunities or creating new opportunities for existing products. And you can go to Horizon 1 even, which is extending existing products. Right? So, making them better. So, we work in that, in that balance, between the three.

Sam Charrington: [00:04:52] Nice. And so you mentioned AI at scale as being one of your big focus areas. What exactly does that mean at Microsoft?

David Carmona: [00:05:00] Yeah. So, AI at scale, I mean, we, we named that as a new category. So, it’s not that it’s a product or anything like that. So, it’s how we refer to what we believe is a huge change in the way that we are going to see people developing AI. And it’s driven by m- many different things, many different trends and technology breakthroughs. But I think the most important one is this concept of massive models and, and what they mean. Right? So, this, this ability to create now, like, this huge [laughs], massive models with billions of, of parameters. And beyond the technical achievement, the reality is that those massive models are opening new opportunities that go beyond the technology and get into the business. Right? So, we can discuss it today. So, [inaudible 00:05:47] … So, we can spend a lot of time on the technology behind it. And then-

Sam Charrington: [00:05:47] Mm-hmm [affirmative].

David Carmona: [00:05:47] … we can, we can focus a little bit on, “Hey, but what does it really mean?” So, how is this going to change the way that any company can develop AI? Right? And, and [inaudible 00:05:59] it’s really interesting. And then there’s a whole ecosystem around this concept like, that, that you need to, for example, train these models, you need an AI supercomputer. So, that’s another piece of the puzzle, right, for AI at scale.

Sam Charrington: [00:06:14] So, we talk a lot about the increasing size of models and, you know, particularly in the context of NLP and language models. But help us contextualize that. You know, we throw around, you know, millions of parameters and, you know, hundreds of layers, and things like that. How is it shaking out? Or how do you think of this progression towards larger-size models?

David Carmona: [00:06:41] Yeah. I think in, in a sense, you probably remember [laughs] [inaudible 00:06:45] ImageNet moment for, [laughs]-

Sam Charrington: [00:06:46] [laughs].

David Carmona: [00:06:47] … for [inaudible 00:06:48] learning. Right? So eh-

Sam Charrington: [00:06:49] Uh-huh [affirmative].

David Carmona: [00:06:49] That was, … I mean, [inaudible 00:06:51] many people referring to this moment, like the ImageNet moment for NLP. Right? So, because we get to a point that there’s something that allows us to increase the size of the model. So, we go for it. And then we see, “Hey, wait a second. This is getting better. So, the more parameters that I add, the better that this is getting.” Right? So, that was the moment in ImageNet with ResNet, for example. Right? That we added so many layers, and, “Hey, this, this image classifier is, is working so much better.” So, we are kind of in the same place, but at a totally different scale, right, or order of magnitude. Right? For example, that model, the ResNet model for ImageNet, I think had like 60 million parameters. I mean, a completely different domain.

That was computer vision. Now, we’re talking about billions of parameters. And, and, and when we see progression, it’s being like, very [laughs], very quick. So, [crosstalk 00:07:44]-

Sam Charrington: [00:07:46] Mm-hmm [affirmative].

David Carmona: [00:07:46] I don’t know. GPT-2. So, the first version was like 100 million parameters. Then, I think BERT was like 300. Then you have Turing NLR. I think it, at that time, was like 1.2 billion. Then you have GPT-2, 1.5. Then you have Turing NLG. That was 17 billion parameters. That was last year [laughs]. We’re not talking months ago. That, … We’re not talking about, about years ago. And then we had just, just a couple of months after that, GPT-3 with 175 billion [laughs] parameters. Right? So-

Sam Charrington: [00:08:18] Yeah.

David Carmona: [00:08:18] Every step is 10 times [laughs] [inaudible 00:08:21]. It’s a new order of magnitude [crosstalk 00:08:22]-

Sam Charrington: [00:08:22] Mm-hmm [affirmative].

David Carmona: [00:08:22] … which is super impressive [laughs].

Sam Charrington: [00:08:24] So, we’ve kind of transitioned from … In the domain of Vision, you know, we would always talk about the number of layers as an indication of the size and complexity of the model. And now, when we talk about these language models, we tend to talk about parameters. What is that? And how does that tie to the architecture of these models?

David Carmona: [00:08:45] Yeah. I mean, behind … It’s not that we didn’t want to build these massive models before. It’s that we couldn’t [laughs]. That’s the reality.

Sam Charrington: [00:08:52] Mm-hmm [affirmative].

David Carmona: [00:08:52] And I think the big breakthrough to really enable these, these sizes of the model is the transformer architecture. And yeah, definitely a lot of say about that. But, yeah, the transformer architecture, it has … I mean, it’s also based in layers. In this case, they are symmetric. So, it scales very well because it always has the same number of inputs and outputs. So, you can stack up all the layers. And, and it was a huge change because that broadened the blocker that we had before with scaling these NLP models, is that we were using techniques as, as you know, as recurrent neural networks. Right? Like, LSTM and things like those. And those things are great because it allows you to connect, for example, in a text, the words between words. You can have some kind of memory.

So, a word right now can be impacted by words in the text before. Right? And, and you keep that memory. The problem is that the way that we were doing that was very sequential. So, and I mean, by definition, a recurrent neural network taking the previous step as an input. So, you need to finish that step to go to the next one. So, that impacted on the scalability of the models. So, I think with the transformer architecture, we kind of broke that ceiling because now, suddenly, we don’t have an architecture that is [inaudible 00:10:05]. So now, in this case, it’s all in parallel. We take the, all the inputs in parallel and with some techniques, in particular, … I think the most important one [inaudible 00:10:16] I would highlight too. But definitely, for that work, two things have to happen. One, it’s the concept of the positional embedding, so how every word needs to get an input in the, in the model, the position somehow, a flag of an indication of where that word is because that’s [laughs], of course, important [laughs]. It’s very important-

Sam Charrington: [00:10:36] Mm-hmm [affirmative].

David Carmona: [00:10:37] … Where a word is in a sentence to understand the sentence. But then the second thing is this concept of attention or, in this case, self attention, which is a way to kind of replicate that concept of connecting or changing the meaning of words, depending on the words that were happening before, or even in the case of bidirectional [inaudible 00:10:56] words are happening after that. Right? And that’s, that’s a whole new construct applied to NLP that is proving to be, not only super scalable, but even, performing even better [inaudible 00:11:08] the traditional approach to NLP.

Sam Charrington: [00:10:43] Hmm. And so how should we think about how attention works in these kinds of models?

David Carmona: [00:10:43] So, I, I, I mean, it’s a very simplistic view, but I like to think of it … Because attention is not new. So, we’ve been using attention-

Sam Charrington: [00:10:44] Mm-hmm [affirmative].

David Carmona: [00:11:23] … in, in others … Even in other domains. Right? Like, vision or i- image generation, or … I mean, the most simple example that I use all the time is movie recommendation. Right? So, how do you know if, if a user is gonna like a movie or not? So, the way that you do that is that you take a vector defining the movie in, you know, in any dimensional space. And then you take another vector defining the taste of the user. And then you multiply those vectors, right, to get the distance, the, like, the cosine distance or similarity between those two vectors. And that’s an indication how much the user will like the movie. That’s that’s attention, but in the case, of two different entities. Right? My taste and the movie. In this case, self attention is like doing similar, but with a sentence with itself or with a text with itself. Right? So, but in this case, the w- the attention that we want to measure is the connection between the words. So, how one word is related or connected to the rest of the words.

And at the end, you’re gonna have like, a heat map, right, so, where every word is connected in some manner with other words. So, if you’re saying, “The kid hit the ball, and he was happy.” So, he will be super connected with the boy. Right? So, I mean, super simple because at the end, you have multi [inaudible 00:12:42] attention blocks. And, and then you have all these different layers. It’s like trying to understand [inaudible 00:12:49] networks. After three layers, you’re lost [laughs]. You are completely lost on [crosstalk 00:12:53].

Sam Charrington: [00:12:53] [laughs].

David Carmona: [00:12:53] But I mean, that’s the core principle of it.

Sam Charrington: [00:12:56] Mm-hmm [affirmative]. Part of what’s interesting here is that, you know, we’ve transitioned from an approach to NPL that was, like you mentioned … Prior to capturing positionality, you know, we’d take a bag of words of things that was at document level, didn’t capture where those words were, didn’t really do a good job of capturing the relationships, but we’re just looking at the statistical properties of a document or sentence or-

David Carmona: [00:13:22] Yeah.

Sam Charrington: [00:13:23] … corpus to now looking at the relationships between all of these entities that make up language. Is that part of the power of this [crosstalk 00:13:31]?

David Carmona: [00:13:32] Yeah. Yeah. E- exactly. I would say that and then the concept of, of training these models with self supervised algorithms. Right? So-

Sam Charrington: [00:13:42] Mm-hmm [affirmative].

David Carmona: [00:13:42] [inaudible 00:13:43] supervised training. I think that’s the other thing that, that … It was the explosion in all these models, is how now, … Because this scales amazingly well, now, you can afford training these things with huge amounts of data. Like, for example, the entire internet [inaudible 00:14:00] kind of. Right? Which is kind of what we’re doing with this model. So, we take the text on the internet. And then depending on the model we can go in, in a little more detail in there if it’s a [inaudible 00:14:10] model or representation model. With smart techniques, you take that. You take … You mask that text, so the, so the model can try to guess either the missing words or the words that are happening after a given text. And by training that with that input, that you are almost not touching at all. Right? So, it’s all self supervised, [inaudible 00:14:31] and, and all of that. The model can actually learn very complex concepts and relationships.

Sam Charrington: [00:14:37] Mm-hmm [affirmative]. You mentioned different types of models. Elaborate on that a bit.

David Carmona: [00:14:41] Yeah. So, I think, the way that … And, and we can talk that more about because at the end, these same concepts can apply beyond NLP. But if we focus just on NLP, they are the main families of models. One is that I think people are super excited also because of Turing NLG and because of GTP-3. Those models are generation models. So, they are a natural language generation model, so NLG. And in that case, what … The way that that model is trained, they are called ultra aggressive models because you train the model with the, a lot of text. But then you train it to guess what is gonna happen, what text goes after a particular text. Right? So, they generate … They are super good, generating text, like guessing the end of a sentence or guessing an entire document, or guessing how a movie will, will end, or whatever [laughs] we want to, to guess or [inaudible 00:15:37] text, things, things like those. And that’s one big family of models.

You have em … Again, like, GTP-3 is an example of that. Turing NLG is an example of that. And then you have another family, which is more about representation, so natural language representation models. And the goal of those is more like, representing the text. So, in that case, the architecture that is, that is used, instead of trying to guess … Or the way that it’s trained. Instead of trying to guess what’s next, what we do is that you mask some words in the text. And then the model will try to guess it. And they are called bidirectional because in that case, not only they look at what happened before a certain moment, but also after that. So, they will look at the words before and after a particular word to understand the context there. Right? So, those are really good to map like, text to representation, then I fine tune to do whatever I want. Right? So, from super basic sentiment analysis to question answering, or whatever I want to fine tune the model. So, those are like, the two big blocks.

Then I like to go a little bit deeper ’cause for each of them, they are two other families that I think are very relevant to understand, which is how, … So, then there’s more than one language in the world [laughs]. Right? So-

Sam Charrington: [00:16:58] [crosstalk 00:16:59].

David Carmona: [00:16:59] You need to address that. Right? So, in particular, where you are creating real products. So, we are using these models in, in Office, for example. Office is working [inaudible 00:17:07], I feel like, 100 languages. So, imagine doing this for every language would be very [crosstalk 00:17:13].

Sam Charrington: [00:17:13] Mm-hmm [affirmative].

David Carmona: [00:17:13] And that would be the traditional approach of, of doing this. So, we, … And, and Microsoft has been a big believer on the need of doing this thing in an universal way. So, that creates a new family of models that are universal models, right, universal language models. And in the case of Turing, for example, we have both. We have a regular model. And then we have the universal language representation, ULR, so T, Turing ULR, universal language representation. And that is super powerful ’cause what allows us, for example, in, in Microsoft, is to implement features in Word using this, like, … I don’t know. Em, semantic search. We don’t need to train that feature or that model for every language. We just need to fine tune it for one language. And then you have the feature for free in 100 languages. Right?

Sam Charrington: [00:18:03] [crosstalk 00:18:04].

David Carmona: [00:18:03] Which is super cool. So, very, very recommend them to use those models for that. Th- this was, by the way, for people who want to go deeper. There’s a paper that I like a lot is [inaudible 00:18:14] 2017 where it explains this, this concept. And, the example that it uses is how you learn math. Right? So, you look at … Well, not me. I wouldn’t consider me bilingual. I speak Spanish and a little bit of English, but [laughs] my kids are truly bilingual. And when they learn math, they don’t need to learn that two plus is equal four in English, but then [Spanish 00:18:39] in Spanish. Right? So, they just need to learn math once. And then-

Sam Charrington: [00:18:43] [crosstalk 00:18:44].

David Carmona: [00:18:43] … they can apply that in different languages. So-

Sam Charrington: [00:18:46] Mm.

David Carmona: [00:18:46] It’s the same thing for models. So you can focus on teaching or training the core concepts, fine tuning for the concept. And then you have it for free in all the languages.

Sam Charrington: [00:18:56] Mm-hmm [affirmative]. Yeah. [inaudible 00:18:57] I wanna dig into transfer learning and multitask. These are all things that are coming to mind as you’re explaining this. But before we do that, we started out talking about language models as an example of these massive models that require a new way of thinking about, you know, AI at scale. And you mentioned, you know, the progression of the sizes of these models … And you know, it’s 10X each time. GPT-3 is, you know, 10X Turing. And one question that occurs to me is, you know, is size the, you know, the most important or the only factor? You know, does it mean that each time we jump a generation, you know, “Let’s just forget about the, you know … We shouldn’t be using Turing anymore. Let’s just use GPT-3 because it’s 10X better.” I think, you know, there are some obvious reasons why that might be the case, like if they’re trained on, on different corpuses.

Like, we know that GPT-3 has kind of a very broad public internet. And at least with GPT-2, like, there was a lot of critique about, you know, Reddit, you know, and, and the biases that get introduced there. So, the training set is going to be an obvious differentiator that separates from the size. But I’m wondering if there are other things that we need to be thinking about beyond just the size of the model.

David Carmona: [00:20:24] Yeah. Yeah. No, you are right. And I think … So, it’s a very simplistic thing to just discuss the models of … Or the parameters of a, of a model. [crosstalk 00:20:35].

Sam Charrington: [00:20:32] Mm-hmm [affirmative].

David Carmona: [00:20:33] There’s way more. I have say, though, that the one thing that we are, we are seeing is that the more parameters that you add … Right now, we are not seeing the ceiling of this. So, we keep improving the accuracy and the generality of the, of the model. So, hey, parameters are important. But then at the same time, it is true that it really … So, there’s not one model for everything. So, different models are good for different things. Right? And in our case, for example, we, we … Turing, our family of models. It’s actually a family because of that. So, we don’t believe that one model will … At least right now, will be useful for every single scenario that you are targeting. Right? So, in, in our case, we created that, that family of model, which are inclusive of, of many things, including many different language, like, this basic [inaudible 00:21:27] that I was providing before or, or this, these metrics-

Sam Charrington: [00:21:30] Mm-hmm [affirmative].

David Carmona: [00:21:30] … of, of different models. You’re gonna need a model for each of them, depending on what you want to accomplish. But then even beyond that, ’cause not everything that you do is NLP. So, in the family of Turing in Microsoft, we have models that are even multi-modal, that include image and text or that are focused on image. And that thing will keep growing. So, that’s something important to keep in mind. The other thing is, of course, the eternal debate on the importance of the architectures, right, that, that you’re using. So, I think there’s a … And I don’t have a super strong opinion. I think it’s like everything. It will go through phases. It will get to a moment that just by adding brute force parameters, the thing will be very difficult to improve. And we’ll need to be a little bit smarter on how we can improve those models. We can optimize those models in, in another different way.

But again, I don’t want to diminish the fact that we keep seeing that we add more parameters and, and we get more power. Right? One thing that you said, though, Sam, I, I want to, I want to double click on that ’cause it’s super important. So, it’s the responsible AI implications of the model. I think that will be an an area for models to differentiate and to keep in, in mind when you’re using a model ’cause the reality is that, right now, these models, they have a lot of challenges from the bias, transparency, and, and, and others that, that we need to keep in mind. So, we need to just … So, we innovate on the power, accuracy and, you know, multitask aspect of generality of these models, we also need to innovate on the responsible side of them. And eh-

Sam Charrington: [00:23:08] [crosstalk 00:23:09].

David Carmona: [00:23:09] As, as you said, the training corpus, that’s important. I think right now, we are probably way too late in the pipeline to apply responsible AI principles to these models, meaning that we create things with these models. And then, just then, we apply those things like … I don’t know. Like, you know, filtering or many, many other techniques that you can use there. I think we need to go earlier in the process, even at the point of the training, so we can make those models responsible by design.

Sam Charrington: [00:23:41] Do have a sense for how we can do that? A lot of the power of these models comes from, essentially, taking the entire internet and building a language model based on it or, you know, large parts of the internet. How do you apply the, you know, how … What are the techniques that we can use to build responsibility earlier at that scale?

David Carmona: [00:24:08] So just as an example, but one example in Microsoft could be the Office or the Outlook auto reply. Right? So, what is … So, that is the typical example of a massive NLP model that is taking as an input, an email and, as an output, is creating a likely reply that you want to, that want to do. Right? So-

Sam Charrington: [00:24:28] Mm-hmm [affirmative].

David Carmona: [00:24:28] That [scenario on paper, it looks so simple [laughs] il- extremely simple. But when you get into the responsible side of [inaudible 00:24:37] extremely complex. And you need to, you need to pay a lot of attention. And it’s not like a one-shot thing that you do, and done. You are, you are, you are golden. The reality is that you need to apply that across the entire lifecycle of the model from, as you said … So, you mentioned one that is important, which is the training data. So yes, of course, we need to get a subset of the training data to make sure that there’s no toxic data that is training the model. But that is not, that is not enough. So, we need to keep in mind things like the privacy of the user. Right? So, think of, “How can we … ” So, actually, for this feature, we use differential privacy to make sure that the instances that we use [inaudible 00:25:20] surface, they are not … They cannot identify a user or things like those.

And you can also think of the input as something that we also manage, that we make sure that they are short answers, that they are not like, long emails [laughs], of course, things like those. So, it’s something that you need to do at every stage. There’s a ton of research, active research happening right now to really tackle this super complex challenge that we have with these models.

Sam Charrington: [00:25:47] Mm-hmm [affirmative]. So, before we jump into how we achieve this kind of scale, you mentioned something in our pre-call that really stuck with me, is this idea that models are becoming a platform. And you know, transfer is a piece of that. Fine tuning is a piece of that. I’d love to hear you riff on, on that idea. I think it’s a really interesting way to think about models.

David Carmona: [00:26:14] Yeah, yeah. It’s not a new concept. So definitely, we’ve been, seeing … So, you see our services [inaudible 00:26:23] services in Azure. And they support the concept of transfer learning. So, you don’t need to train a model from scratch. Right? So, it’s … But the reality is that a lot of what we do in AI is training models from scratch for your particular scenario. So, we’re doing everything that we can to try to simplify that process because if we don’t simplify that process, it’s gonna be very difficult to really scale AI in an organization, in a, in a company. So, there are definitely many techniques to do that. I think in the area of NLP, fine tuning is the most relevant now. And then we can talk about some emerging ones that are super interesting and cool. But with the fine tuning process, the idea is that you pre-train … You can use a model that is pre-trained, like our Turing model, pre-train on that [inaudible 00:27:10] information from the internet, multi domain, totally general. And then you fine tune that model.

So, fine tuning, meaning adding something to it. Like, for example, you want to fine tune the model to do a sentiment analysis. So, you would add then like, a classifier or something like that, a binary classifier. And then you use label data. In this case, you use like, sentences that are, you know, positive, negative sentiment. And then you fine tune. So, you train additionally. It’s like extra steps of training that entire thing with your added classifier, in this case, for example, which is gonna update the weight. But it’s not starting from scratch, meaning that you don’t need that massive data and the skills because you don’t need to change the architecture. You don’t need to compute because it’s not that much compute needed. So, that is certainly a huge step into democratizing these models. Right? So, that’s, that’s super important. And not only you can do that for fine tuning for specific tasks, you can also fine tune it for your domain.

So, if you work in finance, or you work in health, or you are in any industry, and you want to find a law company … So, you want a law firm. You want to fine tune that model for the domain of your vertical. So, you don’t need to train the whole thing. You just need to train for that particular domain. So, super, super important, but then what we’re seeing is these models can go even beyond that. And that’s a super interesting area. Right now, it’s still in the beginnings. But what is the big difference with that approach? So, in this first approach, with fine tuning, you are training the model at some point. I mean-

Sam Charrington: [00:28:51] Mm-hmm [affirmative].

David Carmona: [00:28:52] Not from scratch, but you’re training it. You are changing the weight of, of the model. You’re-

Sam Charrington: [00:28:56] Mm-hmm [affirmative].

David Carmona: [00:28:56] You’re updating that model. You need [inaudible 00:28:58] to train it. But then we have these other techniques. They are called like, zero-shot or few-shot, where you don’t do that. So, the model can learn in [inaudible 00:29:08] time. So, you don’t need to change the [inaudible 00:29:11] of the model. You have only a model. You don’t change that model. Now, in [inaudible 00:29:15] time, where you are doing the inference of the model, you can … If you are doing a few-shot, then what you do is just provide a few examples of the task that you want to do, and then directly, the one that you want to solve. And the model will do it, which is mind blowing [laughs] that it can do that. But then you have zero-shot, which is like, the mind blowing times three [laughs], which is that you don’t even need to provide examples. So, you can ask one of these models, “Hey, I want to translate this to French.” And you provide the sentence. And the model will know how to do that. It will identify patterns in the corpus data that it was trained on.

And it will know what it means to be, to do a translation. And it will do that translation. So, those techniques, what they are really doing, from fine tuning to few-shot to zero-shot, is making it much easier to really use these models in your particular scenarios for your particular domain, your particular task, or your particular modality. Super cool.

Sam Charrington: [00:30:18] Mm. Awesome, awesome. We’ve talked about different kinds of models. Uh, just a few quick words on applications. Like, you know, what do you think are the most exciting applications of language models generally or, or Turing in particular, you know, within and outside of Microsoft?

David Carmona: [00:30:38] Yeah. So what, what I can do because it’s a [laughs], it’s a big one. We can, we can talk for a long time. I can give you an overview of how we are using it in Microsoft. And then you can get a sense of, of the usages that, that it can have. So, in Microsoft, the way we look at this is like … We always look at these things, any technology is a stack. So, our goal always is to deliver a full stack. So, you just … And that’s our approach to any technology. So, we do the research. But then we want to make sure that that research is available for others to, to use. And then we want to make sure that we keep adding layers [inaudible 00:31:19]. for example, the first one would be releasing that as open source. Right? So, we add another layer. We want that to be part of Azure, so you can train those models yourselves, which is the AI supercomputer that we are, providing in Azure to train those models.

But then we keep building on that. On top of that, we have things like Azure machine learning. So, you have another abstraction layer that can improve your productivity, fine tuning those models, like [inaudible 00:31:44] mentioned before. But then we put another layer on top of that, which is [inaudible 00:31:49] services, which are end to end out-of-the-box services that you can use as [inaudible 00:31:54] points. And you can infuse directly into your application without worrying about doing anything with, with those models. And then on top of that, we build applications. So, we make them part of our products, like, Office, Dynamics. Or we create new products that were impossible before. So, that’s the [inaudible 00:32:11] approach. I think if we focus on the application side, just to give you some, some examples of things that are already available, that people can use that are powered by these massive models [inaudible 00:32:21] a lot in Office. A lot of things in Office are powered by these models. So, you can think of, for example, semantic search in Office [inaudible 00:32:30] you open a Word document, you search for something in that Word document.

And that is not the traditional find and replace [laughs] that we had before. This is semantic search. So, you can even ask questions to the document. And [laughs] the document will answer those, those questions. That is all powered by, by Turing. You have things like document summarization. So, you go to SharePoint, and you hover on a document. And you will see a summary of the document in there. That is a … It’s an abstraction. So, it’s not just taking parts of the document. That is generated with, with Turing. Things in Outlook, like Outlook auto-reply that I was mentioning before, or things like, … There’s something meeting, Meeting Insights, that before a meeting, it will give you all the relevant information about that meeting. So, those are like, … In the taxonomy that we were talking about before, those would be Horizon 1. It’s about making those applications better. But then we have these Horizon 2 things that are [inaudible 00:33:24] new opportunities that these models can open.

And I think a good example of that would be Project Cortex. So, Project Cortex is part of the Microsoft 365 family. And the goal of that project is super cool. So, what it does is that it’s able to get all your internal knowledge in your organization by looking at both the structure and the, and structure data in your organization. So, think of documents, meetings, PowerPoints, anything that you have in there, even images ’cause it’s able to scan and do OCR on, on images. So, it’s able to crawl all that information for your company, and then to extract knowledge out of that. So, what we do is that we create this concept of a knowledge entity. Like, imagine that, … I, I don’t know. You are in a law firm. Imagine international, whatever, commerce. I don’t know. I have no idea of, of law. But it’s like a topic-

Sam Charrington: [00:34:23] [crosstalk 00:34:24].

David Carmona: [00:34:23] … that then AI system was able to extract from your information. And it can, it can help you a lot. So, it can give you … It can provide you with a summary. It can give you, what are the most relevant documents for that particular subject in the company, what are the experts, so, who you should talk with about, about those topics. So, it’s mind blowing [inaudible 00:34:45] knowledge basis. Right? So that, that you can get … It’s extracting the DNA of your company. So, you can really make it available for the, for the rest of the employees. And like, those, I mean, I can [inaudible 00:34:57]. So, every, any product that you can mention [inaudible 00:35:00] use Bing. So, it’s another, of course, super important one. Things like question and answer in Bing [inaudible 00:35:05] even the universal search. So, we use this trick of universal language representation in Bing. And those are all available in there as well. Yeah. So, we use it [inaudible 00:35:16].

But more on the business side, I would mention, in Dynamics 365, we use these models for a lot of different things. Very obvious one, of course, is anything that has to do with customer service understanding or, you know, sentiment analysis. All of that in customer service that is-

Sam Charrington: [00:35:33] Mm-hmm [affirmative].

David Carmona: [00:35:33] … powered by these models. But then things that are more visionary. So, think of, for example … In Dynamics 365, one of the things that we can provide is suggestions to sellers in your company by looking at any interaction with that customer before, like emails or documents, phone calls, whatever. Right? So, it’s able to understand that and structure information, and give you … It’s like language generation. But in this case, to take the next steps to your, to your customs.

Sam Charrington: [00:36:01] Hmm.

David Carmona: [00:36:02] So, yeah. Super, super broad. We could talk for a while. Yeah [laughs].

Sam Charrington: [00:36:04] [laughs]. So, you know, let’s maybe jump into what’s happening that’s enabling all of this to take place now. One of things that … You know, when we think about kind of the scale and size of these models … You know, we’ve talked about the scale of the compute that has been required to enable it. You know, how do you thi- … And you mentioned AI supercomputers. Like, what’s that all about? How do you think about, you know, building out the infrastructure to scale and train these models?

David Carmona: [00:36:36] Yeah. Le- let’s say that the train model like this in your laptop will take probably thousands of centuries [laughs]. So, definitely, you need a lot of scale to train [crosstalk 00:36:48].

Sam Charrington: [00:36:48] Yeah.

David Carmona: [00:36:48] And you need … I mean, it’s amazing, the kind of challenges that you get when you grow a model like this. Like, fundamental challenges like, “Hey, the model doesn’t fit in your GPU.” [laughs] That’s-

Sam Charrington: [00:37:02] Mm-hmm [

David Carmona: [00:37:03] affirmative].

… Something that we wouldn’t use before. Right? So, I think it is like … If you pass 1.3 parameters, something like that, then the model is not gonna fit. So, you better find new ways. But then it’s just a computer. So, the time-

Sam Charrington: [00:37:15] [crosstalk 00:37:16].

David Carmona: [00:37:16] … required to train one of these models, you need like, ultra [inaudible 00:37:19]. I, and, and I think … So, that’s the main reason why we focus on … And like, always, like I was saying, in the beginning, we try to have a platform approach to it. So, not thinking of fixing this problem for Turing, for our models, but fixing this problem for our customers, so they can use this infrastructure as well.

Sam Charrington: [00:37:38] Mm-hmm [affirmative].

David Carmona: [00:37:38] So, the approach that we took was building this massive infrastructure in Azure. So, these are massive clusters that are, that you can sting directly in Azure. And not only you can sting, then, of course, you have the complexity when you have … These are … I mean, imagine … For example, the one that we announced a year ago, that is a massive cluster of like, 10,000 GPUs. You have more 200,000 CPUs. So, it’s massive scale. So, how do you manage that? You need things that allow you to manage that in a distributed way. And then what is even more challenging is, “Okay. So, I have my infrastructure completely managed. I can [inaudible 00:38:15].” It is integrated with Azure machine learning. So, you can like, launch like, jobs in that massive infrastructure. But then how would you actually do it? So, you have a model that is by definition, huge. So, how do you train that thing? How do you divide this task, this super complex task, into individual [inaudible 00:38:36] in your, in your massive cluster?

And that’s that’s the other side of the coin, which is our work on these like, software systems that are meant to help you in that process. So, this was … At the same time that we announced the AI supercomputer, we also announced … It’s called DeepSpeed. It’s open source. So you can use it on, on top of anything. And it will help you do that for you. So, what it will do is that it will take this training. And it will distribute that training across a massive infrastructure. So, it will know how to do that in an efficient way. And it does it basically … It’s like a three … We call a 3D distribution because it takes like three different [inaudible 00:39:18] to, let’s say, chunk this task. Right? One, which is the most basic one, is the data distribution. So, you just [inaudible 00:39:27] your data in smaller chunks. And then you have [inaudible 00:39:30] each node is gonna take one of those chunks. But that is not enough. You need to go further than that.

So, the other level of distribution that we use is [inaudible 00:39:39] distribution, which is [inaudible 00:39:41] because of the transformer architecture, that [inaudible 00:39:44] symmetry is [inaudible 00:39:46] to split the [inaudible 00:39:49] layers. So [inaudible 00:39:50] each node will take a different layer [inaudible 00:39:54] communication and optimization going on there that [inaudible 00:39:57] you need to take care. And then the last one is the [inaudible 00:40:00] which [inaudible 00:40:01] even for each of those layers, we can divide [inaudible 00:40:04] smaller chunk [inaudible 00:40:07] a different GPU. So [inaudible 00:40:09] what that allows you, it [inaudible 00:40:11] a lot of research involved [inaudible 00:40:13] this framework. [inaudible 00:40:14] you almost get like, a linear distribution, like, a linear growth in your model. So, you can [inaudible 00:40:20] number of parameters … And by the way, [inaudible 00:40:23] is able [inaudible 00:40:24] more than one [inaudible 00:40:25] parameters. So huh, you can train models that are not even [inaudible 00:40:29] existing today. And you see the line, and it’s almost linear. So, it’s exactly what you’re, you are looking for in these systems.

Sam Charrington: [00:40:35] Oh, wow. Wow. And what about on the hardware side? Microsoft announced this Brainwave Project some time ago to bring new hardware architectures to bear this problem. Can you share a little bit about that?

David Carmona: [00:40:50] Yeah. So, yeah. We announced the [inaudible 00:40:53] maybe a little bit more ago. But it’s fully available now. So, you go to Azure. And you go to Azure machine learning. And one of the options that you have to deploy your model is[inaudible 00:41:02]. And what, what that is gonna give you, especially [inaudible 00:41:05] inference time, is very low latency and a lot of, you know, efficiency in cost. Right? So, it’s perfect for massive … I mean, I, I always use the same example. So, this feature in Word, one of the features powered in Word by Turing, is called predictive text. So, that means that, when you type, it’s gonna give you suggestion, how the text will continue. Right? So [inaudible 00:41:29] think of [inaudible 00:41:30] intelligence, but, but for Word. 300 million users of Word. Imagine doing the inference of that model in every keystroke [laughs]. So, that’s the-

Sam Charrington: [00:41:39] Mm-hmm [affirmative].

David Carmona: [00:41:40] That’s the scale that we’re talking here. it’s huge. So, you better optimize that a lot if you want to scale it to that, to that number. And we do that … I mean, you have to do it in, … Again, it’s like a game that you have to tweak every single step. Of course, we don’t go with this m- multi billion models on inference time. So, there’s a lot of optimization to do there to reduce the number of parameters, to even using techniques to make it more efficient. And then there’s the hardware. Right? So, we use the ONNX Runtime thing in Microsoft. That can optimize not only for the CPU … So, it has optimization for CPUs, but also for [FPA 00:42:21]. So, it’s a way of [inaudible 00:42:23] from the hardware that you have, underneath. And it really allows you to bring all these things that are great to talk from the research point of view. But then putting [inaudible 00:42:33] in action, it requires all this level of detail that is a new level of complexity.

Sam Charrington: [00:42:38] Mm. So, this is primarily focused on the inference side. Do you see any … Are there any particular innovations you’re excited about on the hardware side for training? Or you, do you see it primarily being evolutions of today’s GPUs?

David Carmona: [00:42:55] I mean, when we see … I mean [inaudible 00:42:57] super evolving. So, we’ll see … The reality right now is that you have to be flexible. So, we are not-

Sam Charrington: [00:43:02] Mm-hmm [affirmative].

David Carmona: [00:43:02] … discarding any approach, any at all. Right? So, the reality is that FPA for the inference was super efficient because it allows you to change it. Right? So, it’s programmable. So, that was very, very efficient [inaudible 00:43:16] and very agile. The combination of agility and efficiency was, was the right thing. But that may change at, at any moment. And as these things get more stable, then ASIC may be the way to go. And, and, yeah, of course, we are, we are not discarding any, any of those approaches.

Sam Charrington: [00:43:32] So, how do you see this level of scale that we’re dealing with today impacting the world for kind of users of AI? What, what changes?

David Carmona: [00:43:43] I think that the main thing maybe bringing, bringing all of this together is how this will change the way that you develop AI. So, how this will open new ways of developing AI that we can, that we can use right now. So, that whole concept of creating more general multitask, multi-domain, multi-modality models, that then you can customize for your particular task, that is, that has huge implications on how you can … One, how you can scale AI in your organization and how AI can scale to other organizations, like smaller organizations. Right? So, that for us, it’s a, it’s a huge aspect of, of all of this. And the way that I see it is, is that uh, it’s kind of what we experienced in the last 20 years for software. And this is very similar. So-

Sam Charrington: [00:44:38] Mm-hmm [affirmative].

David Carmona: [00:44:38] Software at some moment, we had the hard lesson that software has to be super connected to [laughs] the business. So, you have a team of software developers in a basement [laughs] not connected to the-

Sam Charrington: [00:44:51] [laughs].

David Carmona: [00:44:51] … business, that is not gonna work. I think we are ki- … AI is in a basement right now, kind of. Right? So, it’s-

Sam Charrington: [00:44:57] [laughs].

David Carmona: [00:44:57] We are not fully connected to the business [inaudible 00:45:01] because it requires so much like, skills so many skills and expertise that, that it’s a very technical domain right now. We need to change that. So, we need to make sure that the business and a- AI come together. And, we learned that with software. It’s called devops. It’s about bringing the two together, and then doing a small iteration [inaudible 00:45:22]. It’s coming to AI. We are all talking about MLops now. It’s a huge area. It’s our [inaudible 00:45:28] definitely in Microsoft to provide the platform to empower that collaboration and that continuous iteration, and trackability of everything that you do in your AI development cycle. [crosstalk 00:45:37] and that will be, massively be empowered by AI at scale. So, you have models that can really empower like, a more dynamic way, so you don’t have to create from scratch, these models. You can iterate on them with the business and just focus on teaching your domain to the model instead of starting from scratch. That goes in that direction. We do think that there’s one step beyond that. We are also seeing … We also saw it with software. That also needs to happen with AI, which is really going beyond the technology and the businesses, and getting to every employee.

So, how every employee in an organization should be empowered with AI just like they can Excel right now to [inaudible 00:46:21] numbers [inaudible 00:46:21] that for AI. So, every employee can apply AI, and not only apply it, but also create, consume, mix and match [inaudible 00:46:31] of having some level of freedom to really apply AI to, to what they do. That’s another huge area, like the augmented intelligence area.

Sam Charrington: [00:46:41] Mm-hmm [affirmative].

David Carmona: [00:46:41] That [inaudible 00:46:42] models, we, we may see it happening sooner than later.

Sam Charrington: [00:46:45] Awesome. Well, David, it’s been wonderful to catch up with you and to dig into some of the work you’re doing around AI at scale. Thanks so much for taking the time to chat with us.

David Carmona: [00:46:58] Thank you so much, Sam. It was a pleasure.

Sam Charrington: [00:47:00] My pleasure.

David Carmona: [00:47:01] Thank you.

Sam Charrington: [00:47:02] All right, everyone. That’s our show for today. To learn more about today’s guest or the topics mentioned in this interview, visit of course, if you like what you hear on the podcast, please subscribe, rate, and review the show on your favorite podcatcher. Thank you so much for listening, and catch you next time.