Inside AI Tokenomics: How to Profitably Turn Tokens Into Business Value | NVIDIA AI Podcast Ep. 299

May 21, 202633 min · 5,263 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

As AI factories scale and token costs become a defining competitive variable, the way businesses measure infrastructure ROI needs to change. In this episode, Shruti Koparkar from NVIDIA's Accelerated Computing team breaks down tokenomics—the four-pillar framework of token utility, supply, demand, and monetization—and reveals why NVIDIA Blackwell's architecture delivers 50x more tokens per watt than NVIDIA Hopper, translating to a 35x reduction in token cost. 🔬Topics covered: The four pillars of tokenomics: utility, supply, demand, and monetization Why cost per token beats FLOPS per dollar as an infrastructure metric NVIDIA Blackwell vs. Hopper: 50x more tokens per watt, 35x lower token cost How extreme co-design turns spec-sheet numbers into real-world output Jevons paradox: why lower token cost always drives more GPU demand, not less The four business models for turning tokens into revenue Chapters: 00:00 – Introduction and the four pillars of tokenomics 02:09 – Token value: intelligence, interactivity, and use case mapping 06:32 – Estimating token demand: users, reasoning, and agentic multipliers 10:00 – Token supply and why cost per token is the right infrastructure metric 13:12 – NVIDIA Blackwell vs. Hopper: 50x more tokens, 35x lower cost 14:52 – Extreme co-design for lowest token cost and the NVIDIA Vera Rubin platform 21:10 – How software multiplies hardware performance (8x gains in six months) 23:56 – Token monetization: pricing and business models 26:52 – Jevons paradox and the future of GPU demand

Highlighted moments

“There are two key factors that impact token value. One is the intelligence embedded in the token, or how much intelligence does the token carry. And the other is how fast does it arrive, which is essentially the interactivity.”

Jump to 2:31 in the transcript

“Whereas co-design is about designing from the ground up simultaneously multiple parts of the same system, knowing that they are all optimized towards the same outcome, that of lowest token cost.”

Jump to 15:26 in the transcript

Transcript

Token Value Introduction

0:00Not all tokens are created equal, and there is a way to look at token value. There are two key factors that impact token value. One is the intelligence embedded in the token, or how much intelligence does the token carry? And the other is how fast does it arrive?

0:22Welcome to the NVIDIA AI podcast. I'm Noah Kravitz. I'm here with Shruti Khopakar. Shruti is a member of the Accelerated Computing team here at NVIDIA, and she focuses on inference.

Tokenomics Basics

0:33And we're here to talk about tokenomics. As data centers become AI factories and produce intelligence for the new industrial revolution, this word tokenomics has been fluttered about. It's a useful term, but maybe we can break it down with your help, Shruti, so that it's really something that business leaders can understand and take into practice. Yes, absolutely. Well, first of all, thanks a lot for having me, Noah. Thank you for coming. I am very excited to dig into the economics of AI, or tokenomics. And as you said, it is a term that gets used quite a bit,

1:06and I welcome the opportunity to help define it, so to speak. So the way to think about tokenomics is, it's about how tokens are valued, supplied, consumed, and monetized. And what that essentially maps to is token utility, which is all about token value, token supply. And this is where your AI infrastructure decisions are, right? Thinking about what infrastructure to invest in that will maximize your token output while minimizing cost. Then there's token demand.

1:38This is where customers and organizations think through what is their number of users, how many use cases, what types of use cases. So really sort of mapping out the volume and velocity of the tokens that they need. And then finally, there's token monetization, which is taking the tokens and turning it into business value. So those are sort of the four pillars for tokenomics. And it's super important to understand all four of those and how they relate to each other to be able to deploy AI successfully.

Token Utility

2:09So let's start at the top then with utility or value. How do you define the value of a token? Are all the tokens worth the same? Do they have differing values? Is there a better way to look at it? How do you approach that? That's a really great question. And you're right, actually, that not all tokens are created equal. And there is a way to look at token value. There are two key factors that impact token value. One is the intelligence embedded in the token, or how much intelligence does the token carry.

2:40And the other is how fast does it arrive, which is essentially the interactivity. So to unpack that a little bit, the intelligence of the token is dependent on the model that produced the token. So more complex, more intelligent models will produce tokens that, in general, have much more intelligence. A higher value, okay. And then it also depends on the context that the models looked at. And generally speaking, the longer the context that you sort of allow the model to look at, the better the accuracy, the better the intelligence of the tokens.

3:16Now, I say generally because there are cases where if the context increase too much, then the model quality, the output quality can degrade. But I don't want to rabbit hole into that. Generally, more context does kind of equate to better intelligence. So that's one aspect. And then, as I mentioned, how fast the token arrives is the token interactivity, which is essentially tokens per second per user. So it's the rate of token generation.

3:47And so if you look at token value as a spectrum, on the one hand, you have these basic models with shorter context, generating tokens at not that fast a speed. And then on the other extreme is these more complex, more intelligent models with much larger context and generating tokens that are really fast. And across that entire spectrum is your different use cases and how you map those use cases to the token value.

4:19So, of course, there is an absolute way in which to think about the token value, but there is also a relative way with respect to the use cases in how to think about it. And we can unpack that a little bit if you want. So is it fair to say that the value of the token is tied at all to the task that it's hand as well? Yes. And that's exactly sort of what I was trying to get at when I said that, like, you have to think about mapping the right use case to the token value.

4:50So as an example, we said, like I said earlier, right, that tokens generated by more complex intelligent models are more valuable. But that's in an absolute sense. Relatively speaking, your use case may not require that more complex, more intelligent model. And then that additional value is completely useless to you. One example of this is domain-specific applications, where in a very narrow context, a post-trained, which is fine-tuned, small language model, so a much smaller model, can give you just the value you need.

5:26In fact, in some cases, even better accuracy for that given task. So you don't need always the big, large models. And so relatively speaking, you need to map where on the spectrum of token value does your use case sit. Same thing is true for the interactivity piece, which is agentic applications absolutely need the highly interactive tokens. But you may have applications like chat interfaces or enterprise search, which don't need that level of interactivity. And so that is very critical when you are thinking through your sort of AI deployment decisions of where to map your use case to what token value.

Token Demand

6:06So when a business leader is thinking about demand and thinking about mapping out tokens to use cases and the different use cases have different values associated with them, what's a good approach for someone who's looking at what their org is doing, what their different team members need? How do they start to get a handle on, well, how many tokens are we going to need to produce and how many of each kind? Yeah. So use cases are extremely important when thinking through token demand.

6:37And there are three layers in which you can think about this with improving levels of forecasting accuracy, if you will. So the basic sort of, you know, back of napkin math is look at how many users you have, how many requests or sessions the user is going to initiate in a given day or month, and then how many tokens you need per request or per session. Right.

7:07And those three numbers put together will give you your base demand for a single day or a month or, you know, whatever is your time period of analysis. Now, that is the base and very, very sort of simplistic look at it. Right. There are multipliers that you do need to account for that will dramatically change your understanding of your token requirement. And a couple of those are, number one, are you using reasoning models? As we know, like reasoning models use thinking tokens, which never gets seen by the end user.

7:40And oftentimes when AI is deployed, you can actually set thresholds on how many thinking tokens are allowed per interaction. And so when you are estimating demand, you do need to think through, are we using reasoning models? What are our thresholds? What do we expect the peak and average to be on those, you know, on that use? So that's one. Second is agentic. Agentic is a huge multiplier because any use case, if you are deploying it in this sort of agentic workflow context, then there are multiple sort of turns and loops that might happen that can increase your token demand significantly.

8:14And then finally, the last factor is something called cache hit rate or the KV cache hit rate. And for those listeners for whom this term might be new, KV cache is sort of like the short-term memory of a model. And so anytime an input request comes into a model, it needs to process it. But if it's already seen that input request before, then many times it actually gets stored in the cache. And then when it comes in again, it doesn't need to recompute it.

8:45It can just use those cached values. So those are some key factors to kind of look at to get to a higher degree of accuracy when thinking about token demand. And then the final one is demand variability, which is how is your demand changing in a day? Like sometimes you may have products that get used quite a lot in the morning hours, but not so much in the evening or vice versa. Same thing with seasonal variability. For example, you know, retail providers or e-commerce will see a surge during the holidays when they're trying to push a lot of products out, right?

9:22So you do need to think through those. And then, of course, there is the user growth. So you started with a base number of users, but you as a business are trying to constantly drive up user growth. So you need to factor in how much you expect that to grow as you think through your token demand.

Token Supply

9:41So demand, of course, leads us to supply. How do you start thinking about supply? And you've mapped out, you know, sort of your baseline and the conditions that you just outlined, Shruti. How do you go about then translating that into creating the supply necessary to get all these tasks done? So when it comes to token supply, that's where a lot of the AI infrastructure decisions lie. And when you're making that decision, what you want is maximum token availability, token output while minimizing your token cost.

10:14Now, when you think about cost or total cost of ownership, oftentimes organizations and decision makers can gravitate towards the easily available metrics and what I like to call input metrics, such as the cost per GPU hour or the flops per dollar, which is essentially how many floating point operations are you getting per dollar. And these are input metrics because they don't tell you anything about the actual delivered token output, which is a function of much more than just flops or just, you know, the memory you have.

10:51It is a function of extreme co-design. And so the metric that represents both your input, but also the output is cost per token. It's a very simple metric that tells you what is the cost that you're paying of cost of generating one token. And it's essentially, you know, the cost of GPU divided by how many tokens does the GPU produce.

11:21So in a way it incorporates both the input and the output and gives you a sense of your true ROI from the AI infrastructure. It's interesting to hear you explain it because it sounds so simple and I can understand coming from the other point of view, right, of what we're outlaying for the GPUs and the server racks and all the interconnectivity. And so we, you know, we count those costs. But looking at it from the other end, as you said, the output cost just makes so much sense because that's what you're trying to get at the end is the intelligence, the token.

11:55And so putting the price on that seems like a really kind of clear way to think about it. Does the cost per token metric vary at all or do you have to think about it differently depending on the use cases as you were talking about before?

12:09Cost per token is sort of the base metric. It's the base metric. Of course, it will vary depending on all the other things, like the model, the context, the intelligence, basically, and then the interactivity. So any tokens that are generated by a more complex model or are more interactive are going to be costly. Of course, yeah. They are just, that's just physics, right? So, yes, it definitely does depend on the models, the context, as well as the interactivity. But, you know, you said it really well earlier that ultimately, if the business runs on the output, which is the tokens, right, it is kind of a fundamental mismatch.

12:50If you are evaluating infrastructure based on the inputs, but your business runs on the output. And that's why cost per token starts to get at sort of the real ROI because it measures both in many ways. So, Shruti, as we think through input metrics and cost per token, is there an example that comes to mind that can really kind of bring this idea to life? Yeah, absolutely. In fact, if you look at NVIDIA Blackwell compared to NVIDIA Hopper, and if you look at just merely the input metrics, which is the hourly GPU cost, that's 2x.

13:27And so that's Blackwell being maybe 2x more expensive than Hopper. If you just look at flops per dollar, that's also 2x. So Blackwell does deliver 2x more flops per dollar. And that sounds like a huge advantage, which it is, but it also doesn't even scratch the surface of the true sort of benefit and value of Blackwell. And that's because Blackwell, when it comes to delivered output, delivers 50x more tokens per watt compared to Hopper.

14:0350x. 50x. Fantastic. So with the same infrastructure footprint, the Blackwell NBL 72 system delivers 50x more tokens than Hopper. And that translates to a 35x lower token cost. Amazing. And so that really, I think, brings the point home on why not just look at the input metrics, but look at a metric like cost per token, which represents both what you're paying, but also what you're truly getting design.

14:37So I'm glad you mentioned I was going to ask you to go back. You mentioned extreme co-design. We've talked about it before. Obviously, anyone familiar with the space has heard the term. But maybe you can dig in a little bit to what it means, particularly in this context. Yes. I actually welcome the opportunity to talk about extreme co-design. Fantastic. Because we get asked this question quite a lot. And so, you know, often we get asked, why extreme co-design? What does co-design even mean? Like, is it just integration?

15:08And, you know, people may think that this is just splitting hair or just semantics. But I do think that the distinction is important. Okay. Because when you think about integration, you think about different parts, different sort of, you know, independent units that are then integrated post facto. Whereas co-design is about designing from the ground up simultaneously multiple parts of the same system, knowing that they are all optimized towards the same outcome, that of lowest token cost.

15:42And so that's why the word co-design is extremely important. And the reason it is called, or rather we call it, extreme co-design is what NVIDIA does, is because of the depth and breadth into which it extends. So it's co-design across just compute? No. It's compute, memory, storage, networking. Everything. Right? Everything. I mean, the Verarubin platform has seven chips. But it goes even beyond that. There's all the software that sits on top.

16:13So everything from the CUDA kernels to the runtimes to the serving software, as well as all the way out to the ecosystem. Everyone from our, you know, silicon partners, our OEMs, our cloud providers that we work with, the various OSS frameworks that we work with. The co-design extends beyond just sort of, you know, what's in a system, what's in an AI factory, all the way out to ecosystem. And that's one of the reasons why it's extreme.

16:44Yes. And so, anyway, but you asked a question, a more specific question, about what are some of the extreme co-designs that help with the cost per token. And I think one important one, which I think you've actually discussed in the previous podcast, is the mixture of experts models. Sure. How the Blackwell and VL72 is such a great fit for them because it kind of helps with the inter-GPU communication. And then all the software in terms of dynamos, disaggregated serving, coupled with any of the runtimes that we support, whether it is TensorRT, VLLM, SGLang, doing a technique called Wide Expert Parallel that greatly optimizes the inference performance and then thereby reduces the cost per token for those mixture of experts models.

17:31So that's one great example. In a conversational setting, the user prompts something, say you prompt something, and then the LLM says something.

18:07Right. And then you say something else, and then it says it back. So you are taking, as a human, turns with the LLM, with the AI. In agentic, it's actually AI taking turns with AI as well as with software because a main agent can sort of, based on the user input, decide to do some reasoning, then decide, oh, I need to do a tool call. So call some software, then it might decide, oh, I need a sub-agent or a specialized agent to go do some work.

18:40So it's going to take a turn with the specialization till the specialization, you know, does its computation, comes back with a result. And this just keeps going. Right. And we love agentic for that. That's right. And it's multi-turn in a way that has no user involvement other than the prompt that the user gave in terms of, like, maybe, say, book a ticket to Miami. And then it goes through all of this, several turns, to then finally produce an outcome.

19:10And the number of turns involved in agentic is significantly higher than conversational. So the number of LLM calls, like the number of times the large language model is called is also higher. And in general, the token demand for that reason is also higher. Of course, yeah. And that's why extreme co-design is so critical because you are using up so many tokens, so you have to lower the cost per token. Latency is really important because on every turn, even a couple of milliseconds more add up to, you know, several potential seconds of delay or the end result.

19:48And then finally, coming back to the Vera Rubin platform, now that we've sort of described the agentic workload, we can clearly see why the co-design is required to accelerate sort of the LLM itself or the reasoning. And the AI itself, you need the Rubin GPU, you need the Grok3 LPX solution and deliver that ultra-low latency. You need Vera CPU because it's going to do all this tool calling or, you know, sandboxing for code generation and code testing.

20:22You need the CMX platform that we've talked about, which is the Bluefield DPUs together with Spectrum X, which allow for the KB cache or the short-term memory, as we discussed, to be offloaded when needed so that it can be retrieved when required for a match with an incoming request. And so that's sort of another example of co-design where being able to develop all of these from ground up helps a lot.

20:54Right, right. So we talked about extreme co-design and you mentioned all the different pieces that go into it, building, designing and building from the ground up. Software is a part of that, but maybe you could double-click, Shruti, into how software plays a role and how important software really is. Yes, absolutely. So software actually is the difference between what you get in the real world, the delivered token output and the actual token cost, versus what you see on a spec sheet.

21:25Software makes all the difference, right? All the things on the spec sheet, the system design, cannot be fully realized unless you have software that makes use of it and delivers really good output. And the other important thing about software is that it cannot be piecemeal optimization. You need to have a robust software stack that can turn on, enable every single optimization.

21:55So that can do, say, NVFP for quantization. It can also do MTP or speculative decoding. It can also do disaggregator serving. It can do the wide export paddle, the KV cache offloading, the KV aware routing, and on and on and on, right? To be able to stack all of those optimizations together is really important because that is what gets you the 50x. Right. The 50x more throughput that we see with Blackwell and the 35x lower token cost. And so software is a huge, huge piece of that story for sure.

22:30The other thing about software is that it never stops. Open source software especially, it never stops. And it's not just the NVIDIA team that is building the software, it's the entire ecosystem, right? It's all the OSS frameworks, all of our partners, customers, the developer community, and every small optimization that they do, that's a drop that just keeps adding and adding to this massive ocean of advantage that is the NVIDIA ecosystem. And so just as an example, on both VLLM and SGLang, which are these inference runtimes,

23:06we've seen 8x more performance in just about six months. Wow. And that's huge. Yeah. Because from the same infrastructure footprint, you're getting so much more token output, and that's driving down your token cost as well. So absolutely, software is a huge, huge piece of the puzzle. So we're through three of the four pillars.

Token Monetization

23:28The fourth one, perhaps the big one, monetization. How do you talk about monetization? How does a business leader think about, okay, so I understand the importance of extreme co-design. I understand the different value of tokens in different situations and different tasks. And Gentic is wonderful, requires more tokens, all the things you've elucidated here. How does a business leader think about monetization of the tokens? Right. So when it comes to monetizing tokens, there are various different ways in which, you know, you can go to market.

24:04But one of the best proxies is to just think through it as you're generating tokens, and then you're selling the tokens. And so when you think about selling the tokens, how much do you sell them for? And it's sort of a classic exercise in figuring out your pricing, which is, A, you need to think about what is the cost to produce the token, which is this lowest token cost that NVIDIA is helping with, right? But you do need to understand what is your token utility. And given that token utility and token value, what is your cost to produce the token?

24:38And you obviously want to charge more than that, right? So there's that. Okay. So that's cost-based pricing. And then you also obviously have to think through value-based pricing, which is essentially how much is the willingness to pay? What is this sort of token utility? How valuable is it to the people who are going to pay for it? And so you do need to take that into account. And then finally, before you think through the pricing, you also need to think through what is the demand distribution?

25:08Because ultimately, there are revenue goals and profit margin goals that you are working towards. And so to land at a place that you like, you do need to think through where will your sort of bulk demand be? And how will the demand taper off when it is, say, for tokens that are not as much utility? There may not be many takers. But in the same way, tokens that are highly valuable, there will be fewer people who are willing to pay the premium for that.

25:42You do need to account for that sort of demand distribution. And then with those three things, you can figure out what the pricing for each token can be and then deploy AI successfully. So the key thing here, though, is that, you know, pricing the tokens, again, is obviously just one proxy. There will be customers who are building value-added services on top of those tokens. Sure. So like customers building AI-native products or something like that.

26:14And then in that case, the process is similar, but you do need to think through then what is the additional value you are adding on top of just generating those tokens. So going back, Sheree, to something I was thinking about when you were explaining extreme co-design, when you get to a point, kind of a sweet spot where you've, you know, your infrastructure is humming and the cost per token is low, does that mean that ultimately you won't need as many GPUs to produce the number of tokens that you really need?

26:48Or what happens in that kind of a scenario? Right. This is a great question. And what we see here is the classic Jaewan's paradox, which is essentially, you know, you would think that, okay, the GPUs are way more productive. They're, you know, generating so many more tokens. Do you need less of them? And the answer is absolutely no. And the reason is, is as you see the efficiency, new use cases get unlocked. And people just figure out, we have all this, you know, thriving research community, data scientists, ML engineers, they just figure out how to just use up that efficiency, how to absorb that efficiency and do more with it.

27:33Right. People aren't going to run away from intelligence. They want to use it. That's right. And if you look at the sort of macro pattern that we've seen so far, it's very telling. So when generative AI became a thing and people were, you know, sort of generating summaries and images, that was great. Then we lowered the cost per token and instead of, you know, needing less GPUs, they needed more GPUs and more tokens. Why? Because test time scaling and reasoning. And so our researchers figured out that by scaling at test time, we can generate better, accurate, more intelligent responses.

28:11And that was valuable for the use cases. And so that happened. And it just, it didn't just happen once. We are seeing that again now with Agentech. Now that we've, you know, sort of figured out how to deploy these mixture of expert models, reasoning models efficiently and lowered the cost per token for those significantly. Now here comes another inflection point where it's like, hey, we've got more tokens. Let's do more with them. Of course. And so that's where the Agentech revolution is happening. And so definitely it's Jevon's paradox in action at the macro level.

28:45And I've also seen this play out at sort of, you know, individual customers. Right. So that's a great question. So Shruti, can we kind of ground this in some examples of how businesses, organizations that you've been working with are putting all of this into action and really, you know, extracting value from the tokens and using them to build? Yeah, absolutely. So when you think about taking tokens and turning it into business value, there's four primary, I would say, business models, so to speak.

29:17Number one is what we just discussed, which is selling tokens directly. And a lot of NVIDIA customers and partners are doing this. And the examples are Fireworks and Base 10, Together AI, Deep Infra. There's just so many of them. And all of them are helping their end customers build valuable services on top of the tokens that they are selling. So that's number one. Number two is AI native companies who are building products from the ground up with AI in it, you know, sort of permeating through it from day one.

29:53And those are customers like Perplexity or Cursor who have a coding engine and, you know, many, many, many, many others. So that's sort of the second model. Third is you might use AI to enhance your existing products and infuse AI through your existing products. And again, lots of different examples. We have Shopify, we have Airbnb, there is Adobe. In fact, a lot of them are doing both. They are building AI native capabilities, but they're also using AI to improve their existing products.

30:28For example, Adobe is, you know, they've built their Firefly family of models. And then they're using those models to infuse new capabilities into Photoshop, for example. And then the final bucket is pretty much every organization today, which is trying to improve their internal operations, their internal processes, improve employee productivity by deploying AI. So they are not necessarily sort of deploying external customer-facing products or services, but these are internal to their own operations.

31:01And again, NVIDIA is working pretty much with everyone across the board on something like that as well. So those are the four key ways. I'm sure that there are others that are more nuanced that I missed, but that's a useful framework to think about, how to take the tokens and turn them into business value. Right. So for the business leader who's listening and comes away from this with a better understanding of the pillars and how they relate and really what tokenomics, right?

31:33What the cost of a token is and how you ascribe value and everything. How do they get started putting this into practice? What advice would you leave them with for thinking about how to put this into action in their own organizations? I think the best place to start is to first just think through what is the final outcome. And usually that starts with your customers, whether they are external customers or your own internal employees and, you know, internal processes that's immaterial. You have to start back from the customer need, from the use case, because as we discussed, the use case actually dictates a whole lot.

32:10The user and the use case dictates what type of model will you use? It dictates what type of context lens you might need to support. It dictates what type of interactivity will you need? So the intelligence and interactivity. And then those factors are what dictate what type of infrastructure you need. Right, right. And then, of course, we walk through the key metrics such as cost per token when making those infrastructure decisions. And and and then, you know, that that's the supply, you know, so you essentially walk back from token utility and and token demand.

32:44Think through token supply. And then once you have a handle on all three of those, you think through your sort of monetization strategy and and then go to market and then you fly. Customer first. Work back from that. Yeah. Easy enough. Shruti, thank you so much for taking the time to join the podcast and really break down tokenomics in a way that I think listeners, viewers can extract so much value from. And to talk about extracting value, it's a really comprehensive, but yet really easy to follow and understand, you know, start to finish how this all comes together.

33:18So thank you again. Yeah. Thank you for having me. Thank you.

Inside AI Tokenomics: How to Profitably Turn Tokens Into Business Value | NVIDIA AI Podcast Ep. 299

Show notes

Highlighted moments

Transcript

Token Value Introduction

Tokenomics Basics

Token Utility

Token Demand

Token Supply

Token Monetization

More from NVIDIA AI Podcast

How Mistral Is Building Frontier AI for the Enterprise | NVIDIA AI Podcast Ep. 301

Everyone Can Build a Robot: Open Source Embodied AI With Seeed Studio | NVIDIA AI Podcast Ep. 300

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297

How Dassault Systèmes Is Building AI That Understands Physics - Ep. 296