The AI Podcast with Fexingo: Artificial Intelligence, Machine Learning, and Modern AI Models

Why AI Inference Costs Are Crushing Software Margins

June 7, 20267 min · 1,167 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

Lucas and Luna dig into the growing gap between training and inference economics in AI. With NVIDIA down 8.5% in a week and Broadcom off 16%, they unpack why inference costs—not training—are now the real bottleneck for software companies. Using Notion's recent Anthropic outage and OpenAI's Lockdown Mode as jumping-off points, they explore how rising compute spend is reshaping product margins, enterprise pricing, and stock valuations. A fresh look at the hidden cost of running AI at scale. #AI #Inference #ArtificialIntelligence #MachineLearning #NVIDIA #Broadcom #Notion #Anthropic #OpenAI #LockdownMode #SoftwareMargins #EnterpriseAI #ComputeCosts #Technology #TechPodcast #FexingoBusiness #BusinessPodcast #AIEconomics Keep every episode free: buymeacoffee.com/fexingo

Highlighted moments

“Software has traditionally had amazing unit economics—near-zero marginal cost per user. AI inverts that. Every successful interaction costs you compute.”

Jump to 0:00 in the transcript

Transcript

0:00Lucas: So NVIDIA is down eight and a half percent this week. Broadcom off sixteen. And this is after a year where everyone was convinced that AI hardware was the safest bet on the board. Luna: Right, and it's not just them. AMD down eight and a half, Super Micro off eleven, Arm Holdings down sixteen. The entire semiconductor complex is getting hammered. Lucas: But here's what's interesting. A lot of the coverage frames this as 'AI hype fading' or 'capex slowdown.' I think that misses the real story entirely. Luna: What's the real story? Lucas: It's about inference costs. Training a model is expensive—everyone knows that. But deploying that model, running it millions of times a day for customers—that's where the economics get brutal. And I think the market is starting to price that in. Luna: You mean the cost of actually using AI in production, not just building it. Lucas: Exactly. There's a great example from this weekend. Notion had a service disruption that cut off access to Anthropic's models for a few hours. And the immediate reaction was, oh, another outage. But look at what it says about dependency. Notion is a note-taking app. They're using Anthropic to power some AI features. If that goes down, their product is partially broken. And they're paying per token—every query eats into their margin. Luna: Yeah, and Notion isn't alone. Every SaaS product that's bolted on an AI chatbot or a copilot feature is facing the same math. The more users engage with the AI, the higher the variable cost. Lucas: That's the core tension. Software has traditionally had amazing unit economics—near-zero marginal cost per user. AI inverts that. Every successful interaction costs you compute. And if your product is good and people use it a lot, your costs go up, not down. Luna: Which is why we're seeing this weird dynamic where some companies are actually discouraging heavy AI usage. Or capping it. Or charging extra. Lucas: Right. And it's not just startups. Microsoft spent billions on AI infrastructure, and their stock is down nine and a half percent this week. The market is looking at these companies and asking, where are the margins? Luna: Let's talk about the hardware side. NVIDIA is still the default choice for inference, but their GPUs are expensive. If you're a company running inference at scale, your GPU bill is your biggest line item after headcount. Luna: So what does this mean for investors? The stocks getting crushed this week—NVIDIA, Broadcom, AMD—they're all levered to that compute spend. But maybe the sell-off is rational if the addressable market for inference isn't as profitable as people thought. Lucas: I think there's a nuance. Training spend is lumpy—one big order, then nothing for a while. Inference is recurring. So in theory, it should be more valuable. But the problem is that inference pricing is under constant pressure. OpenAI keeps cutting API prices. Google follows. And if you're a hyperscaler selling compute, you're competing with your own customers who are trying to build cheaper inferencing solutions. Luna: There's also the security angle. OpenAI just announced Lockdown Mode to protect against prompt injection attacks. That's interesting because it shows that even as they push for more usage, they're worried about the risks of running models in production. Lucas: And that adds another layer of cost. You need guardrails, monitoring, red-teaming. All of that eats margin. So you have this triple squeeze: compute costs, security costs, and pricing pressure. Luna: Which brings us back to the stock sell-off. If the market is correctly pricing in lower margins across the AI stack, then maybe these multiples deserve to compress. Lucas: I think that's exactly what's happening. And it's not just a hardware story. Software companies that built their AI features on thin margins are going to have to rethink pricing. Or they'll get squeezed. Luna: Quick honest thing—we do this show ad-free, and it stays that way because a handful of listeners chip in through buy me a coffee dot com slash fexingo. That support literally funds the research and production time for episodes like this. Lucas: Yeah, it's a small group, but it makes a huge difference. Keeps us independent and lets us dig into topics like this without worrying about ad reads. Luna: And we really appreciate it. Okay, back to inference economics. Lucas, you were talking about the triple squeeze—let's get into what companies can actually do about it. Lucas: One lever is model optimization. Quantization, pruning, distillation—making the model smaller so it runs faster and cheaper. We did an episode on distillation a few weeks back. That's becoming a core competency for any serious AI team. Luna: But distillation has its own costs. You need a good teacher model, which is expensive to run, and you need compute for the student training. It's not free. Lucas: No, but it's a one-time cost versus a recurring one. If you can cut your inference cost by fifty percent with distillation, that pays for itself pretty fast if your usage is high. Luna: Another approach is batching. Instead of running one query at a time, you batch multiple requests together and process them as a group. That improves GPU utilization and lowers cost per query. Lucas: Right, but batching adds latency. If you're building a real-time chatbot, users expect instant responses. So there's a trade-off between cost and user experience. Luna: And that's why we're seeing a split in the market. Some companies are building low-latency, expensive inference for premium products. Others are optimizing for cost and accepting slower responses. Lucas: The interesting thing is that the hyperscalers are trying to solve this with hardware. Google's TPUs, Amazon's Trainium, Microsoft's custom chips with OpenAI. They want to own the inference stack end to end. Luna: But that's a huge capital bet. And if inference margins are compressed, it's not clear those investments will pay off the way training investments did. Lucas: Exactly. So you have this paradox: everyone wants AI to be ubiquitous, but ubiquitous AI requires cheap inference, and cheap inference requires massive scale and optimization that might not be profitable. Luna: Unless you're the one selling the picks and shovels. NVIDIA makes money whether you train or infer, as long as you buy their chips. But even they're not immune to the sell-off. Lucas: Because the market is forward-looking. If inference demand grows but prices fall faster, the revenue pie might not expand as much as expected. And that's what we're seeing priced in this week. Luna: So the takeaway for listeners: if you're evaluating AI companies, look at their inference cost structure. Ask how their costs scale with usage. And be skeptical of any company that claims AI will be a margin expansion story without showing the math. Lucas: That's the real bottom line. The AI revolution isn't cancelled—it's just hitting the messy reality of unit economics. And for investors, the companies that figure out inference efficiency will be the long-term winners.

More from The AI Podcast with Fexingo: Artificial Intelligence, Machine Learning, and Modern AI Models

Why Apple Intelligence Is Reshaping Enterprise AI Adoption

Jun 13, 20268 min

Why AI Hardware Stocks Are Splitting Into Two Markets

Jun 12, 20266 min

Why ASML and Applied Materials Surged While Nvidia Stalled

Jun 12, 20268 min

Intel Stock Surges 18 Percent on AI Foundry Bet

Jun 11, 20266 min

Why AI Model Safety Is Now a Public Company Risk

Jun 11, 20267 min