Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

May 13, 202623 min · 3,767 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

Snap processes more than 10 petabytes of experimentation data every single morning—and with NVIDIA GPU-accelerated Apache Spark on Google Cloud, Snap cut job costs by 76%, reduced memory usage by 80%, and eliminated 120 terabytes of disk spill from its pipelines. Prudhvi Vatala, head of engineering platforms at Snap, joins the NVIDIA AI Podcast to break down how he and his team completely modernized data infrastructure for a social platform serving nearly a billion monthly active users—using NVIDIA cuDF plugin (formerly referred to as NVIDIA RAPIDS plugin) for Apache Spark on Google Kubernetes Engine, with zero application code changes. 🔬Topics covered: How Snap runs A/B tests at planetary scale using rigorous statistical methods like heterogeneous treatment effect detection and variance reduction Why Snap reuses idle inference GPUs between 1–5 a.m. for batch data processing—and how it built a Kubernetes-based platform to do it How NVIDIA cuDF delivered 3x+ speedups on join-heavy Spark jobs with no code rewrites The full business impact: 76% cost reduction, 62% fewer cores, 80% less memory, 120 TB of spill eliminated How a three-way partnership between Snap, NVIDIA, and Google Cloud made it possible in just 8–9 months Chapters: 0:00 Introduction and Snap overview 3:35 What is Snap’s experimentation platform? 4:05 Why experimentation, safety, and privacy are core at Snap 4:52 How A/B testing works at billion-user scale 8:14 Discovering NVIDIA cuDF plugin 9:06 Benchmarking results: join, union, and aggregation jobs 12:00 Reusing idle GPUs overnight via GKE 13:24 Building a bottom-up GPU data platform at Snap 17:48 Results: 76% cost reduction and partnership impact 20:56 Snap’s evolution and what’s next Learn more: NVIDIA cuDF: https://developer.nvidia.com/topics/ai/data-science/cuda-x-data-science-libraries/cudf#accel-apache

Highlighted moments

“We were able to cut almost about 76% of our job costs as a result of this migration. 76? 76.”

Jump to 0:00 in the transcript

“with Spark Rapids, I want to mention it, we didn't have to change a single thing about how we ran the jobs.”

Jump to 10:47 in the transcript

“when some of our biggest markets went to bed, a lot of our online inference GPU capacity was sitting idle.”

Jump to 12:28 in the transcript

“we had to figure out how to gracefully fall back from GPUs to CPUs, right? And then, if the shared GKE resources itself was the constraint, then we had to gracefully fall back from CPUs to data proc clusters.”

Jump to 15:24 in the transcript

Transcript

Introduction to Snap

0:00We were able to cut almost about 76% of our job costs as a result of this migration. 76? 76. It's phenomenal. I mean, for the engineers out there, we were able to cut down the number of cores required by like 62%. Amazing. The memory footprint, we could drop it by like 80%. So phenomenal results. The results speak for themselves.

0:25Welcome to the NVIDIA AI Podcast. I'm Noah Kravitz. I'm here with Prudvi Vatala. Prudvi is the head of engineering platforms at Snap, and we're here to talk about data processing, and in particular, how a social platform with more than 940 million active users accelerated their data pipeline. Prudvi, welcome to the NVIDIA AI Podcast. Thanks so much for taking the time to join us. Yeah, thanks for having me here, Noah. So maybe we can start with the basics.

Snap Overview

0:53Tell us a little bit about, well, about what Snap is now. I'm old, but I still think of it, you know, the Snap glasses and everything, but Snapchat, obviously, a huge social platform. So maybe tell us a little bit about Snap and then your role there. Absolutely. Yeah. I mean, Snapchat at this point is pretty much a household name. Snap as a company, it's interesting that you bring up the spectacles because Snap as a company believes that camera is at the center of, you know, improving how people communicate and improve their lives, you know, in the digital world, so to speak.

1:27So we've been steadfast on that belief, and, you know, Snap right now is at the intersection of augmented reality, AI, and visual communication. Like you said, serving close to a billion monthly active users. I've been at Snap for a while now, and I lead a multifaceted organization. We do a little bit of it has to do with big data infrastructure, a little bit of it with developer productivity, and a little bit of it with enterprise AI and whatnot.

2:04So, yeah.

Accelerating Data Processing

2:05And so when we talk about accelerating data processing, what does that mean to you? What does that mean for Snap? And thinking about the scale that you operate on, just talk a little bit about what it means to accelerate data at that level. Absolutely. That's a great question. Like, as you can imagine, with as many users as we have, and Snap, Snapchat in particular, is a very complex application. So you can imagine the scale at which we operate, especially on the data processing side.

2:36We are dealing with my team's experimentation platform is dealing with 10 plus petabytes each day. It's a massive scale, right? It's a huge scale, yeah. And then we have a strict SLA in the morning because experimentation results need to be ready for developers, product managers, data scientists to act on as early as possible so that, you know, they can take appropriate action. So for us, accelerating data processing basically means instead of throwing more and more CPUs at the problem, figuring out a way to flatten that scale curve, you know?

3:09So in this particular scenario, it was about figuring out how to leverage GPUs for improving our workloads, making sure they run faster, cheaper, and scale, you know, linearly or sublinearly, unlike, you know, right now, it's definitely super linear with feature areas. So that's what accelerating. So you mentioned experimentations. What does that mean? What are, when you're conducting experiments at Snap, what does that look like?

3:40And then maybe how does that fit into, is that where the 10 petabytes of data each morning comes from? Or we can talk about that. Yeah, absolutely. So this 10 petabyte data is only about the experimentation platform. The big data across Snap is far wider. Sure. So experimentation, it's a little bit about Snap's product philosophy. Like we are, we believe that experimentation, safety, and privacy are core pillars for our product development and iteration.

4:14Like when, when we, when we are thinking about new product areas, when we are shipping new product features to our, you know, half a billion daily active users across the globe, we need to think about how the users are receiving it, how they're responding to it, how they're using it, whether or not this is adding value to their, you know, daily lives. And also guard railing things, like, is it regressing their performance, you know, is it causing their devices to slow down or, you know, we need to be very particular about protecting their experiences as well.

4:50And so, Prue, along those lines with the experimentation, can you talk a little bit about the importance of A-B testing? So, A-B testing is, you know, the, the, the concept of randomized control trials has been around for a long time, you know, especially in the clinical fields and whatnot. But with the digital revolution, it has become the mode of bringing statistical rigor to decision making at scale, right? So, that's what A-B testing adds to us, like, you know, when we are dealing with this massive user base that is diverse by nature, you know, from all walks of life, across the globe, you know, and we are trying to delight them, we are trying to bring experiences to them.

5:35We need to make sure, you know, what we are delivering is, is buttoned down, like it's, it's actually really adding value the way we think it is, right? And at this scale, a lot of things, you know, can happen. And that's where having the statistical rigor grounded in, you know, holdouts and, you know, well-defined controls and statistical methods comes in. Like, over the years, my team has added a bunch of statistical methods to our platform, you know, heterogeneous treatment effects detection.

6:12For example, you know, you may think that feature is performing well for the global audience, but it may not perform so well for, like, a subset. Right. So, figuring out those heterogeneous effects is one thing that we focus on. And, you know, at this scale, no matter how you slice your experiments, you are still allowing some bias to seep in. As in, you know, some power users may end up on one side of the experiment rather than the other.

6:42So, how do we make sure the distributions are evened out when the experiment results are read? That's the variance reduction aspect. So, that's something my team built over time. And then, you know, sometimes when we ship a feature, if people don't like it, they might even just stop showing up, you know? Right. Right. That's the sample size mismatch problem. So, we also do a bunch of that rigorously. So, that's what A-B testing brings to the table. So, with all of the data processing every day, what made you think that maybe some NVIDIA tech put into the stack might help things out?

7:21How did that process start? And maybe you can talk about, you know, what you've integrated and what you're using. Absolutely. So, I'm really proud of this. I'm really proud of my team because over the years that I've been seeing our platform, the number of users grew, like Snap, you know, ballooned, right, in terms of footprint. The number of features we shipped, like, you know, spotlight, you know, AR features, AR lenses, and all of the AI features we shipped in the recent past. So, they've also been adding a lot of additional dimensions to the platform.

7:53And my team was hard at work making sure we are not, you know, we're scaling appropriately, even as all of this scale grows. Of course, yeah. And they've done a very good job of it historically for years now, maintaining the cost flat and, you know, performance predictable, meeting the SLAs and whatnot. And one thing we came across, you know, we came across NVIDIA Spark Rapids on one of the blog posts. And we saw NVIDIA is shipping this, you know, solution to speed up our PySpark workloads by anywhere from 3.6x performance versus 50%, you know, runtime, you know.

8:36Okay. It was phenomenal. Yeah. Right. So, that's what drew us to it. I know, I'm waiting to hear them. The numbers sound good. I'm waiting to hear the rest. Yeah. Yeah. So, we read those and we got super excited. Sure. And then we, our stack was, it still is entirely Google Cloud for experimentation platform. We loved working with them. The Google Cloud data proc was phenomenal. They've been a fantastic partner to us throughout the scaling journey. So, when this. It's great to hear. Yeah.

9:07And then when this news came out with Spark Rapids, we wanted to try it out. We did a bunch of benchmarking. We tried, obviously, or like I said, we do a lot of things. So, there is a lot of complexity and to the nature of the jobs we run. So, we had to benchmark each kind of job as well. Like, you know, taking jobs that are heavy with joins and repartitions and, you know, shuffling of data that moves data around versus, you know, jobs that are purely unioning data from

9:39various places versus, you know, jobs that are purely aggregating, like running sums and whatnot. So, we had to benchmark across all of them. And we noticed that even on Google data proc with Spark Rapids, we got about, you know, I want to say 3x plus, you know, improvement for the, you know, join jobs and about close to 2x for, you know, the union jobs and a little over 1.5x for aggregations.

10:13That's largely because CPUs are already good at aggregation. Right. Right, right. So, and then the other thing is GPUs by nature support parallelism and high bandwidth memory on the hardware itself. So, that made it like a very good candidate for us to pursue. And so, you're running your GPU accelerated pipelines on Google Kubernetes, is that right? Yes, yes. That has been a very interesting journey from, you know, testing out our pipelines with data proc GPUs and to today.

10:47And one other thing, like with Spark Rapids, I want to mention it, we didn't have to change a single thing about how we ran the jobs. That was the beauty of it. Not at all. Zero code changes. Oh, that's amazing. Zero code changes. So, I'm into developer productivity and developer enablement. So, for me, that was music to my ears. Sure. Of course. So, that was very impressive. So, with data proc, which abstracts out the Spark runtime for us and Spark Rapids, which didn't require us to change the jobs, it was phenomenal. Yeah. Amazing. So, it went very well.

11:17So, we wanted to productionize this. We were able to, at our scale, pipelines aren't just monolithic, right? We do a bunch of sharding and then, you know, batching of work. So, we were able to migrate one shard to production on Google Data Proc using 300 GPUs. The results were phenomenal. Yeah. And then, in the next phase, we wanted to migrate 10 shards for total, you know, 50 plus shard architecture. And then, it needed about 3,000 GPUs, which was still doable with Data Proc on-demand GPUs.

11:50Okay. Because, you know, GPU capacity is on everybody's minds these days, right? Yeah. So, that was well and good. But then, we didn't have a path forward. Okay. After that, right? So, we kind of hit a roadblock with, you know, on-demand GPU capacity. So, we had to get creative. So, we started looking around. We were like, where at Snap do we have GPU capacity that we can borrow, right? And, you know, that's where the real insight came for us. Like, Snap has a global audience and the Snapchatter's behavior is cyclical during the day, right?

12:25People wake up, they use Snapchat. When they go to bed, they don't, right? So, what that meant was when some of our biggest markets went to bed, a lot of our online inference GPU capacity was sitting idle. Yeah, yeah, yeah. Somewhere between 1 a.m. and 5 a.m. you can speak, you know? Yeah. So, that was our opening, our opportunity to go tackle. And that brought about its own set of complexity, right? Because online serving stack is not built for batch data processing.

12:55They're fundamentally, they were considered fundamentally different words, right? So, all the online GPUs were tied to Kubernetes and GKE. And we were already on Google Clouds. GKE wasn't an issue for us at all. It was actually very welcome. So, we had to migrate our workloads to Kubernetes-based Spark runtime and host it on GKE so that we can leverage, you know, what the online GPUs had to offer. And for that, we had to actually build a data platform ground up.

13:30Okay. You know, because it's one thing for my team to just use this idle capacity. But at Snap, we wanted to make sure even as the online need for GPUs increased, as our AI footprint increased, we could, we should still have any team at Snap be able to leverage that capacity for any of their needs. Sure. Yeah. As available. And then we had to also acknowledge that if a user wanted to see fresh spotlight content, it supersedes GPU need for experimentation. Yeah. So, you know, preemption had to be built in.

14:01Yes. Yeah. So, if we had a sudden spike in traffic, we had to give up GPU capacity. So, with all of that in mind, we built out a platform ground up. Okay. And then we started migrating. And that's the, and we had a lot of blockers along the way, and the team got really creative. Yeah. It was a phenomenal journey. Amazing. Yeah, yeah, yeah. And so, you're also running an accelerated Apache Spark pipeline? Yes. Yes. So, a lot of our pipelines, at a high level, our pipelines are split into daily and hourly cadence.

14:41So, hourly is mostly for guard railing, like I said. Like, you know, we don't want to break users' experience no matter what. And having that hourly feedback cycle goes a long way in doing that. And then we also have daily pipelines, which serve as the statistical authority for decision-making. So, our first migration to GKE plus NVIDIA Spark Rapids was the hourly pipeline. Because, you know, speed mattered there far more, right? So, we migrated. And then we migrated and operationalized it.

15:14And during that process, we ran into a few corner cases. You know, if the GPU capacity wasn't available at, like, 11 a.m. When everybody was active on Snap, right? What do we do? So, we had to figure out how to gracefully fall back from GPUs to CPUs, right? And then, if the shared GKE resources itself was the constraint, then we had to gracefully fall back from CPUs to data proc clusters. So, building all of that with operational reliability in mind was also great.

15:47Yeah. Looking back on it, what learnings would you, you know, if there's a listener out there who's embarking on a similar project or trying to figure out maybe there's a, you know, like you said, kind of a daily cycle of when the GPUs are in use for inference and when they're not. And they're thinking about, you know, borrowing GPUs from other parts of the company. Learnings you would share from this whole process?

Learnings and Takeaways

16:11Is there a big takeaway? Something that surprised you? Right. Yeah. Right. So, the direction that NVIDIA is headed in is phenomenal for these kinds of needs. You know, NVIDIA Spark Rapids, like I said, zero code written. Yeah. Zero code changed. Amazing. To enable it. We had to figure out the image building and environment difference and whatnot. The testing cycles, obviously, any production workload needs to go through that regress rollout process. So, everybody needs to pay attention to it.

16:43But this is a real possibility, you know, the NVIDIA direction. The other thing that NVIDIA offered that really helped us a lot was NVIDIA Ether. It's another solution that gives us Spark tuning out of the box. Because especially when we had this fallback mechanism in place, where we had to go from GPUs to CPUs to data proc, the environments are different. The Spark parameters had to be different. So, something like NVIDIA Ether giving us a starting point and making sure the tuning stayed consistent across all of these versions was also very helpful.

17:24So, you've mentioned, obviously, the work with NVIDIA and Google Cloud as well. Kind of from taking a step back, sort of bigger picture, what are these partnerships and working, you know, hand-in-hand so closely with Google Cloud, with NVIDIA, what is that doing to the way that you and Snap see your roadmaps for both data and AI kind of growing going forward? Yeah. It's, I mean, huge props to the NVIDIA team and the Google Cloud team, honestly. It's been a phenomenal three-way partnership like I've never seen in my career before.

17:57Oh, amazing. It was phenomenal. And the impact speaks for itself, right? Like, we were able to cut almost about 76% of our job costs as a result of this migration. 76? 76. Wow. It's phenomenal. Yeah. Right? I mean, for the engineers out there, like, we were able to cut down the number of cores required by, like, 62%. Amazing. The memory footprint, we could drop it by, like, 80%. I mean, for the Spark nerds out there, we were able to cut out almost 120 terabytes of disk spill, disk and memory spill from our pipelines.

18:34Wow. Just vanished once we started doing, you know, all of this. Yeah. So, that is one of the biggest headaches any, you know, data pipeline at scale runs into. So, phenomenal results. The results speak for themselves. So, without the partnership, this would not have been possible in the time scale that it was possible in. Like, migrating a production pipeline with 10 plus petabytes from, you know, prototyping, exploration to full production in a matter of about eight to nine months is phenomenal.

19:08Right? And without the continuous, you know, back and forth and, you know, knowledge sharing and partnership across these three companies, this wouldn't have been possible. That's great. Yeah, and in terms of the roadmap, it definitely had an impact. Like I said, my team built this bottom-up data platform to enable any team at Snap to leverage the GPU capacity and, you know, what NVIDIA libraries have to offer.

19:39And we're already seeing movement with it, right? Even my own team started migrating other things that we haven't even tried out so far, experimenting with them, you know, trying out. Because even if we don't have ideal capacity to fit all of our workloads all the time, if we can schedule things creatively, if we can move things around, we can maximize the capacity as much as we can. And a lot of other teams are also picking this out. Yeah. That's fantastic. So you've been at Snap for eight years. Is that right? Seven? Close to eight. Okay.

20:09And Snap's been around for about 15 years. Yes. Year or take? Yes. Working at a huge social media platform over this span of time where social media has just, you know, become such a core part of the fabric of so many people's lives. What's it been like to be at Snap and to see the changes both, you know, I said at the beginning, right, I remember the spectacles. That's my first thought at Snap. And obviously now Snapchat, you know, same lineage, same philosophy, different product, obviously, right?

20:43But what's it like to just have seen the evolution of social media and then also so many technological changes that impact, you know, what you're able to do and how you do it, as you were just describing? What's it been like from the inside? Yeah, it's been unbelievable of an experience, Noah. Like, that's what gets me up in the morning every day, you know? Like, Snap, I mean, in the visual communication AR, AI landscape, Snap has had a massive impact on the planet.

21:19Yes. Honestly. And having a direct role to play in it is a great feeling, right? I've seen the company grow from, you know, the camera messaging, you know, picture messaging to what it is today, AR stories, which is something we invented and the whole world, including some newspapers. So the stories as a format.

21:50And then to your point about spectacles, we did it before anybody else was even thinking about it, you know? So the company is innovative. We come up with so many new things and running platforms inside means that I have to, you know, figure out a way to enable all of this, even as the company evolves. And that's been having a front row seat to that evolution and playing a big part of it has been very fulfilling. Fantastic. Prue, for listeners, viewers who, there are some out there who haven't used Snapchat before, for anyone who wants to get the experience, but also to learn more about Snap and maybe about some of the technical work that you're doing, are there obviously the website, their social media, is there a research blog, where can people go?

22:36Absolutely. So we have an engineering blog that's pretty active. We share a lot of phenomenal work that engineers in the company are working on. And, you know, we are also participating in events like this and sharing our knowledge with the world. So, you know, and Snapchat, if you haven't used it, you should definitely give it a try. It's different from social media. I, this is a true story. I got a Snap from my younger son, maybe 45 minutes before we sat down to do this, and it made my day.

23:13So, absolutely, if you haven't, yeah. Pruvatala, thank you so much. This has been a great conversation, and I'm sure the developers, the engineers, and the audience hopefully have taken a lot from it. But thank you so much for taking the time to join us, and all the best to you and everybody at Snap to keep changing the world for the better. Thank you so much, Noah. Thanks for having me. Appreciate it.

23:34Thank you.

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

Show notes

Highlighted moments

Transcript

Introduction to Snap

Snap Overview

Accelerating Data Processing

Learnings and Takeaways

More from NVIDIA AI Podcast

How Mistral Is Building Frontier AI for the Enterprise | NVIDIA AI Podcast Ep. 301

Everyone Can Build a Robot: Open Source Embodied AI With Seeed Studio | NVIDIA AI Podcast Ep. 300

Inside AI Tokenomics: How to Profitably Turn Tokens Into Business Value | NVIDIA AI Podcast Ep. 299

Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297

How Dassault Systèmes Is Building AI That Understands Physics - Ep. 296