How Synthetic Data Predicts Real Markets

April 8, 202621 min · 3,680 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

This episode explores how synthetic data, artificial information created to mimic real-world statistical patterns, is transforming investment management. It discusses a paper by James Tait published by the CFA Institute Research & Policy Center. While traditional methods like Monte Carlo simulations remain useful, Tait highlights Generative AI techniques such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) for their ability to model complex financial datasets. These technologies help firms overcome obstacles related to data privacy, historical scarcity, and dataset imbalances found in areas like fraud detection. By integrating synthetic information into their workflows, practitioners can improve model training, backtesting, and risk analysis while reducing costs. The referenced paper emphasizes that maintaining data quality through rigorous evaluation is essential as the industry moves toward these sophisticated, AI-driven simulations. References Tait, James (July 2025) “Synthetic Data in Investment Management,” CFA Institute Research & Policy Center. https://rpc.cfainstitute.org/sites/default/files/docs/research-reports/tait_syntheticdataininvestmentmanagement_online.pdf Podcast Disclaimer This podcast is an independent production and is not affiliated with or endorsed by any third-party entities unless explicitly stated. The content is for educational and informational purposes only and does not constitute financial, investment, legal, or professional advice. Listeners should consult qualified professionals before making any decisions based on this content. This episode is based on the references listed above and was generated using Notebook LM and other AI tools. While I have reviewed the content for accuracy, it may still contain errors, inaccuracies, or omissions. Neither the producers nor any affiliates accept liability for any damages or losses arising from the use or interpretation of this content.

Highlighted moments

“You aren't slicing existing data and you aren't filling a missing cell in a spreadsheet. You're using artificial intelligence to generate completely new data points from scratch.”

Jump to 2:14 in the transcript

“They use algorithms to completely mask the underlying asset names and shift the numerical scales. But they perfectly preserve the mathematical covariance”

Jump to 4:08 in the transcript

“If you only give it historical records, the AI just optimizes for the highest overall accuracy by guessing healthy or safe every single time.”

Jump to 5:16 in the transcript

Transcript

Introduction to Synthetic Data

0:00Imagine betting, like, $1 billion on a market crash that never actually happened. Yeah, just trading on a boast, basically. A completely fabricated, artificially hallucinated Tuesday where tech stocks just plummet and volatility goes through the roof. Right, but on modern Wall Street, that isn't some glitch or a catastrophic error. Yeah. It's actually the most cutting-edge risk strategy on the planet. Which completely upends everything, right? I mean, we've always treated quantitative investment as this ultimate empirical discipline.

0:31You measure the real market, you model the real market, and you deploy capital based on those hard facts. You do, or, well, you used to. But the reality is the financial industry is increasingly relying on completely fabricated data to make real-world decisions. Yeah, so if you're listening to this and you've ever looked at your portfolio during a sudden market dip and wondered if the algorithm saw it coming, the answer is increasingly, yeah, they didn't just see it coming, they rehearsed it a million times using synthetic data. Which is wild to think about. It really is.

1:02So welcome to today's Deep Dive. We are exploring this really fascinating report from the CFA Institute by James Tate. It's called Synthetic Data in Investment Management.

Synthetic Data in Investment Management

1:12And our mission today is to pull back the curtain on how and why these multi-billion dollar quantitative firms are programming machines to dream up fake markets to predict the real one. Right. And to do that, we first need to separate synthetic data from, like, standard data manipulation. Because in the data science world, firms have always used something called data augmentation. Okay, let's unpack this. What exactly is the difference? Well, data augmentation is simply taking a real data set and rearranging it.

1:43So, say you have a five-year history of a stock's price, you might chop that up into multiple overlapping 30-day rolling windows. So you aren't creating anything new there. Exactly. You're just slicing the dough into different shapes to give a model more angles to look at. Right. Or, like, they use data imputation, which is basically just patching holes in reality. Yeah, imputation is just if a sensor goes offline and you miss an hour of trading volume, you use mathematical averages to estimate and fill in that blank. But synthetic data is different. It's an entirely different beast, yeah.

2:14You aren't slicing existing data and you aren't filling a missing cell in a spreadsheet. You're using artificial intelligence to generate completely new data points from scratch. Out of thin air, basically. Basically, yeah. Yeah. But these artificial points perfectly mimic the statistical properties, the deep correlations, and the underlying structure of the original real world data. Without ever actually copying a single line of it. Right, exactly. And the stakes here for you, the listener, are almost difficult to overstate. The CFA report predicts that by 2030, a staggering 60% of all generative AI training data will be synthetic.

2:49Yeah, that's a massive number. AI models are literally running out of human-generated data to ingest. Exactly. So understanding the shift is basically a shortcut to understanding the future of how financial markets will be modeled and understood. But there is an obvious paradox here. Oh, for sure. If you're a massive quant fund that can afford to buy the highest quality real-world data in existence, why deliberately choose to train your models on a hallucination? Well, it solves two massive bottlenecks that have plagued quants for decades.

3:22Privacy and scarcity. Let's start with privacy. Right. So financial data is heavily regulated, obviously. And proprietary trading signals are fiercely guarded secrets. You can't just, you know, open source your client order flow or your alpha-generating algorithms to see if the broader data science community can improve them. Which creates a massive wall against crowdsourced innovation. Unless you do what Jane Street does. Oh, right. The Kaggle competition. Exactly. They are one of the most elite quantitative trading firms in the world. And they regularly host these massive public data science competitions on Kaggle.

3:56Right, because they want independent geniuses building forecasting models for them, but they obviously can't just hand over their secret trading data. They'd lose their entire competitive edge. So they use synthetic data to build an obfuscated data set. They use algorithms to completely mask the underlying asset names and shift the numerical scales. But they perfectly preserve the mathematical covariance, right? Like how the assets move in relation to one another. Exactly that. So independent data scientists can then hunt for mathematical inefficiencies and build predictive algorithms on this fake data.

4:27That's so smart. It is. And when someone builds a winning model, Jane Street takes that exact mathematical logic and applies it to their real internal data. The external scientists never see the proprietary details, but their algorithms still work perfectly.

Advantages of Synthetic Data

4:41So privacy is the first massive advantage, but the scarcity issue with the report calls the imbalance problem is arguably even more critical. Yeah, real world data is just deeply skewed. The CFA report highlights this study on European credit card fraud. Oh, this was crazy. Yeah, over a two-day period, researchers analyzed something like 284,807 transactions. And out of that massive pool, only 492 were actually fraudulent. Wow. It's like trying to teach a machine to find a needle in a haystack, but there are so few needles it forgets what they even look like.

5:15That's a perfect way to put it. If you only give it historical records, the AI just optimizes for the highest overall accuracy by guessing healthy or safe every single time. Exactly. It achieves 99.9% accuracy and totally fails its actual objective because it never gets enough examples of the fraud to learn its underlying pattern. Synthetic data lets us just generate a haystack made mostly of needles, right? Yes. If we connect this to the bigger picture, instead of waiting years to accumulate enough rare fraud data, you use an AI to generate thousands of realistic, highly complex, but entirely fake fraudulent transactions.

5:54You balance the data set. Exactly. And that fundamentally changes how we handle macroeconomic risk. The most dangerous events in finance, corporate defaults, massive fraud, extreme market crashes like 2008. They're rare tail risks. Right. By synthesizing these rare events, risk officers can train their models to recognize tail risks that historical data barely covers. You're basically vaccinating a portfolio against a future black swan. That's exactly what it is. So the necessity is clear, but the mechanism, like how we actually generate this data, is where the real revolution is happening.

6:26Because traditionally, we just used math, right? Yeah. Historically, the industry relied on traditional statistical methods. You had Monte Carlo simulations, which actually date back to the 1940s. Right. Running thousands of random scenarios based on predefined formulas. Yeah. Or quants would use GRCH models to simulate volatility clustering. But the friction there is that traditional models force chaotic, nonlinear markets into rigid, human-defined mathematical assumptions, don't they? That is the exact problem.

6:57If the quant assumes a normal distribution, the model spits out a normal distribution. But financial markets have incredibly fat tails. They don't obey clean, human-defined rules. No, they really don't. And that's why GNI models are so disruptive. They bypass that entirely. They don't need explicit rules. They just ingest raw, complex market data and learn the deep, hidden rules of the data without being explicitly told what to look for. Okay. So if these GNI models are so smart, how exactly are they dreaming up new financial data?

7:28Well, let's look at the first major architecture used for this, which is the variational autoencoder, or VAE. VAE. Okay, let's dive into the mechanics of that, because the concept of an autoencoder just sounds like simple file compression to me, like zipping a folder on your computer. It shares that DNA, for sure. But a VAE operates probabilistically. It consists of two parts. First, the encoder. It takes highly complex data, say a 100-day sequence of 50 different stock metrics, and forces it through a mathematical bottleneck.

7:58It strips away all the noise and compresses it down into a tiny, dense summary called a latent space. So it's extracting the core mathematical properties. It figures out that the only variables that actually matter in this massive sequence are, say, momentum and volatility trend. Right. And then it plots the stock on a coordinate map based only on those core variables. Exactly that. But the magic happens with the second part, the decoder. Because to generate synthetic data, the decoder doesn't just pull the original file back out of the latent space.

8:29What does it do? It samples a brand new point mathematically near the original data in that space. Then it decodes it, or reconstructs a full 100-day stock sequence from that new point. Oh, wow. So because it's sampling a nearby coordinate, the new sequence shares the deep DNA of the real market, but the specific daily price movements are a completely novel scenario. Right. It generates a parallel universe that plays by the exact same physics as our universe, like modeling options, volatility surfaces, for example.

9:02That's amazing. But as powerful as VAEs are, I know the real heavy hitters in synthetic data for years have been JANs, right? Generative adversarial networks. Yeah, JANs. They function through an evolutionary arms race. You basically have two distinct neural networks locked in a brutal feedback loop. It's like a game of cat and mouse, right? Exactly. The first network is the generator. Think of it as an art counterfeiter. Its goal is to create synthetic stock charts. The second network is the discriminator, which acts as the detective. Okay, I like this analogy. So the counterfeiter hands the detective a mix of real historical stock charts and newly synthesized ones.

9:38And the detective's sole function is to calculate the probability that a chart is fake. Yes. And when the detective inevitably catches the counterfeiter early on, it doesn't just say, you failed. The mathematical penalty travels backward through the network, telling the counterfeiter exactly which mathematical feature gave the fake away. Like maybe the variance of the daily returns was too tight or the correlation to moving averages was slightly off. Exactly. So the counterfeiter mathematically adjusts its internal weights to fix those specific flaws and tries again.

10:10And they just keep fighting until the fake money looks identical to the real thing. They iterate millions of times until the counterfeiter is producing data so flawless that the detective's detection accuracy drops to 50%, meaning it's essentially just flipping a coin. That sounds perfect. But there is a specific challenge with Yans, isn't there? Something called mode collapse? Yeah, mode collapse is the fatal mechanical flaw of a Jans. Yeah. Because this is an evolutionary arms race, mode collapse happens when the generator, the counterfeiter, figures out a highly specific cheat code.

10:43Uh-oh. Like what? Like it discovers one single obscure trick, maybe a very specific flat-lined volatility curve that consistently fools the discriminator. Oh, so instead of learning to generate a diverse dynamic market simulation, it just spams that one trick. Exactly. It just keeps producing the exact same fake data over and over. The diversity of your synthetic data collapses. Right. And if you train a trading algorithm on that collapsed data, you will get absolutely crushed in the live market. Because the algorithm only knows how to trade in a market that looks exactly like that one flat-lined chart.

11:17Oh, right. So because Jans can be unstable and suffer from mode collapse, the industry has pushed forward into even newer territory. Yeah. And that brings us to diffusion models. Which is the tech behind image generators, like mid-journey. Yeah. Right. Exactly. Diffusion models are the current state-of-the-art. And they basically mimic the laws of thermodynamics, specifically entropy. In finance, this works in two steps. First is forward diffusion. Okay. How does that work? Imagine a massive matrix representing the correlations between hundreds of global equities.

11:50The model progressively, step-by-step, adds random Gaussian noise to that matrix. Like television static. Yeah, exactly. It corrupts the data until the correlations are completely destroyed, leaving nothing but pure mathematically random static. Wait, if a diffusion model is just adding random static to a matrix and then wiping it away in the reverse step, aren't you just regurgitating the exact same matrix? That sounds like it, right? But no. How does adding and removing noise create a brand new market scenario? Because the neural network isn't memorizing the matrix itself.

12:23It's learning the process of denoising. During reverse diffusion, the model is trained to predict and subtract the noise at each step. Once the model masters the physics of how to clean data, you don't feed it the original ruined matrix. You feed it a brand new block of pure, unconditioned static noise that never came from a real matrix at all. Oh, wow. The model applies its denoising rules to that static, effectively generating a brand new, valid correlation matrix from pure noise.

12:53It's hallucinating structure where there was none. Exactly. Yeah. And the CFA report highlights a specific real-world application of this. Researchers used a conditional diffusion model to build a fixed income strategy. Conditional meaning it accounted for specific real-world factors. Yeah, like interest rate volatility and equity volatility. And by filling the gaps in their historical data with this noise-clean synthetic data, they built a strategy that actually outperformed U.S. Treasury bills by a full 1%. Finding a 1% edge purely through synthetic data generation is a massive deal in fixed income.

13:27Oh, it's a huge signal to the industry. But here's where it gets really interesting. The frontier of this technology isn't just generating static spreadsheets or correlation matrices. No, we are now simulating entire markets. Yeah. Researchers are deploying large language models, LLM agents, as autonomous traders within virtual stock markets. But how does an LLM, which is essentially a text predictor, actually execute trades in a simulated market? It relies on prompt engineering and context windows.

13:57Researchers give each LLM a highly specific system prompt that acts as its personality. Like, one agent might be programmed as a highly leveraged, aggressive momentum trader with a short memory. Exactly. And another might be a conservative value investor with strict stop-loss rules. Then you feed them real-time text, like news headlines or simulated price ticks. And the LLM processes the semantic meaning of the news, weighs it against its risk profile, and relies on its massive pre-trained knowledge of human economic history to decide how that specific persona would react.

14:30Right. And then it outputs a highly specific numerical buyer sell order. In the study from the source, researchers introduced a macroeconomic shock. They fed the agents news about a hypothetical Fed interest rate cut. And how did the agents react? Did they just act like cold calculators? Not at all. They exhibited genuine, bullish, herd-like behavior. They panicked, they bought aggressively, they overreacted based on their profiles, completely mimicking human psychology. That is wild. It allows quants to simulate the irrational psychology of a market crash before it happens.

15:02Exactly. But, okay, this brings us to the ultimate friction point. It's great that AI can simulate markets and generate endless data, but if it's bad data, it leads to catastrophic financial decisions. Oh, absolutely. How do quants actually mathematically verify that the synthetic data isn't garbage? How do we evaluate it? You have to deploy a rigid, two-pronged quality check, qualitative and quantitative. Qualitatively, you look at it visually. But human eyes can't process a 12-day sequence of 50 different interacting stock features.

15:33Right, so quants use dimensionality reduction techniques like PCA or TSNE. These algorithms basically squash incredibly high-dimensional relationships down to a 2D scatter plot. Oh, so you plot the real historical market data as, say, blue dots on a scatter plot and the fake data as yellow dots. Yes. And if the yellow dots perfectly overlap and blend into the blue clusters, your AI has successfully captured the complex structure of reality. But if the yellow dots isolate themselves in their own corner, your model is hallucinating irrelevant patterns.

16:04Exactly. That's the visual test. But quantitatively, you use statistical tests, specifically the Kolmogorov-Smyrnov test. The KS test. Right. Think of it as measuring a maximum gap between the shape of the real data's probability curve and the fake data's curve. If that mathematical gap is statistically tiny, you have proof that the synthetic data shares the same underlying reality as the true data. But the absolute gold standard, the ultimate Limbus test, is the TSTR methodology, right? Yes, TSTR. Train on synthetic, test on real.

16:35So you train one machine learning model entirely on historical data. Then you train a second model purely on your newly minted synthetic data. And finally, you test both models on a real-world, hold-out data set that neither has ever seen. Exactly. If the model trained on the AI's hallucinations performs just as well or better than the one trained on cold, hard history, you have a winner. You've fundamentally proven the value of your synthetic pipeline.

Case Study and Ethical Considerations

17:00And to prove all this theory actually works, we have to look at the specific case study from the CFA report on financial sentiment analysis. Yeah, the Quinn III case study. So to set the scene for you listening, large language models are great at reading thousands of financial news articles to judge sentiment, right? Right. Judging if a headline is positive, negative, or neutral. But piping proprietary trading data through a massive public model like GPT-4 is the huge security risk. Right. So quants prefer to fine-tune a smaller open-source model locally. It's cheaper and safer.

17:30Like Alibaba's Quinn III model. Yeah. But the problem is a smaller model isn't as smart out of the box. You have to train it on a highly specific data set. Which brings us back to the imbalance problem. They attempted to train Quinn III using a data set called FICU-SA. And the data set was highly imbalanced. It was flooded with positive financial news, but contained almost zero examples of neutral news. So the local model literally wouldn't know what a neutral headline looked like. It would just misclassify neutral events as either bullish or bearish.

18:03Exactly. To fix this, they used a larger model, GPT-4O, to generate 800 purely synthetic financial sentences with perfectly balanced sentiment. Like totally fabricated sentences. Yeah. Things like, Tesla's quarterly report shows an increase in vehicle deliveries by 15%. Just a perfectly neutral synthetic fact. So what does this all mean? Did it actually work? Oh, it worked incredibly well. They tested the performance using the F1 score. And we should clarify, the F1 score isn't just basic accuracy. It balances precision and recall, meaning it severely penalizes models that just blindly guess the most common answer to inflate their score.

18:39Right. So model one, which was trained only on the real imbalanced historical data, achieved an F1 score of 75.29%. Which is functional, but entirely insufficient for automated trading. Yeah. But model two is trained on the real data, plus just 200 of those synthetic AI-generated sentences. And the score jumped to 85.17%. An almost 10 percentage point leap. Just by adding a dash of synthetic data, like 200 fabricated sentences corrected the model's blind spots entirely. What's fascinating here is just the sheer power of this technology.

19:10It circumvents privacy laws, it solves the scarcity of tail risk events, and it demonstrably improves algorithmic accuracy. But we do have to mention the ethical and practical warnings here. Yeah. Deploying this comes with severe systemic risks. The first one being data drift. Right. Financial markets are not static. The regime is constantly evolving. If you train an LLM to synthesize data based on the macroeconomic environment of 2024, and the market regime shifts entirely in 2026. Your models decay if they aren't updated. Your synthetic data is now mathematically codifying an extinct reality.

19:45Exactly. And then you face the black box problem. If a diffusion model discovers a hidden multidimensional correlation between, say, orange juice futures and semiconductor stocks. And generates synthetic data based on that. Right. The quants don't necessarily know why the AI made that connection. It's a black box. Yeah. And if you cannot interpret the mechanism behind the correlation, betting billions of dollars on it is a massive institutional risk. And there's the bias risk, too. Oh, absolutely. If your original historical data contains systemic human biases, say, biased lending practices, the AI doesn't just copy those biases.

20:21It mathematically codifies them and scales them infinitely in synthetic output. It acts as an algorithmic amplifier. So if we aren't careful, we bake historical flaws into future algorithms. You really have to be ruthlessly precise about the reality you choose to synthesize. You do. The underlying math demands nothing less. Well, this brings us to the end of our deep dive. Going from the raw constraints of historical data to virtual agents trading in hallucinated markets is just a massive paradigm shift. But before we sign off, I want to thank you for joining us on this deep dive and leave you with a final provocative thought to mull over on your own.

20:57It's a big one. It is. Let's return to that staggering prediction that by 2030, 60 percent of all AI training data will be synthetic. If that prediction holds true, we will soon live in a world where AI is primarily learning from other AIs. A completely closed loop ecosystem. Exactly. So if perfectly structured, flawlessly balanced synthetic data becomes a cheap, infinite commodity available to every firm, what happens to the value of human action? Will the messy, unpredictable, real human data suddenly become the most expensive and highly prized asset on Wall Street?

21:29Something to think about. We'll catch you on the next deep dive. We'll catch you on the next deep dive.

How Synthetic Data Predicts Real Markets

Show notes

Highlighted moments

Transcript

Introduction to Synthetic Data

Synthetic Data in Investment Management

Advantages of Synthetic Data

Case Study and Ethical Considerations

More from Expanding Frontiers

Private Equity 2026: NAV Lending and Secondaries Market Trends

Beyond the Buyout: Private Equity and the American System

Sui Live 2026: The Future of Agentic Finance

Smart Contracts on Web 3.0

Explainable Machine Learning for Investing