TabPFN: a foundation model for tabular data
Yet another significant landmark on our breathless AI journey
Recently, I wrote about the work GRID is doing to enable AI to work with calculations in your spreadsheets.
I have also written about using LLMs to perform some basic data analysis and visualization.
Hopefully, both of those posts pointed out some basic ways forward to use our data with those new AI technologies that we’re all experimenting with. But every day, businesses rely on data tables – from sales spreadsheets to customer databases – as critical assets. Applying advanced AI to this kind of structured data has been surprisingly difficult.
We have seen great breakthroughs in deep learning when it comes to analyzing and generating images and text, but AI still struggles with tabular data.
In fact, the go-to techniques for working with data tables have remained things like decision trees and regression models rather than neural networks, which have powered latest AI developments in other fields.
As Schönberg said, There’s a lot of good music still to be written in C Major, and to be fair, these traditional methods can work well, in clearly-defined scenarios. However, they also demand extensive trial-and-error and expert tuning to squeeze out the best performance.
It’s also true that organizations using these techniques often end up relying on legacy tools or libraries, which hold back their ability to innovate or refresh their data architecture.
So far, there has been no general purpose AI model that could bring the advantages of flexibility and usability to tabular data analysis.
TabPFN
TabPFN is a new foundation model from a German startup Prior Labs; it is designed specifically for tabular data, and so takes a very different approach.
When we think of foundation models for language (like OpenAI’s ChatGPT), we should recognize that they come pre-trained on vast amounts of data, giving them a broad base of knowledge. The same would be true for an image-based model like MidJourney.
TabPFN’s name comes from Tabular Prior-data Fitted Network, reflecting that it was trained on 130 million synthetic datasets, modeling real-world scenarios. Thanks to this massive prior training, TabPFN can understand patterns in a new dataset and make predictions almost instantly without any task-specific training or tuning.
In practical terms, you can feed your spreadsheets or data tables into TabPFN and get immediate predictive insights instead of spending hours training a model.
In one published benchmark, TabPFN produced better predictions in 2.8 seconds than the best existing machine learning models could achieve with over 4 hours of careful tuning and training.
Synthesizing
By synthetic data, we mean artificially generated datasets that mimic real-world tabular data structures, including a vast range of types, variables, relationships, and distributions. Instead of collecting real-world data (with all the vulnerabilities that might expose), synthetic data is created using statistical models or simulations. Generative AI can generate the data used to train this new AI.
In addition to avoiding data privacy or data governance issues, there are some advantages to using synthetic data:
Unlimited data can be generated, allowing the model to learn a vast range of patterns.
The model is exposed to many possible data distributions, making it more adaptable to unseen datasets.
Synthetic data can include rare scenarios and edge cases that might be underrepresented in real datasets but that improve the model.
Nevertheless, the data remains synthetic, so it may not fully reflect the complexity of real-world data. The one thing you can be sure of is that the real world is more complex, contradictory and contrary than any simulation.
There’s also the common problem of bias. To generate synthetic data, we make assumptions about what that data should look like. If those assumptions include hidden biases, the synthetic data will reflect them or even exaggerate them. If you have not created enough variety in your synthetic data, the model may become too optimized for the training and fail to adapt to real-world nuances: what we call over-fitting.
Some use cases
So, what is a model like this good for?
TabPFN is going to be very attractive to financial analysts to forecast market trends or credit risks, for logistics and supply chain analysis, and in retail.
Personally I thinkTabPFN offers some far-reaching possibilities in healthcare, especially for analyzing clinical trial data. I hope (I expect) the complexities of this kind of data set have been well-represented in the synthetic training data. Clinical trial analysis is a nightmare of data integration, especially what is called meta-analysis of many similar trials.
I asked DeepSeek to list some of these complications. I think it’s still listing some in the background as I write this!
Patient Variability: Differences in age, genetics, lifestyle, pre-existing conditions, and responses to treatment.
Heterogeneous Data Types: Combination of numerical (biometrics, test results), categorical (diagnoses, treatment groups), and unstructured data (clinical notes).
Longitudinal Nature: Data collected over time, requiring an understanding of temporal trends and progression of conditions.
Missing Data: Patients drop out, miss check-ups, or have incomplete records.
Placebo and Control Groups: Need to model randomized controlled trials (RCTs) with treatment and placebo comparisons.
Adverse Events and Rare Outcomes: Important but rare side effects must be modeled effectively.
Regulatory Constraints and Bias Mitigation: Handling diverse populations to ensure fairness and regulatory compliance.
Next
Many of my clients have been excited by what LLMs such as Claude, ChatGPT or DeepSeek can do in their business. Sometimes it takes a little digging, but there are great use cases in almost every business.
For TabPFN, there’s no almost. Business runs on tabular data and this is an exciting opportunity to bring the simplicity and insight of AI to a wide range of scenarios.
Of course, others are looking at tabular data; Google’s TabNet, Microsoft’s LightGBM and Amazon’s AutoGluon are all interesting. However, unlike TabPFN, these are not foundation models, and they don’t treat datasets as input context; they are more like advanced deep learning architectures applying remarkably traditional techniques at a massive scale. TabPFN is more like (I reluctantly have to say it, but I will) … it’s more like ChatGPT for data.
TabPFN is open source - yay! - has an API for integrating into business workflows and will soon have a range of database connectors.
Remember their name. I think this is the start of something very big indeed.
Thanks, Donald. As always your posts are always very enlightening. I am going to try TabPFN.