ML Engineering
When to use an MLP vs. a Transformer for tabular data
· 6 min read · By Jon Jovinsson
If you have tabular data (rows and columns, fixed schema, bounded cardinality) and you are trying to decide between an MLP and a Transformer, the short answer is: start with an MLP or a gradient-boosted tree. Move to a Transformer only when a specific feature of the problem asks for it. We build ML systems for Australian businesses across retail, property, and finance, and the overwhelming majority of tabular workloads are still best served by the older, simpler architecture.
That is not nostalgia. Transformers earned their crown on sequence data: language, audio, protein chains, time series with long dependencies. Tabular data is mostly not that. A row of customer attributes does not have a temporal order. Columns are not tokens. The inductive bias that makes Transformers dominant on language is mostly wasted on a CSV of account records.
When to reach for an MLP
An MLP (multi-layer perceptron) works when each row is an independent observation and the relationships you care about are feature-by-feature. Predicting churn from a user profile. Predicting loan default from an applicant record. Classifying a transaction as fraud or not. All of these are problems where the MLP architecture, plus basic feature engineering, will get you 95 percent of the way. They are faster to train, cheaper to serve, easier to interpret, and require less data.
In practice on Australian datasets, an MLP or a gradient-boosted model like CatBoost or XGBoost is almost always our first model. We only move past them when they visibly plateau and the residual error has a structure a bigger model could exploit.
When to reach for a Transformer
Reach for a Transformer on tabular data when one of three things is true. First: your features are themselves sequences (clickstreams, event logs, transaction histories per customer). The Transformer was built for exactly this and it shines. Second: you need to learn complex, pairwise relationships between columns, and you have enough data to justify the parameter count. Modern tabular Transformers like TabNet, FT-Transformer, and SAINT can outperform trees here, but only with large datasets. Third: you are fusing tabular features with unstructured data (product text, customer reviews, images of listings). A Transformer gives you a natural way to embed and combine everything in one model.
The cost side nobody mentions
Transformers on tabular data are expensive in ways that do not show up on the validation set. Training takes longer. Hyperparameter search is bigger. Serving them on Vertex AI or Cloud Run costs more per prediction. Interpretability is worse, which matters for any compliance-sensitive work (financial services, insurance, health). If the MLP gets you to within a percent or two of the Transformer, the MLP is almost always the right production choice.
How we actually decide on client work
Our decision tree on real engagements looks like this. Baseline with logistic regression or a simple tree. Step up to a well-tuned gradient-boosted model (CatBoost is our default). Step up to an MLP if the data is large and the tree has plateaued. Step up to a Transformer only if the feature space actually contains sequence or multi-modal structure. Stopping early is a feature, not a failure. Most Australian businesses do not need a Transformer. They need a reliable, monitored, fast-to-serve model that solves their problem.
When it is a close call
If you run both and they are within noise of each other on your validation metric, ship the MLP. It will be cheaper in production, easier to retrain, and less brittle when your data drifts. Save the Transformer for the problem where it genuinely wins by more than the cost difference.