We Spent Six Months Building the Wrong Thing. Here's What We Learned.

Published on September 15, 2024 by Gautham Vemulapalli
⏳ 5 min read

Back in 2021, Ford’s machine learning teams faced a problem that now seems almost embarrassing. Engineers spent more time searching for and rebuilding data than actually building models. Development cycles dragged on for six months. Teams ended up creating the same features on their own, not out of ignorance, but because there was no way to tell if someone else had already done the work. Training data was calculated one way, while inference data was handled differently, so models that worked in the lab often failed in production.

We chose to build a feature store to solve this, and the experience taught me just how easy it is to underestimate a problem. I learned more from this project than almost anything else I’ve worked on.

The decision nobody warned us about

Before writing a line of code, we evaluated the existing options: Google Feast, Tecton.ai, Hopsworks. All credible. All solving real versions of our problem.

We almost went with Hopsworks. It was the closest fit technically. But when we dug into the authorization model, we hit a wall: access controls were at the feature store level, not the feature level. At Ford, that’s a dealbreaker. You can’t give a marketing team access to vehicle diagnostic data just because they need one adjacent feature. Granular, per-feature access control wasn’t a nice-to-have — it was a compliance requirement.

Feast and Tecton had their own gaps. And all three had a shared problem: data privacy concerns meant we couldn’t push Ford’s data to a cloud-hosted solution. We needed on-premises ownership.

So we built.

That decision added months to the timeline and significant engineering complexity. I think it was the right call. But I want to be honest about what it cost — because the “build vs. buy” framing undersells how much of the real work happens after you’ve made the decision and have to live with it.

The authorization problem we didn’t fully anticipate

The hardest technical problem we faced wasn’t storage architecture or query performance. It was authorization.

Ford uses Apache Ranger for access control — a reasonable enterprise choice. The problem was that the Ranger API had significant rate limits that made real-time permission checks impractical at feature query volume. We couldn’t call Ranger on every feature access without tanking performance.

The solution we landed on was inelegant but effective: nightly exports of Ranger policies as JSON to HDFS, with event notifications to detect policy changes, feeding an authorization service we built that mapped permissions to our feature metadata database. We integrated it with Ford’s internal access request system so users could request feature access without leaving the platform.

It worked. But it meant we were running on a slightly stale view of permissions — updated nightly, not in real time. That tradeoff was acceptable for our use case, but it took weeks of debate to get there, and it introduced operational complexity we hadn’t budgeted for.

The lesson: authorization in enterprise ML infrastructure is never just a technical problem. It’s a political and compliance problem that happens to have a technical component. Treat it that way from day one.

The discovery problem we definitely didn’t anticipate

Here’s the one that stings a little in retrospect.

When we designed the feature catalog — the searchable index of all features in the store — we assumed feature counts would stay manageable. We built basic search. Keyword matching, some filtering. Good enough for a catalog of a few hundred features.

We were wrong about the scale. As adoption grew, the catalog became genuinely hard to navigate. Engineers couldn’t find features that existed. They’d build duplicates. The reusability problem we’d built the whole system to solve was creeping back in through a search interface that couldn’t keep up.

We should have built for Elasticsearch from the start. Semantic search, faceted filtering, feature descriptions that actually surfaced contextually. We knew it intellectually — we’d discussed it early and deprioritized it. The cost of that decision compounded as the catalog grew.

If I built this again: treat feature discovery as a product problem, not an infrastructure afterthought. The best feature store in the world doesn’t help if engineers can’t find what’s in it.

What we actually shipped

After all of it — the build decision, the authorization workaround, the discovery gaps — the system worked. Model development cycles dropped from six months to one week. Teams stopped rebuilding the same features independently. Training-serving skew nearly disappeared.

The performance targets we’d set held up: feature retrieval under 500ms for exploration, under 50ms for real-time inference, 99.99% uptime for serving infrastructure. The hybrid materialization approach — computing cold features on demand, pre-materializing hot ones — turned out to be the right call both for cost and for latency.

700,000+ engineering hours saved annually, across Ford’s ML teams. That number still surprises me when I say it out loud.

What I’d tell someone starting this today

Don’t buy a feature store if your authorization requirements are genuinely complex — most off-the-shelf solutions treat access control as an afterthought. Build, but scope it aggressively and add complexity only when the pain of not having it becomes undeniable.

Invest in discovery earlier than feels necessary. The catalog is the product. The infrastructure is just how the catalog stays accurate.

And finally: the training-serving skew problem is real, it’s insidious, and it will undermine everything else you build if you don’t treat consistency as a first-class architectural requirement from the start. We did. It paid off.