Lessons From The Trenches: How We Chose Our Experimentation Platform

In every organization striving to deliver products users adore, there comes a turning point where the rallying cry of “we need to experiment more” shifts to “we simply can’t sustain this pace of experimentation.” Picture this: manually configured holdout groups, traffic-routing requests ping-ponging between product managers and engineering teams, and analysts’ schedules fully booked weeks in advance. The ambition to become truly data-driven has essentially outgrown the infrastructure originally built to support it.

That was the exact situation we found ourselves in at ManyChat last year. Yes, we ultimately selected Eppo, but that choice is actually the least significant part of this story—and the hardest element to replicate at your own organization. What I’d rather walk you through is the journey I took to reach that decision, the missteps I made along the way, and the unexpected lessons I discovered once the deal was signed (yes, doctors would disapprove of this particular strategy).

A quick note on context. We chose Eppo during a particularly dynamic period in the industry, with the vendor landscape shifting beneath our feet even as we evaluated options. Eppo had already been acquired by Datadog several months prior. Statsig had been picked up by OpenAI and would later change hands again to Amplitude. I don’t believe the insights I’m sharing here are tied to that specific news cycle, but I want to be transparent that it certainly influenced our mindset throughout the decision-making process.

I’ve organized the rest of this into three phases: what happened before the decision, the decision itself, and what came after.

Before

Let me paint a picture of where we stood before any of this unfolded. When I first joined the company, an engineer shared something telling: if his team had two promising experiment ideas on the table, they’d routinely shelve the second one for a future sprint. Why? Because the technical complexity of managing two simultaneous traffic allocations was simply too daunting. The risk of misconfiguration eventually eclipsed the enthusiasm to test. This is, quite literally, the opposite of velocity at best—and a complete halt to experimentation at worst. And for the single experiment they did manage to set up, copying and pasting boilerplate allocation code was their daily routine.

An analyst working downstream from that same pipeline described herself as a “human microservice”—and she meant it. Holdout groups were defined manually, refreshed manually, then handed off to engineers manually, and so on. It was certainly a first-person, immersive experience of the entire workflow. But setting the irony aside, that was the moment the need for a proper platform stopped being a theoretical discussion and became painfully concrete.

I’d encountered versions of this challenge before. At Marktplaats, several years back, I’d built internal Python libraries designed to absorb exactly this kind of friction, and we managed to shrink time-to-insight from days to hours in the trickiest cases.

I watched the same build-versus-buy debate unfold again at Adevinta, this time on a global scale, and the conclusion there was to build in-house. Fortunately for us at ManyChat, by late 2025 the available platform solutions had matured to the point where, for a company of our size and at that particular juncture, purchasing was clearly the smarter path.

We were looking for the tool that would give us the best chance of building a world-class experimentation program: cutting-edge statistical methods, absolutely, but more critically, a platform that naturally guides its users—product managers included—toward running experiments that yield definitive answers.

Two obstacles stood between us and a final choice. The first was straightforward: we had identified the pain, but it was still largely anecdotal. Leadership had a solid intuition about what was broken, and I’d heard developers and product managers voice frustrations about the existing setup during my early days. But none of that constituted a structured vendor requirements document. Until we could lay the two side by side, we had no way to distinguish genuine must-haves from nice-to-have features.

The second challenge was more complex. The decision carried significant weight because, regardless of how you frame it, every platform involves some degree of lock-in—cultural if not technical. And resources are limited: we couldn’t possibly run proof-of-concept trials with every vendor on the market. Not to mention the sunk cost of having to reverse course and start from scratch. Picking one platform in a single shot, with no opportunity to adjust, would have been a recipe for regret. And with the offerings being so comparable in many respects, identifying the best fit for us demanded surgical precision. We needed a way to decompose one high-stakes decision into a series of smaller, lower-stakes steps that built upon each other.

Interviews, and de-risking the decision

I kicked things off with interviews—product managers, product analysts, engineers, marketers. The goal was to transform anecdotal frustrations into something we could systematically compare against a vendor’s capabilities. The engineer’s scheduling nightmare, the analyst’s “human microservice” role, the product manager who had abandoned granular experiments in favor of bundling changes into larger releases and deferring others entirely—these stories became the job specification for the tool we needed. I cannot overstate how valuable this exercise proved later. Every time the process veered off track—and it did—the interviews served as our anchor. They also gave the entire initiative credibility internally: explaining to my CPO why we were launching a proof-of-concept was a completely different conversation when I could quote a specific pain point directly back to her.

To address the single-shot problem, we structured the evaluation in three progressive layers, each probing deeper into the assessment:

Desk research. Review vendor documentation, compile a long list. Most platforms eliminated themselves at this stage, before we ever entered a sales process. I leaned heavily on Claude Code during this phase as well.
Demos. A focused session with each shortlisted vendor. Some sales pitch, certainly, but primarily us interrogating the areas we’d identified as most critical.
POC. Hands-on evaluation with real data and real users, reserved exclusively for the final two contenders.

Each layer winnowed the field and delivered insights at a cost we could justify. By the time we reached the proof-of-concept stage, we were down to two options, and the decision before us had been distilled to something genuinely manageable. Statsig, or Eppo?

There’s one element I’d replicate on day one of any future platform evaluation, regardless of category: the interviews that surfaced those pain points. They were the single most impactful unlock of the entire phase. A close second was sponsorship. And I don’t just mean from my director, who encouraged me to move it forward. I kept peers and stakeholders who would ultimately need to support and adopt the decision informed throughout the entire process. By the time the POC wrapped up, the outcome surprised no one.

At the close of the “before” phase, we had a shortlist of two finalists and a disciplined methodology for how we’d narrowed the field. We understood what mattered to us. The tougher question still loomed: between two platforms that both met our threshold, which was genuinely superior for us? How would we conceptualize “better,” and how would we reach consensus on it in practice?

During

It was the debrief session following the POC, and the analysts on the evaluation panel were sharing their perspectives one by one. Two of them, the most familiar with our existing stack, wrapped up their assessments with nearly identical conclusions:

“As a product analyst, I’d genuinely be delighted to move forward with either of them.”

I let that sink in for a moment. The consolidated

The scores from the panel matched those numbers: the two platforms earned 4.36 and 4.47 out of five, evaluated across more than twenty weighted criteria. By any honest assessment, it was a dead heat. I had invested weeks designing a process meant to clearly favor one platform, and that process had just delivered a verdict, through the voices of the peers I trusted most to detect a real difference, that there was no meaningful difference from his seat.

The insight I gained in that moment, one I never would have reached without the panel, is that analyst-level rigor is now the bare minimum. The incremental advantage of selecting one modern experimentation platform over another does not show up on your scorecard; it shows up somewhere else. Figuring out exactly where became the new question I had to answer.

So I needed a choice I could stand behind; first to myself, then to my data director and CPO, then to the teams who would inherit it. Coin flips and gut feelings are shaky ground for a multi-year commitment. And the tie meant the tiebreaker could not be reverse-engineered after the fact; it had to reflect what we genuinely wanted from the next phase of experimentation at ManyChat.

Put simply, we were not comparing two static snapshots; we were comparing two paths forward. Eppo’s vision centered on guided, opinionated, PM-friendly *cough * proof *cough * workflows; Statsig’s leaned into power-user flexibility. Both were reasonable bets. But we had said, remember:

We wanted the tool that would give us the best shot at getting our experimentation program where we want it: cutting-edge statistics, yes, but more importantly a tool that nudges its users toward conclusive experiments by default (…)

I noticed what did not happen. The proof-of-concept plan called for PMs to try both platforms and feed their scores back into the matrix. For the most part, they did not, because of limited bandwidth. One head of marketing operations and one PM shared unsolicited impressions, but the rest of the PM-side evidence and input remained thin. The lack of PM feedback had a paradoxical effect: it raised the importance I placed on PM-facing UX, workflows, and governance in the final decision. The reasoning is asymmetric. Analysts are flexible, power users by nature; they will figure out whatever interface you put in front of them. PM onboarding does not bend the same way. If the platform our analysts rated equally is also the one that lowers the barrier for our PMs, that is a clear decision; the alternative, choosing the analyst-equivalent platform that would have tripped up our PMs, would have been quiet self-sabotage.

In short, we could finally say: with everything else roughly equal, the ease of use for non-technical people is what truly separates the two platforms.

So we chose Eppo. The trajectory question was what tipped the scales: over a longer horizon, Eppo aligned better with where we wanted experimentation to sit; closer to the teams running experiments, not just the analysts. Knowledge management treated as a first-class concern. Reporting that does not require a slide deck rebuilt from scratch. Statsig had its strengths too; CUPED (a variance-reduction method) built into its power calculator, a standalone metrics explorer, a more flexible analysis surface; and we accepted those as Year 1 gaps to work around while Eppo was being rebuilt inside Datadog and catching up on those features.

Looking back, the lesson I draw from it cuts both ways. The decision demanded more rigor than instinct wanted, and then less faith in that rigor than I expected. The scorecard mattered because it forced everyone to be precise and built trust and credibility in the outcome. It gave me 360-degree coverage, but the final call came from the pivotal moments within it: the analyst tie, and the vision question. Six months after signing, a curious colleague would ask me how we had chosen, and I could walk them through the panel, the scorecard, the adjustments, and the vision framing. That, to me, is a win.

After

I think I expected, in some part of me I would not admit out loud, that signing the contract was the finish line. I had spent weeks constructing a credible decision-making process and had logged a couple of hours of vendor calls. The week we signed, I had a quiet day. I sat down at my desk and opened a working document about what would come next. Legend has it I am still writing it.

The clean-water metaphor I had used in the proposal kept resurfacing. We had laid the pipes; that was the SDK integration, the data plumbing, the warehouse connections. The platform itself, too, if you want to think of it that way. Pipes give you flow, but not clean water. In the worst case, pipes make things worse instead (more low-quality output, faster). Clean water is what comes out of pipes when the rest of the system (the source, the treatment, the people maintaining it) does its job. Experiments work the same way: a platform gives you the flow, but trustworthy results come from governance and process, from people, and from how seriously the organization treats the gap between testing an idea and shipping a feature.

The tool is ready; the organization is not yet ready for the tool.

Up to that point, I had been focused on the cost of the contract, not the cost of closing the gap between the tool is here now and the organization is ready to use it.

I had told colleagues, in the weeks before signing, that a portion of the analytics team’s capacity would gradually ramp up to a new steady state once Eppo was live. As of this writing, I am still hopeful that will happen a quarter or two from now, but not before we get some things in place first. Velocity, the sheer ability to run more experiments in a given period, also has to wait.

Signing did not give us time back, nor did it immediately bring us more experiments. The work that began the day after signing; forming a cross-functional integration group, drafting the experiment lifecycle, configuring Eppo protocols (part of its governance framework), certifying our first success metrics and guardrails, migrating a knowledge base, designing a training curriculum; all of it had to happen before the platform could deliver the velocity potential we knew it had. In short, what lay ahead was not a tool problem. It was a governance, process, and people problem.

Three legs of a stool

For experiments to actually be trustworthy at ManyChat, three things have to exist at the same time: the tooling and engineering integration so experiments can flow through the platform, process and governance so the experiments flowing through are properly designed and decided on, and people and skills so best practices are followed in practice and not just on paper. Remove any one of the three and the whole thing wobbles.

We had the tool and the integrations now. Process and governance lived mostly on the data science team: a five-stage experiment lifecycle (Propose, Design, Run, Analyze, Decide); a certified set of success and guardrail metrics; all of it encoded into the platform’s own protocol templates so the guardrails were not a Notion page but a built-in feature of the tool. People and skills would take shape through ad hoc Eppo-delivered quick-start sessions and, longer term, an Experimentation 101 and 102 curriculum. An ongoing push for a graduated autonomy model, PMs paired with analysts at first, more independence over time; that is the distant goal on the horizon.

The other thing

A subtler lesson: signing Eppo was where my role changed. I had entered the project as the Staff member responsible for selecting a tool. I came out of it doing change management; onboarding teams, teaching, holding PMs accountable to lifecycle compliance, spending political capital I had saved for other things. It was completely worth it for me, though.

Closing notes

If I had to distill all of this, these are the lines I would fit it into:

A credible decision is the deliverable, not the platform. The platform is an artifact. The decision is what your organization will live inside for years.

In the same spirit, pipes are not water. A tool is necessary infrastructure for trustworthy experimentation, but not sufficient. The work begins, not ends, on the day the contract is signed.

I am writing all of this knowing the experimentation tools market is shifting; the vendor turnover I noted earlier has not slowed down. Whatever the landscape looks like by the time you read this, the process elements that endured for me are probably the ones worth borrowing: the interviews, the phased discovery, the vision framing, and the honest accounting for what comes after.

If you want to dig into the details over an online cup of coffee, feel free to reach out to me on LinkedIn! I would be happy to exchange ideas with you.

Also check out my personal page for more pieces like this.

Top Posts

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Lessons from the Trenches: How We Chose Our Experimentation Platform

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trump Mobilizes Defense Industry to Chart Software and Supplier Networks Nationwide

KuCoin Pay: Weaving Crypto Seamlessly Into Everyday Payments

The End of an Era: US Civil Rights Agency Dismantles 60-Year Data Archive

Trending

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Lessons from the Trenches: How We Chose Our Experimentation Platform

Before

Interviews, and de-risking the decision

During

After

The tool is ready; the organization is not yet ready for the tool.

Three legs of a stool

The other thing

Closing notes

Related Posts