Image courtesy Joel Mier (note: not the Netflix research facilities but you get the idea)

The Difference Between Research & Testing: Netflix Edition

One provides directional feedback, the other a winner.

Barry W. Enderwick
8 min readJun 19, 2019

--

We had just wrapped up a focus group session at Netflix and we were elated. One of us turned to the group and said, “Testing showed that 90% of respondents loved it!” It was all high-fives and happiness behind the glass. The problem was, it wasn’t true.

Why? Because what we had just gone through was an exercise in research, not testing. Unfortunately, this difference is not widely understood. And it needs to be. Knowing the difference helps you understand how to weigh the results and consider the implications which ultimately leads you to make the best decisions.

Research

At Netflix, we had an on-site, in-house research facility. On a fairly frequent basis, we’d conduct focus groups with members. Usually in groups of 10 or so for about an hour. Our objective was to learn how customers interacted with our marketing materials, new features, new designs, etc.

This research, always led by an experienced moderator, allowed us to gain directional feedback. It was not, however, “testing.”

Let’s look back at that focus group I referenced at the beginning of this. The one where 9 out of 10 people loved what we showed them. As exciting as that seems, it actually tells us nothing in terms of how the rest of our members would behave had they been presented with the same material.

In other words, it is not “90% of our members, it’s 90% of a very small (statistically insignificant) group. It’s guidance. It indicates that it is something we might want to explore further (by the way, asking trusted advisors what they think is pretty much the same thing: anecdotal). While it’s valuable input, it’s important not to confuse it with a test result. That said, even with the understanding of these limitations, the guidance often helped us cull down options and reinforce suspicions, leading to much more effective testing.

In focus groups, people provide a myriad of answers. Some address the questions asked. Others try to play the role of consultant. You can tell when this is happening when a respondent’s sentence starts off with something like, “Most people…” Still, others would say one thing and do another.

For example, one respondent was handed a letter which was a direct mail concept we were looking into. And while he was given no instructions, he proceeded to read the entire thing and was able to later recount details from it. Yet, when he finished reading he said, “Yeah, I would never read this.”

At the end of it all, the directional insights would indicate that we were heading in the right direction for a concept or concepts but it was not a test.

Also, a quick note about moderating focus groups. Moderating focus groups is like marketing — everyone sees it so they think they can do it. Even if a person is social, engaging, and perhaps good at talking to people, it doesn’t mean they can just step in and be a great moderator.

A great moderator won’t just get answers to questions but engage in dialogue with the respondents without biasing them or leading them. Proper dialogue gets respondents to “ladder up.” That is, get to the real reason behind the initial answers. And that is gold.

Takeaway:
Research provides directional feedback and insights. And while not statistically significant, it allows for companies to come up with new ideas for A/B testing that might not have been obvious. It can also inform what ideas you might not to pursue given finite resources. It is not, however, testing.

Testing

Before we get cracking on testing, it is important to understand that there are different ways of quantitative testing.

Let’s say we did a quantitative questionnaire that was submitted to 1000 customers. And 90% of respondents said that “Netflix should sell used DVDs.” That is meaningful data supporting used DVD sales. It’s legitimate testing. How important is the result? A little, but not a lot.

Why? Because we have learned that what people report and what they do are frequently disconnected. For instance, people SAY they want more options. But empirical data has shown that when presented with a lot of options they are unhappy, confused, and ultimately often don’t do what they said they’d do.

So while it’s a real test, it too is directional. It indicates that we should further explore the concept. But the best test is based on what people actually do. This brings us to A/B testing.

A/B tests are golden and reasonably well-understood: In essence, they are controlled scientific experiments, where two (or more) presentations of something are given to statistically meaningful numbers of people to see how they behave. If they respond differently, you can confidently determine it was due to the variation.

But there are a few potential problems.

1/ Variables — the elements that change from piece to piece in the testing stimulus. If you change a lot of variables between what people are seeing you will have trouble determining what element was causal. If you don’t change enough, you might not see a measurable change at all.

2/ Statistics. If your audience is small, it's very difficult to assemble a meaningful sample. Tests can be run on smaller samples of course, but then you just have to understand the accuracy so you can measure with confidence.

For instance, if you can only sample 100 people in each test group, and one group performs 10% better than the other, you don’t have statistical confidence to say that group “wins” because a 10% difference is within the margin of error in that sample size. You might need a 50% difference in order to declare with confidence that one group was better than another. Small samples only can show large differences. Of course, if you’re small, you may only be interested in large improvements, so this can be okay.

Customer retention, for example, is a much more nuanced metric where small changes have considerable value and something we were often trying to measure at Netflix. And since we were looking for very small changes in a positive direction, we needed extraordinarily large sample sizes; very different from the big changes we needed in other parts of customer experience, where we could get confidence with smaller cohorts.

3/ The need to limit customer bias. A great example of this is presenting a new online experience. If you show a new experience to an existing customer, they are more likely to prefer the previous one they had known. By showing it to new customers only, you can read the data about how they behave and avoid biasing the results.

If a small company makes a change to the online experience, and they’re carefully measuring “the response,” they’ll likely find that a large number of existing customers rebel, complain, or even quit buying from them altogether. The company will likely quickly revert to the previous version.

But it could be possible that while existing customers hate it, new customers love it. That data would be masked because there wasn’t a test cell of isolated new customers being examined.

When your audience is small, you want to be winning more and more new customers and there’s the risk that you’re trying to satisfy existing customers too much. At Netflix, we used to say our features weren’t for the 5M customers we had, but the 10M customers we wanted over the next year. New customer experience was key in most tests.

Some companies, especially startups, don’t test but instead trust leaders’ instincts. Some companies overtest — testing everything and getting somewhat paralyzed by the process.

The key, as we found at Netflix, was to master the balance of having good tests, researching big propositions, testing big propositions, and trusting the leaders to gain the wisdom to not test too many things.

When we could generally predict the outcome of tests (and we always tried), we’d gain the confidence to take bigger and bigger guesses without testing, leaving those resources to the “hard-to-impossible” kinds of propositions we needed to sleuth out.

The key is to become exceptionally good at testing. Not only the application of testing to learn but also knowing what a test is and what to test. Contrary to popular belief, at Netflix, we didn’t test everything — we tested big things. And we got very good at it.

Takeaway:
True testing requires scale and the ability to be confident in statistical significance. It is best represented by A/B testing which allows for multiple versions of a concept to be shown to a new audience in order to gauge the efficacy. And, much like in strategy, what you choose not to do is as important as what you choose to do. It’s all too easy to get caught up in overtesting.

So which to choose? Both.

At Netflix, we used both research and testing and it was critical to the growth of the company. We used research to gain directional insights and testing to make iterative improvements to marketing, product, customer experience, and more at scale.

A small startup might not be able to engage in testing because it simply doesn’t have the volume. That’s ok. They still have the ability to do various kinds of research, getting new ideas, and understanding customers better, something they should be doing on an ongoing basis.

As the company grows, testing should kick in as well and be used on an ongoing basis. The key here is to not stop conducting research just because the volume for testing is there. After all, if all you do is test, you’ll only get the best result of the subset you’re testing, not necessarily the best result achievable.

Recognizing the difference between research and testing, knowing when to do which, and how to evaluate the results, will help propel any business. Just like Netflix.

This article was co-written by myself and a friend and former Netflix colleague, Michael Rubin, CEO of Neomodern in San Francisco. Also, thanks to friend and former Netflix colleague Bill Scott for the editorial gut-check.

Next up, we’ll take a closer look at optimization.

--

--

Barry W. Enderwick
Barry W. Enderwick

Written by Barry W. Enderwick

Brand/marketing executive, Kaizen (ex Netflix). I write on startups, strategy, business, culture & design. Also Sandwiches Of History on Insta/TikTok/YouTube

Responses (1)