
Duck Tales: How DuckDuckGo uses data science to measure marketing effectiveness, privately (Ep.29)
Inside DuckDuckGo
In this episode, Cristina (CMO) and Baran (Data Science) discuss our privacy-first approach to marketing measurement, the role of incrementality, and how the team operates day to day.
Disclaimers: (1) The audio, video (above), and transcript (below) are unedited and may contain minor inaccuracies or transcription errors. (2) This website is operated by Substack. This is their privacy policy.
Cristina: Hi, and welcome to DuckTales, where we go behind the scenes at DuckDuckGo and discuss the stories, technology, and people that help build privacy tools for everyone. In each episode, you’ll hear from employees about our vision, product updates, engineering approach to AI, or how we operate as a company. I’m Cristina on the marketing team, and I’m here today with Baran on the data science. Would you like to say hi?
Baran: Hi everyone.
Cristina: So for our viewers, if you’ve already listened to episode 9 with me and Chuck, you know that, like our product philosophy, privacy is core to the ethos of our marketing. And most of the common marketing practices we just don’t do, identifying and targeting individual users, retargeting, using behavioral data, using third-party cookies and pixels, all that, all hard nos. We have very thoughtfully developed privacy-respecting measurement that is a bit unique, but includes techniques that I think and hope all marketers can adopt. Thankfully, we’ve come a long way from when I was doing the campaign analysis. And thanks to everyone’s data science background, we have much more sophisticated techniques and models. And I’d really like to dive in on a slice of that today. So first question, most companies measure their marketing through attribution, tracking who clicked on what and then installed. We sometimes do a version of when it’s possible to in aggregate and anonymously of course, but can you explain how does it work for us and is that even the right question to be asking?
Baran: Yeah, thank you, Cristina. On one hand, we have a self-built, simple and privacy-respecting attribution. And on the other hand, we have a few channels who do their own attribution. But with both of them, we have three issues. One, only a small percentage of our total impact through the campaigns is captured by attribution. Let’s take the most obvious one, TV campaigns. There might be a QR code on the screen, but most people would just go to the app or Play Store to download the app instead of scanning the QR code. And then it doesn’t work on all channels. Many channels only support the industry standards, more privacy in ways of attribution methods, not necessarily what we built. And third and most importantly, which is also valid for all other companies, is attribution is not the same as incrementality. It measures who clicked, but not which channel actually caused, let’s say, a metric like an install. And the real question is actually incrementality.
Cristina: Okay, so then if attribution doesn’t answer the real question, how do we introduce incrementality and how does that answer the question?
Baran: So ideally it would be A-B testing for incrementality for a channel within a channel’s advertising platform. But there is no way to do this. Channels don’t provide this. It could also be because this would make it very easy for marketing teams to understand the incremental impact. The closest thing we can do is randomized GeoLift tests.
Cristina: Okay, so incrementality is a concept I think a lot of marketers are likely familiar with, but to make sure everyone’s on the same page, can you walk us through what one of these tests actually looks like for us?
Baran: Yeah, we can take all the geos in a country, for example, if it’s the US, we have the DMAs, they are like New York, Los Angeles, we would randomize them into two groups of test and control. While doing this randomization, we will optimize for comparability between the groups in terms of the metric that we’re interested in measuring, such as installs. And then once we have run the test, let’s say the test would be launching a campaign in the test geos, we would use the pre-period of the campaign to measure the relationship between these tests and control groups aggregated metric. And then for the campaign periods, we would use that relationship. And then based on the control geos actual metrics in the campaign periods, we would estimate what we call the counterfactual, what we expect the metric in the test group to look like in the absence of a campaign. And the difference between the actual test results minus this counterfactual would give us the causal impact.
Cristina: Okay, I think that’s pretty clear. When the test is done though, what kind of answer do we get? Is it as simple as it worked or it didn’t?
Baran: Yeah, so it’s not necessarily binary significant or not. In this case, unlike product tests, significant would mean the channel had a significant impact on installs. But maybe it did, but maybe it was too expensive. What matters is, for example, metrics like a cost per install and especially an uncertainty around this metric. If the uncertainty is narrow, we would have high confidence in the result. If it’s wide, we would have lower confidence and take decisions based on this uncertainty. And we do these by using Bayesian methods, which is helping with better decisions than a simple yes or no.
Cristina: Got it. So if we think this is the gold standard, why isn’t everyone doing it?
Baran: I think the main reason is the operational complexity. A clean test requires ideally a full turn on or off in the test or control geos, and there’s multiple parties involved in this. Let’s say you have companies with a paid search team and you’d like to do a test on the paid search channel. It would mean they would have to turn off their channel and maybe they have goals that they have to hit.
Cristina: Yeah, this is an interesting peek into our culture. So can you help our viewers understand why we typically don’t have these problems?
Baran: Yeah, we have a small and unified marketing team. No one person or team is controlling a single channel. There are no, let’s say, siloed budgets per channel or channel group. There are no consequences if a channel stops. Let’s say if we don’t run a certain channel anymore based on test results, the people running these channels would switch to another channel, which makes more sense. And also lacking more sophisticated attribution make it even more important to run these randomized GeoLift tests. So in this case the necessity actually becomes a strength from the marketing data science point of view.
Cristina: I like the way you position that we turn a bug into a feature. We’ve had to get creative and develop marketing and data science methods that respect user privacy while still giving us the insights we need to make data driven decisions. I think it’s not just about what we do. It’s how we do it. That sets us apart. Our approach to marketing is as unique as our product philosophy. So what was it like the first time you ran one of these at DuckDuckGo?
Baran: First time it was also a new experience for me, because I was also used to from previous work the operational complexity part. When we first discussed about turning off a channel, I was very amazed when the team said, okay, let’s just do it, let’s do it like this week, next week. There was no conflict, even though it was a high scale channel, nobody worried about their goals. In the end, the results were not favorable. So we ended up significantly cutting spend and optimizing the campaign from those results, and since then we repeated this approach many times.
Cristina: I think that’s a great example. I like to think of us as a somewhat nimble team. And yeah, one of our core values is questioning assumptions. And this marketing experiment is a clear example of that. I really appreciate how we’re not territorial. We’re very much of all on the same team kind of mindset. So you get the results, then what happens?
Baran: So if it’s a new channel, let’s say we ran it for the first time, either if the results are favorable, we might scale it up, start running it further at scale, we might keep testing if the results require some optimizations, or if the results really out of hope, we will just stop further testing. And if it was an existing channel, let’s say we turned it off in the test geos, then depending on the results, if the results were not favorable, we might adjust spend, change our, let’s say, target cost per 1000 impressions or another metric that the channel allows, we might further optimize or if the results are more favorable than we thought, we might increase scale.
Cristina: Sounds like a lot of work. Do you have to keep running these tests every time?
Baran: Yeah, that will be very difficult, but quick answer no. Let’s say a channel, a digital channel has a meaningful attribution rate. Let’s say if a channel claims y attribution, it has, let’s say 100 installs. Then we run this test, we see, actually we measured 50. Then we would call this 50% the channel’s attribution rate. And as I mentioned, we were using Bayesian methods. We will also measure an uncertainty around this. Let’s say 50% is the point estimate. Our full uncertainty is from 40 to 60%. We would apply this ratio to ongoing attribution data that we get from the channel anyway. And we would monitor without retesting it. But of course, this attribution rate could be subject to change. If there’s a material change, let’s say a major shift in the creatives, a structural change in the placements of the channels in the age groups and so on, then we would retest.
Cristina: Interesting. So what about for bigger campaigns like TV, where it’s not possible or cost efficient to turn it on and off in specific regions?
Baran: Yeah, very interesting. Like if you’re running, if you would like to run long big TV campaign, we wouldn’t like to limit our scale by just using 50% of the geos or even let’s say using maybe a lower percentage as a control group could be very difficult due to how these channels operates. We use media mix models to measure the impact of such large channels. As other companies, we modulate spend from week to week. We have on and off periods, but the challenge there is TV is not the only channel that is running. We also have other smaller or digital channels that are too small to model within the media mix model directly. Here is again where the randomized GeoLift tests come into play. Same principle, we use a Bayesian model where we include digital channels using their attributed installs as the inputs. And we don’t need the model to guess everything. We already have our high-confidence measurements from the randomized GeoLift tests, and we can put those attribution rates along with the uncertainty as Bayesian priors into this model. So in one unified model, we capture everything.
Cristina: So then how do all the pieces fit together?
Baran: Yeah, this is, we are doing the same thing that the industry agrees. Attribution, randomized GeoLift tests and media mix models go in hand. One of them alone is not enough. Some channels have the attribution, but attribution is not telling the whole picture. GeoLift tests are great, but it’s impractical to random all the time. And media mix models are great, but many channels are too small to be measured there on their own. And we’re combining them all together.
Cristina: Well, thank you very much, Baran, for explaining part of how we measure campaigns. I hope it’s helpful and interesting to folks. And then more importantly, for your ongoing work, evolving our techniques to get higher confidence insights and decisions. We only spend marketing dollars on what actually grows the product, and we do it without compromising privacy. And frankly, we couldn’t do it without you. So thank you again, and we intend to keep you busy for a while. I so appreciate your time today and that of everyone who watched. So stay tuned for more episodes and bye for now.
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit insideduckduckgo.substack.com