Brenton McMenamin Apr 2017

The pitfalls of A/B testing in social networks

I'm frequently asked to help run A/B tests at OkCupid to measure what kind of impact a new feature or design change would have on our users. The usual way of doing an A/B test is to randomly divide users into two groups, give each group a different version of the product, then look for differences in behavior between the two groups.

The random assignment in a typical A/B test is done on a per-user basis. For example, if we re-designed our signup page, half our incoming users would get the new page (the "test group") and the rest would get the old page and serve as a baseline measure (the "control group"). Per-user random assignment is a simple, powerful way to test if a new feature changes user behavior (Did the new signup page entice more people to sign up?).

It's the perfect tool for most testing situations. Unfortunately, if you're doing tests for a product that relies heavily on interaction between users -- such as a dating app -- doing random assignment on a per-user basis can lead to unreliable experiments and misleading conclusions.

Why per-user assignment can fail

The whole point of OkCupid is to get users to talk with one another, so we often want to test new features designed to make user-to-user interactions easier or more fun. However, it's difficult to run an A/B test on user-to-user features doing random assignment on a per-user basis.

Here's an example: Let's say one of our devs built a new video-chat feature and wanted to test if people liked it before launching it to all of our users. I could do an A/B test that randomly gave video-chat to one half of our users... but who would they use the feature with?

Video chat only works if both users have the feature, so there are two ways to run this experiment: you could allow people in the test group to video chat with everybody (including people in the control group), or you could limit the test group to only use video chat with other people that also happened to be assigned to the test group.

Both approaches have major limitations.

If you let the test group use video chat with anybody, the people in the control group wouldn't really be a control group because they're getting exposed to this new video chat feature. However it's a weird, frustrating, half-experience where people could chat with them but they couldn't initiate conversations with people they liked.

So maybe you decide to limit video chat to conversations where both the sender and receiver are in the test group. This would keep the control group free of video chat, but now it would result in an uneven experience for the users in the test group because the video chat option would only appear for a random set of users. This could change their behavior in a number of ways that bias the experimental results:

They might not buy-in to a feature that's intermittent ("I'll ignore this until it's out of beta")
Conversely, they could love the feature and buy-in completely ("I only want to do video-chat"), thereby severing contact between the control and test groups. This would make things worse for everybody -- the test group would restrict themselves to a small corner of the site, and the control group would have a bunch of ignored messages and unreciprocated love.

Either workaround would result in biased and inaccurate experiment results.

Higher-order effects

Another limitation of per-user assignment is that you can't measure "higher-order effects" (also known as network effects or externalities if you're more business-y). These effects occur when the changes induced by a new feature leak out of the test group and affect behavior in the control group as well.

There are examples of these effects in the video chat example above, but these higher order effect can arise from almost any experimental manipulation -- even simple things that don't immediately seem like they tap into user-to-user interactions. For example, let's say we wanted to test a new rule that required users to put 500 characters of text into their profile.

We'd expect this new rule to force the test group to write more interesting profiles that would lead to a better experience on the site -- they'd get more, better messages because other users would know more about them. However, we can also anticipate that this would change the experience for people in the control group -- they'd see a sudden influx of users with interesting essays and also have an improved experience on the site because they'd find more interesting people that they want to message.

So, this change would theoretically improve the experience for users in the test group as well as the control group -- a clear win that we would want to launch to everybody. However, if we A/B tested it with per-user assignment we may not see this as a clear win because the test looks for improvements to the test group relative to the control group.

In this case, the spill-over effect ends up masking a real change to the user behavior, but the change is obscured because the improvement is echoed by the control group. It's also possible for higher-order effects to create an illusory change that disappears once you roll out a feature out to everybody. It turns out that you can't really trust anything from an A/B test in social networks.

Using per-community random assignment

One alternative to per-user random assignment is to use per-community random assignment. In this case, a "community" is any group of users whose interactions are primarily directed to other users within the same group. Data teams at LinkedIn and Instagram have discussed their own uses for community-based A/B testing, but the hard part is figuring out how to define a "community" for your specific product.

A common mathematical approach to defining user communities is to model the relationships between users with a social graph, and then apply graph partitioning algorithms to find isolated, non-interacting groups.

For many social websites and apps, it's easy to translate the user interactions (e.g., messaging, friending, connecting, following) into a graph. Each user is a node, and edges are placed between nodes that have had some interaction. Then, you can apply graph partitioning methods -- such as Normalized Cuts -- to partition the nodes into groups with lots of within-group connections and relatively few between-group connections.

However, the social graphs for dating apps are a bit different from those that arise in other social media platforms. In dating apps, a typical user is focused on finding new people to talk to rather than maintaining contact with existing connections, so the community is really defined by "anybody that's near you" as opposed to "people you have a history of interacting with". Rather than building a social network to describe connections between pairs of users, I created a "geo-social network" by calculating how frequently connections were made between pairs of locations. When graph partitioning was applied to this graph, we get a set of geographic regions that can serve as different test regions for our experiments.

Defining geographic communities

So defining geographic regions for the experiment is easy, right? You just randomly assign each city to a particular experimental condition. But... as anybody knows who has looked at the myriad ways that the census defines boundaries for cities and metro regions, it turns out that it's tough to tell where a city ends.

And it gets even harder when you realize that there isn't a single consensus 'dating market' associated with each city. Every person defines their own unique set of geographic boundaries. Somebody that lives downtown might talk to people living in the nearby suburbs, but no further; but the people in those suburbs would talk to people in further out suburbs; then the people in those suburbs might talk to people the next town over.

However, I can take mathematical approach to defining 'optimal' regions by drawing boundaries that maximizes the number of connections within each region relative to the number of connections that occur across regions using Shi & Malik's Normalized Cut algorithm for graph partitioning.

Here is a general outline of the method (if you want the actual mathematical proofs, you can go to the original pdf):

Build a graph to describe how many messages were sent between pairs of US/Canadian locations as the square adjacency matrix A (size num_Locations-by-num_Locations). The value at A[i,j] set to the number of messages sent between users in location i to users in location j. I didn't differentiate between being a message's sender or receiver, so A is symmetric.
Calculate the graph Laplacian, L, from the adjacency matrix, A, and the degree matrix, D, as L = D - A. The degree matrix is just a diagonal matrix with the degree of each node on the diagonal and zeros everywhere else.
Calculate the eigendecomposition of L, and inspect its eigenvalues. Each fully-isolated (i.e., disconnected) sets of nodes will be represented by an eigenvector with an associated eigenvalue of 0. Further segmentation of these regions is done by looking at the patterns in the eigenvectors with the k smallest, non-zero eigenvalues. The value of k can be varied depending on how many regions you want to make.

I varied k over a wide range (i.e., 5 to 250) and tested how well each set of regions did at identifying separable user communities by measuring the ratio of messages between regions relative to messages within each region. It turns out that optimal separation among USA and Canadian cities occurred with a 36-region solution. After defining these regions, we validated the seperability on fresh, live data that didn't go into the segmentation analysis and I found that ~95% of all interactions (i.e., messages, likes, profile views) occurred between pairs of users that were within the same region.

Here's a map shows you what these regions look like.

Conclusion

Now we have a set of regions that (mostly) don't interact with one another, we can deploy experiments on a per-community basis. This lets us accurately test changes to user-to-user interactions and avoid the pernicious influence of higher-order effects that spread across a social network.

However, there are drawbacks to using per-community assignment. I have had to make changes to our statistical analyses to reflect that we now have a nested experimental design (rather than users each being assigned to a group independently). Unfortunately, this design has considerably less statistical power.

In conclusion: A/B testing is conceptually simple, but can be difficult to execute if your product involves anything you'd consider "social interaction". Due to interactions between users and higher-order effects, a standard off-the-shelf approach to testing can be misleading. There is not a single approach for solving this that'll work across all products. So... I guess my final recommendation is that you should hire some data scientists that like doing experiments.