# The Explore-Exploit Dilemma in Media Consumption

How much should we rewatch our favorite movies (media) vs keep trying new movies? Most spend most viewing time on new movies, which is unlikely to be good. I suggest an explicit Bayesian model of imprecise ratings + enjoyment recovering over time for Thompson sampling over movie watch choices.
topics: statistics, decision theory, psychology, Bayes
created: 24 Dec 2016; modified: 12 Mar 2019; status: notes; confidence: possible;

When you decide to watch a movie, it can be tough to pick. Do you pick a new movie or a classic you watched before & liked? If the former, how do you pick from all the thousands of plausible unwatched candidate movies; and if the former, how soon is too soon to rewatch?

I tend to default to a new movie, reasoning that I might really like it and discover a new classic to add to my library. Once in a while, I rewatch some movie I really liked, and I like it almost as much as the first time, and I think to myself, “why did I wait 15 years to rewatch this, why didn’t I watch this last week instead of movie X which was mediocre, or Y before that which was crap? I’d forgotten most of the details, and it wasn’t boring at all! I should rewatch movies more often.” (Then of course I don’t because I think “I should watch Z to see if I like it…”) Maybe many other people do this too, judging from how often I see people mentioning watching a new movie and how rare it is for someone to mention rewatching a movie; it seems like people predominantly (maybe 80%+ of the time) watch new movies rather than rewatch a favorite. (Some, like Pauline Kael, refuse to ever rewatch movies, and people who rewatch a film more than 2 or 3 times come off as eccentric or true fans.) In other areas of media, we do seem to balance exploration and exploitation more - people often reread a favorite novel like a Harry Potter novel and everyone relistens their favorite music countless times (perhaps too many times) - so perhaps there is something about movies & TV series which biases us away from rewatches which we ought to counteract with a more mindful approach to our choices. In general, I’m not confident I come near the optimal balance, whether it be exploring movies or music or anime or tea.

The tricky thing is that each watch of a movie decreases the value of another watch (diminishing marginal value), but in a time-dependent way: 1 day is usually much too short and the value may even be negative, but 1 decade may be too long - the movie’s entertainment value ‘recovers’ slowly and smoothly over time, like an exponential curve.

This sounds like a classic reinforcement learning (RL) exploration-exploitation tradeoff problem: we don’t want to watch only new movies, because the average new movie is mediocre, but if we watch only known-good movies, then we miss out on all the good movies we haven’t seen and fatigue may make watching the known-good ones downright unpleasant.

In the language of optimal foraging theory (see ch4 of Foraging Theory, Stephens & Krebs 1986), we face a sequentially-dependent sampling patch problem - where the payoff of each patch can be estimated only by sampling each patch (before letting it recover) and where our choices will affect future choices; the usual marginal value theorem is of little help because we exhaust a ‘patch’ (each movie) before we know how we like it (as we can safely assume that no movie is so good that rewatching it twice in a row is superior to watching all other possible movies), and even if we could know, the marginal value theorem is known to over-exploit in situations of uncertainty because it ignores the fact that we are buying information for future decisions and not myopically greedily maximizing the next time-step’s return. Unfortunately, this is one of the hardest and thus least studied foraging problems, and Stephens & Krebs 1986 provides no easy answers (other than to note the applicability of POMDP-solving methods using DP which is, however, usually infeasible).

One could imagine some simple heuristics, such as setting a cutoff for ‘good’ movies and then alternate between watching whatever new movie sounds the best (and adding it to the good list if it is better than the cutoff) and watching the oldest unwatched good movie. This seems suboptimal because in a typical RL problem, exploration will decrease over time as most of the good decisions become known and it becomes more important to benefit from them than to keep trying new options, hoping to find better ones; one might explore using 100% of one’s decisions at the beginning but steadily decrease the exploration rate down to a fraction of a percent towards the end - in few problems is it optimal to keep eternally exploring on, say, 80% of one’s decisions. Eternally exploring on the majority of decisions would only make sense in an extremely unstable environment where the best decision constantly rapidly changes; this, however, doesn’t seem like the movie-watching problem, where typically if one really enjoyed a movie 1 year ago, one will almost always enjoy it now too. At the extreme, one might explore a negligible amount: if someone has accumulated a library of, say, 5000 great movies they enjoy, and they watch one movie every other night, then it would take them 27 years to cycle through their library once, and of course, after 27 years and 4999 other engrossing movies, they will have forgotten almost everything about the first movie…

Better RL algorithms exist, assuming one has a good model of the problem/environment, such as Thompson sampling. This minimizes our regret in the long run, by estimating the probability of being able to find an improvement, and decreasing its exploration as the probability of improvements decreases because the data increasingly nails down the shape of the recovery curve, the true ratings of top movies, and enough top movies have been accumulated The real question is the modelling of ratings over time.

The basic framework here is a longitudinal growth model. Movies are ‘individuals’ who are measured at various times on ratings variables (our personal rating, and perhaps additional ratings from sources like IMDB) and are impacted by events (viewings), and we would like to infer the posterior distributions for each movie of a hypothetical event today (to decide what to watch); movies which have been watched already can be predicted quite precisely based on their rating + recovery curve, but new movies are highly uncertain (and not affected by a recovery curve yet). I would start here with movie ratings. A movie gets rated 1-10, and we want to maximize the sum of ratings over time; we can’t do this simply by picking the highest-ever rated movie, because once we watch it, it suddenly stops being so enjoyable; so we need to model some sort of drop. A simple parametric model would to treat it as something like an exponential curve over time: gradually increasing and approaching the original rating but never reaching it (the magic of the first viewing can never be recaptured). (Why an exponential exactly, instead of a spline or something else? Well, there could be a hyperbolic aspect to the recovery where over the first few hours/days/weeks enjoyment resets faster than later on; but if the recovery curve is monotonic and smooth, then an exponential is going to fit it pretty well regardless of the exact shape of the spline or hyperbola, and one would probably require data from hundreds of people or rewatches to fit a more complex curve which can outpredict an exponential. Indeed, to the extent that enjoyment rests on memory, we might further predict that the recovery curve would be the inverse of the forgetting curve, and our movie selection problem becomes, in part, “anti-spaced repetition” - selecting datapoints to review to maximize forgetting.) So each viewing might drop the rating by a certain number v and then the exponential curve increases by r units per day - intuitively, I would say that on a 10-point scale, a viewing drops an immediate rewatch by at least 2 points, and then it takes ~5 years to almost fully recover within ±0.10 points (I would guess it takes less than 5 years to recover rather than more, so this estimate would bias towards new movies/exploration), so we would initially assign priors centered on v=2 and r= (2-0.10) / (365*5) ~= 0.001

We could consider one simple model: movies have intrinsic ratings 0-10, are uniformly distributed, there is an infinite number of them, each time period one earns the rating of a selected movie, ratings are unknown until first consumption, ratings do not deplete or otherwise change, and the goal is to maximize reward. (This simplifies the problem by avoiding any questions about uncertainty or posterior updating or how much movies decrease in enjoyment based on rewatches at varying time intervals.) The optimal strategy is the simple greedy one: sample movies without replacement until one hits a movie rated at the ceiling of 10, and then select that movie in every time period thereafter. Since the reward is maximal and unchanging, there is never any reason to explore after finding a single perfect 10, so the optimal strategy is to find one as fast as possible, which reduces to pure exploration and then infinite exploitation.

How about a version where movie ratings are normally distributed, say $\Mathcal{N}(5,2.5)$, with no upper or lower bounds? This is more interesting, because the normal distribution is unbounded and so there will always be a chance to find a higher rated movie which will earn a slightly greater reward in future time periods; even if one hits upon a 10 (+2SD) after sampling ~44 movies, there will still be a 2% chance of hitting a >10 movie on the next sample - this is an order statistics question and the maximum of a sample of n normals follows a roughly logarithmic curve, with the probability of a new sample being the maximum always falling but never reaching zero (it is simply ). Regret is problematic for the same reason, as strictly speaking, regret is unboundedly large for all algorithms since there is an arbitrarily larger rating somewhere in the tail. A pure strategy of always exploring performs badly because it receives merely an average reward of 5; it will find the most extreme movies but by definition it never makes use of the knowledge. A pure strategy of exploiting the best known movie after a fixed number of exploratory samples n performs badly because it means sticking with, say, a 10 movie while a more adventurous strategy eventually finds 11 or 13 or 20 rated movies etc; no matter how big n is, there is another strategy which explores for n+1 samples and gets a slightly higher maximum & pays for the extra exploration cost. A mixed 𝛜-greedy strategy of exploring a fixed percentage of the time performs better since it will at least continue exploration indefinitely and gradually discover more and more extreme movies, but the insensitivity to n is odd - why explore the same amount regardless of whether the P of a new maximum is or ? So decreasing the exploration rate as some function of time, or P in this case, is probably optimal in some sense, like in a standard multi-armed bandit problem.

This problem can be made better defined and more realistic by setting a time limit/horizon, analogous to a human lifetime, and defining the goal as being to maximize the cumulative reward by the end; optimal behavior then leads to exploring heavily early on and decreasing exploration to zero by the horizon.

x(t) = 1+1 ^ (t/r) x(365*5) = 0.10

and then our model should finetune those rough estimates based on the data.

• not standard SEM latent growth curve model - varying measurement times
• not Hidden Markov - categorical, stateless
• not simple Kalman filter, equivalent to AR(1)
• state-space model of some sort - dynamic linear model? AR(2)? dlm, TMB, Biips?

“State Space Models in R” https://arxiv.org/pdf/1412.3779v1.pdf https://en.wikipedia.org/wiki/Radioactive_decay#Half-life https://en.wikipedia.org/wiki/Kalman_filter

The forgetting curve is supposed to increase subsequent memory strength when the memory is reviewed/renewed on the cusp of forgetting. But you don’t want to actually forget the item. Does this imply that anti-spaced-repetition is extremely simple as you simply need to estimate how long until it’s probably forgotten, and you don’t need to track any history or expand the repetition interval because the memory doesn’t get stronger?

hm. so let's see. that DC comic about owlamoo, I don't remember any of and that was almost exactly a decade ago. I remembered very little of Doctor McNinja when I reread it in 2017 after starting it in 2008, so <9 years for most of it. I've been rereading and really enjoying yotsuba, and I read that sometime before 2011 according to my IRC logs, so that was at least 8 years. I also reread...
..._Catch-22_, Fourth Mansions, and Discovery of France relatively recently but I don't think I know when exactly I read them originally
so this suggests that the decay period is somewhere on the order of a decade. I could remember a fair amount of the 3 books but I don't know what those intervals are. for the others, the intervals are all around 9-11 years, and retention is almost near-zero: of course I remember the characters in _Yostuba&!_ but I remember almost none of the stories and it's practically brand new

Is there a simple stochastic exploitation strategy like this?

• MEMORIZE (a randomized spaced-repetition review algorithm derived using principles from control theory)

https://www.newyorker.com/magazine/2018/10/08/the-comforting-fictions-of-dementia-care

Some years ago, a company in Boston began marketing Simulated Presence Therapy, which involved making a prerecorded audiotape to simulate one side of a phone conversation. A relative or someone close to the patient would put together an “asset inventory” of the patient’s cherished memories, anecdotes, and subjects of special interest; a chatty script was developed from the inventory, and a tape was recorded according to the script, with pauses every now and then to allow time for replies. When the tape was ready, the patient was given headphones to listen to it and told that they were talking to the person over the phone. Because patients’ memories were short, they could listen to the same tape over and over, even daily, and find it newly comforting each time. There was a séance-like quality to these sessions: they were designed to simulate the presence of someone who was merely not there, but they could, in principle, continue even after that person was dead.

TODO:

• write a simple demo of the 5k films example ignoring uncertainty to see what the exploit pattern looks like
• use the rating resorter to convert my MAL ratings into a more informative uniform distribution
• MAL average ratings for unwatched anime should be standardized based on MAL mean/SD (in part because the averages aren’t discretized, and in part because they are not comparable with my uniformized ratings)