Pong Wins Above Replacement, Part 1

In case you haven’t figured it out, we here at Bobcat Territory really like beer pong. Since we started the site in April 2012, we’ve had 8 blogcats dedicated to the subject. That’s a veritable blogclowder (yes, apparently a group of bobcats is called a clowder)! In any case, this will be Part 1 of an ongoing series dedicated to trying to look at our trove of pong data in new and interesting ways. Before I jump in, you might want to pre-game by revisiting our previous posts on the subject:

The Beirut Blogcat
DKE Pong Rules & Regulations
The Brother Pi Breakdown
Pong II: Advanced Analytics
Don’t Doobie
NogFest 2015 Analysis: Player Performance
NogFest 2015 Analysis: Individual Games
NogFest 2016

If you’ve done your homework, you will notice that a number of metrics have been derived in these various blogcats all aimed at trying to measure player efficiency and impact at different aspects of the game. For example, we have Shot Value, which compares how often a player’s shot results in a point scored with how often that shot results in a point surrendered to the opposing team, and Points Saved vs. Average (PSVA), which tries to evaluate how difficult a player is to score against by comparing points scored against them versus points expected to be scored against an “average” player. Ultimately, we’d like to extend this logic into some sort of single, catch-all measurement for evaluating how good any individual team or player is: analogous to, say, PER or Win Shares in basketball, QBR in football, Standardized Wrist Torque in professional bowling, or Orc Kills per 36 minutes for Gimli in “Lord of the Rings: Return of the King”.

While metrics like PSVA are useful, they are still only giving us a snapshot of a particular component of the game, without giving us any insight to how these different factors may interact with one another. How do we determine the relative merits of being slightly above average in one area of the game, like saves, versus another, like avoiding UFEs? How can we evaluate the impact on a team’s performance by substituting one or both of its members with an “average” player? We don’t have enough data available to give any reliable answers. There aren’t enough games in our database, and the fact that all of the game data available are from tournament formats gives our data a bias towards those teams that just so happened to win, regardless of whether or not they were a “better” team. That is, how many games would that team have won if the same match-up were played 99 more times?

One way to spare our livers from HAVING to play each game 100 times to give us robust inference about the quality of any given player or team, we can turn to a common tool in the world of statistical analysis: simulation. Pong games have a pretty simple and relatively strict structure to them (as I’m sure you, dear reader, already know from having read the rules and regulations linked to above). Each possession begins with a serve, which either ends with the occurrence of some event (UFE, sink, or hit) or leads to a rally, in which the players take turns hitting the ball in a specified order until the occurrence of an event ends the possession. Each individual step of this process can imagined as a coin flip. Let’s start with the serve:

Team A serves the ball. There is some probability that Team A serves the ball off the table (a UFE), resulting in a point for the other team. This is our first coin flip: is the ball kept on the table or not? If it IS kept on the table, there is another coin flip: did the serve, assuming no UFE, then result in an event? If there was no event, then the rally continues. If there WAS an event, we have yet another coin flip: was the event a hit or a sink? If it was a sink, the possession is over with a point for the serving team. If it was a hit, the receiving team gets a chance to save, which results in another coin flip: if the save fails, the serving team gets the point, if the save succeeds, we have ANOTHER coin flip, which is whether the save itself results in a point (for the sake of simplicity, we are ignoring the fact that a save can lead to a hit, and thus another save, and so on). If this coin flip goes one way, the receiving team gets a point for the save, and the possession ends. If this coin flip goes the other way, neither team has scored a point, and the rally continues as normal. Here’s a diagram of this process, with the formulas accompanying each line how we calculate the probability of that particular coin flip:

So a single serve can end in three different ways: a point for the serving team (team A), a point for the receiving team (team B), or the beginning of the rally, though there are as many as five different coin flips potentially involved in deciding which path through the chart is taken. If either team scores a point, then the serve begins again (with, of course, the usual rules for deciding which team is serving). If nothing happens on the serve, and we progress through to a rally, we then get a series of coin flips for every single shot during that rally to determine whether or not it is a UFE, a hit, a sink, a save, a save point, or nothing, similar to what happens with the serve. The probabilities involved are slightly different (if you look at the data, you see clearly that shooting and UFE %s on serves and on rallies are different for most players/teams), but the basic logic remains the same.

So, the “game flow” for a whole pong game goes something like this:

Thanks to exhaustive video review, we have pretty decent measurements on the frequency of the different types of events that can occur during a pong game, and so can estimate the various coin flip probabilities involved. Simulating a pong game, then, is pretty simple: you are essentially just simulating an ordered series of coin flips, with particular probabilities associated with each one, embedded within the rules that govern serves and scoring (i.e. game to at least 21, have to win by two, serves switch every 5 points, etc). Given this set-up, we can then begin to evaluate the impact of certain teams, or players, by simulating large numbers of games. For now, let’s focus on team play, and ignore individual players until a later blogcat. Say we want to know how many Wins Above Replacement (WAR) a given team is: that is, how many more games would you expect this team to win than an average team, if playing against an average opponent? For example, if two completely average teams play each other 1000 times, you expect each of those teams to win about 500 of those games. If we replace one of those average teams with a particular team, say Nog + RJL or Bullets + Kavulich, then we can see how many games that team is expected to win against an average opponent. That number divided by 500 is then the approximate WAR for that team. Of course, since this simulation is random, even out of 1000 games the exact number won will be different, but as we increase the number of total simulated games, the closer the WAR will be to its hypothetical true value.

The output from a single run of the simulation I designed to follow these rules looks a little something like this:


In the first two columns, you can see each team’s score as the algorithm progresses through the game (in this case, it ends with team 2 winning 21-12). The next two columns are the service indicators: you can see that team 2 served first. The next two columns are the inverse of the service indicators, i.e. they indicate which team was receiving the serve. Then, we have the “shots” and “rally” column; the latter is simply a binary indicator of whether or not that possession progressed to a rally or ended with a point being scored by either team on the serve (you can see that 6 of the 33 “possessions” in this game ended with a serve-point of some kind), while the “shots” column counts the number of shots taken on that rally before a point is scored (that 41 shot rally must have been riveting). After that, there are a number of columns (not all shown) tabulating what events occurred on that possession and which team earned them (i.e. UFEs, sinks, hits, saves, save points). Down 16-10, team 1 had two consecutive serve UFEs … that’s not how you get yourself back into a game.

Now, we can look at the number of events that occurred for each team in this game and compare them to how many we saw, on average, in NogFest 2015 (later I’ll explain why I am only using the inaugural NogFest for these analyses, and holding out the rest of our game data):


You can see that team 1 here had WAY too many UFEs, which really did them in, considering they had more sinks AND saves than team 2. Maybe this one game was an outlier and the coin flips just happened to go this way. Let’s see what happens if we do this simulation 1000 times and look at the average per game stats:


These teams are pretty evenly matched, with team 1 having an edge in saves per game, which is completely offset by having extra UFEs per game. Team 1 won 476 of these 1000 games, while Team 2 won the other 524. The average point differential was -0.38 for team 1, indicating that, on average, all the games were close, with team 2 getting a slight edge (almost certainly due to the difference in UFEs). 122 out of the 1000 games went to overtime – interestingly, team 1 won 53% of the overtime games (clutch). However, just because the teams were evenly matched on average doesn’t mean there can’t be some lopsided games: in one game, team 1 triumphed by 17 points, while in another they lost by 15 points. In 55 out of the 1000 games, team 1 got blown out by more than 10 points (to be fair, there were also 47 games they blew out team 2).

By the way, team 2 here is an average team, while team 1 here is Majowka and Big Bob. So, their Wins Above Replacement is 476/500 = 0.952, meaning against an average opponent they actually win slightly less than you would expect an average team to. They really got to get their UFEs under control.

Now, these numbers aren’t perfect. As mentioned earlier, the data we have to feed into these simulations is biased; maybe Majowka and Big Bob just aren’t looking good because they happened to have a bad day when we recorded the numbers, and they wouldn’t REALLY commit that many UFEs per game in a larger sample (on the other hand, Majowka sucks, so maybe they’d look even worse in a larger sample). The design of the simulation is also making a few assumptions: e.g. a save either leads to a point or a continued rally, and doesn’t take into account the possibility of save-hits leading to saves-of-saves, there are no doobies, and that the probability of an event occurring on any given possession is the same across all possessions (and independent of the occurrence of any previous events). This last assumption may be the most important, because it ignores the effects of, say, drinking a lot of beer and getting less accurate, or that returning a save is more difficult than returning a normal shot, and so on. However, as we can see from the previous tables, the numbers are at least approximately recapitulating those that we actually observed, at least for average/near-average teams, which indicates to me that the changes wrought by these factors tends to even out over the course of many games.

In Part 2, we will look at how the other teams in NogFest fared in these simulations, and start looking at interpreting the results more closely.

Leave a Reply

Your email address will not be published.