Tuesday, 18 August 2015

xG Hexagonal Maps

First popularized by Kirk Goldsberry and then introduced to hockey via War-On-Ice's Hextally plots, hexagonal plots are a great tool for helping to visualize sports. I have created my own version's below in the form of apps. Two quick caveats, my current 2014/2015 data seems to have some bugs in it so take those seasons with a grain of salt and the individual attempts map also seems to be buggy for reasons currently unknown. I am working to fix both of those issues but just keep them in mind.

Here are some of the features of my xG Hexagonal Maps:

  • If you are unfamiliar with xG (Expected Goals) you can read my post detailing the methodology here. Simply, it provides the probability of any given shot resulting in a goal. 
    • A slight change between this xG and the one from that post is that these numbers also included missed shots now
  • The size of each hexagon is the frequency of shots from that specific location. The larger the hex, the more often a player shoots from that location
  • Each hex is coloured by the efficiency (xGe) of a player/team/goalie from that specific location.
    • Efficiency here is measured as the difference between how many goals we expected them to score from that location (their xG) and how many they actually scored from that danger zone.
    • A Blue Hex means that their xG was greater than their actual G, implying that they may have under-performed. 
    • Red Hex means that their xG was less than their actual G, implying that they may have over-preformed. 
  • Danger Zones are denoted by the light-pink and light-purple lines, high/medium/low. 
  • Not every red hex means a player over-preformed and not every blue hex means a player under-preformed. If you play in front of Henrik Lundqvist, your On-Ice Against xG is probably always going to be higher than your actually goals against. 
The links to all the different maps are posted below. Please let me know if you have any thoughts, questions, concerns, suggestions find anymore bugs . You can comment below or reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart  

Team Attempts Map

https://dtmaboutheart.shinyapps.io/app-1

Goalie Map

https://dtmaboutheart.shinyapps.io/Tendy

Player On-Ice Attempts Map

https://dtmaboutheart.shinyapps.io/PuckOn

Player Individual Attempts Map

https://dtmaboutheart.shinyapps.io/Single

Friday, 7 August 2015

Team Rankings and Projections

Note: I have since named this projection system MONDO for no reason other than it is also my friends name.

Projecting hockey isn't an easy task. Just ask SAP, NHL.com's partner in arms, contracted to help this multi-billion dollar industry tackle concepts developed over the past decade by hobbyists. SAP claimed to have developed a model that boasted 85% accuracy, never mind the fact that is basically impossible to achieve. These past playoffs SAP finished with a record of 9-6 (or 10-5 if you are allowed to change your picks once the series has ended).  I will not dive into the world of subjective rankings and predictions either, every outlet on the globe that covers hockey will probably share their gut feelings for next season. Starting at the player level, building up to the team level, I wanted to objectively quantify an individuals impact on their team and then be able to project how each team will perform in the future.

Most of the models we see in hockey deal tend to only operate at the team level, which isn't all that bad. However, in building my model (based off of the basketball version created by 538) I wanted to build something that was truly founded at the player level and malleable to changing circumstances. My model will adapt to injuries, trades and lineup adjustments. That goes both for players and goalies.


Methodology 

Players

The player projections here are based off of Corsi Plus-Minus (CPM). If you don't know what CPM is you would be best off reading about it here. The most basic definition of CPM is that it reflects the impact a given player has on their team's Corsi when said player is on the ice independent of the strength of that player's teammates. 

The projection system used here is what is known as a Marcel Projection system, originally derived by Tom Tango. It involves three basic components. The first step is weighting past seasons based on recency. Here we use a 5/4/3 method, meaning that if we are projecting the 2015/2016 season we would weighted 2014/2015 stats 41.66%, 2013/2014 stats 33.33% and 2012/2013 stats 25%. Simple enough. Using three years worth of data helps ensure we don't put too much stock into one extremely/good or bad season. Giving recent seasons more weight helps us to account for players that might be trending up or down. 

Second step is to apply a regression to the mean based on games played over the past 3 seasons (once again weighted by recency). Reminder, regression doesn't always mean getting worse. Regression to the mean here implies that we are pulling a player's numbers closer to league average (0 for CPM stats). Players who didn't have a single game in a season are given 10 games at league average play for those missing seasons. Also, if a player did not meet a certain threshold of weighted games played over the past 3 seasons instead of pulling their numbers towards zero, their numbers are pulled towards -1.5 which is a about replacement level. That may sound like a ton of regression but in actuality most of it is just to handle outlier players who would otherwise have unrealistic numbers. The logic basically goes, the less experience a player has at the NHL level the more cautious we need should be about their abilities. Rookies and other players who did not play a single NHL game last season they are given projected values of 0 (league average).

Finally, we apply an aging curve. I wanted this aging effect to be present yet not overbearing. I designated peak age to be 26 with a 4% increase up until then and 2% decrease after that. If over 26, Age Adjust = (Age - 26) * .002. If under 26, Age Adjust = (age - 26) * .004. Not crazy, but it serves its purpose. 

Now that we have our projected OCPM (Offense), DCPM (Defence) and CPM (Overall) values, we have to convert them into goals. Based on a recommendation from Steven Burtch, I looked at each teams shot distributions by danger zone. Using these shot distributions and the average Corsi Shooting% from each zone we can derive an expected On-Ice Shooting% for the players on each team. I then used the same process but looking at corsi against to create an expected On-Ice Sv% for players. 

I then used the Marcel system (from above) but adapted for a player's on-ice shooting percentage. This produces an expected On-Ice Corsi Sh% multiplier for each player (ex. when Tyler Seguin is on the ice, you should take his team's Sh% and multiply it by 1.18). Forwards can have an impact on their On-Ice Shooting%, Defenceman cannot. Therefore when converting a players OCPM into goals we will incorporate this multiplier while defenceman will simply receive the expected team average shooting% (calculated via the method above). Neither forwards nor defenceman can consistently influence their On-Ice Sv% so each player is assigned the team expected Sv%.

The last piece is to convert our CPM from rate stats to counting stats. Simply, we need to factor in playing time. This is where lineup construction comes in handy. I assign each player the league average time on ice based on their position (Forward vs. Defense) and spot in the depth chart (1st vs. 3rd line). First liners will have more of an impact than fourth liners but it is important to remember that what happens when your lesser players are on the ice still counts. Combining all of these factors leaves us with our expected offensive, defensive and overall values:

Goals For Above Average = Projected OCPM * Projected Time on Ice * Expected Team CSh% * On-Ice CSh% Multiplier

Goals Against Above Average = Projected DCPM * Projected Time on Ice * Expected Team CSv%

 Total Goals Above Average = Goals For Above Average + Goals Against Above Average 

Goalies

Goalies are voodoo, I know. Thankfully, we can projected them just the same way we did with players. Steps one and three are identical so I won't go over them again. Step two is the same in principle but I will clear up some of the details.

Goalies with little to no experience in a season are given 10 games at league average play. Also, if a goalie did not meet a certain threshold of weighted games played over the past 3 seasons instead of pulling their numbers towards league average (~0.923), their numbers are pulled towards replacement level (~0.910). If a goalie hasn't played any NHL games in the past 3 seasons they are given a rating of replacement level. 

I then split goalies by either starter or back-up. Starters are assumed to play 57.5 of their team's games, while back-ups will play the other 24.5. This is the average playing time split in the NHL. Each goalie is always assumed to face the league average even-strength shots against per game (~23.2). Our final number for each goalies impact is: 


Goals Saved Above Average = (Projected SV% - League Average Sv%) * Projected GP * League Average Shots Against


Season Simulation

Now that we have our player and goalie ratings, we simply simply add up all 18 players and 2 goalies impact for each team. This will give us Team Goals Above Average, a simple way to think about it is just a team's projected goal differential. Using the pythagorean expectation (developed by Bill James) we can project a team's winning percentage based on how many goals we expect them to score and allow, using this formula: 

Win % = goals scored1.8 goals scored1.8goals allowed1.8

Assuming that a team playing on home ice has an inherit 55% chance of winning just by the nature of playing on home ice we can using this formula (also created by Tom Tango), for predicting the odds of a home team winning a given game:


Home Team Win Probability = [(Home Team Win%) * (1 - Away Team Win%) * .55] / ([(Home Team Win%) * (1 - Away Team Win%) * 0.55] + [(1 - Home Team Win%) * (Away Team Win%) * (1 - 0.55)]))

I am sorry if that looks convoluted and hard to follow but we are really just plugging in both teams rating. Using that formula we can now, for any combination of of two team's, figure out the odds of either team winning. We assume that every game has an equally likely chance of going to overtime, low event teams are more likely to go to overtime but its not a substantial difference. Winning or losing a overtime/shootout is equal for each team.

Hockey is a game of probabilities. If Team A is favoured 60% to 40% over Team B, it is important to keep in mind that 4 times out of 10 the lesser team will win that game. To simulate this randomness, for any given game we will roll an imaginary dice. This is a basic example but hopefully can help explain the process:


This shows the outcomes if we had two perfectly equal teams facing each other. Of course we aren't dealing with perfectly equal teams in real life. So when we apply this to actual games the numbers will not be as simple. 

Using the actual 2015-2016 schedule and the above formulas we can derive the win/loss/OT probabilities for every game. Then generating a random number (rolling our imaginary dice) we can find out the result of each game. Do that for every game and congrats, you just used the Monte Carlo method to simulate an entire NHL season. Just like a single dice roll, simply doing this simulation once could generate some weird results. This is why we need to simulate the season 10,000 times to smooth out those weird results (10,000 times might be a tad excessive but it doesn't hurt).


Rankings and Projections

Here are the results below. I also divided the results up to show you the splits between a team's forwards on defence. To get Team Goals either add For + Against + Goalies or Forwards + Defense + Goalies. 




Final Notes

  • I will do my best update these rankings/projections with every trade and injury
  • Once a majority of the 2015-2016 season has been played and I have new CPM data, I will update the model accordingly
  • During the season, the expected point totals will be divided into:
    • Total points gained so far
    • Total expected points for the remainder of the season
    • Total expected points at seasons end
  • Despite my best efforts, some players are still really screwy. Most notably Michael Bournival for Montreal who gives them one of the best 4th lines in the league almost single handedly
  • No, substituting career AHL player A for career AHL player B will not change your team's projections
  • In looking at the methodology behind the model you can make your own changes to the model as you see fit. Think your better goalie will play 70 games this season? (despite the fact that over the past two season only 2 goalies have played that many) Then you can go ahead and add a few goals to your teams projection
  • This isn't the worlds most sophisticated model, it is not supposed to be
  • Below is the data used in the model.
    • Role - Position in the lineup
      • 1 - 2 - 3 - 4 : Forwards
      • 10 -11 -12 : Defense
      • 1 -2: Goalies
    • GFAA - Goals For Above Average
    • GAAA - Goals Against Above Average
    • TGAA - Total Goals Above Average
    • GSAA - Goals Saved Above Average
Please let me know if you have any thoughts, questions, concerns or suggestions. You can comment below or reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart  




Thursday, 23 July 2015

Corsi Plus-Minus: Individual Player Value Accounting for Teammates

***Reminder, if you are just interested in looking at the data/visualizations you can check out the separate page here instead of scrolling through this entire article.

Hockey stats have existed for about as long as the game itself. Simple boxscore stats such as goals and assists can be traced back almost a full century now. These stats have helped informed fans/coaches/managers of the value held by some players. Around the 1950’s the Montreal Canadiens began to track a player’s plus-minus. This stemmed from the idea that simple boxscore stats fail to capture many important elements of a game. Plus-minus was a good start to tracking impact that is not realized in traditional boxscore stats. Unfortunately, Plus-Minus has been recently shown to be quite incomplete and lacking by modern evaluation standards. 

The most famous stat to come out of the modern hockey analytics movement is probably Corsi (or just simply all shots). Corsi has numerous benefits over goal based metrics. It accumulates faster leading to more reliable information and is actually more predictive of who will be better in the future. Raw Corsi% (or CF%) tracks the share of shots (all shots, not just shots-on-goal) directed at the oppositions net versus how many are directed at a player’s own net when said player is on the ice, the higher a player’s CF% the better. This metric suffers from a key drawback however, each player’s CF% is heavily dependent upon the quality of his on-ice teammates. This is how a depth player on a good team (ex. Shawn Horcoff) typically has a higher CF% rating than a star player on a bad team (ex. Taylor Hall).

Issues with Current Metrics

Above, I have outlined some current issues with standard CF%, but these are not new issues. People have been aware of the effect playing on a good/bad team can have on a player for a while now, so next progression in the history of hockey stats resulted in Corsi%Relative. To calculate the CF%Rel of a Player Y on Team X is done as follows, Team X’s CF% while Player Y is on the ice minus Team X’s CF% while Player Y is not on the ice. This still runs into issues of comparing players on vastly different teams. Consider the example below:

Whose stat line is more impressive? It seems as though Malkin makes a pretty good team into a great team when he is on the ice while Hudler turns his team from awful to just bad. Which is more valuable? Or are they equally valuable? This is an improvement upon raw CF% but still shares many of the same problems. 

dCorsi fairly prevalent metric constructed by Stephen Burtch. dCorsi is calculated for offense as, Player X’s Actual Corsi For minus Player X’s Expected Corsi For. The real trick here is calculating what a player’s expected Corsi For is. I won’t dive too deep into this stat since it is not mine and you can read up more on it here or reach out to Stephen Burtch via Twitter here. Expected Corsi is calculated using a multivariate regression using five independent variables:
  • A dummy variable for each Team/Season
  • Time on Ice per game 
  • Team total time on ice that the player in question wasn’t on the ice
  • Offensive Zone Start%
  • Neutral Zone Start%
dCorsi moves the right direction here, especially with the dummy variable accounting for which team a player was on (if you don't know what a dummy variable is, just try to hold on until the Methodology section where I do my best to explain it). Unfortunately this doesn’t seem to help address the largest issue with CF% which is how heavily dependent an individual player’s CF% is on their teammates. 

The next evolution of CF%Rel is CF%RelTM, which stands for Corsi For % Relative to Teammates and is calculated by subtracting a player's average teammate CF% (calculated by weighting a player's teammates' CF% without him by their TOI spent with him) from his observable CF%. This sounds good but runs into issues of collinearity. Collinearity occurs in hockey data because of the way coaches use their players. 

The most famous example is probably with the Sedin twins on Vancouver. During the 2014-2015 regular season, of all 1100 minutes of 5v5 ice-time each the twins played in that season, 92% of their minutes were played with each other. This can greatly boost their teammate relative stats because their CF% is being boosted by playing with there also talented brother and is being compared to when one is instead playing with their lesser teammates. When Daniel wasn’t playing with Henrik last season his CF% was 29% but that was only about 77 minutes of ice time (or 8% of his seasonal total), unfortunately CF%RelTM will then weight this 29% heavily because of how much Daniel and Henrik play together. These rare instances where one player plays without the other can have disproportionately large effects on both player’s ratings.

This issue also occurs between players who are never on the ice together such as Jarret Stoll and Anze Kopitar on Los Angeles. Stoll played 905 minutes at 5v5 last season, how many of those were with Anze Kopitar? 1 minute and 30 seconds or 0.16% of Stoll’s total ice-time. When we look at Jarret Stoll’s most common teammates (the ones who will be weighted highest by CF%RelTM) we find that when they aren’t playing with Stoll they are playing with Kopitar. This unfairly punishes Stoll for playing the same position on the same team as one of the top centres in the game. Kopitar boosts Stoll’s teammates CF% when Stoll is on the bench while never providing the same boost to Stoll’s own CF% simply because they are never on the ice together.

This finally brings me to explain Corsi Plus-Minus, how it is calculated and why I believe it is a better metric for isolating a player’s contributions independent of the strength of that player’s teammates.

Methodology

Using the play-by-play data provided by NHL.com I was able to look at every even-strength shift that took place from 2007-2015. From this data we set up a multivariate regression where our dependent variable is rate at which a Corsi event took place, our independent variables are which players were on the ice during that shift and each shift is weighted by how long it was. This model does not account for a player's zone starts for a variety for reasons (see. here here) and it also does not account for the strength of a player's opponents (see. here). Simply, there has been a lot of research to show that both of those components might not be as relevant in helping determine a player's value. 

For those of you that might not be familiar with multivariate regressions here is a relatively simple and hopefully helpful example to help you grasp the concept. 

y=β01X12X23X3
  • y - Dependent Variable - How well you will do on your test? Measured in points
  • x - Independent Variables
    • X1 - How long did you study for? Measured in hours
    • X2 - How long did you play video games? Measured in hours
    • X3 - Did you go to the study session? Yes/No?
      • This is an example of a dummy variable 
  • B - Coefficients - The value of each independent variable 

So if you had a sheet of data filled out with how well everyone did on their test (dependent variable) along with the information on the 3 independent variables and then ran that data as a multivariate regression you would get something that looks like this (reminder I made all these numbers up): 

y = 75 + 2X- 3X+ 6X3

For the purposes of my methodology, the regressions coefficients are what we will be focusing on. Looking at coefficient β1, for every extra hour you studied (holding all other independent variables the equal) you can expect to score 2 points better on the test. Looking at the dummy variable, X3, it can either be a 1 (yes you did go to the study session) or 0 (no you did not go to the study session). If you go to the study session (holding all other independent variables the equal), you can expect to do 6 points better on your test.

Now back to how this relates to my methodology. Picture every player as a dummy variable in our data, that player was either on the ice during a shift or they weren’t. Running our regression will then give us coefficients that tell us (holding all other players equal) how much of an impact Player X has on his team’s CF%, a value I have coined Corsi Plus-Minus (CPM). Two of these regressions are run for each season, one for offence and one for defence. 

However, due to the collinearity nature of our data, as was touched upon earlier, we would be better served to use a ridge regression (also known as Tikhonov Regularization) to help account for this collinearity. This method adds a penalty factor to the regression for results being far away from the mean. This penalty factor, called lambda, is chosen based on a 10-fold cross-validation. This helps remove a lot of the noise accompanied with such a process. Players falling in the bottom 25% of the league’s playing time (measured by games played instead of TOI to not bias against forwards) are group together and treated as a single player to help reduce volatility in the results that would be caused by their extreme values. While randomness can still have an effect, the damage is less so due to this regularization. 

Repeatability and Predictiveness 

I ran some basic correlation tests to see how repetitive of a skill Corsi Plus-Minus (CPM), Offence CPM (OCPM) and Defence CPM (DCPM) is from year-to-year, players must have played at least 20 games in back-to-back years to be included in the correlation. I compared CPM to the other metrics (CF%RelTM and dCorsi) I mentioned earlier just as a means of reference. A correlation (Pearson R) rangers from -1 to 1. The closer to either -1 or 1 the stronger the relationship is, the closer to 0 the weaker the relationship.


I also ran some correlations to see how well CPM does at prediction future GF%. Reminder, negative correlations are a good thing for DCPM/dCA, so focus on the numerical values and not necessarily positive or minus signs.

  • OCPM -> GF/60
  • DCPM -> GA/60
  • CPM -> GF%




Reliability vs. Validity 

As I said above, having all 3 different metrics here is strictly for comparative purposes. Just because CF%TMRel is higher in most categories doesn't automatically qualify it as a better metric. I will summarize key parts from this Columbia article with regards to reliability and validity. 
“Reliability refers to a condition where a measurement process yields consistent scores (given an unchanged measured phenomenon) over repeat measurements.”  

This is a measure that quantifies how much random fluctuations can interfere with getting consistent results. As we have seen above, the collinearity of CPM samples decreases CPM's reliability. Compared to single-year With-Or-Without-You stats, CPM seems to be quite reliable.
“Validity refers to the extent we are measuring what we hope to measure (and what we think we are measuring).”  

CPM is completely valid, because it is directly measuring the result. Stats such as CF%, CF%Rel and CF%TMRel are not as valid, because they measure a proxy rather than the subject directly. So even while they may seem reliable and predictive they are not hitting the target of determining individual contributions.

Final Notes

Corsi Plus-Minus is just the next step in accurately determining a player's true value. There are many different versions of this methodology that could still be applied instead of Corsi, including goals, shots on net, Fenwick and xG. There is also a Bayesian version where instead of assuming every player starts the season with a rating of zero, we can tell the model what rating a player had the prior season so that it can more accurately estimate that player's true value. This version wouldn't help describe what happened during a given season quite so accurately but it could help provide an even great idea of a player's value. I have looked pretty heavily into this version of CPM and hopefully might release it shortly. In a follow-up post I hope to explore further correlations (ex. Y -> Y+3) as well as how CPM relates to time-on-ice and salary. 

The data is posted below in both as a spreadsheet and in the form of some Tableau visualizations. I also have noticed an error with the Comparison tab in my original Tableau that I am unable (for some  unknown reason) to update, so I created a new Tableau (posted below the spreadsheet) that now includes player salary data.
  • OCPM/DCPM/CPM are rate stats (per 60 minutes of ice-time)
  • The Impact version of these stats apply a player's ice-time to determine their actual impact in a given season.

Thanks to C. Tompkins for helping me with the Tableaus, as well as War-On-Ice and Puckalytics for the data. If you are interested in reading about topics like this in further depth, I will direct you to previous work done by Brian McDonald (here, here & here) that was an instrumental guide in my work.

Please let me know if you have any thoughts, questions, concerns or suggestions. You can comment below or reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart  




Monday, 22 June 2015

Clustering NHL Forwards (using k-means)


How do you classify hockey players? Many would argue to go by the classical six positions (C, LW, RW, LD, RD, G) while some would argue for a rover (see picture above). I suggest a different distinction. Obviously, goalies are their own identity so they're excluded from this analysis. That leaves players which I will further breakdown into forwards and defence. Forwards and defence tend to have very distinct roles with a few exceptions (D.Byfuglien and B.Burns). In this post I am going to focus on forwards. It isn't easy to decide just which position a forward plays, don't bother asking the PHWA (see. the Ovechkin debacle) because they obviously can't tell. NHL.com is no help either since many of their positional declarations are hilarious out of place (ex. Zetterberg is listed as a LW despite taking over 1000 face-offs last season which places him 48th in the entire NHL). Then there is there is the issue of 1st/2nd/3rd/4th line. These roles are usually overstated by most media types and then there is a designation problem. If a player preforms like a 1st liner but his coach players him on the 2nd line? What really are they? I know, deep stuff. Long story short, breaking players down into categories is easier said than done.

K-Means Clustering

Therefore, I set out to with a fun exercise to reclassify forwards based on their playing characteristics. I used k-means clustering to break the players down into 8 categories based on these characteristics. I want to stress that these measurements are meant to reflect a player's playing style not how well or poorly they preformed. The chart below shows the average of each measurement broken down by cluster. I arbitrarily named the clusters myself, you shouldn't read too much into those. Come up with your own if you want. (You should click on that picture if you want to look at the cluster characteristics more carefully.)


Here is a random sample of ten players and which cluster they belong to. Please don't get too upset if you don't like a certain player's cluster. Remember these clusters group players by "playing style" not skill level. 

Cluster Features

Below are some box-and-whisker plots which breakdown the clusters by Corsi%, Age, TOI/GM  and AAV. Here is a quick run through of how to read a box-and-whisker plot:
  • The big solid line going down the whole graph is the mean value for the whole same. Example: The mean Corsi% for all forwards is 50%.
  • Within each box plot is another solid line that marks the median value for that cluster (the middle value of that cluster). Median not the mean.
  • The box itself encompasses the upper and lower quartiles of values, from the 25% to 75% percentile. 
  • The whiskers mark the top and bottom 25%, excluding outliers. 
  • Dots denote outliers.

Notes

  • All-Around, seems like the group a player would want to be in but it still encompasses a range of players from Sidney Crosby to Manny Malhotra.
  • "Safe" Depth is  labelled as such due to them being populated of lower end players (see. AAV), yet their Corsi compares favourably when compared to the Depth cluster of players.
  • High Impact players, do a bit of everything including areas that don't involves scoring ex. draw more penalties than they take while dishing more hits than they receive.
  • Power Forwards,  are big guys (yes, I subjectively looked at the cluster of players for 30 seconds and thought I saw a bunch of perceived power forwards) who prefer to pass more than they shoot but also take more penalties than they draw, probably due to a lack of foot speed.
  • Depth players, while few in numbers (only 13) they dish hits like crazy yet clearly trail in the Corsi%, TOI/GM and AAV categories. 
  • Passers, create a lot of opportunities for their teammates and are wizards at taking the puck away from their opponents more than they give it up.
  • Depth scorers, these are typically young players who have been held down in the lineup by their coach yet can really shoot the lights out.
  • Scoring Wingers, are very similar do depth scorers yet have been given a larger playing opportunity.
  • I would love to do the same exercise for defenceman but their doesn't seem to be enough distinction using my current attribute metrics. Maybe I will discover a better way to classify them in the future, who knows.
Please let me know if you have any thoughts or questions. You can comment below or reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart

Wednesday, 10 June 2015

NHL Draft Probability Tool



**UPDATE: The model has since been updated to include four new NHL Draft Rankings (Adam KimelmanSteven HoffnerMike G. Morreale, Craig Button) in addition to the Bob McKenzie rankings. I have also added a second version of the model (v2) which is the same model except it assumes McDavid and Eichel will go 1st and 2nd overall. I will also be live tweeting these probabilities during the NHL Draft this year as they change with each pick, you can follow me on Twitter here if that interests you.


To paraphrase Michael Baumann, the draft is a way for professional hockey teams to enrich themselves at the expense of 18 year olds by taking away their ability to negotiate with all 30 franchises and imposing them with a strict rookie salary cap to depress their ability to make money while representing these franchises. If you ignore those larger issues, the draft can actually be a lot of fun. A team deciding which player they will eventually select is the cumulation of many different levels of research on the player's on-ice production, off-ice physical condition, psychological assessments and back ground checks. The process of actually making the selections, however, involves some level of game theory which is the foundation for the draft tool found below. (This tool isn't a completely original idea but, it is unique in it's application to hockey. I have based this project on the work done by Brian Burke (not the hockey GM) for the NFL Draft which you can find here).

The draft is slightly more complicated than just choosing which player a team wants when the clock is running and they are on the board. It shouldn't be a shocking statement that each team values players differently.  Let's you're a GM and you really like Lawson Crouse this year. Do you need to trade up or down to get him? How much should you be giving up or asking for? How far should you trade up or down to still get the player you value highly? This tool helps answer those questions.

Methodology 

*Brian did such a good job of explaining his model that much of this methodology will be a paraphrase of his work

This is not simply just an average of projections but instead a model based on Bayesian inferenceBayesian models begin with a 'prior' probability distribution, used as a reasonable first guess. Then that guess is refined as we add new information. It works the same way your brain does (hopefully). As more information is added, your prior belief is either confirmed or revised to some degree. The degree to which it is refined is a function of how reliable the new information is. This draft projection model works the same way.

For a prior estimate of when each prospect might be taken, I created a probability distribution centred around a consensus of best-player rankings. I created my own consensus rankings of the past four drafts using ISS Hockey, McKeens and The Hockey NewsThe key at this stage is that this prior distribution is only as confident as these types of projections have been in the past. The distribution is based on the errors of these rankings over the past four years. They don't take into account team needs (which team's should never consider but whether they do or not is another discussion) or when certain teams have picks. These consensus rankings give us a reasonable first guess as to where a certain player is likely to get picked. 


Player Perspective


Using Bayesian inference, let's add expert draft projections starting with Bob McKenzie's rankings. Keep in mind that the weight of each addition of new information is only as strong as the information has proved to be accurate in the past. Here is the probability distribution of Lawson Crouse, note how when we add McKenzie's rankings into the mix our guess of where Crouse will be drafted becomes more confident. 


The good news is that we don't have to be so concerned about exactly which pick is used for a player. From the point of view of a GM, it's valuable to simply know whether a player will still be available at a given pick number, which is a slightly easier question to answer.The last chart takes the final result and turns it into what's called a cumulative probability distribution. It shows the probabilities that Lawson Crouse will still be available at each pick of the draft. This could be immensely useful to a team, (especially for picks that aren't so obvious). 


Team Perspective


When looking at the Tableau do not miss the team perspective tab. It shows the probability of availability for the top remaining prospects for a chosen pick #. Team's with multiple first round picks will have a 1 following their name to designate their extra pick ex. Buffalo Sabres 1. This page also has some additional filters for example, if you wanted to look at the best defenceman and/or OHL player who will probably be available at the specific pick, you can do that.

Improvements

The tool is great so far but it can get better. The biggest issue I ran into while trying to put this all together was a lack of quality pre-draft prospect rankings by reputable sources. This is where I ask for your help. If anyone reading this has access to/knows where to find more extensive NHL Draft rankings whether that is in the form of rankings from other scouts/insiders or just more extensive lists of one's I am already using as currently, as I only have access to the top 30 rankings for ISS/Mckeens/THN and would prefer to have something like a top 100, please reach out to me. I would prefer any rankings/mock drafts to cover the past four drafts. My email is dtmaboutheart@gmail.com and my twitter is @DTMAboutHeart . Any extra draft data could help refine this model. I still have some ideas for how to add more expert predictions myself but that would require some tweaking, which I would prefer not to do. If you have any data, please reach out.

Draft Tool

Some reminders about this model and the draft in general:

  • The draft is a lot less predictable than people make it out to be.
  • Aggregating expert projections will tend to be more accurate than a single projection, as long as the aggregation is done right.
  • The prior distribution doesn't need to be perfect, just a reasonable starting point.
  • The inputs to the model are weighted only as much as they have proven to be accurate in the past.
  • The final result has all the uncertainty of the system baked in.
  • We don't always need to know exactly when a player will be chosen. Often it's very useful to know whether or not he's still available at a certain pick.
  • Don't freak out about McDavid/Eichels's probabilities. This model is based on the results of past drafts. Look no further than Seth Jones sliding down to the 4th pick in the 2013 draft to see how this model believes theres a chance McDavid or Eichel won't go first and second. Everyone believes they will be chosen 1st and 2nd, so don't get hung up on it.

Below is a Tableau visualization of the draft tool with separate tabs for the player and team's perspective. If you run into any issues just let me know. Once again big thanks to the Brian Burke (not the hockey GM) for his methodology. Everything below will also be stored full time on a separate Draft Tool page. Enjoy!






Wednesday, 3 June 2015

Updated xSV% - Save Percentage Accounting for Shot Quality


Goalies are voodoo. That should probably be added as the 11th Law of Hockey Analytics. Goaltending analysis is currently one of the most lacking subjects within hockey analytics. Great strides have been made however, with 5v5 SV%, AdjustedSV% and High Danger SV%. A few months back I revealed a statistic I referred to as xSV%. You can click there to read the article but, more or less, I calculated a goaltender's Expected SV% based on the quality of shooter for each shot faced based on a rolling average of the shooter's individual shooting percentage. xSV% is a goalie's actual SV% minus their ExpSV%. A higher (positive) xSV% is good while a lower (negative) xSV% is bad. I have thought more about that specific methodology since posting that article and have eventually decided that with some substantial changes I could greatly improve upon xSV% . Based on factors in my ExpG model combined with regressed shooting percentage for each shooter (same mindset but different process as the original xSV%) I basically started from scratch to develop this latest rendition of xSV%.

Methodology

The basis of a goalie's Expected SV% comes from the same model I used in my latest ExpG model. Here is a quick breakdown of the different variables and a brief explanation of why they are included in the model:
  • Adjusted Distance
    • The farther a shot is taken from the lower likelihood it has of resulting in a goal 
  • Type of Shot
    • Snap/Slap/Backhand/Wraparound/etc...
    • Different types of shots have different probabilities of resulting in goals
  • Rebound - Yes/No?
    • A rebound is defined here as a shot taking place less than 4 seconds after a previous shot
    • Rebounds are more likely to result in a goal than non-rebounds
  • Score Situation
    • Up a goal/down a goal/tied/etc…
    • It has been proven that Sh% rises when teams are trailing and vice versa
    • This adjustment, while only slight, helps to account for a variety of other aspects that we are currently unable to quantify yet have an impact goal scoring 
  • Rush Shot - Yes/No?
    • Shots coming off the rush are more likely to result in a goal than non-rush shots
Now that we have the structure of our ExpSV% we need to add shooter talent into the mix since the model currently assumes league average shooting talent for each shot, which we know is not the case in reality. Generally, a shot from Sidney Crosby is more likely to result in a goal than a shot from George Parros. So I wanted to make a multiplier for each player in each season to get a best estimation of their personal effect on each shot's probability of resulting in a goal.

Using the Kuder-Richardson Formula 21 (KR-21) I was able to find that 5-on-5 Sh% stabilizes for forwards at about 375 shots while 5-on-5 Sh% for defenceman begins to really stabilizes around 275 shots. Therefore, for each season I added these shots (375 for forwards, 275 for defenceman) to a players total shots. I also added a certain amount of goals calculated as the added shots (375 or 275) multiplied by league average Sh% (forwards and defence had different league average Sh%) for that season. This would then allow me to calculate regressed Sh% based on these new shots and goals totals. I then divided rSh% by the league average Sh% (forwards and defence had different league averages) to give a Shot Multiplier. Then multiply this Shot Multiplier for each shot they took in that season. In case you didn't quite follow that rough explanation, here is an example of how this process played out for Steven Stamkos' 2011-2012 season, the highest rSh% season since 2007-2008:



Repeatability

The big issue with goalie metrics has always been how well they actually represent true ability. Sample size is a frequent issue with goaltenders and it has been shown countless times that the only way to truly get a good idea of a goaltender's ability is to take very large shot samples. These samples typically take multiple years to accumulate. Looking at year-to-year correlations, so far, it seems as though xSV% is on par with 5v5 SV%. The interesting difference I found, with the vital help of @MannyElk, is shown in the graph below. To paraphrase his earlier work, the experiment was done by drawing random samples of n games for each 40+ GP goalie season since 2007 and each sample was compared with the goalie's xSV% over the whole season. Essentially, the graph below shows that xSV% will show us more signal than noise sooner than 5v5 SV%.
**Disclaimer** Manny and I are not 100% sure of the results here so if anyone has suggestions please reach out. 

Data 

Quick review of the stats below:

  • xSV% = Actual SV% - Expected SV%
  • dGA (Goals Prevented Above Expectation) = Expected Goals Against - Actual Goals Against
Below is a wonderful Tableau visualization created by @Null_HHockey, as well as a spreadsheet with all of the relevant information. Once again big thanks to the guys at War-On-Ice for all their help (and data). Everything below will also be stored full time on a separate xSV% page. Enjoy!