Tuesday, 18 August 2015

xG Hexagonal Maps

First popularized by Kirk Goldsberry and then introduced to hockey via War-On-Ice's Hextally plots, hexagonal plots are a great tool for helping to visualize sports. I have created my own version's below in the form of apps. Two quick caveats, my current 2014/2015 data seems to have some bugs in it so take those seasons with a grain of salt and the individual attempts map also seems to be buggy for reasons currently unknown. I am working to fix both of those issues but just keep them in mind.

Here are some of the features of my xG Hexagonal Maps:

  • If you are unfamiliar with xG (Expected Goals) you can read my post detailing the methodology here. Simply, it provides the probability of any given shot resulting in a goal. 
    • A slight change between this xG and the one from that post is that these numbers also included missed shots now
  • The size of each hexagon is the frequency of shots from that specific location. The larger the hex, the more often a player shoots from that location
  • Each hex is coloured by the efficiency (xGe) of a player/team/goalie from that specific location.
    • Efficiency here is measured as the difference between how many goals we expected them to score from that location (their xG) and how many they actually scored from that danger zone.
    • A Blue Hex means that their xG was greater than their actual G, implying that they may have under-performed. 
    • Red Hex means that their xG was less than their actual G, implying that they may have over-preformed. 
  • Danger Zones are denoted by the light-pink and light-purple lines, high/medium/low. 
  • Not every red hex means a player over-preformed and not every blue hex means a player under-preformed. If you play in front of Henrik Lundqvist, your On-Ice Against xG is probably always going to be higher than your actually goals against. 
The links to all the different maps are posted below. Please let me know if you have any thoughts, questions, concerns, suggestions find anymore bugs . You can comment below or reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart  

Team Attempts Map


Goalie Map


Player On-Ice Attempts Map


Player Individual Attempts Map


Monday, 22 June 2015

Clustering NHL Forwards (using k-means)

How do you classify hockey players? Many would argue to go by the classical six positions (C, LW, RW, LD, RD, G) while some would argue for a rover (see picture above). I suggest a different distinction. Obviously, goalies are their own identity so they're excluded from this analysis. That leaves players which I will further breakdown into forwards and defence. Forwards and defence tend to have very distinct roles with a few exceptions (D.Byfuglien and B.Burns). In this post I am going to focus on forwards. It isn't easy to decide just which position a forward plays, don't bother asking the PHWA (see. the Ovechkin debacle) because they obviously can't tell. NHL.com is no help either since many of their positional declarations are hilarious out of place (ex. Zetterberg is listed as a LW despite taking over 1000 face-offs last season which places him 48th in the entire NHL). Then there is there is the issue of 1st/2nd/3rd/4th line. These roles are usually overstated by most media types and then there is a designation problem. If a player preforms like a 1st liner but his coach players him on the 2nd line? What really are they? I know, deep stuff. Long story short, breaking players down into categories is easier said than done.

K-Means Clustering

Therefore, I set out to with a fun exercise to reclassify forwards based on their playing characteristics. I used k-means clustering to break the players down into 8 categories based on these characteristics. I want to stress that these measurements are meant to reflect a player's playing style not how well or poorly they preformed. The chart below shows the average of each measurement broken down by cluster. I arbitrarily named the clusters myself, you shouldn't read too much into those. Come up with your own if you want. (You should click on that picture if you want to look at the cluster characteristics more carefully.)

Here is a random sample of ten players and which cluster they belong to. Please don't get too upset if you don't like a certain player's cluster. Remember these clusters group players by "playing style" not skill level. 

Cluster Features

Below are some box-and-whisker plots which breakdown the clusters by Corsi%, Age, TOI/GM  and AAV. Here is a quick run through of how to read a box-and-whisker plot:
  • The big solid line going down the whole graph is the mean value for the whole same. Example: The mean Corsi% for all forwards is 50%.
  • Within each box plot is another solid line that marks the median value for that cluster (the middle value of that cluster). Median not the mean.
  • The box itself encompasses the upper and lower quartiles of values, from the 25% to 75% percentile. 
  • The whiskers mark the top and bottom 25%, excluding outliers. 
  • Dots denote outliers.


  • All-Around, seems like the group a player would want to be in but it still encompasses a range of players from Sidney Crosby to Manny Malhotra.
  • "Safe" Depth is  labelled as such due to them being populated of lower end players (see. AAV), yet their Corsi compares favourably when compared to the Depth cluster of players.
  • High Impact players, do a bit of everything including areas that don't involves scoring ex. draw more penalties than they take while dishing more hits than they receive.
  • Power Forwards,  are big guys (yes, I subjectively looked at the cluster of players for 30 seconds and thought I saw a bunch of perceived power forwards) who prefer to pass more than they shoot but also take more penalties than they draw, probably due to a lack of foot speed.
  • Depth players, while few in numbers (only 13) they dish hits like crazy yet clearly trail in the Corsi%, TOI/GM and AAV categories. 
  • Passers, create a lot of opportunities for their teammates and are wizards at taking the puck away from their opponents more than they give it up.
  • Depth scorers, these are typically young players who have been held down in the lineup by their coach yet can really shoot the lights out.
  • Scoring Wingers, are very similar do depth scorers yet have been given a larger playing opportunity.
  • I would love to do the same exercise for defenceman but their doesn't seem to be enough distinction using my current attribute metrics. Maybe I will discover a better way to classify them in the future, who knows.
Please let me know if you have any thoughts or questions. You can comment below or reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart

Wednesday, 3 June 2015

Updated xSV% - Save Percentage Accounting for Shot Quality

Goalies are voodoo. That should probably be added as the 11th Law of Hockey Analytics. Goaltending analysis is currently one of the most lacking subjects within hockey analytics. Great strides have been made however, with 5v5 SV%, AdjustedSV% and High Danger SV%. A few months back I revealed a statistic I referred to as xSV%. You can click there to read the article but, more or less, I calculated a goaltender's Expected SV% based on the quality of shooter for each shot faced based on a rolling average of the shooter's individual shooting percentage. xSV% is a goalie's actual SV% minus their ExpSV%. A higher (positive) xSV% is good while a lower (negative) xSV% is bad. I have thought more about that specific methodology since posting that article and have eventually decided that with some substantial changes I could greatly improve upon xSV% . Based on factors in my ExpG model combined with regressed shooting percentage for each shooter (same mindset but different process as the original xSV%) I basically started from scratch to develop this latest rendition of xSV%.


The basis of a goalie's Expected SV% comes from the same model I used in my latest ExpG model. Here is a quick breakdown of the different variables and a brief explanation of why they are included in the model:
  • Adjusted Distance
    • The farther a shot is taken from the lower likelihood it has of resulting in a goal 
  • Type of Shot
    • Snap/Slap/Backhand/Wraparound/etc...
    • Different types of shots have different probabilities of resulting in goals
  • Rebound - Yes/No?
    • A rebound is defined here as a shot taking place less than 4 seconds after a previous shot
    • Rebounds are more likely to result in a goal than non-rebounds
  • Score Situation
    • Up a goal/down a goal/tied/etc…
    • It has been proven that Sh% rises when teams are trailing and vice versa
    • This adjustment, while only slight, helps to account for a variety of other aspects that we are currently unable to quantify yet have an impact goal scoring 
  • Rush Shot - Yes/No?
    • Shots coming off the rush are more likely to result in a goal than non-rush shots
Now that we have the structure of our ExpSV% we need to add shooter talent into the mix since the model currently assumes league average shooting talent for each shot, which we know is not the case in reality. Generally, a shot from Sidney Crosby is more likely to result in a goal than a shot from George Parros. So I wanted to make a multiplier for each player in each season to get a best estimation of their personal effect on each shot's probability of resulting in a goal.

Using the Kuder-Richardson Formula 21 (KR-21) I was able to find that 5-on-5 Sh% stabilizes for forwards at about 375 shots while 5-on-5 Sh% for defenceman begins to really stabilizes around 275 shots. Therefore, for each season I added these shots (375 for forwards, 275 for defenceman) to a players total shots. I also added a certain amount of goals calculated as the added shots (375 or 275) multiplied by league average Sh% (forwards and defence had different league average Sh%) for that season. This would then allow me to calculate regressed Sh% based on these new shots and goals totals. I then divided rSh% by the league average Sh% (forwards and defence had different league averages) to give a Shot Multiplier. Then multiply this Shot Multiplier for each shot they took in that season. In case you didn't quite follow that rough explanation, here is an example of how this process played out for Steven Stamkos' 2011-2012 season, the highest rSh% season since 2007-2008:


The big issue with goalie metrics has always been how well they actually represent true ability. Sample size is a frequent issue with goaltenders and it has been shown countless times that the only way to truly get a good idea of a goaltender's ability is to take very large shot samples. These samples typically take multiple years to accumulate. Looking at year-to-year correlations, so far, it seems as though xSV% is on par with 5v5 SV%. The interesting difference I found, with the vital help of @MannyElk, is shown in the graph below. To paraphrase his earlier work, the experiment was done by drawing random samples of n games for each 40+ GP goalie season since 2007 and each sample was compared with the goalie's xSV% over the whole season. Essentially, the graph below shows that xSV% will show us more signal than noise sooner than 5v5 SV%.
**Disclaimer** Manny and I are not 100% sure of the results here so if anyone has suggestions please reach out. 


Quick review of the stats below:

  • xSV% = Actual SV% - Expected SV%
  • dGA (Goals Prevented Above Expectation) = Expected Goals Against - Actual Goals Against
Below is a wonderful Tableau visualization created by @Null_HHockey, as well as a spreadsheet with all of the relevant information. Once again big thanks to the guys at War-On-Ice for all their help (and data). Everything below will also be stored full time on a separate xSV% page. Enjoy!

Monday, 25 May 2015

Updated NHL Expected Goals Model

Here is the latest rendition of my Expected Goals model. If you haven't read the original post you probably should read it here before continuing.The only substantial change from the previous version is that this one now includes rush shots. As it has been previously shown that rush shots just by the very fact that they are rush shots result in a higher shooting percentage. My model currently only accounts for 5-on-5 situations and includes a total of five factors:
  • Adjusted Distance
    • The farther a shot the lower likelihood it results in a goal 
  • Type of Shot
    • Snap/Slap/Backhand/Wraparound/etc...
  • Rebound - Yes/No?
    • A rebound is defined as a shot taking place less than 4
  • Score Situation
    • Up a goal/down a goal/tied/etc…
  • Rush Shot - Yes/No?
    • Rush shots have a higher shooting percentage


Same sort of graphs below as in the previous post, along with the correlations for each. The ExpGF correlation jumped slightly from 0.58 to 0.61 yet the ExpGA correlation stayed consistent at 0.60. That isn't to say adding rush shots didn't effect the model. There is definitely some difference both positive and negative on certain teams, typically within the 10 goal range.


I still plan on adding some aspect of regressed shooter and goaltender talent somehow into the model. I am close to releasing ExpG at the player level, hopefully within the next week. Around the time I am able to incorporate goaltender talent into the model I should also be able to update my xSV% with the shot quality aspects of this model.

Expected Goals

Here are the updated results below. Note that, dGF/dGA/dGF%, are calculated as actual minus expected. Therefore, a positive dGF means that a team scored more goals than the model predicted they would. A positive dGA means that a allowed more goals against than the model would have predicted. I will update this spreadsheet in its own tab at the top of this site too. Please let me know any questions or feedback you might have. Enjoy!

Thursday, 21 May 2015

NHL Expected Goals Model

Did anyone ever consider shot quality? 

UPDATE: This model has since been improved upon and shown here. This post still provides good background on the basics of the model.

Shot quality and possession metrics have always been somewhat a point of contention. Expected Goals (ExpG) helps to combine these two facets in hopes of providing better information about the game. Expected Goals are not a novel concept, ones have been presented previously by Brian Macdonald for hockey and the original motivation for my study by Michael Caley's soccer version. I hope to lay out my ExpG model in a way that makes hockey sense, where everyone can understand why each factor was added into the model. The model works by assigning a value to each shot taken over the course of a season based on the model's predicted probability of that shot resulting in a goal. To calculate a team's final ExpG all you have to do is sum up all of these probabilities and there you have it. First I will breakdown the methodology that goes into this model. If you don't care and just want to see the results skip down to the Expected Goals section or check out the Expected Goals tab above.


My model uses a logistic regression to arrive at each goal probability. Basically, it uses a bunch of independent variables to produce the odds of binary outcome occurring, in our case, yes a goal was scored or no a goal wasn't scored. I reran the logistic regression for each season instead of using one big logistic regression. So far my model only accounts for 5-on-5 situations. This helps to account for minor changes in style of league play yet the regression coefficients didn't actually change much year-to-year. Here are the factors taken into account by the model:
  • Adjusted Distance
    • The farther a shot the lower likelihood it results in a goal 
  • Type of Shot
    • Snap/Slap/Backhand/Wraparound/etc...
  • Rebound - Yes/No?
    • A rebound is defined as a shot taking place less than 4
  • Score Situation
    • Up a goal/down a goal/tied/etc…


In the two graphs below you can see how well ExpG, both offensively and defensively, correlates with actual results. Each point represents one team from one season, except 2012-2013 was removed due to the lockout. 

There will always be some outliers in a given season but I think the model goes a relatively good job. The chart below shows that ExpG comes out on top when compared to Corsi and Scoring Chances in terms of correlation to real goals for and against in a given season.

Goals For Goals Against
ExpG 0.58 0.6
Corsi 0.493 0.57
Scoring Chances 0.53 0.562

Future Work

In the next coming weeks I will be focusing my efforts on two different aspects of this model. Firstly, I will investigate how well it predicts future goals, from one season to the next as well as something similar to Micah Blake McCurdy did with score-adjusted Corsi. Secondly, I will be looking at other factors to add into the model. I plan on adding rush shots as a factor, though the current state of my data will require some tweaking before I can do that. I also plan on exploring the effects of incorporating shooter talent and goaltender talent. I also plan on releasing ExpG at the player level and use aspects of this model to better xSV%. 

Expected Goals

I just wanted to thank War-On-Ice and Sam Ventura for the data used in this project. Finally, here are the results below. Note that, dGF/dGA/dGF%, are calculated as actual minus expected. I will give this spreadsheet its own tab at the top of this site too. Please let me know any questions or feedback you might have. Enjoy!

Tuesday, 31 March 2015

xSV% - Team Data

After my original post looking at xSV% exclusively at the individual goalie level I received a few requests to look at the same data but at the team level. Simple enough and presented below is xSV% team data from 2002-2014. If you didn't read my original post on xSV% you can do so here, but I am also going to follow up and reiterate what exactly xSV% entails.

What xSV% is:

  • Expected Save Percentage based on a 110 game moving average of the opposing shooter at the time of each shot faced by a goalie
  • Better players typically have a higher shooting percentage, therefore if a team limits their opponent's best players from shooting the puck, they will raise their own ExpSV%
  • Forwards typically have a higher shooting percentage, if a team can limit the amount of shots taken by an opposing teams forwards and instead force them to rely on their defenceman to generate shots, they will raise their own ExpSV%
  • ExpSV% is highly influenced by era. As shown in the graph below representing the league average Expected Save Percentage for each season with the lockout lost season shown by the red line, scoring has been down in recent years since the lockout.
  • This era influence is the big reason why the 110 game moving average is necessary. Simply using a single season worth of data can sometimes not be enough. Likewise, using a player's career average shooting percentage can provide misleading results.
    • For example, the ever great Jaromir Jagr has a career shooting percentage of 13.7% that is heavily influenced by his earlier playing days. Jagr hasn't had a season of shooting that efficiently since 2005-2006. Therefore a rolling average helps more accurately depict his current conversion ability. 

What xSV% is not:

  • An all encompassing, all-knowing stat that gives the exact Expected Save Percentage for each team
  • A definitive ranking of how well teams manage to play defence


Below is all the team level data. Play around with it and please send me any feedback/questions you might have. Enjoy!

Friday, 13 March 2015

xSV% - Save Percentage Accounting for Shot Quality

While a wave of statistical assessment continues to flow through NHL analytics circles, the majority still cannot come to task with agreeing on goaltenders. Old school statistics that used to be staples in goalie evaluation such as wins and GAA will hopefully be put out to pasture. The only stat mildly useful and currently accessible to the general public is save percentage. Taking save percentage a slight step further was discovering that only using even-strength provided a clearer image of true ability. In this article I will introduce a brand new goalie stat which compares an goalie's current even-strength save percentage to what we would expect an average goalie's save percentage to be given the quality of competition faced by that goalie.

Shot Quality

Shot quality is a hotly debated subject within hockey analytics. Personally, I am of the belief that shot quality definitely exists in small samples, but as the sample is increased the effects will be diminished. Projects such as "The Royal Road" and the "Shot Quality Project" promise to provide unprecedented answers but I think the general public should always be sceptical of such broad claims reached by those with access to proprietary (private) data that hasn't been peer-reviewed. Therefore my version of shot quality presented here is built upon the same NHL play-by-play files available to everyone. 

What constitutes shot quality is another spot of debate. When considering shot quality, the most publicly analyzed forms include distance from the net, type of shot, rebound or not, etc... However, Tom Awad wrote an excellent article for Hockey Abstract 2014 where in discussing shot quality he determined that the majority of difference in player finishing ability can be accounted for by varying levels of talent between players, not simply the factors stated earlier. Based on these findings I set out to create a baseline for how we would expect an average goalie to preform given this quality of competition.


No playoff games were considered in this study because the idea of a shooter facing the same goalie in 4-7 consecutive games I felt might skew their data. Only 5 on 5 play was considered since we already know that it is preferable to all strength conditions. A player's finishing ability was calculated as a 110 shot running average of a player's shooting percentage. Using the same research methods I applied in earlier studies with regards to shooting stabilization, I found that at 5 on 5 a player (using both forwards and defence in this sample) will see their shooting percentage stabilize at around 110 shots. Using this rolling average instead of just a player's career average helps account for aging (player's skill sets do improve/deteriorate during their career) and changes in league environment (shooting percentage was lower in 2014 than in 2002). If a player never amassed at least 110 shots in their career, they were giving the shooting percentage of a replacement level player set at 6.48% here. (If anyone finds a better number than 110 please let me know, it wouldn't be a real inconvenience to alter it).


I coined this new metric xSV% which is simply just the difference between the goaltenders actual save percentage and what we would expect an average goaltender to achieve in similar circumstances. Below is a density plot of of xSV% compared to a normal distribution. We see that xSV% is fairly normally distributed with a slight right skew most likely caused by the fact that we are restricting our sample. Limiting this sample to goalies with at least 500 shots faced removes outliers yet also skews the data by leaving us with a slight majority of higher quality goalies.

Below we see of the amount of shots faced increases, a goalie's expected save percentage (independent of their own talent) experiences less variance. The graph doesn't look like a drastic change but it is actually about a 33% drop. Therefore giving evidence to the fact that the larger your shot sample, the less influence shot quality shares. 

Quick Observations

  • Expected Save Percentage is highly influenced by year. Highest ever was Josh Harding in 2013-2014 (.921) while the lowest was Ed Belfour in 2005-2006 (.905), minimum 500 shots faced.
  • Tim Thomas's 2010-2011 season was one for the record books. 
  • Martin Brodeur comes out positive despite a rough last few seasons.
  • Tomas Vokoun might be one of the most under-appreciated goalies in NHL history, though this analysis doesn't account for Nashville's notorious over-counting of shots. 
  • Braden Holtby is one of the top goalies in the league whether he is appreciated it or not.
  • Luongo is great, end of story.


Quick review of the stats below:
  • xSV% = Actual SV% - Expected SV%
  • xSV (Goals Saved) = Expected Goals Against - Actual Goals Against
  • xSV%+
    • xSV% rated around 100
    • 100 means Actual SV% = Expected SV%
    • Greater than 100 is good, lower than 100 is bad

Obviously, this work is far from complete but I felt like it was time to share what I have so far. I have a few tweaks in mind that I hope might improve this metric in the future along with some follow up analysis of what I have so far. Let me know any thoughts or questions.

Below is a spreadsheet with all of the relevant information, along with some Tableau visualizations to help provide a greater understanding of the data. Everything below will be stored full time on a separate xSV% page, hopefully with data for the 2014-2015 season being added soon enough. Enjoy!