Monday, 22 June 2015

Clustering NHL Forwards (using k-means)

How do you classify hockey players? Many would argue to go by the classical six positions (C, LW, RW, LD, RD, G) while some would argue for a rover (see picture above). I suggest a different distinction. Obviously, goalies are their own identity so they're excluded from this analysis. That leaves players which I will further breakdown into forwards and defence. Forwards and defence tend to have very distinct roles with a few exceptions (D.Byfuglien and B.Burns). In this post I am going to focus on forwards. It isn't easy to decide just which position a forward plays, don't bother asking the PHWA (see. the Ovechkin debacle) because they obviously can't tell. is no help either since many of their positional declarations are hilarious out of place (ex. Zetterberg is listed as a LW despite taking over 1000 face-offs last season which places him 48th in the entire NHL). Then there is there is the issue of 1st/2nd/3rd/4th line. These roles are usually overstated by most media types and then there is a designation problem. If a player preforms like a 1st liner but his coach players him on the 2nd line? What really are they? I know, deep stuff. Long story short, breaking players down into categories is easier said than done.

K-Means Clustering

Therefore, I set out to with a fun exercise to reclassify forwards based on their playing characteristics. I used k-means clustering to break the players down into 8 categories based on these characteristics. I want to stress that these measurements are meant to reflect a player's playing style not how well or poorly they preformed. The chart below shows the average of each measurement broken down by cluster. I arbitrarily named the clusters myself, you shouldn't read too much into those. Come up with your own if you want. (You should click on that picture if you want to look at the cluster characteristics more carefully.)

Here is a random sample of ten players and which cluster they belong to. Please don't get too upset if you don't like a certain player's cluster. Remember these clusters group players by "playing style" not skill level. 

Cluster Features

Below are some box-and-whisker plots which breakdown the clusters by Corsi%, Age, TOI/GM  and AAV. Here is a quick run through of how to read a box-and-whisker plot:
  • The big solid line going down the whole graph is the mean value for the whole same. Example: The mean Corsi% for all forwards is 50%.
  • Within each box plot is another solid line that marks the median value for that cluster (the middle value of that cluster). Median not the mean.
  • The box itself encompasses the upper and lower quartiles of values, from the 25% to 75% percentile. 
  • The whiskers mark the top and bottom 25%, excluding outliers. 
  • Dots denote outliers.


  • All-Around, seems like the group a player would want to be in but it still encompasses a range of players from Sidney Crosby to Manny Malhotra.
  • "Safe" Depth is  labelled as such due to them being populated of lower end players (see. AAV), yet their Corsi compares favourably when compared to the Depth cluster of players.
  • High Impact players, do a bit of everything including areas that don't involves scoring ex. draw more penalties than they take while dishing more hits than they receive.
  • Power Forwards,  are big guys (yes, I subjectively looked at the cluster of players for 30 seconds and thought I saw a bunch of perceived power forwards) who prefer to pass more than they shoot but also take more penalties than they draw, probably due to a lack of foot speed.
  • Depth players, while few in numbers (only 13) they dish hits like crazy yet clearly trail in the Corsi%, TOI/GM and AAV categories. 
  • Passers, create a lot of opportunities for their teammates and are wizards at taking the puck away from their opponents more than they give it up.
  • Depth scorers, these are typically young players who have been held down in the lineup by their coach yet can really shoot the lights out.
  • Scoring Wingers, are very similar do depth scorers yet have been given a larger playing opportunity.
  • I would love to do the same exercise for defenceman but their doesn't seem to be enough distinction using my current attribute metrics. Maybe I will discover a better way to classify them in the future, who knows.
Please let me know if you have any thoughts or questions. You can comment below or reach me via email me here: or via Twitter here: @DTMAboutHeart

Wednesday, 3 June 2015

Updated xSV% - Save Percentage Accounting for Shot Quality

Goalies are voodoo. That should probably be added as the 11th Law of Hockey Analytics. Goaltending analysis is currently one of the most lacking subjects within hockey analytics. Great strides have been made however, with 5v5 SV%, AdjustedSV% and High Danger SV%. A few months back I revealed a statistic I referred to as xSV%. You can click there to read the article but, more or less, I calculated a goaltender's Expected SV% based on the quality of shooter for each shot faced based on a rolling average of the shooter's individual shooting percentage. xSV% is a goalie's actual SV% minus their ExpSV%. A higher (positive) xSV% is good while a lower (negative) xSV% is bad. I have thought more about that specific methodology since posting that article and have eventually decided that with some substantial changes I could greatly improve upon xSV% . Based on factors in my ExpG model combined with regressed shooting percentage for each shooter (same mindset but different process as the original xSV%) I basically started from scratch to develop this latest rendition of xSV%.


The basis of a goalie's Expected SV% comes from the same model I used in my latest ExpG model. Here is a quick breakdown of the different variables and a brief explanation of why they are included in the model:
  • Adjusted Distance
    • The farther a shot is taken from the lower likelihood it has of resulting in a goal 
  • Type of Shot
    • Snap/Slap/Backhand/Wraparound/etc...
    • Different types of shots have different probabilities of resulting in goals
  • Rebound - Yes/No?
    • A rebound is defined here as a shot taking place less than 4 seconds after a previous shot
    • Rebounds are more likely to result in a goal than non-rebounds
  • Score Situation
    • Up a goal/down a goal/tied/etc…
    • It has been proven that Sh% rises when teams are trailing and vice versa
    • This adjustment, while only slight, helps to account for a variety of other aspects that we are currently unable to quantify yet have an impact goal scoring 
  • Rush Shot - Yes/No?
    • Shots coming off the rush are more likely to result in a goal than non-rush shots
Now that we have the structure of our ExpSV% we need to add shooter talent into the mix since the model currently assumes league average shooting talent for each shot, which we know is not the case in reality. Generally, a shot from Sidney Crosby is more likely to result in a goal than a shot from George Parros. So I wanted to make a multiplier for each player in each season to get a best estimation of their personal effect on each shot's probability of resulting in a goal.

Using the Kuder-Richardson Formula 21 (KR-21) I was able to find that 5-on-5 Sh% stabilizes for forwards at about 375 shots while 5-on-5 Sh% for defenceman begins to really stabilizes around 275 shots. Therefore, for each season I added these shots (375 for forwards, 275 for defenceman) to a players total shots. I also added a certain amount of goals calculated as the added shots (375 or 275) multiplied by league average Sh% (forwards and defence had different league average Sh%) for that season. This would then allow me to calculate regressed Sh% based on these new shots and goals totals. I then divided rSh% by the league average Sh% (forwards and defence had different league averages) to give a Shot Multiplier. Then multiply this Shot Multiplier for each shot they took in that season. In case you didn't quite follow that rough explanation, here is an example of how this process played out for Steven Stamkos' 2011-2012 season, the highest rSh% season since 2007-2008:


The big issue with goalie metrics has always been how well they actually represent true ability. Sample size is a frequent issue with goaltenders and it has been shown countless times that the only way to truly get a good idea of a goaltender's ability is to take very large shot samples. These samples typically take multiple years to accumulate. Looking at year-to-year correlations, so far, it seems as though xSV% is on par with 5v5 SV%. The interesting difference I found, with the vital help of @MannyElk, is shown in the graph below. To paraphrase his earlier work, the experiment was done by drawing random samples of n games for each 40+ GP goalie season since 2007 and each sample was compared with the goalie's xSV% over the whole season. Essentially, the graph below shows that xSV% will show us more signal than noise sooner than 5v5 SV%.
**Disclaimer** Manny and I are not 100% sure of the results here so if anyone has suggestions please reach out. 


Quick review of the stats below:

  • xSV% = Actual SV% - Expected SV%
  • dGA (Goals Prevented Above Expectation) = Expected Goals Against - Actual Goals Against
Below is a wonderful Tableau visualization created by @Null_HHockey, as well as a spreadsheet with all of the relevant information. Once again big thanks to the guys at War-On-Ice for all their help (and data). Everything below will also be stored full time on a separate xSV% page. Enjoy!