Thursday, 27 November 2014

Corsi Against Doesn't Correlate with Save Percentage


How does a goalie's workload affect their ability to preform?  This question always seems to be bouncing around  and recently has come up again with regards to whether a goalie's workload (the amount of Corsi events they face) has a tangible impact on their save percentage.


Previous Literature 


The first analysis was done by Brodeur Is a Fraud and found little to no evidence of a correlation between the two variables. Another look was done over at Hockey-Graphs and found similar results with a different method:
For the forty active goaltenders to play at least one hundred NHL games over the past four seasons, there is no substantial relationship in them playing better -in terms of save percentage- when facing more or less shots against.
Chris Boyle in his own study at SportsNet did seemed to find a quite strong relationship yet I have some serious doubts as to the validity of his methodology. Essentially by looking at the raw shot counts and save percentages posted in individual games while removing goalies who didn't play the full game you result a very serious issue of survivor bias. Why do goalies in this study who see a large amount of shots against only post high save percentages? Most likely it is because if a goalie faces a large number of shots and doesn't post a high save percentage they will allow a large number of goals which leads to them being pulled from the game and therefore they are removed from this study. This removal doesn't happen for goalies who face a low number of shots while posting a low save percentage because they can still allow only a low number of goals against giving their coach no incentive to pull them. Example, a goalie faces 20 shots against and lets 2 in. That's a .900 save percentage which in the big picture isn't good but in an individual game only allowing two goals against is just fine. Therein lies my issues with this study.

Finally, we arrive at the most recent post by David Johnson at Hockey Analysis who can summarize his own methods best:
In my opinion, the proper way to answer the question of whether shot volume leads to higher save percentages is to look at how individual goalies save percentages have varied from year to year in relation to how their CA60 has varied from year to year. To do this I looked at the past 7 seasons of data and selected all goalie seasons where the goalie played at least 1500 minutes of 5v5 ice time. I then selected all goalies who have had at least 5 such seasons. There were 23 such goalies. I then took their 5-7 years worth of CA60 and save % stats and calculated a correlation between them. 
Basically, he found the individual correlations for each goaltender and then averaged these individual correlations. A few issues I noticed starting with the fact that correlation coefficients aren't additive. You need to first convert them to Fischer z values which are additive. This issue is minor as I ran his test again the results don't alter too much with this adjustment.

The second issue I take is with the claims made based on this study. Starting the use of word "boost" in the title implying that there is not only causation here which I am not convinced of (we simply see a correlation via his methodology) and also that there is only a positive correlation, meaning that an increase in CA/60 results in an increase in SV%.  Examine the data closer you find that 8/23 goalies saw the inverse effect (more shot-attempts against lowered their SV%) while another two saw essentially zero change in SV% in relation to their shot-attempts faced. This leaves us with only 13 goalies who we can see to have a positive correlation. This leads to my issue with the author making a general assumption about the impact of CA/60 boosting Save Percentage as a uniform result that can be applied across the board to all goalies, when he is really only talking about a specific subgroup. Later on in this post I will reveal my doubts with regard to his methods and how I believe he simply found a false positive for a relationship that doesn't exist. 

My Findings


I tweeted this graph out earlier when this question was first raised on Twitter. It is a very basic graph that took me a few minutes to put together but you can see a team allowing more shot-attempts against having a noticeable impact on their save percentage to be essentially zero.


These next few charts look at the individual goalie level. I set different cut offs in each graph just to see if we could weed out some goalie talent since better goalies tend to play the more minutes (unless your team is located in Winnipeg) and we still aren't able to find any strong evidence (the correlation does actually increase as we narrow the sample jumping from about 0 to 3%). 


This graph below is the same as the ones above but only using the data included in the Hockey Analysis study.



Since none of the graphs I managed to produce were able to find any correlation I decided to try my own blind recreation of the method used at Hockey Analysis. Below are two graphs very similar to the graphs first produced at Hockey Analysis that seemed to demonstrate the correlation between CA/60 and SV%. I have removed the titles of these two to add an element of surprise. Take a quick look at both before finding their titles below. 



***

***

***

 Surprised? This is my basic way to suggest that the results shown in Hockey Analysis' study could be the result of simple random variation. Pekka Rinne's chart is to show how one of these samples can be pretty much out of wack on the individual level while the Niemi vs. Howard chart shows that even when picking two variables that we know for a fact should have zero correlation to each other, when dealing with such small samples in this case only 5 seasons (or data points), it can be pretty easy to discover a relationship that doesn't actually exist.

The chart below shows the data on the correlation's found by Hockey Analysis. I took the liberty of converting it to Fisher z-values and then the Inverse of that which is the real correlation that he was looking for. So in actuality his correlation was higher than he first reported. To make things simpler I have stared* the important column here with the true correlation. 

Average Correlation Average Fisher Average Fisher Inverse*
0.183 0.215 0.212



The issue as you may have seen above in the Niemi vs. Howard chart is that it is very easy with this data set and this method to find correlation's that we know for a fact shouldn't exist. Below I calculated 23 correlations and their subsequent Fisher values in my blind test. I simply put the goalies in alphabetical order and compared the CA/60 for goalie A with the SV% of goalie B. 



Correlation Fisher
-0.292 -0.301
0.098 0.098
-0.098 -0.098
0.730 0.930
0.631 0.743
-0.407 -0.432
-0.726 -0.919
0.116 0.117
0.536 0.599
0.117 0.118
0.126 0.127
-0.230 -0.234
-0.131 -0.132
0.338 0.351
0.586 0.671
0.468 0.507
-0.631 -0.744
-0.383 -0.403
-0.616 -0.718
-0.708 -0.882
-0.213 -0.217
-0.095 -0.095
Average Average Fisher
-0.784 -0.916
Fisher Inverse*
-0.724

We know from common sense and logic that the number of shot-attempts faced by Evgeni Nabokov will have no effect on Henrik Lundqvist's save percentage but the number's actually show a correlation (.73). This is obviously a false positive showing a correlation that doesn't truly exist. Simply stated, correlation doesn't always prove causation. Based on what I have found here and the earlier research done on the subject, I feel confident in stating there is still little to no evidence relating the Corsi Against a goaltender and their Save Percentage.



You can reach me via email me here: DTMAboutHeart@gmail.com or via Twitter here: @DTMAboutHeart







3 comments:

  1. "This is my basic way to suggest that the results shown in Hockey Analysis' study could be the result of simple random variation."

    If it were random, we would expect and equal number of goalies with positive correlations as negative ones and they would be equally distributed around zero. We did not see this.

    If we assume that CA60 was correlated to save percentage we would expect to see the exact results that I observed. Specifically that the average correlation should steadily rise as we restricted to only goalies with higher variations in CA60 (essentially removing goalies that might have a low signal to noise ratio) and restrict to goalie that didn't change teams (i.e. eliminate goalies that might have high noise factor due to the complete change in roster and possible style of play). If it were random, one would not expect these outcomes.

    It's great that you are good at applying statistical models, but you still need to apply a little logic and analysis to the results.

    As for your concern about me suggesting "boost" I said 'does it' not 'it does' though I did provide a reasoned hypothesis of why such a relationship exists (which I hope to test further in the future).

    ReplyDelete
  2. I looked into it and it actually is fairly normally distributed. (Mean: 0.183, STD Deviation: 0.483). 8 Negative, 2 Zero, 13 Positive.

    I think logic dictates that if you are looking at flawed data, restricting that flawed data can still give you flawed results. For example, Average (CA60 StdDev>3.5) = 0.256 which is lower than that of >2 (0.264) and >3 (0.292). These are such small samples that slight changes can swing the observed results pretty wildly.

    ReplyDelete
    Replies
    1. "I looked into it and it actually is fairly normally distributed. (Mean: 0.183, STD Deviation: 0.483). 8 Negative, 2 Zero, 13 Positive."

      Except when you restrict to only goalies with StdDev(CA60)>2 to drop out those with lowest signal to noise ratios it now becomes 13 positive, 2 zero and 5 negative and a mean of 0.264. Of those, only Bryzgalov and Howard have correlations below -0.35 while 11 goalies (Brodeur, Vokoun, Backstrom, Niemi, Price, Kiprusoff, Ward, Thomas, Quick, Smith and Lundqvist) are above 0.40.

      There is variability and uncertainty in the data (Why is Jimmy Howard such an outlier? Did the retirements of Rafalski and then Lidstrom play a factor?) but the overall bias is clearly towards their being a relationship.

      Delete