I know I haven’t written anything for a week, but I’ve been hard at work. For the last week I’ve been working on making improvements to the forecast algorithm, particularly the pitching side. Through December and January, I was able to incorporate the component scores into the hitter forecasts, and produce an improvement over the whole stat-line approach I had been using. I’ve been trying to do the same thing for pitchers, and just this morning cracked the ‘prior performance’ barrier. While I’m still working on the improvements, I felt good enough about them to incorporate them into new model run. While the changes were dramatic for some pitchers, the effects on teams wasn’t so large – the new method does not shake up the standings. But new standings, new depth charts, and new projections are on-line.
Like the old system, the projection is based on a Marcel-like baseline. Where it differs from Marcel is that the different statistics have different weighted averages, and I use the translated data throughout the process. Strikeouts are very heavily weighted towards the most recent season – roughly a 5-2-1, rounding off, with a small (~15%) regression to mean component. Walks and groundball rates are also highly slanted, though not as much as Ks. At the other extreme, hit rates have essentially no weight for seasons (1.2 – 1.1 – 1), and an 85% regression to mean, which is why stats like FIP work. Once the baselines are calculated for everybody, I go through a similar-player search, and then see how those similar players deviated from their baselines in the following year(s), and apply those deviations to the players. Once all this is done, I run the player through the translation routine backwards to get his stats back into an expected-2012 performance baseline.
I’m testing the new projection system against the set of all pitchers, who had 50 major league innings in 2011, who pitched for only one team in 2011, and who had a major league appearance in 2010. My note says that is 437 pitchers. I’m only looking at five top-level stats for judgment – hits, walks, strikeouts, homeruns, and runs allowed. The projection is normalized to the actual innings pitched in 2011, and I just look resulting errors tabulated.
Here’s the root-mean-square error you get from just using the player’s 2008-10 (major league) stats as your 2011 projection:
Hits 14.03 HR 4.09 BB 8.74 SO 12.72 R 11.48 Sum= 51.06
Same thing, but using his translated stats for 2008-10 as the projection:
Hits 13.76 HR 3.57 BB 8.01 SO 14.26 R 10.66 Sum= 50.26
Lower is better, so this gives us the not terribly surprising result that using reasonably adjusted minor league data in addition to major league data is better than major league data alone. Incidentally, if I use the luck-free runs allowed instead of actual runs – that would be calculated runs, using a normal number of H/BIP and HR/FB – the run error would drop almost a run, to 9.85.
Here’s the results of the program I’d been using to use projections for the past two months:
Hits 11.66 HR 3.27 BB 7.62 SO 13.03 R 9.88 Sum= 45.46
And here’s the results I’m getting from the new version, as of 11:00 PM Sunday night:
Hits 11.48 HR 3.33 BB 7.59 SO 13.16 R 9.47 Sum= 45.03
I’m more than a little annoyed at seeing the strikeout numbers trend backwards; on the other hand, the improvements everywhere else suggest that I’ve got a blind spot – a hole in my swing, as it were – probably a calculation error that should lead to a nifty improvement once I track it down.
In case you were wondering about over-fitting, I am also checking the routines against 2009 and 2010 pitchers, who are not part of the test set. The improvements there are about 3/4 size of the 2011, which suggests some mild overfitting, but not enough for me to be worked up over. At least not yet.
While everything on this site is free, a donation through Paypal to help offset costs would be greatly appreciated. -Clay