Thoughts on Pitcher Value
After last week's post that awarded Gold Gloves to major league baseball's most outstanding defensive designated hitters, I want to go back to some "basics" this week. Although I (obviously) really enjoy reading and learning about sabermetric approaches to baseball, and use it in what I write, I also have to remind myself, when reading things that I've posted, that not long ago I wouldn't have understood more than half of what I'm writing these days.
I'm hardly an expert in sabermetrics, and this shouldn't be considered an "introduction" in itself. However, I did want to have a basic "reference text" that explains some (but not all!) of the stats and concepts that I like to use to evaluate pitcher performance, one that I can simply link to in future posts. I will also explain which stats I like to use for what and why. Some of this may be old hat to many, some of it may not. I hope it will also serve to explain my reasoning for preferring one stat over another. So read on, and if you're not careful, you might even learn something before we're through.

You Start At the Very Beginning: Replacement Level as a Baseline
There are plenty of places to read about replacement level and the debates about whether replacement level players "actually exist," whether they are just helpful idealization, how replacement level should be set, and so on. You can google these sorts of things yourself (Wikipedia!, although it assumes that Keith Woolner's version is canonical), if you would like. In short, the idea is that there are players who are below average who still can help a team win -- these are players who are below average, but still better than a triple-A lifer (a replacement level player).
In a way, it's easier to understand the "below-average-but-above-replacement-level" notion with pitching that with position players. If we assume that #3 starting pitchers are average starters, and #1s and #2s are above average, then we'd assume that #4s and #5s pitchers are below average. But those guys are (hopefully) better than your typical scrubs -- they're helping the team although they aren't average. (For an excellent discussion and analysis of "rotation slots," see this outstanding, must-read post by NYRoyal.)
Theoretically, a team of idealized replacement level players would have a .300 winning percentage, or be expected win about 49 games in a 162 game season. Wins above replacement (in the current period, something like 10 runs scored/saved above/below replacement is about one win).
In any case, I've strayed from the narrow path of getting the point. For all players, how many runs/wins they contribute above replacement is defined by this quite simple formula:
rate above replacement times playing time = value above replacement level.
Pretty obvious, no?
What is the replacement level for pitchers? With position players, one easy way to think about it is that a replacement level player is something like 2.25 wins below average over a given a full-season of full-time play. However, for pitchers, things aren't that straightforward, since their playing time works out differently. While more sophisticated people calculate winning percentage (no, not pitcher wins) for pitchers, for now, Iuse something a bit more straighforward: a percentage of of the league average rate stat as replacement level.
I will follow Tom Tango's suggestion that for starters, replacement rate is 128% (remember, lower is better for pitchers!) of the league average. For relievers, replacement level is 107% of league average.
Starters: ((lgavgRate*1.28)-playerRate)*(IP/9.0)
Relievers: ((lgavgRate*1.07)-playerRate)*(IP/9.0)
Easy enough... but what is this mysterious "Rate" of which I write?
Pitching Stats and Their Discontents
ERA: Earned Run Average is the traditional "rate stat" used for pitchers. It's the number of earned runs that the pitcher gives up every nine innings. So to figure out a player's runs saved above a replacement pitcher (and for the purposes of this post, we're going to be leaving out park adjustments for the most part -- although they are very important), we substitute "ERA" for rate in the formula above.
For example, let's take Derek Lowe in 2008. He pitched in the National League, where the league average ERA was 4.47. He pitched 211 innings with an ERA of 3.24. So figure out his runs (saved) above replacement, then, we simply plug the numbers in to the equation above, for:
(4.47*1.28-3.24)*(211/9) = 53.04 runs saved above replacement (RAR hereafter). That's a bit over 5 wins above replacement -- great year for Derek Lowe, according to ERA, anyway.
However, there are numerous problems with ERA. This points have been made many times -- for a good iteration of them, see Dave Cameron's post on evaluating pitcher talent. In short, ERA doesn't really separate the pitcher's performance from the defenses performance and other factors out of the pitchers control, and moreover the silly and antiquated distinction between earned runs and runs in general is... well, I'm saving that rant for another time.
One way of improving matters is to simply use runs allowed for a Run Average, but that only deals with the "earned run" problem and the subjective judgments of scorers, not the larger problem of the pitchers (general non-control over balls in play.
FIP: The legendary Voros McCracken may or may not have come up with the notion that pitchers only really have control over home runs, strikeouts, and balls, but it is certainly associated with him. McCracken and others came up with various versions of Defense Independent Pitching Statistics, but in the spirit of keeping things simple, I like the simple FIP (Fielding Independent Pitching) formula invented by, you guessed it, Tom Tango. You can find it at THT or Fangraphs, but for you do-it-yourself types, here is the formula:
((HR*13+(BB+HBP-IBB)*3-K*2)/IP) + 3.2
The "3.2" can be used as a decent generice substitute, but ideally (as is done at both Fangraph and THT, I believe), it is changed each year so that the lgERA=lgFIP. (Thus, you could also scale FIP to the league RA if you prefer using RA to ERA). It's easy enough if you remember basic algebra (so it was difficult for me), but for the interested, I used my feeble MySQL skills to generate not only lgERA and lgRA, but the FIPScale for each for the majors leagues from 1871 through 2008 on this spreadsheet.
[Note that Hardball Times also carries a stat called xFIP, or eXpected Fielding Independent Pitching that normalizes the HR/FB rate. This is more of "true talent" stat for each year to show to what level the pitcher might be expected to regress. This is still classified as an "experimental" stat and might better be used to neutralize luck during a season than to evaluate performance retrospectively.]
FIP is a good solution, and if scaled properly, works like ERA and can plug right into our RAR equation. Let's take A. J. Burnett, ex-Blue Jay and Marlin, and recipient of a massive new Yankees contract. The AL lgERA this season was 4.36. Using the formula above, his ERA-RAR is 37.25 -- about 3.7 wins. But if we plug his FIP into the equation, we get 50.46 FIP-RAR -- about 5 wins.(We could also do this same exercise with RA-RAR and FIP-RAR on an RA Scale).
I like FIP and use it a lot. There are some problems, however. The main problem is simply that while ERA (and RA) have the problem of not separating out the pitcher form the defense, FIP and similar stats don't take into account that, indeed, some pitcher do seem to be better at giving up "field-able" balls in play than others. Ideally, if FIP represents a pitcher's "true talent," then over time, his ERA should regress to his FIP. But for some pitcher this doesn't seem to be the case. The puzzling case of Javier Vazquez comes to mind.
One compromise is to average ERA and FIP and/or the related RARs. This might give a better general idea of a pitchers ability, but then park-adjustments might be a bit of a problem, and the whole point in the beginning was to separate them...
Despite all these problems, I do (and will probably continue) to ERA, RA, and their associated FIP versions simply because people are familiar with that scale, and because I can easily generate RARs and other stuff from my database.
Here is another spreadsheet based on some queries I ran with baseball databank. It is the top 50 pitchers of 2008 in the AL and NL, respectively, based on the averaged ERA-RAR and FIP-RAR, which I have cheesily called "FE2RAR" (for FIP+ERA/2, yeah, I know). Keep in mind this is not park adjusted, but does given an idea who how valuble most pitchers were. For wins above replacement, dividing by 10 pretty much does the trick this season. I'm not suggesting this as the best ranking of pitchers this season, just an example as how one would go about doing it.
tRA: Don't worry, folks, we're near the end. While there are a number of very cool other stats out there that I want to explore in more detail for pitching (Colin Wyers' use of BaseRuns is one I want to explore here in the future), tRA, developed by the Graham Macaree, is currently my favorite pitching statistic, and is also currently only available at Stat Corner. I will admit that I'm just getting a basic handle on how it works. For an simple explanation without the math, see this, for the more complete explanation, see this.
In short (and hopefully I won't screw this up), tRA comes out of the same general idea as FIP -- that ERA and RA are closer to team run-prevention stats than pitcher run prevention, and thus a pitching stat should take account of only what a pitcher can control. However, whereas FIP deals only with home runs, strikeouts, and walks (and HBPs as done above), tRA goes beyond that by using play-by-play data to deal with all sort of things that a pitcher can control (and thus to overcome the "BABIP blindess" (my silly term) of FIP. So tRA looks at a pitcher's strikeouts swinging, strikeouts looking, walks, hit batters, ground balls, line drives, popupts, fly balls, and home runs. Then the math comes in, estimating the run and out expectancies (hence xIP instead of IP) of each of these events to generate (in principle) something closer to the "true talent" and also true performance of the pitcher. All of this is park-adjusted, and is on the same scale as RA -- so when you see the higher tRA numbers, don't be surprised. This also means that when using it as a "rate" stat, the lgAVG you should use is the lgRA, not lgERA.
[Stat Corner also has tRA*, which, if I understand it properly, regresses against averages for the year in question to get at a pitchers "true talent" and thus is more suitable for projections. It seems to me to be roughly analogous to xFIP, but I may be wrong about that.]
There are problems with every stat, and the creators of tRA admit this. However, the only reasons I don't use it all of the time as my primary stat is that 1) it's only available (as far as I know) at Stat Corner right now, and there isn't an easilly downloadable, updated-each-day version of it, and 2) they only have archives going back a few years, so I it isn't currenlty useable for projects like historical comparisons. At least not for someone with my little brain.
Stat Corner does have their own year-by-year leaderboards (they also have a park-adjusted version of wOBA, and yes, they had it before Fangraphs) and a run-based stat called pitching runs above average (pRAA), but we are looking for Runs (saved) Above Replacement. Since I don't have what it takes to generate a whole league's RAR, here is a brief chart giving the ERA-RAR, FIP-RAR, and tRA-RAR for the 2008 Kansas City Royals starters (all stats reflect pitchers just in role as starter, it includes no relief stats), and sorted by tRA-RAR.
| Player | IP | ERA | FIP | ERARAR | FIPRAR | tRA | xIP | tRARAR |
| Z. Greinke | 202.3 | 3.47 | 3.56 | 47.45 | 45.43 | 3.74 | 208.3 | 53.27 |
| G. Meche | 210.33 | 3.98 | 3.61 | 37.41 | 46.06 | 4.06 | 212.8 | 46.85 |
| L. Hochevar | 129 | 5.51 | 4.43 | 1.01 | 16.49 | 5.05 | 130.6 | 14.39 |
| K .Davies | 113 | 4.06 | 4.22 | 19.09 | 17.09 | 4.98 | 112.6 | 13.28 |
| B. Tomko | 54 | 6.17 | 4.11 | -3.54 | 8.82 | 4.29 | 58.1 | 11.31 |
| J. Bale | 15.33 | 7.63 | 3.52 | -3.49 | 3.51 | 4.27 | 16.8 | 3.31 |
| B. Duckworth | 38 | 4.50 | 4.50 | 4.56 | 4.56 | 5.65 | 36.8 | 1.60 |
| R. Tejada | 5 | 1.80 | 4.13 | 2.10 | 0.81 | 3.6 | 4.4 | 1.19 |
| B. Bannister | 182.6 | 5.76 | 5.03 | -3.64 | 11.18 | 6.13 | 193.5 | -1.8 |
Not much to add to that other than 1) yeah, Zack Greinke has arrived; and 2) wow, different stats give reallly different results. So what do you think?
In summary: for now, I'll use tRA whenever possible for smaller projects, but for larger projects and historical comparison I'll use some combination of ERA and FIP for starters, and probably just FIP for relievers, since fewer innings means a smaller sample size, and thus more possibility for ERA to skew a viewof their true talent. I reserve the right to change my mind and be generally inconsistent.
[Note, since I did the initial tRA-RAR number crunching that ended up in this post, the amazing Sky Kalkman of Beyond the Box Score did a post that (at least used to) include a downloadable spreadsheet that included the tRA-RAR for all the MLB 2008 starters.]
Now, I haven't gotten to quite everything about pitching (and never will), but I'll save the matter of valuing relievers for another time, as well as more specifics on conversion from runs to wins...
[Thanks to Sky Kalkman for helping me get into pitching RAR (I'll try to ask less dumb questions from now on, Skye). Also thanks to Royals Review user Gopherballs for turning me onto Stat Corner. Yes, much of this is Gopherballs' fault -- blame him for many of my sabermetric pretensions.]
Comments
A+ Introduction.
Webmaster of Driveline Mechanics
http://www.drivelinemechanics.com - An Unconventional Look at Scouting
by Kyle Boddy on
Jan 13, 2009 2:26 AM PST
reply
actions
0 recs
Good coverage
If you are going to use tRA, I would suggest throwing tRA+ in there was well. With FIP, since it is scaled to ERA, you can pretty much tell how good an FIP is at a glance (4.50 is average, 4.00 is above average, 5.00 is below average, etc.). But since tRA isn’t scaled to something like ERA, the quality of Wang’s 4.89 tRA doesn’t jump out at you. It turns out that was an average 100 tRA+. So I’d use something like 4.89 tRA (100).
That reminds me of something. Have you run into any articles or studies dealing with pitchers having persistent differences between their FIP or tRA and their ERA? This comes up a lot with guys like Javier Vazquez. Sabermetric guys tend to really like him. His peripherals are usually very good, making for good FIP and tRA. But his ERA is often quite higher. As you rightly pointed out, FIP and tRA are attempting to isolate the pitcher’s performance. But some claim that there are some (perhaps many) pitchers who consistently give up more runs than you’d expect from their peripherals. I’d like to see some study on whether this is really true, the degree to which this is true, and who these pitchers are. Is Javier Vazquez really a #1 SP, who has been screwed by his defense, home park and back luck, as his tRA would suggest? Or is there something about his pitching that leads to good peripherals and poor run prevention?
The immoderate moderator
by NYRoyal on
Jan 13, 2009 12:15 PM PST
reply
actions
0 recs
thanks
on tRA+ — good point in general, but that would apply just on the “rate” part. By comparing with replacement level (which i always based on league average), the comparision with a baseline (as is the “100” in tRA+ and other such stats) is implicit, But for getting a sense of the value of a player’s season, it’s important to get his playing time — that’s why one multiplied by innings pitched. So for value per se, RAR of whatever sort gives a sense of how good the pitcher is. I would guessthe usual #s of WAR (about 10 war, currently) apply to performance of pitchers as it does to hitters:
0 replacement level
1 crappy player
2 around average
3 above average
4 very good
5 great
6 best players in the league
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 13, 2009 7:09 PM PST
up
reply
actions
0 recs
oh
and the FIp/ERA/tRA gap is something I hope to read more on in the future. Vazquez is a classic problem case, of course. I thinked to Josh Kalk’s THT article analyzing Javy from the perspective of Pitch f/x.
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 13, 2009 7:11 PM PST
up
reply
actions
0 recs
I think...
…Tango said at FanGraphs that FIP/tRA had like a .75 correlation. Pretty good.
tRA is scaled to RA, FIP to ERA.
tRA is on a R/9 scale
One letter makes a big difference, I guess. :o
by kensai on
Jan 14, 2009 4:15 AM PST
up
reply
actions
0 recs
I understand the scaling
and that would account for, say, and ERA-tRAt. I think NYRoyal is talking more about the discrepancies between fielding independent stats like FIP and tRA and “traditional” stats like ERA and RA for pitchers like Vazquez
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 14, 2009 8:09 AM PST
up
reply
actions
0 recs
Exactly
I’ve read many fans saying that it doesn’t make any difference how good Vazquez’s defense independent stats are because he consistently underperforms them. He always gives up more runs and earned runs than you’d expect from a guy with his FIP and tRA. Many of these same fans will tell you that a guy like Mark Buehrle consistently overperforms his FIP and tRA. I haven’t done any studies to see how true this is (if at all). But it would be interesting to see if this is true of some pitchers, who they are, and the degree to which it is true.
The immoderate moderator
by NYRoyal on
Jan 15, 2009 12:44 PM PST
up
reply
actions
0 recs
Buerle's FIP is actually decent
Fangraphs added pitcher values and a bunch of posts about it over the weekend (I’m glad, but it sort of makes my post above redundant… I’m sure it was done to screw me!!!111).
I would think by dealing with the run/out expectancies of groundballs, line drives, etc., that in principle tRA is supposed to do better.
It will be interesting to see how Vazquez does in Atlanta. I really don’t know. I would guess better, and not just because of the NL. I think the Sox have a small park, and also during his time have had a dreadful OF defense.
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 15, 2009 3:02 PM PST
up
reply
actions
0 recs
Intentional walks
The immoderate moderator
by NYRoyal on
Jan 13, 2009 4:22 PM PST
up
reply
actions
0 recs
Wow a whole new light
If i’m getting this right. Tom Seavers FIP was 4.3 compared to his 2.20 ERA in 1968
by sdhman11 on
Jan 13, 2009 7:24 PM PST
reply
actions
0 recs
Not quite
for some reason, Fangraphs and have different numbers. But in any case, I have 2.47, they have 2.22… Maybe they don’t use HBPs do use IBBs for something.
I’m curious, how are you getting 4.3?
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 13, 2009 8:37 PM PST
up
reply
actions
0 recs
Heres what i did. Am i doing this right?
Seaver 1968 stats HR 15
SO 205
BB 48
HBP 8 I
BB 5
IP 277.7
15*13=195 +48+8-5=246*3= 738 205 K *2 =410
738-410=328/277.7 IP =1.18
1.18+3.2=4.38
by sdhman11 on
Jan 14, 2009 8:49 AM PST
up
reply
actions
0 recs
hmmm..
I think you added the K*2 product rather than subtracted it…
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 14, 2009 2:26 PM PST
reply
actions
0 recs
supposed to be a reply to sdhman11
you can always check your work at Fangraphs player pages. Now, your number may not match up exactly (since they solve for the FIP “scaler” each year and stuff), but you’re a fair ways off… Sorry if the equation above isn’t clear.
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 14, 2009 2:27 PM PST
up
reply
actions
0 recs
no prob
but did you figure things out? I won’t belabor it, but I hope that I’m at least somewhat helpful to someone
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 15, 2009 8:36 AM PST
up
reply
actions
0 recs
Another look at FIPs
Some discussion I have with Tom Tango on the subject:
Its close to xFIPs but also takes into account ground balls.
by Jeff Zimmerman (TucsonRoyal) on
Jan 15, 2009 12:48 PM PST
reply
actions
0 recs
thanks trqa
Bringing you more-or-less replacement level analysis and commentary since sometime in 2008.
by devil_fingers on
Jan 15, 2009 2:58 PM PST
up
reply
actions
0 recs
Sounds like a slimmed down tRA to me
I found this element to be a somewhat limiting shortcut:
You can assume, if you like, that the line drive and popup rates are fairly random across all pitchers, and remove those from the equation, as are HBP.
If the metric doesn’t take into account LD and PO rates, I think it is missing something important. Short, I think Tango’s metric here is, as he says, better than FIP because it includes more information. By that same reasoning, it isn’t as good as tRA because it includes less information.
The immoderate moderator
by NYRoyal on
Jan 16, 2009 8:25 AM PST
up
reply
actions
0 recs
Agree ... when I asked the
On not being as good, but when is the question, I hoping to find a pitching stat that is predictable from one stadium to the next. I know that groundballs, flyballs, strikeouts and walks are consistent. Any Idea about FD and PO?
by Jeff Zimmerman (TucsonRoyal) on
Jan 16, 2009 9:53 AM PST
up
reply
actions
0 recs
I assume you mean LD and PO
I think PO’s are consistent park-to-park, but LD’s are not. The reason why LD% vary from park to park is the topic of much speculation. Unfortunately, most of this speculation centers around the human factors involved in classifying batted balls. One whether the ball is caught or not, may effect whether it is classified as a LD, and of course park size factors into how often LD’s are caught. Whether the ball is caught or not shouldn’t affect the classification, but it might for some psychological reason. Another possibility is just differences from park to park in the person who is making the classification. In short, the guy(s) in Tampa may have a somewhat different standard for what they call a LD than the guys in Los Angeles. So it isn’t really about park factors, it’s about human factors, more particularly the nature of the measurerers.
Thankfully, Hit f/x will eventually make all of these batted ball classification issues moot, because we’ll know for certain how hard the ball was hit and at what trajectory. Thankfully, this might come soon.
The immoderate moderator
by NYRoyal on
Jan 18, 2009 2:06 PM PST
up
reply
actions
0 recs










