From Davis Balestracci -- A FINAL Final Farewell to the 2015 Baseball Season

Published: Mon, 12/14/15

From Davis Balestracci --
A FINAL Final Farewell to the 2015 Baseball Season
To read on-line:

Hi, Folks,
This is my last technical newsletter for this year.  I will have a very brief non-technical holiday newsletter next week, then skip a cycle until 19 January. 

Given that time frame, this one is a little longer than usual because the Boston Globe article on which this and my last newsletter are based has been a gold mine for teaching so many useful basic concepts about variation.  I have had a lot of fun writing them and hope it has translated to a high entertainment factor for my readers.

For my non-U.S. readers, I hope you will be able to follow the analysis philosophy and see parallels similar to your favorite sports and newspaper articles.

Once again, for those of you who are not interested in the statistical mechanics, but want to be aware of how this type of analysis can drastically change one’s thinking, just skip to the Bottom Line conclusions when indicated.

Any italics in direct quotes are mine and if I make comments within the quote, I show that by inserting [DB:...]

Continuing from last article...

“The hallmark of bullpens is their inconsistency. The Mariners offer a compelling microcosm, having gone in the last four years, from a 3.39 ERA in 2012 to a 4.58 ERA in 2013 (rise of 1.19 runs) to a 2.59 ERA in 2014 (drop of 1.99 runs) to a 4.15 ERA (rise of 1.56 runs) in 2015.”

He might have a point since each of these year-to-year differences was greater than the 1.1 calculated in last newsletter.

He then went on a “fishing expedition” on the last 10 years of data (was his choice of 10 years arbitrary?). For those of you who are old enough, do you remember the “Excedrin headache # [x]” commercials?  I think reading the following qualifies – a “masterful” explanation of explaining probable common cause as special while at times calling it common, then making a special cause conclusion (see what I mean?):
  • “While [the Mariners’ inconsistency] is an extreme snapshot, it’s far from isolated. Over the last 10 years [DB: 30 teams x 9 year-to-year differences = 270 ranges], there are 30 instances of teams whose relief ERAs changed by at least one run [DB:  ERA: lower = better] – with 16 of them representing improvements by at least a run and 14 representing declines of at least one run [DB:  half went up, half went down.  Sounds average to me – as well as common cause (< 1.1)]. On average, teams saw their bullpen ERA change by 0.52 runs on a season-to-season basis over the last 10 years – meaning that a ‘normal’ ERA adjustment [from 4.24 to 3.71] could give the Sox at least an average bullpen [DB: Huh?], and with the possibility that it would be far from an outlier for the team to improve by, say, a full run, which would in turn suggest a bullpen that had gone from a weakness to a strength."  [DB: I’m reaching for the Excedrin]
Actually, Seattle is isolated -- as you will see, it was the only bullpen with more than one special cause.

" least an average bullpen" :  the Red Sox already have an average bullpen.

...far from an outlier” :  He’s right.  Changing by a run would be common cause, i.e., not an outlier.  But look at his conclusion:  he implies that nothing could in essence change, yet it would now be a strength (special cause conclusion)? 

If he can fish, I can fish – but more carefully and statistically.  I didn’t want to rely solely on the differences between 2014 and 2015.  Since he initially looked at the ERA data for 2012 to 2015, I decided to start there.

Optional technicalities.  I did a quick and dirty 2-way analysis of variance (ANOVA) to see whether there were any differences by year and/or league, and there weren’t.   

Bottom line.   I can look at the three year-to-year differences for each team (2012 to 2013, 2013 to 2014, 2014 to 2015), which gives me 90 ranges to work with.

Taking the average of the 90 ranges of two, R avg ~0.50Note how close this 4-year average is to his 10-year average of 0.52 !  (consistent inconsistency?)

What he didn’t realize was that this average range by itself isn't very useful.  It needs to be converted to the maximum difference between two consecutive years that is due just to common cause:  R max =  3.268 (from theory -- for use with an average range of 2) x 0.50 ~ 1.6.  Two were much higher than that (Seattle 2013 to 2014 (-1.99) and Oakland 2014 to 2015 (+1.72). These need to be taken into consideration to get a more accurate answer.

[For those of you not interested in the ensuing -- and what could be perceived at times as gory -- details, skip to the Bottom Line below]

Optional technicalities:  It is standard practice to begin a process of omitting special cause ranges and recalculating until all of the remaining ranges are within common cause.

                    2012      2013               2014                    2015
Seattle       3.39        4.58                 2.59 (-1.99)        4.15
Oakland    2.94        3.22                 2.91                     4.63(+1.72)

1. Eliminating these two, R avg now equals 0.465 and R max = 1.52, which then flags:

Seattle        3.39      4.58                  2.59 (-1.99)         4.15 (+1.56)
Houston     4.46      4.92                  4.80                      3.27 (-1.53)

Note the similar pattern to last newsletter when I used just the 2014 / 2015 data:  Oakland, Seattle, and Houston get flagged on their 2014 / 2015 difference.

2. Eliminating these, R avg now equals 0.4398 and R max = 1.44

Milwaukee   4.66     3.19 (-1.47)      3.62                     3.40

3. Eliminating Milwaukee, R avg now equals 0.4276 and R max = 1.40

Largest remaining:  
Atlanta        2.76       2.46                    3.31                   4.69 (+1.38)

According to this analysis, 1.38 is not a special cause;  but a deeper subsequent confirmatory analysis using ANOVA left little doubt that 4.69 was a special cause (just like last newsletter).

So, omitting this range, I get a final R max of 1.36.

Two anomalies of last newsletter:
  • Given the 1.36, San Diego’s 2014 to 2015 difference of 1.29 seems to have been common cause.
  • The previous R max of 1.1 based on 2014 / 2015 was probably low and quite variable in its estimate due to the use of only 25 ranges.  This 2012 to 2015 analysis ends up using 84 ranges, which makes it more reliable and accurate.
-- Neat trick to avoid all this eliminating and recalculating.   One can alternatively use the median range of the original 90 differences at the outset as a very good initial estimate of what constitutes an outlier.  In this case, R med = 0.375 and, from this, R max =  0.375 x 3.865 (from theory -- used with a median range of 2)  ~ 1.43, which is very close to the final answer using the average range with successive eliminations.  This is oftentimes “one stop shopping.”  Using the median range on the final data with six outliers eliminated, it matched the R avg result.

-- Using the BoxPlot analysis on the original 90 actual differences yields that any range greater than ~ 1.5 is a special cause.

Bottom line.   My approach of liking to use several analyses simultaneously to seek convergence was successful:  three different simple approaches (along with some slight help of ANOVA) yield a very similar conclusion:

Two consecutive years’ ERA can have a difference of ~1.4 due just to common cause.

“The Sox’ biggest one-year improvement of the last decade came between 2006 and 2007, when the relief ERA dropped by 1.41 runs en route to a championship primarily thanks to a) lightning in a bottle…; b) a breakthrough… in middle relief; and c) drastic defensive improvement that permitted a bullpen group with modest strikeout numbers to record outs.”

Given 10 years, isn’t one of the differences going to be the largest?  This is what is called “cherry picking,” but we can test it.  That difference of 1.41 is a borderline special cause and needs to be examined more closely.
[For those of you not interested in details, skip to the Bottom Line below]

Optional technicalities.  I did an I-chart of the Red Sox bullpen ERA from 2000 to 2015: 

Looking at this graph, I wondered whether 1.41 indicated a distinct shift in overall bullpen performance, i.e., the possibility that the bullpens of 2007 to 2015 have been consistently better than those of 2000 to 2006 (due to new coaching staff, more consistent philosophy or approach?  Was this around the time where bullpen philosophy began to tilt more towards "one inning (or even one batter) specialists"?).  

Using a simple T-test, I got the surprising (to me) p-value of 0.012 (only an approximate 1% risk that this difference might not be real).

Using this along with the standard deviation estimate from all the data in the previous analysis (~0.37) (same scale as chart above):

It also seems to confirm that the standard deviation estimate of ~0.37 is reasonable (and hence R max of 1.36 or ~1.4).

Another angle.  I was curious and made an assumption that a u-chart ANOM could be used to compare Boston's 2006 and 2007 ERAs.  Considering “runs” as somewhat discrete random events and “innings” as the window of opportunity:
Based on the u-chart being an appropriate analysis, the 2007 bullpen ERA does seem to be significantly lower than 2006 with a risk of less than 1% (because the results are outside the second set of red lines at 1.8 SL).

Bottom line.  The 1.41 drop seems to be a special cause – but for the reasons he cites? 

Aren’t (a) and (b) random luck?

Regarding (c), his alleged “drastic defensive improvement” from 2006 to 2007:  the nature of fielding percentage lends itself beautifully to a p-chart ANOM.  Here is 2006:

Obviously listed in descending order (and the team order will be different for 2007), look who’s got the highest fielding percentage as a true special cause:  Boston (0.989)! 

Let's take a look at the 2007 data as an ANOM:
Boston is #3 (0.986) on the horizontal axis – and average.
To review a key point:  Statistically (based on this data only), there is no difference between teams #2 through #29.

What planet was he on to conclude (c)?

“The lesson?  Bullpen improvement can happen even without adding a single ‘name’ relief arm. That said, there’s a considerable amount of luck involved in getting the sort of performances from unheralded relievers that allow a bullpen transformation.”

To paraphrase statistically:  Alleged “improvement” due to common cause = luck (and it is!).  Everything that could possibly go right goes right.  Sort of like those rare days you get all the green lights going to work.  Try to reproduce it?  You can’t!  You know it’s going to happen again, but when?  You don’t know!

Common cause "lightning in a bottle" :  Given what amounts to usually 10 or so good teams of relatively equal ability, it’s going to happen to someone for an entire season -- or, increasingly, even some mediocre teams during the playoffs (wildcards) -- but you can’t say just who…until the end.  But what do people do then? – try  to explain it as special cause with opportunistic data torturing!

Final thoughts from the article:
  • Quoting the Red Sox general manager:  “What you really try to do is . . . project some people’s performance taking a step forward, through scouting and analytics, and try to go that way…[T]here’s so much inconsistency in bullpen performances throughout the years [DB:  no kidding!]. So the good arm just doesn’t settle, because you can have a good arm and still get hit… I think sometimes you have to look at the year before.” (DB:  Given two different numbers, one will be larger)
Perhaps a plot of an individual’s performance over more than one year might be better for prediction, especially to predict a special cause drop-off in performance?
  • “…[E]ven in an area where dramatic improvement is possible, the path to achieve it is, for now, obscure.”  (Especially when people keep treating common cause as special cause through opportunistic data torturing)
So why not use some statistical thinking applications to find true special causes that help focus and motivate better questions for prediction

Common cause inconsistency is, predictably, consistently inconsistent.  And people may not like its level, but it is what it is!  Perhaps in the end, winning the World Series is somewhat of a lottery.

One final thought – a “What if…?” to ponder
What if a  green or black belt certification exam consisted of simply passing out this or a similar article with the only instructions being, “Apply any statistics you have learned to statements made in this article”?

I’ve said it before and I will say it again: there is no “app” for critical thinking!

I will send out a brief holiday edition newsletter next week.

Kind regards,

P.S. A unique feature of Data Sanity is the 10 examples in Chapter 2 designed specifically for executives using everyday business scenarios similar to this.  Use them to create dialogue.
As always, I welcome contact from my readers with comments or to answer any questions.
( )

Was this forwarded to you?  Would you like to sign up?
If so, please visit my web site -- -- and fill out the box in the left margin on the home page, then click on the link in the confirmation e-mail you will immediately receive.

Want a concise summary of Data my own words?
Listen to my 10-minute podcast. Go to the bottom left of this .