From Davis Balestracci -- Why "3" Standard Deviations?

Published: Mon, 04/12/10


From Davis Balestracci -- Why "3" Standard Deviations?


The "2" vs. "3" Standard Deviation Conundrum

[NOTE:  There is a Word document attachment to this e-mail containing two graphs]
 
[~950 words and a bit technical today.  Take 5-10 minutes to read over a break or lunch]
 
To read this as a web page: http://www.aweber.com/b/g1L1
 
Hi, Folks,
In my 15 March newsletter [If you need to refresh your memory:  http://www.aweber.com/b/hTAP (analysis graph reattached)], some of you no doubt noticed that the
p = 0.05 and p = 0.01 limits of my summed rank scores analysis were approximately 3.1 and 3.5 standard deviations, respectively. You might be wondering how in the world I obtained them.  And some of you might even be wondering, "Why don't you use just '2' standard deviations like we were taught in our college courses?"  By the way, are you asked that...a LOT?
 
The "short answer": You are indeed taught to use "2" standard deviations to declare statistical significance; however, did you ever notice that you were making only ONE decision?  With control charts and analysis of means charts, you are making multiple simultaneous decisions, which necessitates, as a rule of thumb, using "3" standard deviations as an outlier criterion (As many of you know, this is one of THE sore points in explaining this "funny new" SPC way of doing things).  Let's hope the following helps a bit.
 
Generally, given the structure of many data sets, one doesn't have the luxury of calculating exact limits, and "3" standard deviations will have to suffice.  However, because it's possible with the ranking data, I'll demonstrate the calculation and address an additional important point...that "3" standard deviations as a general criterion is pretty darn good -- Hey, if it was good enough for Walter Shewhart and W. Edwards Deming, it's good enough for me!
 
Multiple, Simultaneous Decisions Are the Rule


To review the scenario, 10 sets of rankings for each of 21 counties were summed. However, because of the nature of rankings, you don't have 21 independent observations. Once 20 sums are "known," the 21st is also known by default. The overall average is always 110 (10 x 11) and isn't affected by any individual county's sum. So, statistically, one is making, in essence, only 20 (not 21) comparisons...simultaneously.  Because we're making 20 simultaneous decisions, what's the probability (p-value) needed to ensure that, if there are indeed no true outliers, the risk of creating a false signal is very low? 

Usually, one chooses overall risks of 0.05 (5% risk of declaring common cause as special cause -- the usual "standard") or 0.01 (1% risk of declaring common cause as special cause -- lower risk, but more conservative limits).  In the current case, if you were to naively use "2" standard deviations thinking your overall p = 0.05, consider this:  The probability that ALL 20 points -- IF none of them are outliers -- will behave properly and stay within two standard deviations is (0.95 to the 20th power):  (0.95**20) = 0.358.  So, the probability of AT LEAST ONE data point being declared a special cause when it isn't is:
 
[1-(0.95)**20] = 0.642
 
Are you willing to take a 64% chance of being wrong (i.e., at least one false signal)...especially if these are doctors being compared?
 
So what level  of p makes [1-p**20] = 0.05 (and 0.01)? The answers are 0.997439 (p-value = 0.002561) (and 0.999498 -- p-value = 0.000502), respectively.
 
Further, because these are two-sided tests (A County can be an outlier either high or low), I need to "redistribute" the probability so that half is on each side of the limits, meaning that I need to find the t-values corresponding to p = 0.998719 (and 0.999749), with, as shown in the 15 March newsletter, [(k-1) x (T-1)] degrees of freedom (in this case, 9 x 20 = 180). These t-values are 3.06 (and 3.54).
 
So, given all this statistical mumbo jumbo, using "3" is pretty good, eh?  It will also be further confirmed below.

Another Handy Technique 


A five-number summary (a standard summary some of you may have encountered) can be constructed from the 21 summed rank scores:

--Minimum = 42
--First quartile (Q1) = 95.5
--Median = 107
--Third quartile (Q3) = 124.5
--Maximum = 181.
 
A box-and-whisker plot (attached for this data, along with the
original analysis of means graph) is a distribution-free graphic --
available on most all standard statistical computer packages --
that takes the five-number summary one step further to calculate a
criteria to detect potential outliers.

The first and third quartiles form a "box" containing the middle 50
percent of the data. With the median notated within the box, lines
are drawn from the sides of the box to the last actual data values
within the inner fence (described below), i.e., the "whiskers."
Actual data values outside of this fence are plotted with individual asterisks as possible outliers.

Note also the intuitive calculation of the inner fences:

1. Find the "interquartile range," i.e., (Q3 - Q1),
     [In this case, 124.5 - 95.5 = 29]

2. Multiply (Q3-Q1) by 1.5
     [1.5 x 29 = 43.5]
 
3. Subtract this quantity from Q1
     [95.5 - 43.5 = 52]
 
4. Add this quantity to Q3
     [124.5 + 43.5 = 168]
 
5. Any number <52 or >168 is a possible outlier 
     [In our data, 44 and 181]

The spread between the inner fence limits -- (168 - 52) = 116 --
is very close to the difference of the overall 0.05 limits shown in
the original ANOM graph (166 - 54 = 112)(attachment).

Note:  The standard deviation of all 21 scores is 29.18.  Converting the "116" corresponding to the width of the inner fence to a multiple of this, it encompasses a range of approximately +/- 2 standard deviations.  Of course, this is a moot point because the presence of the two special causes INVALIDATES (and inflates) this (typical) calculation...and once again demonstrates that three standard deviations -- calculated correctly -- is a very good criterion for declaring outliers.  It is this MISapplication of the "standard" calculation of standard deviation that has led to the rampant incorrect use of "2" or, as I am seeing increasingly, even "ONE"(!) standard deviation as criteria for declaring performance outliers.
 
Whew...are you as exhausted reading this as I am writing it?  But, it did need to be said.  Now...on to more important things next time (and an easier read...PROMISE!).
 
Kind regards,
Davis
 
===================================================
P.S.  The "Mumbo Jumbo" is indeed in my book...

===================================================
...in Chapter 8.  But, more important, rather than teach you statistics, I try to show you how to solve your problems in this and the other 10 chapters of my book Data Sanity:  A Quantum Leap to Unprecedented Results, which can be ordered via:
 
 
You can also order it through Amazon:
 
 
(Thank you Steve Tarzynski, Dean Spitzer, and Adam Lenox for your very kind reviews)
 
Foreign subscribers would be best contacting Marilee Aust directly at:  maust@mgma.com.  I promise you excellent service.
 
Attention UK Subscribers:  I shall be giving an all-day Data Sanity seminar on 11 May and be the closing keynote for the UK
Deming-based Transformation 2010 Forum:  Out of the Crisis...back to the future.  Details at:
 
http://www.transformationforum.org/Annual-Forum-Main.html

=======================================================
Was This Newsletter Forwarded to You?  Would You Like to Sign Up?
=======================================================
If so, please visit my web site --
www.dbharmony.com -- and fill out the box on the home page then click on the link in the confirmation e-mail you immediately receive