## From Davis Balestracci -- Why "3" Standard Deviations?

Published: Mon, 04/12/10

 From Davis Balestracci -- Why "3" Standard Deviations?

 The "2" vs. "3" Standard Deviation Conundrum [NOTE:  There is a Word document attachment to this e-mail containing two graphs]   [~950 words and a bit technical today.  Take 5-10 minutes to read over a break or lunch]   To read this as a web page: http://www.aweber.com/b/g1L1   Hi, Folks, In my 15 March newsletter [If you need to refresh your memory:  http://www.aweber.com/b/hTAP (analysis graph reattached)], some of you no doubt noticed that the p = 0.05 and p = 0.01 limits of my summed rank scores analysis were approximately 3.1 and 3.5 standard deviations, respectively. You might be wondering how in the world I obtained them.  And some of you might even be wondering, "Why don't you use just '2' standard deviations like we were taught in our college courses?"  By the way, are you asked that...a LOT?   The "short answer": You are indeed taught to use "2" standard deviations to declare statistical significance; however, did you ever notice that you were making only ONE decision?  With control charts and analysis of means charts, you are making multiple simultaneous decisions, which necessitates, as a rule of thumb, using "3" standard deviations as an outlier criterion (As many of you know, this is one of THE sore points in explaining this "funny new" SPC way of doing things).  Let's hope the following helps a bit.   Generally, given the structure of many data sets, one doesn't have the luxury of calculating exact limits, and "3" standard deviations will have to suffice.  However, because it's possible with the ranking data, I'll demonstrate the calculation and address an additional important point...that "3" standard deviations as a general criterion is pretty darn good -- Hey, if it was good enough for Walter Shewhart and W. Edwards Deming, it's good enough for me!
 Multiple, Simultaneous Decisions Are the Rule To review the scenario, 10 sets of rankings for each of 21 counties were summed. However, because of the nature of rankings, you don't have 21 independent observations. Once 20 sums are "known," the 21st is also known by default. The overall average is always 110 (10 x 11) and isn't affected by any individual county's sum. So, statistically, one is making, in essence, only 20 (not 21) comparisons...simultaneously.  Because we're making 20 simultaneous decisions, what's the probability (p-value) needed to ensure that, if there are indeed no true outliers, the risk of creating a false signal is very low?  Usually, one chooses overall risks of 0.05 (5% risk of declaring common cause as special cause -- the usual "standard") or 0.01 (1% risk of declaring common cause as special cause -- lower risk, but more conservative limits).  In the current case, if you were to naively use "2" standard deviations thinking your overall p = 0.05, consider this:  The probability that ALL 20 points -- IF none of them are outliers -- will behave properly and stay within two standard deviations is (0.95 to the 20th power):  (0.95**20) = 0.358.  So, the probability of AT LEAST ONE data point being declared a special cause when it isn't is:   [1-(0.95)**20] = 0.642   Are you willing to take a 64% chance of being wrong (i.e., at least one false signal)...especially if these are doctors being compared?   So what level  of p makes [1-p**20] = 0.05 (and 0.01)? The answers are 0.997439 (p-value = 0.002561) (and 0.999498 -- p-value = 0.000502), respectively.   Further, because these are two-sided tests (A County can be an outlier either high or low), I need to "redistribute" the probability so that half is on each side of the limits, meaning that I need to find the t-values corresponding to p = 0.998719 (and 0.999749), with, as shown in the 15 March newsletter, [(k-1) x (T-1)] degrees of freedom (in this case, 9 x 20 = 180). These t-values are 3.06 (and 3.54).   So, given all this statistical mumbo jumbo, using "3" is pretty good, eh?  It will also be further confirmed below.