From Davis Balestracci -- Guaranteed: Two More Serious Lurking Mess-ups for ANY Experiment, Designed or Not

Published: Mon, 06/06/16

From Davis Balestracci -- Guaranteed: Two More Serious Lurking Mess-ups for ANY Experiment, Designed or Not
A lot of food for thought here for everyone -- some things many of you may have never been taught.

The "Human Variation" Factor is Always Present.  PLAN on it!

Hi, Folks,
I hope you've found Hendrix's "ways to mess up an experiment" helpful in putting your DOE training into a much better perspective. Today, I'm going to add two common mess-ups from my consulting experience.  If you're not careful, it's all too easy to end up with data that is worthless.

Balestracci's Mess-up #1:  Underestimating the unintended ingenuity of human psychology to mess up your experiment -- and this includes the study planners! 

Trust me, there is no way you could make up the things busy people will do to (unintentionally) mess up your design and its data.

"But I just want to run a simple 2 x 2 x 2 factorial"

To refresh your memory:  Supposed you are interested in examining three components of a weight loss intervention:

  • Keeping a food diary (yes or no)

  • Increasing activity (yes or no)

  • Home visit (yes or no)

It's easy to set up the 2 x 2 x 2 factorial design matrix, and you might think the only remaining issue is how many people should participate.

Remember last newsletter when a  client's initial request for just a design template turned into three consults totaling 2-1/2 hours?  Using this weight loss experiment scenario, let's see why.

1. Some things need to be clarified

  • "Weight loss" is pretty vague. What exactly do you mean by that (operational definition)?  Pounds?  Percent of original weight? Meeting a pre-established goal?  Something else?

    • Is a weight loss goal of any kind going to be involved?  For everyone?

    • Imagine you have the data in hand.  What are you going to do with it? Will it allow you to take the action you desire?

    • Are counting tallies of any kind going to be recorded?
      • Is the threshold between any "non-event" ("0") and "event" ("1") clear?

    • If two people were evaluating a participant, would they get the exact same number(s)?

  • What time period is going to be studied? One month? Two months? Three months? Six months? A year?

  • What exactly do you mean by "keeping a food diary," "increasing activity," "home visit"? 

    • Do you mean daily?  Weekly?  Do you want them to formally record the activity?  Does a phone call count as a "visit"? 

  • What characterizes the people you would like to study? 

    • How will you decide whom is admitted into your study? 

    • How will you sample to obtain these people?

  • Are you going to make it absolutely clear to the study participants what is expected?

    • How will you know that they understand? 

    • Should assessing this understanding be an ongoing part of the home visit? 

    • How will you get this information from the people not having home visits?  Would the knowledge of a scheduled check-in phone call bias their results?

  • Do not underestimate a very serious "headache factor" that I have often encountered:  Is this study going to collect data from several sites or departments?  If so, how are you going to make sure everyone agrees on the answers to the questions above?  How would you know?

    • Would a simple designed data collection sheet (with definitions on it) help reduce this variation? (Hint: Yes)

  • Should you PLAN a brief initial study whose only objectives are to reduce the human variation in perceptions of (1) the execution of the protocol, (2) how to define and collect the data and (3) the ease of using the designed data collection sheet? (Hint: Yes!  But these data will not be used in the study itself).  How will you choose people for such a study?

    • DO the brief study

    • STUDY the results with everyone's input -- both planners and participants. What mess-ups occurred in both the study's execution and data collection?  How could they be avoided?  Should the data collection sheet be redesigned or simplified for easier recording?

    • ACT on these conclusions and begin your next PLAN -- another process / data study or begin the experiment? 
      • Should you PLAN some type of ongoing data collection assessment during the study to prevent data contamination from human variation? (probably)

2. What threshold of weight loss would result in proceeding with your investment in such a program? Or is it a matter of "met personal goal" ("1") or "did not meet personal goal ("0")? (the latter measure of "percent of patients who met goal" is not recommended -- see below)

3. How badly do you want to detect this difference?

4. How many people do you need? 

  • Key question:  What is the ratio of this desired result in (2) relative to your standard deviation?

Are any actual data available from similar weight loss research studies (preferably using the same time period) to measure the weight losses (as well as unintended gains)?  If so, what are some of the reported standard deviations for such a group of people?  Are they consistent enough to come up with an approximate value?

Remember the client dialogue from last newsletter regarding the tar scenario?
  • In wanting to detect an effect of "1," its resulting ratio relative to the tar process standard deviation of "4" was (1/4) = 0.25, which required 500 to 680 experiments . If this was your desired ratio, then, like them, you would need 500 to 680 people.

  • Similarly when wanting to detect an effect of "2," (2/4) = 0.5. If this was your desired ratio, then you would need 130 to 170 people.

If you obtained sample sizes like these, might you consider the possibility of studying two additional variables' effects on your results?  Perhaps consider:  "Person does not set a goal" ("0") or "Person sets goal" ("1")?

Experimental logistics aside:  Where do these sample size numbers come from?

They come from answering three questions. The first is answered pretty much by default:  What risk are you willing to take for declaring an effect significant when it isn't?  Usually, 5 percent. 

(2) and (3) are the others. Initially these concepts can be very confusing (at least they were for me!), but I hope it will become clearer when I address this further next time.  Answering these three questions determines (4). 

Actually, answering any three of the questions above automatically answers the fourth. 

If you have a specific sample size in mind and plan on using p = 0.05 for significance (2 questions answered), you can work backwards to calculate various combinations of the other two, either:

  • what  effect you can reasonably detect (2), given your specific answer to (3) (desired probability to detect it) or

  • the probability of detecting a desired effect (3), given your specific answer to (2) (desired effect to detect).

Important point:  If all you do is default to p = 0.05 to detect effects, your design will, by default, answer questions (2) to (4) (as in the calculated sample sizes above) -- which could possibly waste your hard work unless you reconsider your objectives.

And for those of you who are wedded to the "rapid cycle PDSA" methodology:  have I brought up some things that might need consideration during your PLANning? (Hint: Yes).

Here is an additional mess-up, which, unlike Mess-up #1 above, isn't necessarily guaranteed -- if you carefully PLAN:

Balestracci's Mess-up #2:  Vague planning on a proposed vague solution to a vague problem usually results in vague data, on which vague analyses are performed -- yielding vague results.

Many times, I see data collection addressed as an afterthought, usually ad hoc. Collecting poorly designed data (or not even collecting data at all!) virtually guarantees non-trivial human variation seeping in -- an open door for introducing the toxic and very human "constant repetition of anecdotal perceptions."  (CRA...)

More discussion of sample size next time.

Kind regards,
Data Sanity's data philosophy will certainly help you improve the quality of your PLANs to test any theories

It will help you avoid DOE "mess ups" -- or make you come to the realization that you don't even have to run a DOE!

Data Sanity: A Quantum Leap to Unprecedented Results is a unique book that synthesizes the sane use of data, culture change, and leadership principles to create a road map for excellence.

Click here for ordering information [Note:  an e-edition is available] or here for a copy of its Preface and chapter summaries (fill out the form on the page).

[UK and other international readers who want a hard copy:  ordering through U.S. Amazon is your best bet]

Listen to a 10-minute podcast or watch a 10-minute video interview at the bottom of my home page where I talk about data sanity: .

Please know that I always have time for you and am never too answer a question, discuss opportunities for a leadership or staff retreat, webinar, mentoring, or public speaking --  or just about any other reason!  Don't give it a second thought to e-mail or phone me.

Was this forwarded to you?  Would you like to sign up?
If so, please visit my web site -- -- and fill out the box in the left margin on the home page, then click on the link in the confirmation e-mail you will immediately receive.