How to Validate your Decisions Using Statistics (Analysis of Variance) – Part II

crossroadsIn the previous post I showed how to solve a simple problem by performing an Analysis of Variance (if you haven’t read it, please click here). The example was about three different methods to perform a task and we wanted to know if we are getting different results or the variations on the data are only due to the population dispersion.

Generally speaking, we compared the variances between the three samples and the variances within each sample and used the Variance Ratio Test to know, with a specific confidence interval, if the samples are coming from the same population, which in practical terms (referred to our example) means that the different methods are not changing the outcomes. In our example we rejected this hypothesis, so we can say that there is a significant difference between the methods.

To obtain trustworthy result, we need to consider the method assumptions which limit its use. They are:

  1. The subjects should be randomly sampled
  2. The response variable should be normally distributed
  3. The population means may be different between samples, but the population standard deviation should be the same for all samples.

Fortunately, this method is robust enough to deliver trustworthy result even when we don’t respect those assumptions (to some extent), specially the last two. Some ways to know if we are too far from the assumptions, thus we might be getting misleading results, are:

  • look at the normal quantile plots for each sample to see if the data are at least close to the normal line (This plots are a way to see if a sample comes from a normally distributed population).
  • Calculate the standard deviation for each sample and calculate the ratio between the largest and the smallest. The result should be less than two.

In our example, if we calculate the variances for the three samples using the Eq.1 from the first post we get:

{{ \sigma }^{ 2 }}_{ 1 }=1.3 {{ \sigma }^{ 2 }}_{ 2 }=1.3 {{ \sigma }^{ 2 }}_{ 3 }=1.7

Since the variance is the square of the standard deviation, we can obtain the standard deviation as below:

{ \sigma }_{1}{ = }\sqrt { { \sigma }^{ 2 } } =1.14 { \sigma }_{2}{ = }\sqrt { { \sigma }^{ 2 } } =1.14 { \sigma }_{3}{ = }\sqrt { { \sigma }^{ 2 } } =1.3

The ratio between the largest and the smallest is 1.14, so we can assume that we fulfill the third condition. Related to the second condition, we assume that the distribution is normal enough to get good results. In a future post I will talk about normality test in depth.

So, now that we can say that the result we obtained in the last post are accurate, we can move on with the analysis of the methodology itself. In the Table 6 below we can see the results derived from our analysis on the first post.

Source of VarianceSum of SqDegrees of FreedomVarianceFcalcTabulated F for 5%
Between Samples26.8213.49.373.88
Within Samples
(residual)
17.2121.43
Total44143.14

Taking the information highlighted in Table 5, a way to see what we’ve done is that we decomposed the total result (last row) in two parts (which were then compared using the variance ratio). The first part is the data that follows our model (the samples means) and is what we called Between Samples. The second part is to what extent individual elements separate from our model (extracting the sample mean from the individual elements), and it’s called the Within Samples analysis. This last term is called the residual error and is the variability that are not explained by the model:

[Eq.7]   Total=Data+\begin{matrix} Residual \\ Error \end{matrix}

Reflecting what I mentioned in the paragraph above, we replace the terms in the previous equation. For the total, we took each element minus the overall mean, for the data (the Between Samples analysis) we took the sample media minus the overall mean and for the residual error (Within Samples analysis) we considered each element minus its sample mean. That lead us to Eq.8.

[Eq.8]   ({ X }_{ i }-\overline { X } )=({ \hat { X } }_{ i }-\overline { X } )+(X _{ i }-\hat { X } _{ i })

Now we square the values and sum for all elements on both sides:

[Eq.9]   \sum { { [({ X }_{ i }-\overline { X } )] }^{ 2 } } =\quad \sum { { [({ \hat { X } }_{ i }-\overline { X } )+(X_{ i }-\hat { X } _{ i })] }^{ 2 } } \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad =\quad \sum { { ({ \hat { X } }_{ i }-\overline { X } ) }^{ 2 } } +\quad \sum { { (X_{ i }-\hat { X } _{ i }) }^{ 2 } } +\quad \underbrace { 2\sum { ({ \hat { X } }_{ i }-\overline { X } )\quad (X_{ i }-\hat { X } _{ i }) } }_{ =\quad 0 }

As shown in the previous equation, we decomposed the right side (each term squared plus the double of both multiplied). The last term equals 0 due to orthogonality. If you believe me, skip the explanation below, otherwise keep reading:

So we isolate the last term:
[Eq.10]   2\sum { ({ \hat { X } }_{ i }-\overline { X } )\quad (X_{ i }-\hat { X } _{ i }) }

For each sample, the first part can be taken out of the summation because they are independent from i
[Eq.11]   2({ \hat { X } }_{ i }-\overline { X } )\quad \sum { (X_{ i }-\hat { X } _{ i }) }

Now, what we have left inside the summation is the sample mean { \hat { X } }_{ i } that after we performed the summation is n times { \hat { X } }_{ i } and for the other part we get all the elements summed:
[Eq.12]   2({ \hat { X } }_{ i }-\overline { X } )\quad \left[ \sum { X_{ i } } -\quad n{ \hat { X } }_{ i } \right]

If we replace the sample mean for its formula (from Eq.5), we can see that the result of the term is zero:
[Eq.13]   2({ \hat { X } }_{ i }-\overline { X } )\quad \left[ \sum { X_{ i } } -\quad n\quad \frac { \sum { { X }_{ i } } }{ n } \right] \\ \quad \quad \quad \quad \quad \quad \quad \quad \underbrace { \left[ \sum { X_{ i } } -\quad \sum { X_{ i } } \right] }_{ =\quad 0 }

So, without the last term, we have the Eq.7 related to the formulas we used during the first part of the post.

[Eq.14]  \sum { { [({ X }_{ i }-\overline { X } )] }^{ 2 } } =\quad \sum { { ({ \hat { X } }_{ i }-\overline { X } ) }^{ 2 } } +\quad \sum { { (X_{ i }-\hat { X } _{ i }) }^{ 2 } }

It’s important to remind that what we did in the first part of the article is how we actually solve these problems, this part is just to know where they come from. We can see that the Between and Within calculations are the terms of the equation and considers that the total variance can be expressed in two parts:

  • Variability Between Samples BS. Using the sample means and its variation around the overall mean.
  • Variability Within Samples WS, also called residual error. The variation of individual elements around their sample mean.

So if the variance between samples (BS) is relatively large respect the variance within each method (WS), we can deduct that the mean of the population from which they was taken are different. As we saw in the first post, comparing the ratio with the tabulated F function we can affirm our deductions with a certain level of confidence.

Thanks for reading! This is the first post of a series about the practical use of statistics to take better decisions and obtain information from our data. Please let me know if you have any question!

I look forward to your comments!

 

Leave a Reply

Your email address will not be published.

/** >>> Added <<< */