How to Validate your Decisions Using Statistics (Analysis of Variance) – Part I

Are we producing the desired effects?In manufacturing environments, as well as in many other settings, we take actions and modify parameters, procedures and processes to obtain a specific result (usually to improve the situation). In these cases we need to know if the obtained result is a expected consequence of our changes or we’re just observing variations inherent to the population, that are not related to our actions.

One of the techniques we can use is the Analysis of Variance which is a powerful tool used in statistical design of experiments, Lean Manufacturing, Reliability Engineering and in situations involving many variables and/or samples from different populations.

It’s true that many software can perform this calculation automatically; however, it’s important to know how the method works – at least with a simple example like the one below – to be able to interpret and take advantage of the results the software we’ll give us when solving more complex problems.

In this post I will explain this technique in depth, solving a typical problem step by step. So, let’s consider the problem of having three different training methods to perform a specific task. We trained 15 persons, 5 with each method and then we measure how much time they spend in doing the task. The results are:

[Table 1] Time to perform the task

Method 11516141517
Method 21413151614
Method 31312111411

If we look at the results we could probably note some differences between them, but how can we be sure that they are due to our actions and not to the variation of the population itself? And what if we have much more samples? The difference wouldn’t be evident anymore. So, at this point we define the hypothesis: “the samples come from the same population and therefore there aren’t significant differences between them” (in other words, all methods produce the same effect). Now we will prove it or reject it with some level of confidence using statistical methods.

General Procedure

To test the hypothesis we are going to use the Analysis of Variance. In this method we estimate the variances of the population in a special way (we’ll see how to do it later) and compare them by applying the F-test or Variance Ratio Test. This test is used to determine if the samples come from the same population based on their variances.

In this way, we know that the Variance of a set of data is:

[Eq.1]    {\sigma}^{2} =\frac { 1 }{ n } \sum _{ i=1 }^{ n }{ { ({ X }_{ i }-\overline { X } ) }^{ 2 } }

where n is the sample size
{X}_{i} is the element i of the sample
\overline { X } is the sample mean
to estimate the variance of the population, in Eq.1 we divide by (n-1) instead of (n):

[Eq.2] {\sigma}^{2} =\frac { 1 }{ n-1 } \sum _{ i=1 }^{ n }{ { ({ X }_{ i }-\overline { X } ) }^{ 2 } }

here (n-1) is called the number of Degrees of Freedom DF

Once we have the estimated variances we calculate the Variance Ratio F, used to perform the test itself:

[Eq.3]  F=\frac { Greater\quad estimate\quad of\quad population\quad variance }{ Lesser\quad estimate\quad of\quad population\quad variance }

Applying Eq.3 we’ll get a value that can be compared with the F function, which is widely available in tables and software such as MS Excel and OpenOffice Calc. In the last ones, we need to use the function FINV() to compare the value we estimated with the function’s value for a specific confidence interval and for the degrees of freedom of both variances. The function syntax is the same for both softwares:

[Eq.4] FINV(P; DF1; DF2)

where P is the probability (in our case will be 1 minus the confidence interval)
DF1 and DF2 are the number of degrees of freedom for numerator and denominator.

If our estimated F is less than the value obtained from the function, we accept the hypothesis and we conclude that there is not a significant difference between samples at that confidence level:

Calculated F < F function → Accept hypothesis → the effect is the same for all methods

Calculated F > F function → Reject hypothesis → the methods affect results at the significance level

Now we’re going to learn how to estimate those two variances to calculate the F value for our example. For the first one, we need to analyse the variation Between Samples (BS) to know if it’s caused by the methods or it’s just the variation from the same population. To get the second one we are going to analyse that variation Within Samples (WS) because we can see that within the same sample there are internal variations that must be addressed.

Calculating the overall variance

First of all we calculate the overall mean taking all the data together (without considering the samples division).

[Eq.5]  \overline { X } =\frac { 1 }{ n } \sum _{ i=1 }^{ n }{ { X }_{ i } }

where n is the sample size
{X}_{i} is the element i of the sample
Then we estimate the population variance applying Eq.2. We take each sample, subtract the overall mean and square the result like in Table 2.

[Table 2] Sum of Squares { \left( { X }_{ i }-\overline { X } \right) }^{ 2 }

      Meth. Sum
Method 11401915
Method 2011406
Method 31490923

From the Table 2 we get that the total sum is 44. Since the total samples size is 15, the number of degrees of freedom for this estimation is 14.

Calculating the Between Samples Variance

The following step is analysing the data between samples (BS). We replace each element for its sample mean \hat { { X }_{ i } }. By doing this we are eliminating the variations among the sample’s values. Using Eq.5 we calculate the mean for each method:

Method 1 mean: \hat { { X }_{ 1 } } = 15.4
Method 2 mean: \hat { { X }_{ 2 } } = 14.4
Method 3 mean: \hat { { X }_{ 3 } } = 12.2

Then we estimate again the variance for all the data together, but with the replaced values, like in Table 3.

[Table 3] Elements replaced by their sample mean / calculated term from Eq.2

Method Sum
Method 115.4 / 1.9615.4 / 1.9615.4 / 1.9615.4 / 1.9615.4 / 1.969.8
Method 214.4 / 0.1614.4 / 0.1614.4 / 0.1614.4 / 0.1614.4 / 0.160.8
Method 312.2 / 3.2412.2 / 3.2412.2 / 3.2412.2 / 3.2412.2 / 3.2416.2

Since all values are the sample mean we can just calculate the method totals with n{ \left( { \hat { X } }_{ i }-\overline { X } \right) }^{ 2 }. In this case, since we are analysing the data between samples, n=3 (the three methods) and the number of DF is (n-1)=2.

Calculating the Within Samples Variance

Now, since we have already considered the sample means in the Between Samples (BS) estimation, now we want to focus in the variation within every sample respect of that mean that we used. For that we create a new set of data considering only the variation respect the sample mean like in Table 4.

[Table 4] New set of data subtracting the sample mean to each element

Method 1-0.40.6-1.4-0.41.6
Method 2-0.4-
Method 30.8-0.2-1.21.8-1.2

For each element in Table 1 we subtracted its sample mean, \left( { X }_{ i }-{ \hat { X } }_{ i } \right) so the sample’s mean for the new set of data is 0 and so is the overall mean. Then we calculate the sum of squares for the data in Table 4, which is reduced to square each element provided that the mean in cero. The results are in Table 5.

[Table 5] Sum of squares from Table 4

      Meth. Sum
Method 10.160.361.960.162.565.2
Method 20.161.9610.362.560.165.2
Method 30.640.041.443.241.446.8

The overall sum of squares is 17.2

As we are now considering the data within the sample n=5, the sample size. To get the number of degrees of freedom we have (n-1)=4 for each of the 3 methods (samples), so in total we have 12 degrees of freedom for this estimation.

Evaluating the Variance Ratio

Once we have estimated both variances from the Between Samples and the Within samples analysis (dividing the sum of squares by the number of degrees of freedom), we use them to know if they are both coming from the same population based on Eq.3, so we calculate the F value dividing both variances:

[Eq.6]  F=\frac { Between\quad Samples\quad Variance }{ Within\quad Samples\quad Variance } =\frac { 13.4 }{ 1.43 } =\quad 9.37

Now we summarise all the information calculated up to now in the Table 6:

[Table 6] Results from the analysis

Source of VarianceSum of SqDegrees of FreedomVarianceFcalcTabulated F for 5%
Between Samples26.8213.49.373.88
Within Samples

If we want an answer to our hypothesis at a significance level 5% (95% confidence interval) we can use Eq.4 either in Excel or Calc to look for the F value for that significance level and the respective DFs:

=FINV(0.05; 2; 12) = 3.88 > estimated 9.37 → Reject Hypothesis

Since the number obtained from the function for that significance level is lower than the estimated F, we can assure that the methods affect the results (hypothesis rejected).

If we use the FDIST function, which is similar to FINV but using the F value as an imput to show the probability, we can see the significance level for our estimated F value:

=FDIST(9.37; 2; 12)= 0.0035 -> 1-0.0035= 0.997

That is 99.7% confidence interval. We can see that the confidence interval is greater than the one we expected so the variance ratio is significant and the hypothesis is rejected.

At this point you know how to solve the problem. In the next post we’ll have a closer look at the process and its limitations and know why we compare the variances in this way.

Thanks for reading!

Leave a Reply

Your email address will not be published.

/** >>> Added <<< */