Using statistical methods to develop confidence intervals to develop measurable and meaningful inferences about data
Author: Sean David Christopher, MBA
This post was originally written in 2013. I wanted to repost it as I believe that it adds value to the Business Analysis Toolkit. Enjoy.
When business analysts start out on any project, whether the project is to support business process renewal or the exploration of modern information technology solutions, or perhaps to conduct simple research on data available from a wide variety of sources, the first activity in which they typically engage is enterprise analysis. Enterprise analysis consists of a series of activities managed and monitored by the business analyst to help an organization develop requirements that eventually lead to a solution or a series of solutions to satisfy a business need. Guidance on how to perform these activities is found in Chapter 5: Enterprise Business Analysis of the Business Analysis Body of Knowledge (BABOK®).
A business analyst uses enterprise analysis to help an organization not only identify a business need but also conduct a capability gap analysis, determine a solution approach and a scope, and develop a business case to justify an investment of time, money and resources (material and human) to supporting a project that manages progress towards development of a solution.
The end result of these activities is the development, documentation, and management of business requirements. Business requirements consist of the high level goals and objectives a business wants to achieve to measure success in a specific strategic area. Typically, the business requirements often identify a “to-be” state as written in the form of high level statements. Yet, business analysis often includes less and less of data analysis to support discussion about business requirements.
A key component of these business requirements gathering tool kit is the use of statistical methods to help a business analysis understand meaning in data. The results of data analysis can be vital inputs in discussion about SMART objectives.
The SMART principle is an overarching set of terms for identifying simple, measurable, achievable, realistic, and time-based objectives. Keep in mind too that the SMART principle can be applied to test assumptions. Goals are the qualitative, high level statements. Objectives are often quantifiable meaning there is a metric applied to them in order to determine success or to gauge a company’s performance against the stated objectives after time has passed once a solution has been implemented for the organization.
The focus of this paper is on the measurable aspect of the SMART objective realm. Often, we read, whether online, in periodicals, or industry journals about a business objective and a measurable component attached to it.
But how does a business analyst acquire insight into data using statistical methods? What tools, other than subjective speculation, exist to help analyze the pools of data available at one’s disposal? In other words, what method exists to maintain a business analyst’s objectivity while calculating results based on data inputs? And what can a business analyst do with work that yield results?
This paper will specify the learning objectives and key arguments a business analyst can apply as part of their core set of competencies while performing enterprise analysis.
According to the Business Analysis Body of Knowledge (BABOK®), a measure of knowledge of business practices includes concepts related to business management and decision making concepts, principles activities and practices. One tool at a business analyst’s disposal related to decision making is statistical analysis or methods. Statistical analysis is a mathematical process of analyzing historical data to make inferences about future values.
Using statistical analysis to develop an analysis of data could foster progressive discussion between the business analysis team and the sponsors of a particular project, perhaps even the executive management team as part of strategic planning exercises. The discussion could also help participants to make informed decisions about setting objectives.
This paper takes a fictitious case study that gathers data about annual cell phone expenditures in the Ottawa area. The statistical analysis will be used to generate confidence intervals given an historical list of annual expenditure data. The goal is not to question the results but how to use statistical measures and analysis as a means to estimate or generate measurable values that could be adopted and used as inputs into mutual discussions later on between the business analyst and stakeholders, if necessary. Statistical analysis using the t-distribution theory can be a key tool in the BA competency toolkit.
From that analysis, steps a BA can take, armed with statistical results, can contribute to a more reasonable and realistic discussion with sponsors about stated business objectives or assumptions. The assumption being made here is that the metric is not based on statistical methods already done and sponsors are keen on feedback of this nature. From that activity, any task documented in the business requirements document and used to work toward such objectives can be clearly determined given strong statistical backing.
A Profile of our Fictitious Company and Objective
The company: ACME Inc.
The objective: Determine a confidence interval to infer cell phone expenditures in Ottawa.
The method: Statistical analysis using t-distribution and confidence intervals.
The result: A 99.9% confident interval.
The level: Enterprise Analysis
The Method – The Mean, Standard Deviation and Confidence/Predication Intervals
Access to historical data is required in order for a business analyst to build a good confidence interval. In this case study, access to a list of annual cell phone expenditures in Ottawa is needed. How far a BA needs to go back is subjective and depends on how long a list is made accessible.
Taking a sample of data from a yearly sample is OK, as in today’s cell phone market many users utilize a mobile device for communication. The example in this paper uses a sample size of 20.
Statistical methods, in this example, take the average of the expenditure data and calculate dispersion of the data around the average to calculate what is called the standard deviation. In statistics, a t-distribution is used and will be the focus of our analysis. Why the t-distribution? In most cases, we rarely have access to the standard deviation of the population and the t-distribution is useful when working sample sizes of 30 or less.
The t-distribution calculates a range of values around the average. In the end, we come up with a confidence interval that shows the range of cell phone expenditures values based on the calculated standard deviations.
The standard deviation shows how much variation or dispersion exists from the average (the average of our data). Estimated data that fall outside the range of each confidence interval are considered statistically significant.
Sampling is used to make inferences about a population of date. In opinion polls, it is expensive and time consuming to call each and every citizen for their opinion. So pollsters call a cross-section of citizens to get answers to their views on whatever the pollster is trying to estimate. From this activity, sample results can be used to make inferences about what the population (cell phone users and their expenditures) as a whole is spending.
Confidence intervals predict the distribution of estimates of the true population mean or quantity of intervals that cannot be observed. A prediction interval is an estimate of an interval in which future observations will fall, with a certain probability, given what has been already observed (our cell phone expenditure data).
Figure 1 shows the t-distribution.
Figure 1: t-Distribution Curve (*1)
Random ACME Cell Phone Expenditures
First, we need to obtain and document a list of annual cell phone expenditures and this fictitious list is shown in Table 1.
Table 1: Random Annual Cell Phone Expenditures in Ottawa
The Average Expenditure
The first step in our analysis is to calculate the average expenditure from our sample size of 20. We calculate the average by adding up the expenditure and divide by 20. This calculation yields an average cell phone expenditure per person of $677
Figure 2 illustrates the formula for calculating a sample average.
Figure 2: Calculating the Average, or Mean (*2)
To determine the standard deviation of the expenditure data from the average, we need to calculate the variance of our sample. Variance is calculated to give us a sense of the magnitude of the dispersion of the sample expenditure data around the average.
The sample variance is calculated by taking the difference of each individual expenditure data from the average we have calculated above, squaring each resulting value then adding the values together, and dividing by 19 (not twenty as will be discussed shortly). The variance for our sample is $9801 which indicates that there is quite a bit of variation in the data from our sample. In business terms, not all cell phone users spend the same amount and the amount they do spend varies widely. Subjectively this statement may be obvious as spending patterns, needs and other factors contribute to the dispersion. Objectively, we can say so based on a quick statistical measurement.
Figure 3 illustrates the formula for calculating the sample variance.
Figure 3: Formula for Sample Variance (*3)
The sample standard deviation is then the square-root of that resulting calculation.
Figure 4 illustrates the formula for calculating the sample standard deviation.
Figure 4 – Formula for Sample Standard Deviation (*4)
In our example the standard deviation of our expenditure data is $99.
Up to this point we have some good business data. From these two calculations, ACME can state that the average annual cell phone expenditure is $677 with a sample standard deviation of $99. In other words, an Ottawa cell phone customer spends almost $700 annually and from year to year we estimate there is almost a $100 variation in that expenditure.
From these calculations we can quickly calculate the degrees of freedom (DF), a key metric in our t-distribution discussion and analysis. DF is calculated by subtracting one from our sample size, 20. So our degrees of freedom is 19 in our example. Degrees of freedom is a concept for distinguishing between various t-distributions because we can conduct as many tests using various sample sizes to determine different t-distributions. The larger the sample size the more the results should converge to the normal distribution of the population. We use the degrees of freedom number as one input in our effort to find the critical value (called t-value) associated with building our confidence level that we want to establish for our test. The other factor is the confidence level itself.
In our example, using our degrees of freedom of 19, I want to create a confidence interval of 95%, meaning I want to create an interval where 95% of all values taken from the population will fall within the interval I calculate. Checking my t-table I look for the value “19” in the DF column (Footnote 7), and look across the table until I come across the “t” value associated with the 95% confidence level. The value is 2.093.
The formula for calculating the confidence interval for estimating our population average with unknown standard deviation is shown in Figure 5:
Figure 5: Confidence interval for t-distribution (*5)
Note: In the formula above, x-bar is the sample average; t is the t-value I retrieved from the t-table; sigma is the sample standard deviation and n is the sample size.
Using this formula we plug in the following: 677±2.093(99/4.47) = 677±46.35
Using a confidence level of 90%, with 19 degrees of freedom, my confidence interval is $630.65 and $723.35.
We could widen our confidence interval by calculating it for the 99% confidence interval (the only value that changes here is the t-value).
In this case, the formula yields the following: 677±2.861(99/4.47) = 677±63.36.
Using a confidence level of 99%, with 19 degrees of freedom, my confidence interval is $613.64 and $740.36.
And for the 99.9% confidence level, the formula yields the following: 677±3.883(99/4.47) = 677±85.99. Therefore, for a 99.9% confidence level, my confidence interval is between $591.01 and $752.99.
Statistical methods can be used to determine how probable a future unobserved metric may occur. Keep in mind that these methods do not predict future expenditure or future values in any way, but merely empower a business analyst with a tool set to gauge confidence in the realism of a range of data.
Keep in mind also that if another range of sample data is used, say another random sample of annual cell phone expenditures, then a different average value will be calculated and a different confidence interval will be calculated.
The point here is that no matter what statistical data is used as the sample source, a confidence interval can be determined and an assessment can be made of the confidence we can have in making inferences about the population. In our particular example, we have built up a 90%, 99% and 99.9% confidence interval. We can make inferences about the population as whole with these numbers.
What does this mean for our objective and you as a business analyst? Our objective was to develop a confidence interval to calculate a range of cell phone expenditures. We have achieved that objective. As a business analyst, you can use the confidence interval data, say for that of the 99.9% confidence interval, to make an inference that you are 99.9% certain that a typical cell phone expenditure value taken from the population will fall, in this case, between $591 and $752. You can use this information to feed discussion about an expenditure profile of Ottawa cell phone users. In turn, this could be used as inputs into marketing campaigns, reports for the Finance department, or feed discussions about setting business objectives for ACME if they chose to implement a new process to get more Ottawa cell phone users to spend more money on mobile services.
In summation, if we were to take the entire expenditure data and calculate the population true average, we can be certain that those values will fall between our established confident interval. But as I wrote earlier, it is expensive and time consuming to calculate the true average of the populations, so we rely on building confidence intervals like those of the t-distribution and make an inference about the population data. While the data informs us little in the way of why people either spend a little or a lot on cell phone services, it does inform us about what we can confidently say, based on our test example, is a range of expenditures an average person spends each year on these services.
Dividing by N-1
Earlier in our calculation of sample variance and sample standard deviation, we divided our summed value by 19 and not 20 (as we did for calculating the average).
Why is this so?
Most of the time we do not know μ (the population average) and we estimate it with x (the sample average). The formula for our sample variance measures the squared deviations from x rather than μ. The sample values tend to be closer to their average x rather than μ, so we compensate for this by using the divisor (n-1) rather than n.
Business analysts play a key role in helping sponsors on projects and their stakeholders come up with and document business requirements as a result of proper enterprise analysis. A key component of business requirements is the presence of measurable objectives that are traced back to strategic goals. Good objectives consist of simple, measurable, achievable, realistic, and time-based statements.
A business analyst can use statistical methods using historical data to test hypotheses about population data. Test of confidence can be used to open lines of communications with stakeholders and more importantly the sponsor to adjust measurements if necessary and if sponsors are willing and flexible to make adjustments and listen to objective input based on sound statistical calculations. Statistical methods can be used to add another tool set to the competency skill set a business analyst possesses. Sample data used as the source of the analysis can generate confidence intervals calculated across the range of data provided for the analysis.
Setting measurable objectives is in itself a decision, whether that decision is made by a chief executive, by a group of strategists or by a project sponsor. The point is a decision point is determined. The business analyst can help shape that decision by providing feedback based on statistical methods to help shape the valuation of measurements associated with business objectives, using a simple but powerful technique.
1. T-Distribution curve image courtesy of http://www.xycoon.com/images/studen24.jpg
2. Sample mean image courtesy of http://www.discover6sigma.org/img/samp-xbar.png
3. Sample variance formula courtesy of http://www.weibull.com/DOEWeb/graphics/chapter3__60.png
4. Sample standard deviation formula courtesy of http://cfacuecards.files.wordpress.com/2012/06/samplestandarddeviation.jpg
5. T-distribution theory provided by http://www.stat.tamu.edu/~sunwards/303/7.pdf
6. A t-table is made available courtesy of http://www.stat.tamu.edu/stat30x/zttables.php#ttable if readers wanted to follow the logic further beyond the article.