This article is the first of a series of posts that focus on quantitative research. I introduce basic statistics in this article because if your PhD thesis will use quantitative research methods, then it is important to understand the basic statistics, which are used when designing your study, data analysis and interpreting results from quantitative analyses.
Difference between population and sample
A population is a collection of all the items in your field of research interest. On the other hand, a sample is a subset of the population.
A population is denoted by N, while a sample is denoted by n.
The values obtained from a population are called parameters, while the values obtained from a sample are called statistics.
It is difficult to observe or contact a population, while it is easy to observe and contact a sample.
Example of population and sample
Assuming your research interest is in the small and medium enterprises (SMEs) in your country and there are 10,000 SMEs. Your study population will be the 10,000 SMEs. However, since it is almost impossible to collect data from all the 10,000 SMEs, you will draw a sample from the 10,000 SMEs, for instance, only 500 SMEs, that you will use to collect your data from.
For quantitative research, the sample drawn from the population should be random and representative if you want to generalise your study’s findings to the entire population.
A random sample means that all the elements in the population have an equal chance of being selected into the sample.
A representative sample means that the different elements in the population are included in the sample. In the example of SMEs above, your sample will be representative if it includes: youth-owned SMEs, women-owned SMEs, men-owned SMEs, family-owned SMEs, SMEs of different sizes, SMEs that have been in operation for different years etc.
Types of data
Understanding the different types of data is important because different types of data require different measurements and different analytical techniques as well as different visualisation methods.
There are two types of data: categorical and numerical data.
Categorical data are in categories/groups, for example: types of housing structure, types of schools, questions with two or more response categories etc.
Numerical data are further classified into discrete and continuous data. Discrete data can be counted finitely and only take whole numbers, for instance, number of siblings, number of teachers in a school etc. Continuous data on the other hand are infinite and uncountable, for instance, weight, height, distance etc.
Levels of measurement
In addition to understanding the types of data, it is also important to understand the levels of measurement for data. There are four levels of measurement classified for qualitative and quantitative data.
For qualitative data, the levels of measurement include nominal and ordinal data.
Nominal data are categorical data, for instance, gender (either male or female), religion (Christianity, Islam, Hinduism, Buddhism etc) etc.
Ordinal data on the other hand are data that are in categories and can be ordered in a strict order, for instance, performance of students (poor, average, good, best), level of satisfaction (very dissatisfied, dissatisfied, neutral, satisfied, very satisfied) etc.
For quantitative data, the levels of measurement include interval and ratio data. Both of these data types are represented by numbers.
The main difference is that while ratio data have a true zero, interval data do not have a true zero. Ratio data are more common and include distance, time, age, number of objects etc.
The most commonly-cited example of interval data is temperature. Zero temperature does not imply the absence of temperature hence it does not have a true zero.
Data visualisation means presenting data in a way that is easy to interpret. It is an important skill and one of the preliminary steps of data analysis.
The choice of data visualisation method depends on the type of data and their levels of measurement.
Data visualisation for categorical data
Categorical data can be presented in various ways, the four most common ways are:
Frequency distribution tables
A frequency distribution table has two columns: the category and the frequency of that category.
Bar charts/column charts
A bar chart/column chart is a type of graph in which the horizontal axis represents the category and the vertical axis represents the frequency of that category.
A pie chart is a type of graph that shows the category and its relative frequency (in percentage). All the relative frequencies/percentages should add up to 100%.
Pie charts are useful for conveying the share of the total of each category.
A pareto diagram is a special type of bar chart in which the categories are shown in descending order of their frequencies.
A curve is then drawn inside the chart that shows the cumulative frequency while another vertical axis is included on the right side of the chart showing the relative frequencies.
A pareto diagram is therefore a combination of the strengths of the bar chart and pie chart.
Data visualisation for numerical data
Frequency distribution tables
Frequency distribution tables can also be used for numerical data but one must first group the data using intervals (for instance, intervals of 5, 10, 20 etc) depending on the data one has.
A frequency distribution table is then generated using the grouped data.
Including the relative frequencies for each group is recommended.
Histograms are the most common charts used to present numerical data.
They are similar to bar charts but have numerical data on the horizontal axis instead of categories.
Histograms are created using the frequency distribution tables.
In histograms, there is continuation of the intervals in that one interval ends where the next begins. This is different from bar charts where the intervals are completely independent.
The vertical axis on histograms can have either the absolute frequencies or relative frequencies.
Cross tables are useful when you want to present data on two variables.
A cross table is used when you have a categorical variable and a numerical variable, for instance, amount of budgetary allocations by health facilities in a country.
It is important to include the totals for the rows and columns in a cross table.
An example of a cross table is shown below:
The data in a cross table can also be presented in a side-by-side bar chart as shown in the example below:
A scatter plot is also used when you have data on two variables. However, both variables should be numerical.
Scatter plots are loaded with information, for instance, they can tell if there is a trend in the data, if a relationship exists between the two variables, the nature of that relationship (positive or negative), and if there are any outliers in the data.
An example of a scatter plot is shown below:
Measures of central tendency
There are three measures of central tendency: mean, median and mode.
The mean is the simple average of the data.
It is obtained by summing the data and dividing it by the count of that data.
Assuming you have the following data: 1, 2, 3, 4, 5, 6, 7, and 8.
The mean of this data will be: (1+2+3+4+5+6+7+8) / 8 = 4.5
The mean is the most common measure of central tendency but its disadvantage is that it is easily affected by outliers.
It is therefore not sufficient to use the mean to make general conclusions about a given set of data.
The median is the middle number in a dataset that has been ordered, that is, arranged either in an ascending or descending manner.
For dataset with odd count, the median is straightforward as shown in the example below:
1, 2, 3, 4, 5, 6, 7: In this dataset, the median is 4 because it is the middle number.
For dataset with even count, the median is obtained by calculating the mean of the two middle numbers as shown in the example below:
1, 2, 3, 4, 5, 6, 7, 8: In this dataset, the median is the mean of 4 and 5, which is 4.5.
The advantage of the median is that it is not affected by outliers.
The mode is the most frequent number in a dataset.
It can be used for both numerical and categorical data.
Some datasets have no mode.
In the dataset: 1, 2, 2, 3, 4, 5, 6, the mode is 2 because it is occurring more frequently than other numbers.
In the dataset: 1, 2, 3, 4, 5, 6, 7, there is no mode because there is no single number that is occurring more frequently than other numbers.
Measures of asymmetry: skewness
The most common measure of asymmetry is the skewness.
Skewness measures how the data is concentrated.
There are three types of skewness:
Positive (right) skewness
In this case, the data points are concentrated on the left but the tail is leaning towards the right of the distribution graph.
The mean is greater than the median.
The mode is the highest point.
In this case, there is no skewness, meaning that the data points are uniformly distributed and are neither concentrated on the left or right.
The mean = median = mode.
Negative (left) skewness
In this case, the data points are concentrated on the right but the tail is leaning towards the left of the distribution graph.
The mean is less than the median, and the mode is the highest point.
Skewness therefore refers to where the tail is leaning towards (not where the majority of the data points are concentrated). It is also where the outliers lie.
Measures of variability
There are several measures of variability:
The range is the difference between the highest and lowest points in a dataset.
In the dataset: 1, 2, 3, 4, 5, 6, 7, and 8, the highest point is 8 and the lowest point is 1. So the range is 8-1=7.
Variance measures how data points are dispersed around their mean.
It is obtained by subtracting each data point in a dataset from the mean, then squaring the difference.
Squaring the differences between the data points and their mean is done for two main reasons:
- It results in non-negative values.
- It enhances the effect of large differences.
The variance of a sample is always bigger than the variance of a population. The sample variance corrects upwards so as to take into account the potential for a higher variability.
The standard deviation is obtained by taking the square root of the variance.
The variance is usually a large number because of the squaring of the differences between the data points and their mean. Hence the standard deviation is more meaningful.
Coefficient of variation
The coefficient of variation is obtained by dividing the standard deviation by the mean.
Whereas the standard deviation is useful when you only have one dataset, it is rendered meaningless when you want to compare variability of two or more datasets. In this case, the coefficient of variation is the most useful measure of variability.
The coefficient of variation does not have a unit of measurement.
If two datasets have the same coefficient of variation, then one can conclude that the datasets have the same level of variability.
If one dataset has a higher coefficient of variation than another, then one can conclude that the dataset with a bigger coefficient of variation has more variability than the other dataset, and vice versa.
Consider two datasets on the household incomes of two countries A and B. The dataset on country A has a coefficient of variation of 2.35, and country B has a coefficient of variation of 0.79. Country A has more variability in its household income while country B has less variability in its household income. One can therefore conclude that country B is a more equal country than country A.
Measures of relationship between variables
The measures discussed in the previous sections are applicable to one variable in a dataset.
When you have two variables in a dataset, you can use other measures to test the relationship between those variables. The most common measures are co-variance and correlation coefficient.
Assuming two variables A and B, the co-variance is obtained by: calculating the sum (of the dispersion of the data points in variable A multiplied by the dispersion of the data points in variable B) then dividing this sum by the sample size less 1.
Unlike the variance, the co-variance can be positive, negative or zero:
- A positive co-variance (>0) means that the two variables are moving together in the same direction.
- A negative co-variance (<0) means that the two variables are moving in the opposite direction.
- A co-variance equal to zero means that the two variables are not related.
The downside to the co-variance is that it can have a large number which makes its interpretation difficult.
A correlation coefficient is an adjustment to the co-variance obtained by dividing the covariance by the product of the standard deviation of the two variables.
The correlation coefficient ranges between +1 and -1:
- a coefficient correlation of +1 is referred to as perfect positive correlation and implies that the entire variability of one variable is explained by the other variable and that the two variables are moving in the same direction.
- a coefficient correlation of -1 is referred to as perfect negative correlation and also implies that the entire variability of one variable is explained by the other variable but the two variables are moving in opposite directions.
- a coefficient correlation of 0 means that there is no correlation between the two variables.
The closer the coefficient correlation is to 1 (either +1 or -1), the stronger the correlation between the two variables, and the closer it is to 0, the weaker the correlation between the two variables.
Hence correlation coefficient is interpreted in terms of both the strength and the direction of the correlation.
When discussing correlation between variables, one thing that often comes up is the issue of causality.
Two variables may be correlated but it does not mean that each variable causes the other to occur. Hence the famous saying in statistics that “correlation does not always imply causation.”
Consider two variables, level of education and income level. These two variables are highly likely to be positively correlated, in that higher levels of education are associated with higher income levels. However, whereas a higher level of education may lead to a higher income level, the opposite may not hold. Higher income levels may not cause higher levels of education. Therefore, education influences income but income does not influence education.
A distribution is a function that shows all the possible values of a variable and how often those values occur in a dataset. The distribution is best presented in a graph.
The normal distribution
The normal distribution has a bell shape and is also referred to as the Gaussian distribution.
The normal distribution is symmetrical and has no skew therefore the mean=median=mode.
The data points are perfectly centered around the mean.
The position of the normal distribution depends on the size of the mean. A dataset with a lower mean would have its normal distribution more on the left side whereas a dataset with a bigger mean would have its normal distribution more on the right side.
On the other hand, how thin or wide a normal distribution is spread out depends on the value of the standard deviation.
A lower standard deviation (with the mean unchanged) would lead to a normal distribution that has more data points concentrated in the middle and thinner tails.
A higher standard deviation (with the mean unchanged) would lead to a normal distribution that has more data points spread out with fewer data points in the middle and fatter tails. The normal distribution will therefore flatten out.
The standard normal distribution
All distributions, including the normal distribution, can be standardised.
A standardised distribution has a mean of 0 and a standard deviation of 1.
A standardised variable is referred to as the z-score. The score is obtained by subtracting each data point from its mean then dividing the results by the standard deviation.
Standardisation is done to make predictions and inference much easier.
The central limit theorem (CLM)
The mean of a sample will depend on the individuals in that sample. You can have as many samples from the sample population and calculate their means, creating a sampling distribution. The average of these samples means will approximate the mean of the population irrespective of whether the population itself is normally distributed or skewed distributed. The mean of the sample will also be the same as the mean of the population. This is what the central limit theorem is about.
On the other hand, the variance of the sample will depend on the sample size. The larger the sample, the lower the variance, and the closer the approximation will be to the population variance. Therefore, the accuracy of your statistical results depend on the sample size.
For the central limit theorem to hold, one needs a sample size of at least 30 observations. However, for a population with normal distribution, the theory will hold with even a smaller sample size.
The central limit theorem enables researchers to conduct tests and make inferences about the population using the normal distribution even when the population is not normally distributed.
The standard error is the standard deviation of a sampling distribution.
It is obtained by taking the square root of the variance divided by the sample size OR dividing the standard deviation by the square root of the sample size.
The standard error shows how the means of the various samples extracted vary.
The standard error is an important estimate and is usually reported in most statistical outputs. It is an indicator of how well a researcher was able to approximate the true population mean.
The standard error will reduce with an increase in the sample size. This is because larger samples result in better approximation of the population.
Estimators and estimates
There are two types of estimates: point estimate and confidence interval.
Point estimates are single numbers e.g. sample mean is an estimate of the population mean; the standard deviation is an estimate of the population variance.
Point estimates have two characteristics: efficiency and bias.
Efficiency are those with the least variability, while unbiased estimators are those with expected values equal to the population parameters.
Researchers should aim for the most efficient and unbiased estimators.
Unlike the point estimates, the confidence intervals are intervals, with a lower value and an upper value.
The confidence interval is related to the point estimate in that the point estimate lies exactly in the middle of a confidence interval.
A confidence interval is a better estimate than a point estimate because it allows for some uncertainty that comes with statistics. A researcher cannot be 100 percent certain of the results obtained.
A confidence interval is measured at levels of confidence, which is given by 1- alpha (confidence level). The confidence level lies between 0 and 1 but the most common confidence levels used in research include: 90%, 95% and 99% confidence levels, with alpha values of 10%, 5%, and 1%, respectively.
A researcher who uses 95% confidence level is certain that in 95% of the cases, the true population parameter measured would lie in the specified interval.
There is a tradeoff between the confidence level chosen and the confidence interval. A higher confidence level will result in a broader confidence interval which implies a lower estimate precision. The opposite is true: a lower confidence level will result in a narrower confidence interval which implies a higher estimate precision.
Hypothesis testing is fundamental to statistical tests.
A hypothesis is an idea that the researcher has and which he wants to test if it’s true or not.
There are types of hypothesis:
Null and alternative hypotheses
The null hypothesis is what is tested, and the alternative hypothesis is everything else.
How a researcher frames his hypothesis will determine what types of tests he will carry out.
Hypotheses tests can either be one-tailed or two-tailed tests.
An example of a one-tailed hypothesis test is shown below:
Null hypothesis: the mean monthly household income in country A is greater than or equal to $1,000
Alternative hypothesis: the mean monthly household income in country A is less than $1,000.
An example of a two-tailed hypothesis test is shown below:
Null hypothesis: the mean monthly income in country A is $1,000
Alternative hypothesis: the mean monthly income in country A is not equal to $1,000
The researcher will go about conducting the tests using appropriate tests and will either reject the null hypothesis or will fail to reject the null hypothesis depending on the results from the tests.
Hypothesis testing is done using a selected significance level, denoted by alpha, which is the probability of rejecting the null hypothesis.
Common alpha values include 0.01, 0.05 and 0.1, with 0.05 being the most common.
Type I and Type II errors
There are two types of errors that a researcher can make when conducting hypothesis tests: type I and type II errors.
Type I error is also referred to as false positive, and is the error a researcher makes when he rejects a true null hypothesis.
The probability of making this error is denoted by the alpha, and is therefore under the control of the researcher.
Type II error is the error made by a researcher when he accepts a false null hypothesis, and is also referred to as false negative.
The probability of making type II error is denoted by beta, which depends on the sample size.
The probability of rejecting a false null hypothesis is denoted by 1-beta and is the goal of the researcher. 1-beta is therefore called the power of the test. The power of the test can be improved by increasing the sample size.
It is not possible to make both type I and type II errors at the same time.
In this article, I have touched the very basics of statistics as a way of introducing common statistics concepts to students who have no background in statistics. Understanding statistics is important especially for students who plan to use quantitative research methods in their theses and dissertations.
Some useful readings