Once you have designed and administered a questionnaire, the next step would be to enter the raw data in a data management program such as Microsoft Excel, SPSS, and STATA. There are many more other programs that can be used depending on one’s preference and field of study.
However, before entering the data, it is good practice to create a codebook for the raw data. There are two ways of going about this:
- The first would be to create the codebook in MS Excel and then import the data into another program where the data analysis will be done (e.g. SPSS, STATA).
- The other method would be to create the codebook directly into the program that will be used for the data analysis.
The advantage of creating the codebook and entering the raw data into Excel before exporting to other programs is that it is easier to clean and manipulate data in Excel than it is in the other programs. Excel is also a powerful tool for preliminary data analysis and data visualisation.
This post is a demonstration of how to create a codebook in MS Excel. The post will use a fictitious questionnaire (the sample questionnaire below) for the practical step-by-step instructions.
In Excel, the columns represent the variables, while the rows represent the observations (respondents to the questionnaire). It is therefore important to assign a unique identification number to each filled questionnaire before data entry begins.
When writing the variable names, you can either use the question number or short form of the question. For example, “what is your gender” can be entered as “gender” as the variable name. The latter (short form of the question) is preferred because it is easier for readers to know what the variable is about without referring back to the questionnaire.
For the sample questionnaire above, the column headers will look like this:
Each question represents a specific type of data and level of measurement, which determines how it is coded.
The section below explains the variable names used for the sample questionnaire above.
This is the first column and represents the unique number assigned to each filled questionnaire.
Question 1: named simply as “age.”
Question 2: named simply as “gender.”
Question 3: named simply as “education.”
Question 4: named as “biz_started.”
Question 5: named as “employees.”
Question 6: named as “family_biz.”
Question 7: named as “biz_category.”
Question 8: named as “revenues_2019.”
Question 9: names as “covid_pos_effect”
Question 10: names as “covid_neg_effect
Questions 11 and 12 are a bit different from question 1-10 because of the responses provided is a variable on its own, hence each of them will have its own column.
Question 11: the responses take a yes/no form; yes for the responses which are applicable to the respondent and no for the responses which are not applicable.
Question 12: this is a ranking question where the respondent is asked to rank the four measures in order of their preference, from 1 (most preferred) to 4 (least preferred).
After specifying the variable names in the column headers, the researcher (or research assistants) will now add the raw data for each questionnaire in rows; each questionnaire on its own row. The sheet with the raw data is called the “data view.”
To create the codebook, the researcher will open another sheet called “variable view” where all the codes for the responses will be generated.
Creating the codebook
Open the “variable view” sheet.
In the first column, insert the names of the variables (the names in the first row of “data view”).
In the second column, insert the questions as they are in the questionnaire.
In column three, include the levels of measurement of the variables, that is, whether they are nominal, ordinal, interval or ratio. This is important for coding and data analysis.
Lastly, in the fourth column insert the codes for each variable, where applicable. Some variables do not need codes because they are left open for the respondent to fill.
The codebook for the sample questionnaire is shown below:
As seen in the image above, questions 1, 4, 5, 8, and 12 were not coded because of the nature of their data.
The variable “gender” was coded 1 for male, and 2 for female. It does not matter what codes one uses because gender is a nominal variable and therefore there is no meaningful order in the codes used. This applies to the variables on family business, business category and effects of covid-19 on business.
However, the variable “education” is an ordinal variable therefore the codes used must have a meaningful order. A person moves from pre-primary to primary to secondary and tertiary levels of education. Therefore the code used for pre-primary education should be the lowest and the code for tertiary education should be the highest.
It is best practice to create the codebook before the questionnaire is administered because it will help the researcher to think through the codes that will be used, and how the data will be measured and analysed. Additionally, it will help the researcher recognise any responses that would be problematic to code and analyse and therefore give him a chance to correct the mistakes before the data collection is undertaken.
When entering data in Excel, one can either enter the values (codes) or the variable labels so long as one knows what the values represent. However, when it comes to data analysis, only the values can be analysed.
The next two articles will provide instructions on how to use the data entered into Excel and export it to other statistical programs such as SPSS and STATA as well as how to directly code and enter data into the SPSS and STATA.