SPSS Tutorial #14: Multiple Linear Regression in SPSS

In the previous post, we looked at simple linear regression in SPSS, where only one explanatory variable was considered. In real-life scenarios, we will always have more than one variable explaining a phenomenon. In this post, we will discuss multiple linear regression in SPSS in which many explanatory variables are included in the model.

Table Of Contents

Assumptions of linear regression
How to run a multiple linear regression in SPSS
How to interpret the results of a multiple linear regression in SPSS
Conclusion

Assumptions of linear regression

Before conducting a multiple linear regression analysis in SPSS, it is important to first examine if the dataset to be used upholds or violates the assumptions of linear regression. These assumptions are useful for ensuring that the estimates from the regression are unbiased and reliable. The assumptions include:

No multicollinearity

This assumption states that two independent variables should not be too highly correlated.

A correlation coefficient cutoff of >.70 is often used to indicate presence of multicollinearity.

When two explanatory variables are highly correlated it becomes difficult to isolate the individual effect of each of the two variables on the dependent variable.

Multicollinearity can be checked using Tolerance and Variance Inflation Factor (VIF). Tolerance indicates how much of the variability of an independent variable is not explained by the other independent variables, whereas the VIF indicates how much of the variability of an independent variable is explained by the other independent variables.

Tolerance values of less than .10 and VIF values of more than 10 indicate presence of multicollinearity.

When multicollinearity is detected, it is advisable to drop one of the variables.

Linearity

A second assumption of linear regression is that the dependent and independent variables should be linearly related.

That is, when the dependent variable is plotted against each of the independent variables, the plot should result in a straight line, which indicates a constant change in the dependent variable when the independent variable changes by one unit.

Linearity (or non-linearity) can easily be detected using scatter plots or Normal Probability Plots (P-P).

Independence, homoscedasticity and normality of errors (residuals)

Independence of errors mean that the error term of one observation should not be correlated with the error term of another observation.

Homoscedasticity means that the variance of the error terms should be constant.

When the variance of the error terms is not constant, it is referred to as heteroscedasticity.

Normality of errors means that the distribution of the errors should be normal (bell-shaped).

How to run a multiple linear regression in SPSS

For this illustration, we will use fertility rate of a woman (measured as the number of living children) as the dependent variable and the education and age at first birth as the two independent variables. Hence we will measure if and how fertility is predicted by education and age at first birth of a woman. From theory, we would expect that fertility reduces with higher levels of education, and higher age at first birth. Hence the relationship between fertility and education and age at first birth is expected to be negative.

To run a multiple linear regression in SPSS:

Go to Analyze menu > Regression > Linear

From the linear regression dialogue box, select the dependent variable and move it to the Dependent: box
Then select the independent variables and move them to the Independent(s): box
In the Method section, ensure that Enter is selected.
Click the Statistics tab and check Estimates, Confidence intervals, Model fit and Descriptives options from the Linear Regression: Statistics dialogue box.
Click Continue.

Next, click the Plots tab in the Linear Regression dialogue box.
To plot the Normal Probability-Plot of the residuals, check the box “Normal probability plot” then select ZRESID and move it to the Y: box, and ZPRED and move it to the X: box.
Click Continue.
Next, click the Save tab in the Linear Regression dialogue box and ensure that Mahalanobis and Cook’s under the section Distances are selected.
Click Continue then OK.

How to interpret the results of a multiple linear regression in SPSS

The results of a multiple linear regression in SPSS will have several outputs:

Descriptive statistics

The first output is the descriptive statistics which shows the mean, standard deviation and the sample size of all the variables entered in the model.

Correlations

The next output is the correlations among all the variables entered in the model.

The correlation coefficients will show both the magnitude and direction of the relationship between the variables.

The correlations matrix above shows that education in single years is more strongly correlated to number of living children as compared to the correlation between age of respondent at first birth and number of living children.

It also shows that both the independent variables and negatively/inversely related with the dependent variable, as earlier predicted.

The correlations matrix also shows the correlation between the two independent variables. Given that this correlation is low (.292), one would not worry about multicollinearity issues.

Variables entered or removed

This output shows which variables were entered in the model and the method used.

In this illustration, the Enter method was used.

The variables entered were age of respondent at first birth and education in single years.

It also shows the dependent variable, which in this illustration is the number of living children.

The output would look different if other multiple regression methods were used, for instance, stepwise and hierarchical regression.

Model summary

The model summary output displays: the R (multiple correlation coefficient), the R-square, the Adjusted R-square, and the Standard Error of the estimate.

The R indicates the correlation between the dependent variable and the combined independent variables. It varies between 0 and 1, with higher values indicating a stronger relationship between the dependent variable and the combined independent variables.

The R-square is combined by squaring the R, and is more commonly used. It indicates the proportion/percentage of variability in the dependent variable that is explained by the set of the independent variables.

In the above illustration, education in single years and age of respondent at first birth explain 14.6 percent (.146 x 100) of the variability in a woman’s fertility.

Adjusted R-square is also used in multiple regression especially when there are many independent variables some of which do not make any meaningful contribution to the change in the dependent variables. When this is the case, the Adjusted R-square will be lower than the R-square.

ANOVA

ANOVA, which stands for analysis of variance is used to test the overall fit of the regression model, that is, whether the independent variables explain a significant variance of the dependent variable compared to just having the mean of the dependent variable.

The statistic used in the ANOVA is the F-statistic which tests the null hypothesis that the coefficients of the independent variables are zero and therefore do not explain the dependent variable.

If the F-statistic is statistically significant (as indicated by the Sig. value), the null hypothesis is rejected.

In the above illustration, the F-statistic is statistically significant, hence we reject the null hypothesis. This implies that the two independent variables explain a good proportion of the variation in the dependent variable.

Coefficients

The coefficients output provides a handful of useful information about the magnitude and sign of the regression coefficients, the standard errors of the coefficients, standardized coefficients which are useful in showing which of the independent variables has the largest impact on the dependent variable, t statistic, and statistical significance of the coefficients.

The unstandardized coefficients use the original units of measurement of the independent variables. For instance, education in single years, and age in years. The standardized coefficients on the other hand are converted into standard deviations. Hence, the interpretation will differ between the unstandardized and standardized coefficients.

In the above illustration, the unstandardized coefficients can be interpreted as follows:

An increase in the age of respondent at 1st birth by 1 year leads to a reduction in the number of living children of the respondent by .073.

An increase in a respondent’s education by 1 year leads to a reduction in the number of living children of the respondent by .160.

The standardized coefficients can be interpreted as follows:

An increase in the age of respondent at 1st birth by 1 standard deviation leads to a reduction in the number of living children by .129 standard deviations.

An increase in a respondent’s education by 1 standard deviation leads to a reduction in the number of living children by .325.

The advantage of using the standardized coefficients is that one can easily tell which of the independent variables has a bigger impact on the dependent variable. In the above illustration, education has a bigger effect on the number of living children than age at 1st birth.

The t statistic shows how significantly different from zero the coefficients are. When the t statistic is small, it is likely that the coefficient is close to zero, hence statistically insignificant.

The t values are obtained by dividing the unstandardized coefficients by their respective standard errors.

The Sig. values also show whether the coefficients are statistically significant or not.

They are usually compared against significance levels, with the most widely used significance level being 0.05.

If the Sig. value is less than .05, it means that the coefficient is statistically significant, and vice versa. If the Sig. value is greater than .05, it means that the coefficient is not statistically significant.

One can also use the 95% confidence intervals to confirm the statistical significance of coefficients. Statistically significant coefficients will lie between the lower bound and the upper bound values.

The correlations output also shows the Tolerance and VIF values, whic are used to check for multicollinearity.

In the above illustration, both Tolerance values are greater than .10 and both VIF values are less than 10. Hence, we can confidently conclude that there is no multicollinearity between the two independent variables.

Checking for linearity, normality, homoscedasticity and independence of residuals

This can be checked using the Normal probability plot as well as the scatterplot.

In the Normal P-P, the points should lie close to the diagonal line to indicate normality of residuals.

In the scatterplot, linearity of the residuals will show a rectangular distribution of the residuals with majority of the scores in the centre. In the scatterplot above, there is evidence of non-linearity of the residuals since there is an obvious pattern with one side higher and the other side lower.

The scatterplot also shows if there are some outliers in the dataset. Points in the scatterplot that have a standardized residual of 3.3 or -3.3 are said to be outliers, which is the case in the scatterplot above.

It is important to check for the assumptions of linear regression and remedy any problems that may exist first before running the multiple linear regression in SPSS or any other software.

Conclusion

Multiple linear regression is a power statistical method that is used to predict an outcome/dependent variable from a set of independent variables. Whereas there are different types of multiple regression methods, this post has illustrated how to run the standard multiple linear regression in SPSS and the key assumptions that underlie it.