R: Simple Linear Modeling

11/22/2014: heavily revised to facilitate interpretation and writing up your own reports.

The basic syntax for evaluating a linear relationship between two variables in R is lm(y~x). It is a linear model because we are making a lot of assumptions as we evaluate these variables, such as which is x (independent) and which is y (dependent).

A more compact way to run linear models in R is to have the software write the results into a named dataset, which I will generically call “LM_output”. The basic syntax is therefore:
LM_output = lm(y~x)

However, I will not use a generic name for the output dataset. Instead, I will name the dataset something that indicates what it contains. In this case, I will be analyzing the relationship between the proportion of highly-educated people in each Tract and the per-capita income of that Tract; so I will name the output data as: PCI_by_prpHiEd. Again, I am sticking to the “markers-along-the-trail” principle of creating names that help us retrace the meaning of what we are doing. Here is the syntax of the linear model to evaluate the relationship between per capita income (as an outcome) and the proportion of highly educated people (as the independent variable):

PCI_by_prpHiEd = lm(CO$PerCap_inc~CO$prpHiEdu)

In R-Studio, the new dataset PCI_by_prpHiEd appears in the Workspace Pane (upper right) below CO. To see the contents of this dataset, type:


R returns the following feedback (I have added line-numbers in brackets on the left):

[1]  Call:
[2]  lm(formula = CO$PerCap_inc ~ CO$prpHiEdu)

[3]  Residuals:
[4]    Min     1Q Median     3Q    Max 
[5]  -19338  -4381  -1054   3424  28534 

[6]  Coefficients:
[7]              Estimate Std. Error t value Pr(>|t|)    
[8]  (Intercept)    12391       1026   12.07   <2e-16 ***
[9]  CO$prpHiEdu    77555       2594   29.89   <2e-16 ***
[10] Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

[11] Residual standard error: 7553 on 205 degrees of freedom
[12] Multiple R-squared:  0.8134,    Adjusted R-squared:  0.8125 
[13] F-statistic: 893.7 on 1 and 205 DF,  p-value: < 2.2e-16

Interpreting the output, part 1: the formula for the line-of-estimation

What does this output mean? I am going to begin with lines 6 through 9 of the output, which describe the formula for the line-of-estimation in this linear model.
The basic formula for a Simple Linear Regression Model is: y = a + b(x) + e, where:
a = the “y-intercept”,
b = the slope of the best-fit line, and
e = the error term: in other words, the residual error when actual observations are compared to the line-of-estimation.
In this linear model output, a = 12391 (on line 8); b = 77555 (on line 9); and the standard error of the estimate for y = 1026. I will also discuss the standard error later; but right now I want to explain how to interpret the last two lines of output: lines 12 and 13.

The interpretation this data is a little abstract. Since I am using proportions (not percent), it means that one unit of change in x means a change from 0.0 (0%) to 1.0 (100%), in which case the estimated per-capita income rises from $12,391 when x = 0.0 to $89,946 when x = 1.0. I will re-run this analysis at the bottom of the page using percentages rather than proportions. But for now I will analyze this less-than-ideal model. In practice, this is a clearer example of how we actually work, so bear with me.

Interpreting the output, part 2: strength of relationship

Line 12 of the output above gives the multiple- and adjusted R-squared.
[Why these two numbers? Here is the explanation, but don’t let it derail you from tracking the basic concept. In simple linear regression models, multiple R2 and adjusted R2 are nearly identical, because y is being evaluated against only one x variable. In multiple regression, y is compared to multiple x-variables, such as: “how much is per capita income affected by levels of education, proportion of various races, and population density?” In such a multiple-regression model, the original formula for R2 goes awry; so adjusted R2 was designed to compensate for multiple variables.]

We will use the adjusted R-squared.
R-squared is in fact the square of Pearson’s R. And like Pearson’s R, R-squared indicates the strength of the relationship between these two variables.
In graphic terms, how tightly do the observed data points cluster around the line-of-estimation?
In analytical terms, how much does a change in x explain a change in y?
There is a trade-off between Pearson’s-R and R-squared: on the one hand, Pearson’s -R indicates both the strength and the direction of the relationship, because it can be either negative or positive. On the other hand, R-squared gives a direct, intelligible indication of the strength of the relationship. Here is the interpretation:

In the model-output I am showing above, (adjusted) R-squared = 0.8125, which means that 81.25% of the change in y is explained by a change in x.
So R-squared is a very useful number for explaining the meaning of your linear model in plain language.

For your future reference, here is the formula for calculating R-squared, written in equation-syntax:
R-squared = (cov(x,y)^2) / var(x) * var(y)
And the equation for Pearson’s R, written in the same syntax:
R = cov(x,y) / sd(x) * sd(y)
And in case you were wondering, this is how you would calculate covariance “by hand”:
cov(x,y) = sum((x – mean(x)) * (y – mean(y))) / (number-of-observations-minus-one)

Interpreting the output, part 3: p-value and the hypothesis-test

In bivariate analysis, the null-hypothesis is: “There is no relationship between the two variables.” A change in x has no meaningful effect on the y-values. Graphically, a scatterplot of two unrelated variables usually looks like a random cloud.

If the scatterplot does show a definite shape, then there is some sort of relationship. And if the plotted observations tend to clump along a straight line, then there is a linear relationship. When I reviewed the scatterplots of a number of different bivariate relations (see previous webpage), I found that per capita income had a strong and linear correlation with the proportion of higher-educated people in each tract. So that is the linear model I created. The null hypothesis, in this linear model, is that “there is no relationship between the relative proportion of highly-educated people and the per-capita income of people in Census-Tracts in Contra Costa County, California.”

I think I can reject is null-hypothesis. But how do I express my confidence that I can reject the null? I look at the p-value, which is short for probability-value, which means “the odds that this linear model I created happened to indicate a relationship between two variables when in fact there is no relationship between those two variables.” The last piece of output on line 13 (above) is the p-value for this linear model. R has calculated it as 2.2 to the negative-sixteenth power (2.2e-16 in shorthand). In plain numbers, this means you move the decimal place sixteen to the left; so the odds that I have mistakenly identified a relation (where in reality there is none) are 0.00000000000000022 (in pure proportions), or 0.000000000000022%. In other words, I can reject the null hypothesis with 99.999999999999988% confidence. That is why R gives it a three-star rating back on lines 8 and 9 of the output above. Here, I can confidently reject the null and state that this relationship actually exists. I still need to do some explanation of the mechanism of the relationship, because statistical analysis does not and cannot tell us why a particular relationship exists. In fact, simple linear modelling cannot even tell us whether we correctly identified which variable is the independent cause, and which is the dependent outcome. We need to use our own logical judgment–and literature research–to make this analysis socially meaningful.

Note that when Megan ran the same model for Humboldt County, the relationship was quite weak. The R-squared was a little over 0.50, and the p-value was about 0.21. Those numbers mean that relative concentration of highly-educated people in a tract only explains about 50% of the change in per capita income, and the odds that there is actually no meaningful relationship between PCI and education are about 21%, or one in five. I don’t want to exaggerate too much, but I imagine that there are quite a few hippies with Masters and PhDs from Humboldt State who are working in organic-food stores in Arcata making low wages, and a number of folks with only high-school diplomas or GEDs down in Eureka making good money working in the lumber mills.

The p-value is based on the probability that you would get the t-value, shown as the third column of data on lines [8] and [9] in the output (posted again below). Back in univariate analysis, we saw how observations cluster/distribute around the mean (expected single value) in a Gaussian distribution. When we normalize values (divide them by the standard deviation of that particular distribution), we get a z-score which tells us how far our observation is from the mean, in units of standard deviation. This also indicates probability: an observation which is two SD away from the mean is pretty unlikely. In a linear model, y-values scatter above and below the line-of-estimation in a similar way, but the tendency towards scatter is affected by the size of the data-set, so we use the t-distribution instead of the z-distribution to evaluate probability. When the dataset is large (above 30), the difference between a t-value and a z-value becomes puny. You could look up t-values, based on the degrees of freedom in your model, but R-software does this for you. The fourth column of data on lines [8] and [9] is the probability for the t-values of the y and x variables.

Interpreting the output, part 4: the standard error

To save you from scrolling back up, I am re-posting lines 6 through 9 of the model output here:

[6]  Coefficients:
[7]              Estimate Std. Error t value Pr(>|t|)    
[8]  (Intercept)    12391       1026   12.07   <2e-16 ***
[9]  CO$prpHiEdu    77555       2594   29.89   <2e-16 ***

The second column of numbers is the Standard Error of the Estimate, which is analogous to a standard deviation of the per cap income (the y variable). The difference is that the estimated value is a running estimate (a sloped line) between two variables, rather than a a fixed, singular estimated mean in the distribution of a single variable. In Simple Linear Models, we focus on the standard error of the y-value, which in this case is 1026. With this number, we can build a running confidence-interval, which is a “stripe of probability” above and below the line-of-estimation, running parallel with that estimate line. 68% of the observed values will fall within a “stripe” which one standard error above and below the line-of-estimation; more than 95% of the observed values will fall within a stripe which extends two standard-errors above and below the line-of-estimation.

Interpreting the output, part 5: plotting and evaluating residuals

[I will write up this section when I have time. Need to see to other obligations now]

And now, I re-run the linear model to make it even easier to interpret:

The example I gave above shows a strong relationship, but the meaning of the relationship is hard to interpret because I created the HiEdu variable as a pure proportion. Here is where Chris Bettinger is absolutely correct in his preference for percentages. I am going to create a new variable, pctHiEdu, by multiplying prpHiEdu by 100:

CO["pctHiEdu"] = CO$prpHiEdu * 100

Now, I am going to re-run the linear model and name the output dataset PCI_by_pctHiEd:

PCI_by_pctHiEd = lm(CO$PerCap_inc~CO$pctHiEdu)

This time, when I ask for a summary of the output dataset, the output is almost identical:

lm(formula = CO$PerCap_inc ~ CO$pctHiEdu)

   Min     1Q Median     3Q    Max 
-19338  -4381  -1054   3424  28534 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12391.23    1026.39   12.07   <2e-16 ***
CC$pctHiEdu   775.55      25.94   29.89   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7553 on 205 degrees of freedom
Multiple R-squared:  0.8134,    Adjusted R-squared:  0.8125 
F-statistic: 893.7 on 1 and 205 DF,  p-value: < 2.2e-16

The one difference is the slope of the line-of-estimation (b), which is now 775.55, rather than 77555. The Standard Error of the x-values is also shifted two decimal places. I have bolded both of these numbers and underlined the new (b).
Multiplying the proportion by 100 to get percent is a simple transformation, but it makes the analysis easier. Now we can make a plain-language explanation of what (b) means: “For every percentage-point increase in highly educated people in a census tract, the per capita income in that tract increases by $775.55 per year.”

This adjustment from proportions to percent makes the meaning of (b) more intuitive. But what does (a) mean?
(a) is the estimated per capita income (the y-value) when x = zero. In other words, it is an abstract projection of the line-of-estimation back to where it hits the y-axis. It is a necessary part of the formula for the line-of-estimation, but it is also a bit unreal. In plain language it means that “If there were a Census Tract in which zero people had higher degrees, the projected per-capita income in such a tract would be $12,391.23 per year.”