Bivariate linear transforms

This analysis is beyond the scope of what I taught in the fall of 2014, but it is worth considering how far we can push the Simple Linear Model. Oftentimes there is a definite relationship between to variables, but it is nonlinear because either one or both of the variables is skewed. The classic example, in sociological and policy analysis, is the income-distribution of a geographic area. Given the “Free Market” myth that has governed the U.S. since Reagan, it should be no surprise that this frequency-distribution is badly skewed. In order to stick with the easier concepts and calculations of the Simple Linear Model, we can transform a skewed variable to reshape its distribution into something “more Gaussian” and closer to symmetrical in shape. In the page below I report my attempt to find a linear relationship between the proportion of African-Americans in tracts in Contra Costa County, and income.

One outcome-variable (“y”) that we can consider is unemployment. We have two variables in our data that can be used to calculate the rate of unemployment in each Tract: not_Working and Work_Age. We could divide not_Working by Total, but then a high number of children (below 16) in the Tract would skew the ratio lower. Ideally, we should also exclude everyone above 65. Can you find the data to do this? It is worth 2 points in your semester grade.

Here is the command for generating an Unemployment variable in dataframe CO:

CO["unemp"] = CO$not_Working / CO$Work_Age

I also created a variable indicating relative proportion of higher-educated people per tract:

CO["hiEdu"] = CO$ED_BA + CO$ED_transBA
CO["prpHiEdu"] = CO$hiEdu / CO$Work_Age

Again, I used Work_Age as the denominator because it is a little unfair to ask how many children below 16 have a B.A. or higher. Also, after generating this variable in two steps, I realize I could have condensed the creation into one step:

CO["prpHiEdu"] = (CO$ED_BA + CO$ED_transBA) / CO$Work_Age

Hopefully, as I show you this “thinking-out-loud” process, it helps reveal how to work with R.

Graphic analysis of bivariate relationships

Okay. Now I am going to poke around among my variables to see if I glimpse any significant relationships. Graphic plots are the best way to do this, because our brains see relationships in visual patterns very quickly, and with great precision. How do I know this? Because many Americans can distinguish a ’68 Camaro from a ’69 Camaro at a glance; others can discern the subtlest change in proportion of collar-widths and hem-lengths in clothing. So:

I suspect there is a relationship between proportion of African-Americans in Census-Tracts and proportion of unemployed. That would fit with a pattern of systematic discrimination.



Hmm. Two ways I could read this at a glance: either I am flat-out wrong, and the proportion of African-Americans per Tract has nothing to do with rates of unemployment in the Tract, or possibly that the number of African-Americans countywide (9%) may be too small to get a reliable measure this way.

What about Density by proportion of African-American? In “Five measures of segregation,” Massey and Denton (1988) point out that the reduced options for housing for African-Americans meant they tended to live in more densely-populated areas:


02_Density_by_prpBlackHmm. Not a clear relationship either. Perhaps this particular aspect of residential segregation does not occur in Contra Costa, or again maybe the small proportion of Af-Am overall in the county is throwing off my reading.

What about per capita income?



Yes, that is showing a relationship. But it is not simple to evaluate: the proportion of African-Americans in the tract is inversely related to per capita income, and that relationship is not linear.

Rather than give up, this is where Blanchard’s analysis is meaningful. He did all sorts of transformations on his variables so that he could analyze their linear relationships. First, I am going to invert Per Capita Income and then take a look at the plot:

CO["invPCI"] = 1 / CO$PerCap_inc


Okay. Sort of a relationship. Before coffee this morning I thought maybe I should square both variables to clarify their relationship:



Whoa! No, wrong direction. Oh yes: income typically has a strong positive skew, and the way to correct for that is to take the log of income–or in this case, the log of the inverse of per capita income:


[Notice that for these quick glimpses of variable-relationships, I am doing some of the transformations on-the-fly rather than generating a whole new variable, such as logInvPCI.]


Hmm. Not quite. Then I thought back on it, and realized that the distribution of the proportion of African-Americans in Census-Tracts is also uneven. So what if I log-transform both variables?



Okay. Clearly there is a relationship between these two variables. Here is the problem: although these are valid transformations, I have now lost a feeling for what these transformed variables mean. Yes, distribution of African-Americans across the county is uneven; if we plot a straight histogram of the frequency of tracts by the proportion of African-Americans in them, we get:



Yes, there is an extreme “positive” skew in this distribution, meaning that there are extremely few tracts with more than 20% African-American residents, and a LOT with fewer than 5%. So a log-transform of this variable is legitimate if we want to stick to an evaluation of linear relationships. But in the end, when I have hammered both variables into versions where we can clearly see the inverse relationship between proportion of African-Americans and per capita income by Tract, it becomes very hard to explain that relationship in plain language. So it is a valid analysis, but it is also beyond the scope of what I am teaching you this semester.

Leave a Reply