A lot can be done with spreadsheet software like LibreOffice Calc and Google Sheets. With spreadsheets you can calculate measures of central tendency and dispersion. Most have excellent utilities for generating graphics. Even if you use R, you will still need to continue using spreadsheets to manage the translation of processed data into final reports: either slide presentations of word-processed documents. But at some point you may need to do more data-intensive analysis, and it becomes easier to do it in R (or other statistical software) than with spreadsheet software. This page is written to help you get a sense of how R-Studio works.
It is easiest to begin with a dataset that you find interesting. On previous pages I explain:
1.) How to get Census data at the County/Tract level.
2.) A convention for naming data in a format that is human-readable, machine-readable, and short.
3.) How to assemble that data into a “master” csv that contains 41 different bits of information about each tract
4.) How to import that file into R, calculate more variables in that dataset, and then export it again as a CSV.
I also explain how to manage some mistakes in R. I think these basic skills are necessary to avoid anxiety while using R.
But now that you have used R-Studio a little, it is time to learn more about the nature of this software, so you can use it with more confidence.
R is command-line
There is a good reason why we moved away from command-line interface (CLI) software in the 1980s and 1990s. That blinking cursor on a blank line might represent limitless possibilities to a computer programmer, but for most of us it represents limitless impossibilites–as in, “What do I do now???” Drop-down menus to give us a clue about which commands we can run. Graphic menus also constrain our options, usually limiting us to the correct syntax for executing a command. For example, if you go to File > Save… in a graphic program, it will open up a dialog-box to help you find where you want to save a file. You could type out the same operation from a command-line, but if you make a typo, you may send the file to a location that does not exist. Depending upon the software you are using, you might or might not get an error-message, warning you that you are about to send your work into irretrievable oblivion.
Fortunately, R-Studio is a much better user-experience for several reasons.
First: R-Studio is a graphic front-end to R that enables you to use menu-driven commands for a lot of the basic operations like Save, Save As, Import, etc. R-Studio also makes visible some things that exist in R, but are invisible, like History of every command that is entered.
Second: command-line operations are much easier now that we have the World Wide Web. If you are using software that a reasonable number of other people are also using, you can often solve problems like “What do I do now???” by typing your question into a search engine. Often, if someone else has solved the problem, they will also post the commands that worked. You can copy commands straight off a web-page and paste them onto the command-line of R and run them. Better yet, you can paste them onto a scratch sheet, adapt them to your circumstances, and then run them; when they fail due to some typo or omission, you can revise them and re-run them until they work. Whey they do work, you can save that scratch-sheet as a text file and use it later.
Third: if you work out a series of commands that need to be run multiple times, you can just highlight them all and then run them as a sequence. That is why the scratch-sheet is called a script sheet. Spreadsheet programs have a similar capability called macros. Worth learning, but the scripting capability in R-Studio is remarkably easy to use.
Syntax is half the battle in R: Some examples
On the page where I introduce a first session in R-Studio, I describe how to logically “build up” an R command.
You begin with the basic operation you are trying to do, such as dividing one variable by another.
Then you “wrap” that basic function with a series of specifications that tell R where to find the data, and also tell R where to store the data.
Here I am going to show several simpler commands, for two reasons.
First, these examples show more of the basic syntax of R.
Second, to enable you to do some basic analysis on your data-set.
These examples are based on tract-level county data from the 2010 Census, which has been prepared in spreadsheet software and imported to R-Studio as a dataframe named “CO”. Other pages on this site explain how to create this dataset.
What is the range of the number of African-Americans in all tracts in this county?
Here is the way you would command R to do that calculation:
range(CO$BlkTr)
Interpretation of the syntax of the above command. First: the basic command for finding the range in R is range().
Note that in R documentation, commands are typically written as command().
Second: the stuff inside the parentheses is written as dataframe-dollarsign-variablename. With this syntax we are telling R to look within dataframe “CO” for variable “BlkTr” and then find the range of values within that variable. The answer is “returned” back on the command-line, just below the command you enter:
range(CO$BlkTr) [1] 7 3703
In many cases, R begins the answer with a bracketed number. This helps if R returns a lot of numbers on the command-line. Here is a command that will produce a lot of numbers, and show why the bracketed line-numbering is useful. Just enter a variable with no command in front of it, and by default R will “return” all the data within that variable:
CO$BlkTr [1] 171 447 263 434 716 1368 475 568 627 1442 412 707 140 274 [15] 254 65 61 76 396 857 271 440 478 408 433 470 1163 388 [29] 1132 818 667 915 1011 918 895 1468 943 1300 282 690 111 855 [43] 676 824 1102 224 245 279 50 94 192 93 63 214 318 86 [57] 59 160 85 31 175 186 98 26 272 151 171 130 198 150 [71] 100 104 141 110 305 82 88 67 329 424 102 201 64 305 [85] 98 247 54 37 134 80 15 48 155 147 127 60 56 51 [99] 84 39 142 84 73 45 154 133 219 53 50 123 32 255 [113] 39 7 30 32 36 17 51 85 20 41 74 20 19 8 [127] 21 113 70 70 10 37 21 21 44 525 3036 1040 933 1331 [141] 115 90 221 426 101 156 811 144 40 127 50 496 758 126 [155] 901 908 837 504 737 446 1315 713 446 560 755 664 309 1502 [169] 701 1687 1720 575 833 2199 823 432 278 1465 466 412 757 772 [183] 463 847 554 1882 1600 250 2457 1599 2689 3703 511 397 153 79 [197] 468 228 140 89 187 94 32 65 59 1928 623
Structure of the output above. In the 2010 Decennial Census, there were 207 populated tracts in Contra Costa County. The number of African-Americans in each tract is loaded into R as one column of data within the dataframe “CO”. R has just “returned” all of that data in the console-window, just below the command; it is 207 different numbers. The first number on the first line is 171; and R notes that this is the first number by putting [1] next to it. On the last line (the fifteenth line), the first number is 468. That is the 197th number in a variable containing 207 different numbers. R indicates that this first number on the 15th line is the #197 within the variable by putting a [197] next to it.
NOTE: You can select, copy, and paste everything shown in the Console into another document. I copied the output and pasted it on this webpage; you can copy/paste it into your report.
A little deeper analysis
In the previous section I focused on explaining what is going on with R. But if you look at the data itself, it shows that in one tract, there are only 7 African-Americans, and at the high end is one tract with 3,703 African-Americans. That is a wide range! Typically, the Census creates tracts where it expects about 2,000-6,000 people; so that high-end tract might be virtually all-Black. Such an extreme wide range implies significant segregation in Contra Costa County. But to get a more accurate read on this, we need to look at the proportion of African-Americans in tracts. I will shift to looking at the variable CO$prpBlkTr which we created after first importing this data into R-Studio.
The summary() command returns the range (Min./Max.), the inter-quartile range, and the Median and Mean values of a variable:
> summary(CO$prpBlkTr) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.001952 0.018500 0.046730 0.086620 0.134500 0.53170
This summary tells me that in Contra Costa County in 2010, the “least-Black” Tract has only 0.19% African-Americans, and the “most-Black” Tract has 53.17% African-Americans. Also, note that the median (typical) proportion of African-Americans in Tracts is 4.67%. The fact that the high-end proportion is more than ten times higher than the Median is an indication of a very skewed distribution of the concentration of African-Americans in the county–in other words, residential segregation.
For further analysis, we can also calculate the standard deviation in one step:
> sd(CO$prpBlkTr) [1] 0.09475812
So if the mean is 8.66%, and the standard deviation is 9.48% (rounded), then a high of 53.17% is REALLY unusual: about 3.5 standard deviations above the mean. Even without looking at a graphic plot we can estimate that this is a pretty skewed distribution. And as Massey and Denton (1993) argue, that degree of segregation doesn’t “just happen”.
A few more thoughts about R Syntax
These few examples already reveal a great deal about R syntax.
1. You name your own variables. That way you can use name that help you avoid getting lost. I like to use a mix of uppercase and lowercase letters so that I can identify the subcomponent abbreviations I use to assemble names for complex variables.
2. The syntax for functions in R is similar to the same functions in spreadsheet software. So if you have used functions in Excel or LibreOffice Calc, try those same functions in R. One of the advantages in R is that your own named variables become less abstract. The standard deviation of a set of numbers?
sd(CO$BlkTr) in R. The standard deviation of tract-populations of African Americans in the county I am analyzing.
SD(D2:D237) in a spreadsheet. This calculates the standard deviation of a set of numbers in Column D. The numbers could be anything. One negligent deletion of a column or group of cells could invalidate the calculation without any obvious “error message”.
More General Concepts
R-Studio Interface Terms
When you start R-Studio, it opens a window on your screen. That window has multiple panes. Not all the panes are visible at first. The upper left pane appears only after you import some data or open a blank tab for a new script. But to explain what is going on, we will use the following convention for the names of the panes:
In the following four sections of this page I will clarify a few issues before we move on to using R-Studio. I am making these clarifications because I am reviewing the standard literature on R (outside of Raykov & Marcoulides) to develop the best compromise between consistency and clarity. In case you are befuddled by the very concept of R, I have written a separate page to try and explain the nature of R in terms of software philosophy. Let me know if that one helps. And now, onward with some pragmatic explanations:
The idea of “objects” in R
By matching the standard terminology in other documentation, I’m hoping to explain R in a way that you can check against anything you look up on the web.
1. A scalar is a single value, like the total number of people in a county.
2. A vector is a series of related numbers, such as the 207 numbers representing the population of Asian-Americans in each census-tract in in Contra Costa. A vector in R is like all the values in the column of a table.
3. A matrix is a group of numbers arranged in columns and rows (with no headers). It is a two-dimensional vector.
4. A data-frame is like a matrix, but the columns are labeled. Table P5 is an example of a data frame. In a data-frame, the columns are called variables and the rows are called observations. This makes sense for survey-data: every row is the data from a survey response, what might be recorded on one sheet of paper from one interview or ‘observation.’
Should we use attach() or not? No.
Previously I recommended using the attach() command to simplify subsequent commands. Bettinger uses it. Raykov & Marcoulides use it. In practice this has not worked. If your laptop goes to sleep (such as when you close the lid), R detaches whatever object you had attached. Then every subsequent command will fail with an error message. Unwanted detachment is deeply upsetting, but it is a fact of life when working with R.
What this means is that you will need to use the dataframe-dollarsign-variable syntax to tell R to look inside the dataframe to find the variable. I will stick with this syntax throughout these webpages.