R and R-Studio: general concepts

Revised 21 Dec to standardize variable-terminology.

A lot can be done with spreadsheet software like LibreOffice Calc and Google Sheets. With spreadsheets you can calculate measures of central tendency and dispersion. Most have excellent utilities for generating graphics. Even if you use R, you will still need to continue using spreadsheets to manage the translation of processed data into final reports: either slide presentations of word-processed documents. But at some point you may need to do more data-intensive analysis, and it becomes easier to do it in R (or other statistical software) than with spreadsheet software. This page is written to help you get a sense of how R-Studio works.

It is easiest to begin with a dataset that you find interesting. On previous pages I explain:
1.) How to get Census data at the County/Tract level.
2.) A convention for naming data in a format that is human-readable, machine-readable, and short.
3.) How to assemble that data into a “master” csv that contains 41 different bits of information about each tract
4.) How to import that file into R, calculate more variables in that dataset, and then export it again as a CSV.
I also explain how to manage some mistakes in R. I think these basic skills are necessary to avoid anxiety while using R.

But now that you have used R-Studio a little, it is time to learn more about the nature of this software, so you can use it with more confidence.

R is command-line

There is a good reason why we moved away from command-line interface (CLI) software in the 1980s and 1990s. That blinking cursor on a blank line might represent limitless possibilities to a computer programmer, but for most of us it represents limitless impossibilites–as in, “What do I do now???” Drop-down menus to give us a clue about which commands we can run. Graphic menus also constrain our options, usually limiting us to the correct syntax for executing a command. For example, if you go to File > Save… in a graphic program, it will open up a dialog-box to help you find where you want to save a file. You could type out the same operation from a command-line, but if you make a typo, you may send the file to a location that does not exist. Depending upon the software you are using, you might or might not get an error-message, warning you that you are about to send your work into irretrievable oblivion.

Fortunately, R-Studio is a much better user-experience for several reasons.
First: R-Studio is a graphic front-end to R that enables you to use menu-driven commands for a lot of the basic operations like Save, Save As, Import, etc. R-Studio also makes visible some things that exist in R, but are invisible, like History of every command that is entered.
Second: command-line operations are much easier now that we have the World Wide Web. If you are using software that a reasonable number of other people are also using, you can often solve problems like “What do I do now???” by typing your question into a search engine. Often, if someone else has solved the problem, they will also post the commands that worked. You can copy commands straight off a web-page and paste them onto the command-line of R and run them. Better yet, you can paste them onto a scratch sheet, adapt them to your circumstances, and then run them; when they fail due to some typo or omission, you can revise them and re-run them until they work. Whey they do work, you can save that scratch-sheet as a text file and use it later.
Third: if you work out a series of commands that need to be run multiple times, you can just highlight them all and then run them as a sequence. That is why the scratch-sheet is called a script sheet. Spreadsheet programs have a similar capability called macros. Worth learning, but the scripting capability in R-Studio is remarkably easy to use.

Syntax is half the battle in R: Some examples

On the page where I introduce a first session in R-Studio, I describe how to logically “build up” an R command.
You begin with the basic operation you are trying to do, such as dividing one variable by another.
Then you “wrap” that basic function with a series of specifications that tell R where to find the data, and also tell R where to store the data.

Here I am going to show several simpler commands, for two reasons.
First, these examples show more of the basic syntax of R.
Second, to enable you to do some basic analysis on your data-set.
These examples are based on tract-level county data from the 2010 Census, which has been prepared in spreadsheet software and imported to R-Studio as a dataframe named “CO”. Other pages on this site explain how to create this dataset.

What is the range of the number of African-Americans in all tracts in this county?
Here is the way you would command R to do that calculation:

range(CO$BlkTr)

Interpretation of the syntax of the above command. First: the basic command for finding the range in R is range().
Note that in R documentation, commands are typically written as command().
Second: the stuff inside the parentheses is written as dataframe-dollarsign-variablename. With this syntax we are telling R to look within dataframe “CO” for variable “BlkTr” and then find the range of values within that variable. The answer is “returned” back on the command-line, just below the command you enter:

range(CO$BlkTr)
[1]    7 3703

In many cases, R begins the answer with a bracketed number. This helps if R returns a lot of numbers on the command-line. Here is a command that will produce a lot of numbers, and show why the bracketed line-numbering is useful. Just enter a variable with no command in front of it, and by default R will “return” all the data within that variable:

CO$BlkTr
  [1]  171  447  263  434  716 1368  475  568  627 1442  412  707  140  274
 [15]  254   65   61   76  396  857  271  440  478  408  433  470 1163  388
 [29] 1132  818  667  915 1011  918  895 1468  943 1300  282  690  111  855
 [43]  676  824 1102  224  245  279   50   94  192   93   63  214  318   86
 [57]   59  160   85   31  175  186   98   26  272  151  171  130  198  150
 [71]  100  104  141  110  305   82   88   67  329  424  102  201   64  305
 [85]   98  247   54   37  134   80   15   48  155  147  127   60   56   51
 [99]   84   39  142   84   73   45  154  133  219   53   50  123   32  255
[113]   39    7   30   32   36   17   51   85   20   41   74   20   19    8
[127]   21  113   70   70   10   37   21   21   44  525 3036 1040  933 1331
[141]  115   90  221  426  101  156  811  144   40  127   50  496  758  126
[155]  901  908  837  504  737  446 1315  713  446  560  755  664  309 1502
[169]  701 1687 1720  575  833 2199  823  432  278 1465  466  412  757  772
[183]  463  847  554 1882 1600  250 2457 1599 2689 3703  511  397  153   79
[197]  468  228  140   89  187   94   32   65   59 1928  623

Structure of the output above. In the 2010 Decennial Census, there were 207 populated tracts in Contra Costa County. The number of African-Americans in each tract is loaded into R as one column of data within the dataframe “CO”. R has just “returned” all of that data in the console-window, just below the command; it is 207 different numbers. The first number on the first line is 171; and R notes that this is the first number by putting [1] next to it. On the last line (the fifteenth line), the first number is 468. That is the 197th number in a variable containing 207 different numbers. R indicates that this first number on the 15th line is the #197 within the variable by putting a [197] next to it.

NOTE: You can select, copy, and paste everything shown in the Console into another document. I copied the output and pasted it on this webpage; you can copy/paste it into your report.

A little deeper analysis

In the previous section I focused on explaining what is going on with R. But if you look at the data itself, it shows that in one tract, there are only 7 African-Americans, and at the high end is one tract with 3,703 African-Americans. That is a wide range! Typically, the Census creates tracts where it expects about 2,000-6,000 people; so that high-end tract might be virtually all-Black. Such an extreme wide range implies significant segregation in Contra Costa County. But to get a more accurate read on this, we need to look at the proportion of African-Americans in tracts. I will shift to looking at the variable CO$prpBlkTr which we created after first importing this data into R-Studio.

The summary() command returns the range (Min./Max.), the inter-quartile range, and the Median and Mean values of a variable:

> summary(CO$prpBlkTr)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001952 0.018500 0.046730 0.086620 0.134500 0.53170

This summary tells me that in Contra Costa County in 2010, the “least-Black” Tract has only 0.19% African-Americans, and the “most-Black” Tract has 53.17% African-Americans. Also, note that the median (typical) proportion of African-Americans in Tracts is 4.67%. The fact that the high-end proportion is more than ten times higher than the Median is an indication of a very skewed distribution of the concentration of African-Americans in the county–in other words, residential segregation.

For further analysis, we can also calculate the standard deviation in one step:

> sd(CO$prpBlkTr)
[1] 0.09475812

So if the mean is 8.66%, and the standard deviation is 9.48% (rounded), then a high of 53.17% is REALLY unusual: about 3.5 standard deviations above the mean. Even without looking at a graphic plot we can estimate that this is a pretty skewed distribution. And as Massey and Denton (1993) argue, that degree of segregation doesn’t “just happen”.

A few more thoughts about R Syntax

These few examples already reveal a great deal about R syntax.
1. You name your own variables. That way you can use name that help you avoid getting lost. I like to use a mix of uppercase and lowercase letters so that I can identify the subcomponent abbreviations I use to assemble names for complex variables.
2. The syntax for functions in R is similar to the same functions in spreadsheet software. So if you have used functions in Excel or LibreOffice Calc, try those same functions in R. One of the advantages in R is that your own named variables become less abstract. The standard deviation of a set of numbers?
sd(CO$BlkTr) in R. The standard deviation of tract-populations of African Americans in the county I am analyzing.
SD(D2:D237) in a spreadsheet. This calculates the standard deviation of a set of numbers in Column D. The numbers could be anything. One negligent deletion of a column or group of cells could invalidate the calculation without any obvious “error message”.

More General Concepts

R-Studio Interface Terms

When you start R-Studio, it opens a window on your screen. That window has multiple panes. Not all the panes are visible at first. The upper left pane appears only after you import some data or open a blank tab for a new script. But to explain what is going on, we will use the following convention for the names of the panes:

In the following four sections of this page I will clarify a few issues before we move on to using R-Studio. I am making these clarifications because I am reviewing the standard literature on R (outside of Raykov & Marcoulides) to develop the best compromise between consistency and clarity. In case you are befuddled by the very concept of R, I have written a separate page to try and explain the nature of R in terms of software philosophy. Let me know if that one helps. And now, onward with some pragmatic explanations:

The idea of “objects” in R

By matching the standard terminology in other documentation, I’m hoping to explain R in a way that you can check against anything you look up on the web.
1. A scalar is a single value, like the total number of people in a county.
2. A vector is a series of related numbers, such as the 207 numbers representing the population of Asian-Americans in each census-tract in in Contra Costa. A vector in R is like all the values in the column of a table.
3. A matrix is a group of numbers arranged in columns and rows (with no headers). It is a two-dimensional vector.
4. A data-frame is like a matrix, but the columns are labeled. Table P5 is an example of a data frame. In a data-frame, the columns are called variables and the rows are called observations. This makes sense for survey-data: every row is the data from a survey response, what might be recorded on one sheet of paper from one interview or ‘observation.’

Should we use attach() or not? No.

Previously I recommended using the attach() command to simplify subsequent commands. Bettinger uses it. Raykov & Marcoulides use it. In practice this has not worked. If your laptop goes to sleep (such as when you close the lid), R detaches whatever object you had attached. Then every subsequent command will fail with an error message. Unwanted detachment is deeply upsetting, but it is a fact of life when working with R.

What this means is that you will need to use the dataframe-dollarsign-variable syntax to tell R to look inside the dataframe to find the variable. I will stick with this syntax throughout these webpages.