Dec 21, 2014, other pages updated to conform to these conventions.
Whether you are using a spreadsheet, or other data-management software (including R), one of the challenges is keeping track of your data. A crucial method for maintaining control of your data is to name the variables in human-readable ways. There is an art to naming variables: it is a compromise between:
1.) keeping the names short,
2.) keeping them human-readable, and
3.) keeping them software-readable.
For software-readability, we need to avoid most punctuation marks because software might mistake them as function operators. Arithmetic functions, for instance, are represented by +, -, *, and /, so you shouldn’t even use a hyphen (-) in a name. Periods could be mistaken as decimal points. Commas, tabs, and spaces are sometimes data-separators, such as in CSV files. The uses of Exclamation-points, question-marks, colons, semicolons, parentheses (), brackets [], curly braces {}, hash-marks #, dollar-signs, the at (@) and the ‘et’ (&) symbols are all interpreted by one program or another as a command or function, so we should not use them in variable-names.
Thus, I recommend a pretty conservative approach to naming variables. To minimize the risk of software-disaster, I advise avoiding all punctuation marks except the underscore mark (_). This maximizes the chances that your dataset will remain compatible with a wide variety of different programs. I like to maximize interoperability so that you can move between different programs, even on different computers, to do whatever you need to do to get the job done. The last thing you want to deal with is a mysterious glitch or error caused by some obscure incompatibility in your data.
Furthermore, variables often represent complex information, like the proportion of African-Americans in each Census Tract. To create the names for complex variables, I recommend a modular approach. Figure out the shortest possible abbreviation for each aspect of a variable, and combine these chunks together to make variable names. The table below is the set of abbreviations we came up with for Census data:
multiracial: Mlt | African-American: Blk | Hawai’ian/API: Api |
other: Oth | Asian-American: Asn | AmerInd/Alaska Native: Ntv |
Total: Tot | Latino, all races: Lat | White (incl. Middle Eastern): Wht |
County: Co | Tract: Tr | per capita income: PCI |
proportion: prp | percent (prp * 100): pct | density: Dns |
average: av | Household: HH | Education: ED |
minimum: min | maximum: max | Languages: use ISO 2-letter codes |
Job: Jb | Work: Wrk | data-frame: DF |
Combine these abbreviations to make descriptive but short variable names. You can use an underscore to separate the chunks for legibility. Otherwise, jam the chunks directly together to make the shortest possible variable names. Examples:
- A vector that includes the tract-populations of Latinos should be called LatTr.
- The total number of Latinos in the county is be a single number; however in some software you may want to assign it as a named variable. As a variable it would be named LatCo.
- The proportion of Latinos in each Tract would be prpLatTr.
A closing thought: this page is designed to give you guidelines for naming variables–in our case to be used mostly in R-Studio. Once we get into using R-Studio, the command-syntax looks slightly Martian until you get used to it. But always remember, YOU are the one who names the variables, so name them so that they make sense to you. That way you can stay focused on analysis, and troubleshooting the command-structure in R. Those two things are hard enough to wrestle with. You don’t want the added problem of trying to figure out what data is actually contained in your variables.