Dec 21, 2014: now consistent with variable naming conventions. Note that AIAN -> Ntv, and NHPI -> Api.
Assemble data in a CSV for general analysis
Why do this?
I am asking you to gather current Census data into one “master-csv” file, with intelligible variable-names. Then you can import all the assembled data into a single data-frame in R-Studio.
Once it is in R-Studio, you will be able to do a series of operations on the data.
In R-Studio I will show you how to keep most of this data in a single data-frame, so that you can export it from R-Studio and use the data in a spreadsheet or word-processing file—in other words, your final report for the semester.
How to do this?
1.) Make sure you have the following files downloaded and unzipped:
a) 2010 Decennial Census Table P5
b) The shapefile for your county, from the same Factfinder page as your Table P5
c) ACS 2010 Table DP-02
d) ACS 2010 Table DP-03.
2.) Make sure your files are clean.
The second row and beyond should just be tract-data. No summary data, no extra descriptions.
The final row should be the last tract in your county. No sums, stats, and especially no text.
3.) We will create a master-csv by combining selected data from all three files. Begin by using “Save As…” so save one of the files as a new master-csv named {yourcounty}2010.csv
I began with Table P5, but you could begin with any of the tables because they all have some data that will be included in the final master-csv.
[NOTE: I am using the programmer’s convention of curly-braces {} to represent terms that you yourself must determine. In this case, you must determine the name of your county.]
The software may complain that the CSV format does not support many of the features of a spreadsheet. That is true, but for now you are gathering raw data together so that you can import it into R-Studio.
4.) Rename selected column names per the table shown below. We will use the variable naming conventions used explained on the linked page.
In spreadsheet-software, column names appear on row #1, which is confusing because all the rows below that–from row #2 down to the last tract–are all data. In R-Studio, the names in the first row will become the names of the variables, and appear as column headers.
Original & new Column Name |
Description |
GEO.id → GEOcode | Unique geo-identifier for this tract with all Census datasets. |
D001 → TotTr | The total population in each tract. |
D003 → WhtTr | Population of non-Latino whites in each tract. |
D004 → BlkTr | Population of non-Latino African-Americans in each tract. |
D005 → NtvTr | Pop. non-Latino Native Americans / Alaska Natives, ea. tract. |
D006 → AsnTr | Population of non-Latino Asians in each tract. |
D007 → ApiTr | Pop. of non-Latino Native Hawai’ian / Pacific Islanders |
D008 → OthTr | Population of “Others” (Martians, Jedi, White Walkers…) |
D009 → MltTr | Population of non-Latino multi-racial residents, each tract. |
D010 → LatTr | Population of Latinos (all races) in each tract. |
Delete all the other columns that were inherited from P5: GEO.id2, GEO.display-label, D002, and D011 through D017.
5.) Now open Table DP02. There are ten more columns of data I want you to copy from DP02 and paste into {yourcounty}2010.csv: In the tables below, the upper row is the name that these columns have in DP02. copy the data from these columns, paste it into {yourcounty}2010.csv, and rename them according to the names in the second row on the table:
Original → final “master-csv” variable name |
Description |
HC01_VC20 → avHHsiz | Average household size |
HC01_VC86 → ED_noGED | Pop. without a GED in the tract. |
HC01_VC87 → ED_GED | Pop. with a High School / GED but no higher in tract. |
HC01_VC89 → ED_AA | Pop. with Associates’ Degree, but no higher, in tract. |
HC01_VC90 → ED_BA | Pop. with a Bachelors Degree, but no higher, in tract. |
HC01_VC91 → ED_transBA | Pop. with post-graduate degrees. |
HC01_VC134 → FgnBorn | # foreign-born people in each tract. |
HC01_VC167 → HmLng_onlyEN | Language spoken at home: English only. |
HC01_VC168 → HmLng_noEN | Language spoken at home: no English |
HC01_VC171 → HmLng_ES | Language spoken at home: Spanish |
Once you have pasted all the data from these columns into {yourcounty}2010.csv, close Table DP02. Don’t save any changes to DP02 when closing it. We’ll leave DP02 in the same, original condition.
6.) Open Table DP03. There are 19 variables to copy from this table into {yourcounty}2010.csv:
Original → final “master-csv” variable name |
|
HC01_VC04 → WrkAge | # ppl over 16 in tract. |
HC01_VC05 → Wrking | # people working in the tract. |
HC01_VC10 → not_Wrking | # people not working in the tract. |
HC01_VC50 → JbPrimary | # People working in Primary Sector (farm, mining) |
HC01_VC51 → JbCnstr | # People working in Construction. |
HC01_VC52 → JbManuf | # People working in Manufacturing |
HC01_VC53 → JbWhlsale | # People working in Wholesale |
HC01_VC54 → JbRetail | # People working in Retail |
HC01_VC55 → JbTranspo | # People working in Transportation |
HC01_VC56 → JbInfo | # People working in Information |
HC01_VC57 → JbFIRE | # People working in Finance, Insurance, Real Estate |
HC01_VC58 → JbExec | # People working as Executives, other than FIRE |
HC01_VC59 → JbED_Health | # People working in Education, Healthcare |
HC01_VC60 → JbArts | # People working in in the arts. |
HC01_VC61 → JbOther | # People working in other type jobs |
HC01_VC62 → JbPubAdmin | # People working in Public Administration |
HC01_VC85 → MedIncTr | Median income in each tract. |
HC01_VC86 → AvIncTr | Average income in each tract. |
HC01_VC115 → PCITr | Per capita income for each tract. |
Having fun yet? Save {yourcounty}2010.csv with these additions, and close DP03 without saving any changes to that ACS file.
7.) Lastly, open the unzipped Shapefile for your county. If you have not downloaded this file, go and seek out the 2010 Table P5 (by Tract) from American FactFinder, download, move the .zip archive to your soc393 working folder, and unzip the archive. (Detailed instructions here.) Now, open the very first file within the folder. It is named 140_00.dbf and it is in tabular data format. LibreOffice Calc will open this the same way it will open a .csv file. With Excel, this webpage states that you have to specify that you are opening a database file, using the drop-down options in the “Open…” dialog box. (If someone finds me a better set of instructions, I will revise this page to post them here.) If your software asks “What character encoding should I use?” tell it “Western Europe ISO-8859-1.”
When your software opens the DBF into a spreadsheet, copy the following two columns and paste them into {yourcounty}2010.csv:
NAME,C,90 → TractNum | Tract number. |
SHAPE_AREA,N,19,11 → Area_SM | Area of tract in square meters |
Okay! Save the file {yourcounty}2010.csv, and quit. Next time I will show some of what you can do in R-Studio with this assembled data.
Here is a recap of what variables you should have. I am organizing this table vertically, because it includes too many variables to be displayed horizontally. Please use this as a checklist, but DO NOT misinterpret this as a recommendation to turn your table sideways!
Name of the original file where the data came from: | Original variable name from the original file: | Final name in your “master” CSV: |
DEC_10_SF1_P5.csv | GEO.id | GEOcode |
DEC_10_SF1_P5.csv | D001 | TotTr |
DEC_10_SF1_P5.csv | D003 | WhtTr |
DEC_10_SF1_P5.csv | D004 | BlkTr |
DEC_10_SF1_P5.csv | D005 | NtvTr |
DEC_10_SF1_P5.csv | D006 | AsnTr |
DEC_10_SF1_P5.csv | D007 | ApiTr |
DEC_10_SF1_P5.csv | D008 | OthTr |
DEC_10_SF1_P5.csv | D009 | MltTr |
DEC_10_SF1_P5.csv | D010 | LatTr |
ACS_10_5YR_DP02.csv | HC01_VC20 | avHHsiz |
ACS_10_5YR_DP02.csv | HC01_VC86 | ED_noGED |
ACS_10_5YR_DP02.csv | HC01_VC87 | ED_GED |
ACS_10_5YR_DP02.csv | HC01_VC89 | ED_AA |
ACS_10_5YR_DP02.csv | HC01_VC90 | ED_BA |
ACS_10_5YR_DP02.csv | HC01_VC91 | ED_transBA |
ACS_10_5YR_DP02.csv | HC01_VC134 | FgnBorn |
ACS_10_5YR_DP02.csv | HC01_VC167 | HmLng_onlyEN |
ACS_10_5YR_DP02.csv | HC01_VC168 | HmLng_noEN |
ACS_10_5YR_DP02.csv | HC01_VC171 | HmLng_ES |
ACS_10_5YR_DP03.csv | HC01_VC04 | Wrk_Age |
ACS_10_5YR_DP03.csv | HC01_VC05 | Wrking |
ACS_10_5YR_DP03.csv | HC01_VC10 | not_Wrking |
ACS_10_5YR_DP03.csv | HC01_VC50 | JbPrimary |
ACS_10_5YR_DP03.csv | HC01_VC51 | JbCnstr |
ACS_10_5YR_DP03.csv | HC01_VC52 | JbManuf |
ACS_10_5YR_DP03.csv | HC01_VC53 | JbWhlsale |
ACS_10_5YR_DP03.csv | HC01_VC54 | JbRetail |
ACS_10_5YR_DP03.csv | HC01_VC55 | JbTranspo |
ACS_10_5YR_DP03.csv | HC01_VC56 | JbInfo |
ACS_10_5YR_DP03.csv | HC01_VC57 | JbFIRE |
ACS_10_5YR_DP03.csv | HC01_VC58 | JbExec |
ACS_10_5YR_DP03.csv | HC01_VC59 | JbED_Health |
ACS_10_5YR_DP03.csv | HC01_VC60 | JnArts |
ACS_10_5YR_DP03.csv | HC01_VC61 | JbOther |
ACS_10_5YR_DP03.csv | HC01_VC62 | JbPubAdmin |
ACS_10_5YR_DP03.csv | HC01_VC85 | MedIncTr |
ACS_10_5YR_DP03.csv | HC01_VC86 | AvIncTr |
ACS_10_5YR_DP03.csv | HC01_VC115 | PCITr |
140_00.dbf (inside Shapefile folder) |
NAME,C,90 | TractNum |
140_00.dbf | SHAPE_AREA,N,19,11 | Area_SM |