Gather Data into a “Master-CSV” for Analysis

Dec 21, 2014: now consistent with variable naming conventions. Note that AIAN -> Ntv, and NHPI -> Api.

Assemble data in a CSV for general analysis

Why do this?
I am asking you to gather current Census data into one “master-csv” file, with intelligible variable-names. Then you can import all the assembled data into a single data-frame in R-Studio.
Once it is in R-Studio, you will be able to do a series of operations on the data.
In R-Studio I will show you how to keep most of this data in a single data-frame, so that you can export it from R-Studio and use the data in a spreadsheet or word-processing file—in other words, your final report for the semester.

How to do this?
1.) Make sure you have the following files downloaded and unzipped:
a) 2010 Decennial Census Table P5
b) The shapefile for your county, from the same Factfinder page as your Table P5
c) ACS 2010 Table DP-02
d) ACS 2010 Table DP-03.

2.) Make sure your files are clean.
The second row and beyond should just be tract-data. No summary data, no extra descriptions.
The final row should be the last tract in your county. No sums, stats, and especially no text.

3.) We will create a master-csv by combining selected data from all three files. Begin by using “Save As…” so save one of the files as a new master-csv named {yourcounty}2010.csv
I began with Table P5, but you could begin with any of the tables because they all have some data that will be included in the final master-csv.
[NOTE: I am using the programmer’s convention of curly-braces {} to represent terms that you yourself must determine. In this case, you must determine the name of your county.]
The software may complain that the CSV format does not support many of the features of a spreadsheet. That is true, but for now you are gathering raw data together so that you can import it into R-Studio.

4.) Rename selected column names per the table shown below. We will use the variable naming conventions used explained on the linked page.
In spreadsheet-software, column names appear on row #1, which is confusing because all the rows below that–from row #2 down to the last tract–are all data. In R-Studio, the names in the first row will become the names of the variables, and appear as column headers.

Original & new
Column Name
Description
GEO.id → GEOcode Unique geo-identifier for this tract with all Census datasets.
D001 → TotTr The total population in each tract.
D003 → WhtTr Population of non-Latino whites in each tract.
D004 → BlkTr Population of non-Latino African-Americans in each tract.
D005 → NtvTr Pop. non-Latino Native Americans / Alaska Natives, ea. tract.
D006 → AsnTr Population of non-Latino Asians in each tract.
D007 → ApiTr Pop. of non-Latino Native Hawai’ian / Pacific Islanders
D008 → OthTr Population of “Others” (Martians, Jedi, White Walkers…)
D009 → MltTr Population of non-Latino multi-racial residents, each tract.
D010 → LatTr Population of Latinos (all races) in each tract.

Delete all the other columns that were inherited from P5: GEO.id2, GEO.display-label, D002, and D011 through D017.

5.) Now open Table DP02. There are ten more columns of data I want you to copy from DP02 and paste into {yourcounty}2010.csv: In the tables below, the upper row is the name that these columns have in DP02. copy the data from these columns, paste it into {yourcounty}2010.csv, and rename them according to the names in the second row on the table:

Original → final “master-csv”
variable name
Description
HC01_VC20 → avHHsiz Average household size
HC01_VC86 → ED_noGED Pop. without a GED in the tract.
HC01_VC87 → ED_GED Pop. with a High School / GED but no higher in tract.
HC01_VC89 → ED_AA Pop. with Associates’ Degree, but no higher, in tract.
HC01_VC90 → ED_BA Pop. with a Bachelors Degree, but no higher, in tract.
HC01_VC91 → ED_transBA Pop. with post-graduate degrees.
HC01_VC134 → FgnBorn # foreign-born people in each tract.
HC01_VC167 → HmLng_onlyEN Language spoken at home: English only.
HC01_VC168 → HmLng_noEN Language spoken at home: no English
HC01_VC171 → HmLng_ES Language spoken at home: Spanish

Once you have pasted all the data from these columns into {yourcounty}2010.csv, close Table DP02. Don’t save any changes to DP02 when closing it. We’ll leave DP02 in the same, original condition.

6.) Open Table DP03. There are 19 variables to copy from this table into {yourcounty}2010.csv:

Original → final “master-csv”
variable name
HC01_VC04 → WrkAge # ppl over 16 in tract.
HC01_VC05 → Wrking # people working in the tract.
HC01_VC10 → not_Wrking # people not working in the tract.
HC01_VC50 → JbPrimary # People working in Primary Sector (farm, mining)
HC01_VC51 → JbCnstr # People working in Construction.
HC01_VC52 → JbManuf # People working in Manufacturing
HC01_VC53 → JbWhlsale # People working in Wholesale
HC01_VC54 → JbRetail # People working in Retail
HC01_VC55 → JbTranspo # People working in Transportation
HC01_VC56 → JbInfo # People working in Information
HC01_VC57 → JbFIRE # People working in Finance, Insurance, Real Estate
HC01_VC58 → JbExec # People working as Executives, other than FIRE
HC01_VC59 → JbED_Health # People working in Education, Healthcare
HC01_VC60 → JbArts # People working in in the arts.
HC01_VC61 → JbOther # People working in other type jobs
HC01_VC62 → JbPubAdmin # People working in Public Administration
HC01_VC85 → MedIncTr Median income in each tract.
HC01_VC86 → AvIncTr Average income in each tract.
HC01_VC115 → PCITr Per capita income for each tract.

Having fun yet? Save {yourcounty}2010.csv with these additions, and close DP03 without saving any changes to that ACS file.

7.) Lastly, open the unzipped Shapefile for your county. If you have not downloaded this file, go and seek out the 2010 Table P5 (by Tract) from American FactFinder, download, move the .zip archive to your soc393 working folder, and unzip the archive. (Detailed instructions here.) Now, open the very first file within the folder. It is named 140_00.dbf and it is in tabular data format. LibreOffice Calc will open this the same way it will open a .csv file. With Excel, this webpage states that you have to specify that you are opening a database file, using the drop-down options in the “Open…” dialog box. (If someone finds me a better set of instructions, I will revise this page to post them here.) If your software asks “What character encoding should I use?” tell it “Western Europe ISO-8859-1.”

When your software opens the DBF into a spreadsheet, copy the following two columns and paste them into {yourcounty}2010.csv:

NAME,C,90 → TractNum Tract number.
SHAPE_AREA,N,19,11 → Area_SM Area of tract in square meters

Okay! Save the file {yourcounty}2010.csv, and quit. Next time I will show some of what you can do in R-Studio with this assembled data.


Here is a recap of what variables you should have. I am organizing this table vertically, because it includes too many variables to be displayed horizontally. Please use this as a checklist, but DO NOT misinterpret this as a recommendation to turn your table sideways!

Name of the original file where the data came from: Original variable name from the original file: Final name in your “master” CSV:
DEC_10_SF1_P5.csv GEO.id GEOcode
DEC_10_SF1_P5.csv D001 TotTr
DEC_10_SF1_P5.csv D003 WhtTr
DEC_10_SF1_P5.csv D004 BlkTr
DEC_10_SF1_P5.csv D005 NtvTr
DEC_10_SF1_P5.csv D006 AsnTr
DEC_10_SF1_P5.csv D007 ApiTr
DEC_10_SF1_P5.csv  D008 OthTr
DEC_10_SF1_P5.csv  D009 MltTr
DEC_10_SF1_P5.csv  D010 LatTr
ACS_10_5YR_DP02.csv  HC01_VC20 avHHsiz
ACS_10_5YR_DP02.csv  HC01_VC86 ED_noGED
ACS_10_5YR_DP02.csv  HC01_VC87 ED_GED
ACS_10_5YR_DP02.csv  HC01_VC89 ED_AA
ACS_10_5YR_DP02.csv  HC01_VC90 ED_BA
ACS_10_5YR_DP02.csv  HC01_VC91 ED_transBA
ACS_10_5YR_DP02.csv  HC01_VC134 FgnBorn
ACS_10_5YR_DP02.csv  HC01_VC167 HmLng_onlyEN
ACS_10_5YR_DP02.csv  HC01_VC168 HmLng_noEN
ACS_10_5YR_DP02.csv  HC01_VC171  HmLng_ES
ACS_10_5YR_DP03.csv  HC01_VC04  Wrk_Age
ACS_10_5YR_DP03.csv  HC01_VC05  Wrking
ACS_10_5YR_DP03.csv  HC01_VC10  not_Wrking
ACS_10_5YR_DP03.csv  HC01_VC50  JbPrimary
ACS_10_5YR_DP03.csv  HC01_VC51  JbCnstr
ACS_10_5YR_DP03.csv  HC01_VC52  JbManuf
ACS_10_5YR_DP03.csv  HC01_VC53  JbWhlsale
ACS_10_5YR_DP03.csv  HC01_VC54  JbRetail
ACS_10_5YR_DP03.csv  HC01_VC55  JbTranspo
ACS_10_5YR_DP03.csv  HC01_VC56  JbInfo
ACS_10_5YR_DP03.csv  HC01_VC57  JbFIRE
ACS_10_5YR_DP03.csv  HC01_VC58  JbExec
ACS_10_5YR_DP03.csv  HC01_VC59  JbED_Health
ACS_10_5YR_DP03.csv  HC01_VC60  JnArts
ACS_10_5YR_DP03.csv  HC01_VC61  JbOther
ACS_10_5YR_DP03.csv  HC01_VC62  JbPubAdmin
ACS_10_5YR_DP03.csv  HC01_VC85  MedIncTr
ACS_10_5YR_DP03.csv  HC01_VC86  AvIncTr
ACS_10_5YR_DP03.csv  HC01_VC115  PCITr
140_00.dbf
(inside Shapefile folder)
NAME,C,90 TractNum
140_00.dbf SHAPE_AREA,N,19,11 Area_SM