Gather Data into a “Master-CSV” for Analysis

Dec 21, 2014: now consistent with variable naming conventions. Note that AIAN -> Ntv, and NHPI -> Api.

Assemble data in a CSV for general analysis

Why do this?
I am asking you to gather current Census data into one “master-csv” file, with intelligible variable-names. Then you can import all the assembled data into a single data-frame in R-Studio.
Once it is in R-Studio, you will be able to do a series of operations on the data.
In R-Studio I will show you how to keep most of this data in a single data-frame, so that you can export it from R-Studio and use the data in a spreadsheet or word-processing file—in other words, your final report for the semester.

How to do this?
1.) Make sure you have the following files downloaded and unzipped:
a) 2010 Decennial Census Table P5
b) The shapefile for your county, from the same Factfinder page as your Table P5
c) ACS 2010 Table DP-02
d) ACS 2010 Table DP-03.

2.) Make sure your files are clean.
The second row and beyond should just be tract-data. No summary data, no extra descriptions.
The final row should be the last tract in your county. No sums, stats, and especially no text.

3.) We will create a master-csv by combining selected data from all three files. Begin by using “Save As…” so save one of the files as a new master-csv named {yourcounty}2010.csv
I began with Table P5, but you could begin with any of the tables because they all have some data that will be included in the final master-csv.
[NOTE: I am using the programmer’s convention of curly-braces {} to represent terms that you yourself must determine. In this case, you must determine the name of your county.]
The software may complain that the CSV format does not support many of the features of a spreadsheet. That is true, but for now you are gathering raw data together so that you can import it into R-Studio.

4.) Rename selected column names per the table shown below. We will use the variable naming conventions used explained on the linked page.
In spreadsheet-software, column names appear on row #1, which is confusing because all the rows below that–from row #2 down to the last tract–are all data. In R-Studio, the names in the first row will become the names of the variables, and appear as column headers.

Original & new Column Name	Description
GEO.id → GEOcode	Unique geo-identifier for this tract with all Census datasets.
D001 → TotTr	The total population in each tract.
D003 → WhtTr	Population of non-Latino whites in each tract.
D004 → BlkTr	Population of non-Latino African-Americans in each tract.
D005 → NtvTr	Pop. non-Latino Native Americans / Alaska Natives, ea. tract.
D006 → AsnTr	Population of non-Latino Asians in each tract.
D007 → ApiTr	Pop. of non-Latino Native Hawai’ian / Pacific Islanders
D008 → OthTr	Population of “Others” (Martians, Jedi, White Walkers…)
D009 → MltTr	Population of non-Latino multi-racial residents, each tract.
D010 → LatTr	Population of Latinos (all races) in each tract.

Delete all the other columns that were inherited from P5: GEO.id2, GEO.display-label, D002, and D011 through D017.

5.) Now open Table DP02. There are ten more columns of data I want you to copy from DP02 and paste into {yourcounty}2010.csv: In the tables below, the upper row is the name that these columns have in DP02. copy the data from these columns, paste it into {yourcounty}2010.csv, and rename them according to the names in the second row on the table:

Original → final “master-csv” variable name	Description
HC01_VC20 → avHHsiz	Average household size
HC01_VC86 → ED_noGED	Pop. without a GED in the tract.
HC01_VC87 → ED_GED	Pop. with a High School / GED but no higher in tract.
HC01_VC89 → ED_AA	Pop. with Associates’ Degree, but no higher, in tract.
HC01_VC90 → ED_BA	Pop. with a Bachelors Degree, but no higher, in tract.
HC01_VC91 → ED_transBA	Pop. with post-graduate degrees.
HC01_VC134 → FgnBorn	# foreign-born people in each tract.
HC01_VC167 → HmLng_onlyEN	Language spoken at home: English only.
HC01_VC168 → HmLng_noEN	Language spoken at home: no English
HC01_VC171 → HmLng_ES	Language spoken at home: Spanish

Once you have pasted all the data from these columns into {yourcounty}2010.csv, close Table DP02. Don’t save any changes to DP02 when closing it. We’ll leave DP02 in the same, original condition.

6.) Open Table DP03. There are 19 variables to copy from this table into {yourcounty}2010.csv:

Original → final “master-csv” variable name
HC01_VC04 → WrkAge	# ppl over 16 in tract.
HC01_VC05 → Wrking	# people working in the tract.
HC01_VC10 → not_Wrking	# people not working in the tract.
HC01_VC50 → JbPrimary	# People working in Primary Sector (farm, mining)
HC01_VC51 → JbCnstr	# People working in Construction.
HC01_VC52 → JbManuf	# People working in Manufacturing
HC01_VC53 → JbWhlsale	# People working in Wholesale
HC01_VC54 → JbRetail	# People working in Retail
HC01_VC55 → JbTranspo	# People working in Transportation
HC01_VC56 → JbInfo	# People working in Information
HC01_VC57 → JbFIRE	# People working in Finance, Insurance, Real Estate
HC01_VC58 → JbExec	# People working as Executives, other than FIRE
HC01_VC59 → JbED_Health	# People working in Education, Healthcare
HC01_VC60 → JbArts	# People working in in the arts.
HC01_VC61 → JbOther	# People working in other type jobs
HC01_VC62 → JbPubAdmin	# People working in Public Administration
HC01_VC85 → MedIncTr	Median income in each tract.
HC01_VC86 → AvIncTr	Average income in each tract.
HC01_VC115 → PCITr	Per capita income for each tract.

Having fun yet? Save {yourcounty}2010.csv with these additions, and close DP03 without saving any changes to that ACS file.

7.) Lastly, open the unzipped Shapefile for your county. If you have not downloaded this file, go and seek out the 2010 Table P5 (by Tract) from American FactFinder, download, move the .zip archive to your soc393 working folder, and unzip the archive. (Detailed instructions here.) Now, open the very first file within the folder. It is named 140_00.dbf and it is in tabular data format. LibreOffice Calc will open this the same way it will open a .csv file. With Excel, this webpage states that you have to specify that you are opening a database file, using the drop-down options in the “Open…” dialog box. (If someone finds me a better set of instructions, I will revise this page to post them here.) If your software asks “What character encoding should I use?” tell it “Western Europe ISO-8859-1.”

When your software opens the DBF into a spreadsheet, copy the following two columns and paste them into {yourcounty}2010.csv:

NAME,C,90 → TractNum	Tract number.
SHAPE_AREA,N,19,11 → Area_SM	Area of tract in square meters

Okay! Save the file {yourcounty}2010.csv, and quit. Next time I will show some of what you can do in R-Studio with this assembled data.

Here is a recap of what variables you should have. I am organizing this table vertically, because it includes too many variables to be displayed horizontally. Please use this as a checklist, but DO NOT misinterpret this as a recommendation to turn your table sideways!

Name of the original file where the data came from:	Original variable name from the original file:	Final name in your “master” CSV:
DEC_10_SF1_P5.csv	GEO.id	GEOcode
DEC_10_SF1_P5.csv	D001	TotTr
DEC_10_SF1_P5.csv	D003	WhtTr
DEC_10_SF1_P5.csv	D004	BlkTr
DEC_10_SF1_P5.csv	D005	NtvTr
DEC_10_SF1_P5.csv	D006	AsnTr
DEC_10_SF1_P5.csv	D007	ApiTr
DEC_10_SF1_P5.csv	D008	OthTr
DEC_10_SF1_P5.csv	D009	MltTr
DEC_10_SF1_P5.csv	D010	LatTr
ACS_10_5YR_DP02.csv	HC01_VC20	avHHsiz
ACS_10_5YR_DP02.csv	HC01_VC86	ED_noGED
ACS_10_5YR_DP02.csv	HC01_VC87	ED_GED
ACS_10_5YR_DP02.csv	HC01_VC89	ED_AA
ACS_10_5YR_DP02.csv	HC01_VC90	ED_BA
ACS_10_5YR_DP02.csv	HC01_VC91	ED_transBA
ACS_10_5YR_DP02.csv	HC01_VC134	FgnBorn
ACS_10_5YR_DP02.csv	HC01_VC167	HmLng_onlyEN
ACS_10_5YR_DP02.csv	HC01_VC168	HmLng_noEN
ACS_10_5YR_DP02.csv	HC01_VC171	HmLng_ES
ACS_10_5YR_DP03.csv	HC01_VC04	Wrk_Age
ACS_10_5YR_DP03.csv	HC01_VC05	Wrking
ACS_10_5YR_DP03.csv	HC01_VC10	not_Wrking
ACS_10_5YR_DP03.csv	HC01_VC50	JbPrimary
ACS_10_5YR_DP03.csv	HC01_VC51	JbCnstr
ACS_10_5YR_DP03.csv	HC01_VC52	JbManuf
ACS_10_5YR_DP03.csv	HC01_VC53	JbWhlsale
ACS_10_5YR_DP03.csv	HC01_VC54	JbRetail
ACS_10_5YR_DP03.csv	HC01_VC55	JbTranspo
ACS_10_5YR_DP03.csv	HC01_VC56	JbInfo
ACS_10_5YR_DP03.csv	HC01_VC57	JbFIRE
ACS_10_5YR_DP03.csv	HC01_VC58	JbExec
ACS_10_5YR_DP03.csv	HC01_VC59	JbED_Health
ACS_10_5YR_DP03.csv	HC01_VC60	JnArts
ACS_10_5YR_DP03.csv	HC01_VC61	JbOther
ACS_10_5YR_DP03.csv	HC01_VC62	JbPubAdmin
ACS_10_5YR_DP03.csv	HC01_VC85	MedIncTr
ACS_10_5YR_DP03.csv	HC01_VC86	AvIncTr
ACS_10_5YR_DP03.csv	HC01_VC115	PCITr
140_00.dbf (inside Shapefile folder)	NAME,C,90	TractNum
140_00.dbf	SHAPE_AREA,N,19,11	Area_SM