Create IPUMS NHGIS Data Extracts
Below we provide examples in R, Python and curl showing how to work with the IPUMS API to create and manage NHGIS data extracts.
Get your key from [https://account.ipums.org/api_keys]. Make sure to replace ‘MY_KEY’ (all caps) in the snippet below with your key.
Load Libraries and Set Key
For R, you may have to install the httr and jsonlite libraries if they are not already installed.
import requests
import json
my_key = MY_KEY
library(httr)
library(jsonlite)
my_key <- MY_KEY
export MY_KEY=MY_KEY # set the MY_KEY environment variable using bash shell
Submit a Data Extract Request
To submit a data extract request you need to pass a valid JSON-formatted extract request in the body of your POST. The names to use for values in the data extract request can be discovered via our metadata API endpoints.
Data Extract Request Fields
datasets
: An object where each key is thename
of the requested dataset and each value is another object describing your selections for that dataset.data_tables
: (Required) A list of selected data table names.geog_levels
: (Required) A list of selected geographic level names.years
: A list of selected years. To select all years use["*"]
. Only required when the dataset has multiple years.breakdown_values
: A list of selected breakdown values. Defaults to first breakdown value. If more than one is selected, then specifybreakdown_and_data_type_layout
at the root of the request body.
time_series_tables
: An object where each key is thename
of the requested time series table and each value is another object describing your selections for that time series table.geog_levels
: (Required) A list of selected geographic level names.
shapefiles
: A list of selected shapefiles.description
: A short description of your extract.data_format
: The requested format of your data. Valid choices are:csv_no_header
,csv_header
, andfixed_width
.csv_header
adds a second, more descriptive header row. Contrary to the name,csv_no_header
still provides a minimal header in the first row. Required when anydatasets
ortime_series_tables
are selected.breakdown_and_data_type_layout
: The layout of your dataset data when multiple data types or breakdown combos are present. Valid choices are:separate_files
(split up each data type or breakdown combo into its own file) andsingle_file
(keep all datatypes and breakdown combos in one file). Required when a dataset has multiple breakdowns or data types.time_series_table_layout
: The layout of your time series table data. Valid choices are:time_by_column_layout
,time_by_row_layout
, andtime_by_file_layout
. Required when any time series tables are selected. See the NHGIS documentation for more information.geographic_extents
: A list ofgeographic_instances
to use as extents for all datasets on this request. To select all extents, use["*"]
. Only applies to geographic levels wherehas_geog_extent_selection
is true. Required when a geographic level on a dataset is specified wherehas_geog_extent_selection
is true.
my_headers = {"Authorization": my_key}
url = "https://api.ipums.org/extracts/?collection=nhgis&version=v1"
er = """
{
"datasets": {
"1988_1997_CBPa": {
"years": ["1988", "1989", "1990", "1991", "1992", "1993", "1994"],
"breakdown_values": ["bs30.si0762", "bs30.si2026"],
"data_tables": [
"NT001"
],
"geog_levels": [
"county"
]
},
"2000_SF1b": {
"data_tables": [
"NP001A"
],
"geog_levels": [
"blck_grp"
]
}
},
"time_series_tables": {
"A00": {
"geog_levels": [
"state"
]
}
},
"shapefiles": [
"us_state_1790_tl2000"
],
"time_series_table_layout": "time_by_file_layout",
"geographic_extents": ["010"],
"data_format": "csv_no_header",
"description": "sample6",
"breakdown_and_data_type_layout": "single_file"
}
"""
result = requests.post(url, headers=my_headers, json=json.loads(er))
my_extract_number = result.json()["number"]
print(my_extract_number)
# Results
9
url <- "https://api.ipums.org/extracts/?collection=nhgis&version=v1"
mybody <- '
{
"datasets": {
"1988_1997_CBPa": {
"years": ["1988", "1989", "1990", "1991", "1992", "1993", "1994"],
"breakdown_values": ["bs30.si0762", "bs30.si2026"],
"data_tables": [
"NT001"
],
"geog_levels": [
"county"
]
},
"2000_SF1b": {
"data_tables": [
"NP001A"
],
"geog_levels": [
"blck_grp"
]
}
},
"time_series_tables": {
"A00": {
"geog_levels": [
"state"
]
}
},
"shapefiles": [
"us_state_1790_tl2000"
],
"time_series_table_layout": "time_by_file_layout",
"geographic_extents": ["010"],
"data_format": "csv_no_header",
"description": "sample6",
"breakdown_and_data_type_layout": "single_file"
}
'
mybody_json <- fromJSON(mybody, simplifyVector = FALSE)
result <- POST(url, add_headers(Authorization = my_key), body = mybody_json, encode = "json", verbose())
res_df <- content(result, "parsed", simplifyDataFrame = TRUE)
my_number <- res_df$number
my_number
# Results
[1] 9
curl -X POST \
"https://api.ipums.org/extracts/?collection=nhgis&version=v1" \
-H "Content-Type: application/json" \
-H "Authorization: $MY_KEY" \
-d '
{
"datasets": {
"1988_1997_CBPa": {
"years": ["1988", "1989", "1990", "1991", "1992", "1993", "1994"],
"breakdown_values": ["bs30.si0762", "bs30.si2026"],
"data_tables": [
"NT001"
],
"geog_levels": [
"county"
]
},
"2000_SF1b": {
"data_tables": [
"NP001A"
],
"geog_levels": [
"blck_grp"
]
}
},
"time_series_tables": {
"A00": {
"geog_levels": [
"state"
]
}
},
"shapefiles": [
"us_state_1790_tl2000"
],
"time_series_table_layout": "time_by_file_layout",
"geographic_extents": ["010"],
"data_format": "csv_no_header",
"description": "sample6",
"breakdown_and_data_type_layout": "single_file"
}
'
A successful request will return a response that includes an extract number in the number
attribute:
{
"number": 49,
"data_format": "csv_no_header",
"description": "sample6",
"status": "queued",
"download_links": null,
"datasets": {
"1988_1997_CBPa": {
"years": [
"1988",
"1989",
"1990",
"1991",
"1992",
"1993",
"1994"
],
"breakdown_values": [
"bs30.si0762",
"bs30.si2026"
],
"data_tables": [
"NT001"
],
"geog_levels": [
"county"
]
},
"2000_SF1b": {
"data_tables": [
"NP001A"
],
"geog_levels": [
"blck_grp"
]
}
},
"time_series_tables": {
"A00": {
"geog_levels": [
"state"
]
}
},
"time_series_table_layout": "time_by_file_layout",
"shapefiles": [
"us_state_1790_tl2000"
],
"geographic_extents": [
"010"
],
"breakdown_and_data_type_layout": "single_file"
}
Get a Request’s Status
After submitting your extract request, you can use the extract number to retrieve the request’s status. Here we’re retrieving status for extract number 743.
r = requests.get(
"https://api.ipums.org/extracts/743?collection=nhgis&version=v1",
headers=my_headers
)
pprint(r.json())
{'data_format': 'csv_header',
'datasets': {'1790_cPop': {'data_tables': ['NT2'], 'geog_levels': ['state']}},
'description': 'testing123',
'download_links': {'codebook_preview': 'https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv_PREVIEW.zip',
'table_data': 'https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv.zip'},
'number': 743,
'status': 'completed',
'time_series_table_layout': 'time_by_row_layout',
'time_series_tables': {'B79': {'geog_levels': ['state']}}}
data_extract_status_res <- GET("https://api.ipums.org/extracts/743?collection=nhgis&version=v1", add_headers(Authorization = my_key))
des_df <- content(data_extract_status_res, "parsed", simplifyDataFrame = TRUE)
des_df
$data_format
[1] "csv_header"
$description
[1] "testing123"
$time_series_table_layout
[1] "time_by_row_layout"
$datasets
$datasets$`1790_cPop`
$datasets$`1790_cPop`$data_tables
$datasets$`1790_cPop`$data_tables[[1]]
[1] "NT2"
$datasets$`1790_cPop`$geog_levels
$datasets$`1790_cPop`$geog_levels[[1]]
[1] "state"
$time_series_tables
$time_series_tables$B79
$time_series_tables$B79$geog_levels
$time_series_tables$B79$geog_levels[[1]]
[1] "state"
$number
[1] 743
$status
[1] "completed"
$download_links
$download_links$codebook_preview
[1] "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv_PREVIEW.zip"
$download_links$table_data
[1] "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv.zip"
curl -X GET "https://api.ipums.org/extracts/743?collection=nhgis&version=v1" -H "Content-Type: application/json" -H "Authorization: $MY_KEY"
# response:
{
"data_format": "csv_header",
"description": "testing123",
"time_series_table_layout": "time_by_row_layout",
"datasets": {
"1790_cPop": {
"data_tables": [
"NT2"
],
"geog_levels": [
"state"
]
}
},
"time_series_tables": {
"B79": {
"geog_levels": [
"state"
]
}
},
"number": 743,
"status": "completed",
"download_links": {
"codebook_preview": "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv_PREVIEW.zip",
"table_data": "https://demo.data2.nhgis.org/extracts/325460ab-055e-11e5-9e17-9c961dceb418/743/nhgis0743_csv.zip"
}
}
You will get a status such as queued
, started
, produced
canceled
, failed
or completed
.
Retrieving Your Extract
To retrieve a completed extract (using extract number 743 as the example again):
- Using the request status query above, wait until the status is
completed
. - Extract the download URL from the response, which is in the
download_links
attribute:
r = requests.get(
"https://api.ipums.org/extracts/743?collection=nhgis&version=v1",
headers=my_headers
)
extract = r.json()
my_extract_links = extract["download_links"]
data_extract_status_res <- GET("https://api.ipums.org/extracts/743?collection=nhgis&version=v1", add_headers(Authorization = my_key))
des_df <- content(data_extract_status_res, "parsed", simplifyDataFrame = TRUE)
des_df$download_links
curl -X GET \
"https://api.ipums.org/extracts/743?collection=nhgis&version=v1" \
-H "Content-Type: application/json" \
-H "Authorization: $MY_KEY"
The download_links portion of the response will look like:
"download_links": {
"codebook_preview": "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_csv_PREVIEW.zip",
"table_data": "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_csv.zip",
"gis_data": "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_shape.zip"
},
Next, retrieve the file(s) from the URL. You will need to pass the Authorization header with your API key to the server in order to download the data.
# get the file from the URL and write out to a local file
r = requests.get(my_extract_links["table_data"], allow_redirects=True, headers=my_headers)
open("nhgis0061_csv.zip", "wb").write(r.content)
# Retrieve the file from the URL and read it into R using the ipumsr
# library (https://cran.r-project.org/web/packages/ipumsr/index.html).
# Import the ipumsr library
library(ipumsr)
# Download table data and read into a data frame
# Destination file
zip_file <- "NHGIS_tables.zip"
# Download extract to destination file
download.file(des_df$download_links$table_data, zip_file, headers=c(Authorization=my_key))
# List extract files in ZIP archive
unzip(zip_file, list=TRUE)
# Read 2000 block-group CSV file into a data frame
bg2000_table <- read_nhgis(zip_file, data_layer = contains("2000_blck_grp.csv"))
head(bg2000_table)
curl -H "Authorization: $MY_KEY" "https://api.ipums.org/downloads/nhgis/api/v1/extracts/9123456/nhgis0033_csv.zip" > mydata.zip
Now you are ready for further processing and analysis as you desire.
Get a Listing of Recent Extract Requests
You may also find it useful to get a historical listing of your extract requests. If you omit an extract number in your API call, by default this will return the 10 most recent extract requests. To adjust the amount returned, you may optionally specify a ?limit=##
parameter to get the ## most recent extracts instead.
r = requests.get(
"https://api.ipums.org/extracts?collection=nhgis&version=v1",
headers=my_headers
)
pprint(r.json()[0:5])
[{'data_format': 'csv_header',
'datasets': {'1790_cPop': {'data_tables': ['NT2'], 'geog_levels': ['state']}},
'description': 'testing123',
'download_links': {},
'number': 61,
'status': 'started'},
{'data_format': 'csv_header',
'datasets': {'2006_2010_ACS5a': {'data_tables': ['B01001', 'B15002'],
'geog_levels': ['state']}},
'description': 'test',
'download_links': {},
'number': 60,
'status': 'completed'},
{'data_format': 'csv_header',
'datasets': {'2009_2013_ACS5a': {'data_tables': ['B25003'],
'geog_levels': ['puma']}},
'description': 'Revision of 56: PUMA in 2013 5-year file',
'download_links': {},
'number': 59,
'status': 'completed'},
{'data_format': 'csv_header',
'datasets': {'2017_ACS1': {'data_tables': ['B01001'],
'geog_levels': ['nation']}},
'description': '',
'download_links': {},
'number': 58,
'status': 'completed'},
{'data_format': 'csv_header',
'datasets': {'2009_2013_ACS5a': {'data_tables': ['B25003'],
'geog_levels': ['puma']}},
'description': 'PUMA in 2013 5-year file',
'download_links': {},
'number': 56,
'status': 'completed'}]
for extract in r.json():
if extract["number"] == my_extract_number:
my_extract_status = extract["status"]
break
print(my_extract_links)
data_extract_status_res <- GET("https://api.ipums.org/extracts?collection=nhgis&version=v1", add_headers(Authorization = my_key))
de10_df <- content(data_extract_status_res, "parsed", simplifyDataFrame = TRUE)
de10_df[,c("number","status","description")]
curl -X GET \
"https://api.ipums.org/extracts?collection=nhgis&version=v1" \
-H "Content-Type: application/json" \
-H "Authorization: $MY_KEY"