Create IPUMS USA Data Extracts

Below we provide examples in R, Python and curl showing how to work with the IPUMS API to create and manage USA data extracts.

Get your key from your IPUMS user account management page at https://account.ipums.org/api_keys.

Load Libraries and Set Key

The Python examples use the ipumspy client library. You will need to install the ipumspy module into your Python environment if you haven’t already. You may also want to reference the ipumspy documentation site.

The R examples use the ipumsr R package. For instructions on installing ipumsr, as well an overview of ipumsr API functionality, see the API vignette. You may also want to reference the ipumsr documentation site.

from ipumspy import IpumsApiClient, UsaExtract

# create new API client, set api key as env variable
ipums = IpumsApiClient(your_api_key)
library(ipumsr)
set_ipums_api_key("YOUR_API_KEY_HERE")
# set the IPUMS_API_KEY environment variable using bash shell
export IPUMS_API_KEY=YOUR_API_KEY_HERE

Submit a Data Extract Request

To submit a data extract request, you will either use the Python or R client library to construct an extract object, or alternately construct a JSON payload manually if you are using a tool like curl. Once you have your request formed, you will then submit it to the API.

The names to use for samples and variables in the data extract request can be discovered on our website.

# submit an extract request to the Microdata Extract API
extract = UsaExtract(
    ["us2018a", "us2019a"],
    ["AGE", "SEX", "RACE", "STATEFIP"],
    description = "My IPUMS USA API-submitted extract"
)
ipums.submit_extract(extract)
# create a new extract object
extract_definition <- define_extract_usa(
    "This is an example extract to submit via API.",
    c("us2018a","us2019a"),
    c("AGE","SEX","RACE","STATEFIP")
)

# submit the extract to IPUMS USA for processing
submitted_extract <- submit_extract(extract_definition)

# access the extract number, stored in the return value of submit_extract
submitted_extract$number
# construct the JSON payload manually and submit it 
curl --location --request POST 'https://api.ipums.org/extracts?collection=usa&version=beta' \
--header 'Authorization: YOUR_API_KEY_HERE' \
--header 'Content-Type: application/json' \
--data-raw '{
    "description": "Example extract",
    "data_structure": { 
        "rectangular": {
            "on": "P"
        }
    },
    "data_format": "fixed_width",
    "samples": {
      "us2018a": {},
      "us2019a": {}
    },
    "variables":{
      "AGE": {},
      "SEX": {},
      "RACE": {},
      "STATEFIP": {}
    }
}'

# A successful request will return a response that includes an extract number in the number attribute:

{"data_structure":{"rectangular":{"on":"P"}},"data_format":"fixed_width","description":"my extract","samples":{"us2018a":{},"us2019a":{}},"variables":{"YEAR":{"preselected":true},"SAMPLE":{"preselected":true},"SERIAL":{"preselected":true},"CBSERIAL":{"preselected":true},"HHWT":{"preselected":true},"CLUSTER":{"preselected":true},"STATEFIP":{},"STRATA":{"preselected":true},"GQ":{"preselected":true},"PERNUM":{"preselected":true},"PERWT":{"preselected":true},"SEX":{},"AGE":{},"RACE":{}},"download_links":{},"number":25,"status":"queued"}

Checking a Request’s Status

After submitting your extract request, you can use the Python and R client library functions to monitor the request’s status, or use the API directly using the extract’s number.

# check status of a request
extract_status = ipums.extract_status(extract)
# This will update the submitted_extract object with the latest extract status
submitted_extract <- get_extract_info(submitted_extract)

# Or, if you didn't capture the return value of submit_extract, but you know the extract number is 10, you can use:
submitted_extract <- get_extract_info("usa:10")

# Then you can check its status like this:
# status will be one of: `queued`, `started`, `produced` `canceled`, `failed` or `completed`
status <- submitted_extract$status

# Or if you just want to know if the extract is ready to download (TRUE or FALSE):
is_extract_ready(submitted_extract)
curl --request GET 'https://api.ipums.org/extracts/25?collection=usa&version=beta' --header 'Content-Type: application/json' --header 'Authorization: YOUR_API_KEY_HERE'

# A successful request will provide a response object like below. The exact fields may vary depending on how far along the extract is in processing.
# You will get a status such as `queued`, `started`, `produced` `canceled`, `failed` or `completed` in the status field.
# Here the extract is started and some of the ancillary files like codebooks and syntax files are already available, but the data file itself is not yet available. 

{"data_structure":{"rectangular":{"on":"P"}},"data_format":"fixed_width","description":"my extract","samples":{"us2018a":{},"us2019a":{}},"variables":{"YEAR":{"preselected":true},"SAMPLE":{"preselected":true},"SERIAL":{"preselected":true},"CBSERIAL":{"preselected":true},"HHWT":{"preselected":true},"CLUSTER":{"preselected":true},"STATEFIP":{},"STRATA":{"preselected":true},"GQ":{"preselected":true},"PERNUM":{"preselected":true},"PERWT":{"preselected":true},"SEX":{},"AGE":{},"RACE":{}},"download_links":{"ddi_codebook":{"url":"https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.xml","bytes":89795,"sha256":"c308d26e093bf9b212f5cce679af3f19115d353a9581660090f102663b9e560e"},"spss_command_file":{"url":"https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.sps","bytes":18696,"sha256":"1e355d825e451c2e589ebee5c3b86f097a153da5590d8eb74baa1b9bdff428b7"},"basic_codebook":{"url":"https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.cbk","bytes":24569,"sha256":"f47edbd827e0babdd9b5533ff6f2147f15e2599f2e855af57c99d07e016db2c0"},"stata_command_file":{"url":"https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.do","bytes":34101,"sha256":"e94fe2944f97874ae2cd3da52464cea9771b4b1dd3e3efddb382fcbbe7d97e0c"},"R_command_file":{"url":"https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.R","bytes":406,"sha256":"9b88924c1f57af38795ca4e6c57d09ecbfb6f5bd3a296cf19e6ab726329c369f"},"sas_command_file":{"url":"https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.sas","bytes":18095,"sha256":"a621a2b064d50dd860e6dd200628ca57e0bfe48ac81ba1c158ad711cd154ee34"}},"number":25,"status":"started"}

The client libraries also provide some convenience functions so that you don’t have to manually poll your extract’s status.

# convenience method that will wait until extract processing is complete before returning.
ipums.wait_for_extract(extract)
# This function call will wait until extract processing is complete before returning.
# By default, the function will periodically print messages while it waits for the extract to finish processing. Set argument verbose to FALSE if you want it to wait silently.
# The return value will include URLs for your extract files, so it is useful to capture that value by naming it.
downloadable_extract <- wait_for_extract(submitted_extract)
# This feature is not available without a client library.

Retrieving Your Extract

To retrieve a completed extract, we will once again use client library helper functions, or we can do so directly with the API using the extract’s number.

# download an extract to the current working directory
# you can use the optional download_dir="DIRECTORY_NAME" parameter
# to specify a different location
ipums.download_extract(extract)
# This will save the extract files to the current directory
# use the download_dir argument to specify a different location
# The return value is the path to the DDI codebook file, which can then be passed to read_ipums_micro to read the data
path_to_ddi_file <- download_extract(downloadable_extract)
data <- read_ipums_micro(path_to_ddi_file)
# download the data file using link that came back in extract request status object once completed
curl -H "Authorization: YOUR_API_KEY_HERE" https://api.ipums.org/downloads/usa/api/v1/extracts/1234567/usa_00025.dat.gz > my_ipums_usa_extract_25_dat.gz
# repeat for the other files e.g. codebook etc...

Now you are ready for further processing and analysis as you desire.

Get a Listing of Recent Extract Requests

You may also find it useful to get a historical listing of your extract requests.

# get a list of previous extracts as a list of extract definition dicts
# by default it will return the 10 most recent extracts. This can be overridden
# with the limit=## parameter. 
recent_extracts = ipums.retrieve_previous_extracts('usa')

# display extract IDs and descriptions
for ex in recent_extracts:
    print(f"{ex['number']}: {ex['description']}")

# produces output like:
# 34: My recent IPUMS extract
# 33: Added 2018 sample
# 32: Added more demographic variables
# 31: Revision of (Another extract)
# 30: Another extract
# 29: Revision of (Added SEX and RACE variables)
# 28: Added SEX and RACE variables
# 27: Added AGE variable
# 26: Revision of (My first extract)
# 25: My first extract
    
# re-download an extract to the current working directory
# this assume the data files haven't been purged from IPUMS servers yet
# which is done regularly to conserve disk space
ipums.download_extract(extract="34", collection="usa")

# Note: ipumspy does not have support for "reloading" a former extract into an extract object for revision, nor does it yet have support for resubmission.
# by default returns the 10 most recent. Use the how_many argument to override.
recent_extracts <- get_recent_extracts_info_list("usa", how_many = 5)

# recent_extracts is a list of extract objects:
# [[1]]
# Submitted IPUMS USA extract number 17
# Samples: us2018a, us2019a
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT...
# [[2]]
# Submitted IPUMS USA extract number 16
# Samples: us2018a, us2019a
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT...
# [[3]]
# Submitted IPUMS USA extract number 15
# Samples: us1850a, us1860a, us1870a, us1880d, us1900j...
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT...
# [[4]]
# Submitted IPUMS USA extract number 14
# Samples: us1850a, us1860a, us1870a, us1880d, us1900j...
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT...
# [[5]]
# Submitted IPUMS USA extract number 13
# Samples: us2013a
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, NUMPREC...

# If you'd rather have a tibble / data.frame of extract definitions, use:
recent_extracts_tbl <- get_recent_extracts_info_tbl("usa", how_many = 5)
curl -X GET \
  https://api.ipums.org/extracts?collection=usa&version=beta \
  -H 'Content-Type: application/json' \
  -H 'Authorization: YOUR_API_KEY_HERE'

# If you omit an extract number in your API call, by default this will return the 10 most recent extract requests. To adjust the amount returned, you may optionally specify a `?limit=##` parameter to get the ## most recent extracts instead.

Revising and Resubmitting a Prior Extract (R/ipumsr-specific functionality)

The R client also has specific support for modifying and resubmitting prior extracts.

# This feature is not applicable without the ipumsr library.
# To pull down the definition of USA extract number 17
old_extract <- get_extract_info("usa:17")
# Note that there are no spaces before or after the colon, and that the extract number doesn't need to be zero-padded.

# old_extract is now:
print(old_extract)
# Submitted IPUMS USA extract number 17
# Samples: us2018a, us2019a
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT...

# full list of variables:
old_extract$variables
# [1] "YEAR"     "SAMPLE"   "SERIAL"   "CBSERIAL" "HHWT"     "CLUSTER"  "STATEFIP" "STRATA"   "GQ"       "PERNUM"   "PERWT"    "SEX"      "AGE"      "RACE"

revised_extract <- revise_extract_micro(
    old_extract,
    samples = "us2017a",
    vars = "EDUC"
)

revised_extract <- remove_from_extract(
    revised_extract,
    vars = "RACE"
)

# revised_extract has been reset to an unsubmitted state:
print(revised_extract)
# Unsubmitted IPUMS USA extract
# Samples: us2018a, us2019a, us2017a
# Variables: YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT...

# revised extract has the updated samples and variables
revised_extract$samples
# [1] "us2018a" "us2019a" "us2017a"
revised_extract$variables
# [1] "YEAR"     "SAMPLE"   "SERIAL"   "CBSERIAL" "HHWT"     "CLUSTER"  "STATEFIP" "STRATA"   "GQ"       "PERNUM"   "PERWT"    "SEX"      "AGE"      "EDUC"

newly_submitted_extract <- submit_extract(revised_extract)
# This feature is not applicable without the ipumsr library.