The state cancer profiles website hosts visualization and data exploration tools for cancer incidence and mortality in the United States. For casual browsing, the website is great. However, if having access to the underlying data is the goal, the existing site does not have bulk downloads or an API. Therefore, we provide bulk data scraped from the website for data science applications.
For the TLDR, the data for incidence and mortality are available for download from:
A new release is generated every month with the latest data.
- State cancer profiles scraper
- Contents
- About the code
- About the data
- Using the data
- Example rows from the data
- Local scraper usage
- License
- Contributing
This is a simple web scraper that extracts cancer data from the State Cancer Profiles website. The data is extracted for all states and saved in two CSV files: one for cancer incidence and one for cancer mortality.
The data is extracted from the State Cancer Profiles website, which is a part of the National Cancer Institute. The website provides cancer statistics for all 50 states, the District of Columbia, and Puerto Rico. The data is available for cancer incidence and cancer mortality, and it is based on the SEER and NPCR programs.
The data is saved in two CSV files: one for cancer incidence and one for cancer mortality. The data is saved in a long format, with each row representing a single observation (i.e., a single cancer type in a single state in a single year).
The data are available for download from the releases page, with a link to the latest release here.
Note that R, python, and many other languages can read CSV files directly without the need for downloading the data which might a fast and easy way to access the data.
For those who simply want to query the data in place, both duckdb and clickhouse databases can query csv files directly. For example, using duckdb, you can run the following code to query the data:
install httpfs;
load
select
*
from
read_csv('https://github.com/seandavi/state-cancer-profile-scraper/releases/download/2025-02-10/state_cancer_profiles_incidence.csv.gz')
limit 10;
Or, using clickhouse-local, start up clickhouse local and run the following query:
select
*
from url('https://github.com/seandavi/state-cancer-profile-scraper/releases/download/2025-02-10/state_cancer_profiles_incidence.csv.gz')
limit 10
settings max_http_get_redirects = 10;
reported_locale | fips | 2023_rural_urban_continuum_codesrural_urban_note | age_adjusted_rate_per_100_000 | lower_ci_rate | upper_ci_rate | ci_rank | lower_ci_rank | upper_ci_rank | average_annual_count | recent_trend | recent_5_year_trend_in_rate | lower_ci_trend_in_rate | upper_ci_trend_in_rate | year | sex | stage | race | cancer | areatype | age | state_fips | measurement | locale_type | _extracted_at | url | percent_of_cases_with_late_stage | locale | state |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Harrison County, Kentucky(7) | 21097 | Rural | 601.1 | 537.6 | 670.6 | \N | \N | \N | 71 | stable | -0.6000000000000001 | -1.7000000000000002 | 0.5 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 21 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Harrison County | Kentucky |
Metcalfe County, Kentucky(7) | 21169 | Rural | 601 | 519 | 693.4 | \N | \N | \N | 41 | stable | -0.1 | -1.4 | 1.2 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 21 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Metcalfe County | Kentucky |
Lumpkin County, Georgia(7) | 13187 | Urban | 600.1 | 550.7 | 653.2 | \N | \N | \N | 121 | stable | -0.2 | -1.4 | 1.2 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 13 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Lumpkin County | Georgia |
Boyd County, Kentucky(7) | 21019 | Urban | 600.1 | 561.7 | 640.7 | \N | \N | \N | 192 | stable | 0.1 | -0.5 | 1.7000000000000002 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 21 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Boyd County | Kentucky |
Greene County, Illinois(7) | 17061 | Rural | 599.8 | 525.6 | 682.5 | \N | \N | \N | 50 | stable | -0.4 | -1.7000000000000002 | 0.9 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 17 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Greene County | Illinois |
Washington County, Maine(6) | 23029 | Rural | 599.3 | 554.3 | 647.6 | \N | \N | \N | 152 | stable | 5.4 | -0.6000000000000001 | 9.6 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 23 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Washington County | Maine |
Jones County, North Carolina(6) | 37103 | Rural | 599 | 516.3 | 692.9 | \N | \N | \N | 43 | stable | -0.8 | -2.8 | 1.2 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 37 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Jones County | North Carolina |
Butler County, Kentucky(7) | 21031 | Urban | 598.8 | 522.9 | 683.2 | \N | \N | \N | 49 | stable | -0.7000000000000001 | -2.5 | 1.1 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 21 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Butler County | Kentucky |
Rolette County, North Dakota(6) | 38079 | Rural | 598.7 | 509 | 699.4 | \N | \N | \N | 35 | stable | 0.30000000000000004 | -1.6 | 2.2 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 38 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Rolette County | North Dakota |
Kalkaska County, Michigan(6) | 26079 | Urban | 598.7 | 536.1 | 667.2 | \N | \N | \N | 77 | stable | -0.4 | -2 | 1.2 | Latest 5-year average | Male | All Stages | All Races (includes Hispanic) | All Cancer Sites | By County | All Ages | 26 | incidence | county | 2025-02-10 19:36:38.120091000 | https://statecancerprofiles.cancer.gov/incidencerates/index.php?stateFIPS=00&areatype=county&cancer=001&race=00&stage=999&year=0&sex=1&age=001&type=incd | \N | Kalkaska County | Michigan |
reported_locale | fips | 2023_rural_urban_continuum_codesrural_urban_note | age_adjusted_rate_per_100_000 | lower_ci_rate | upper_ci_rate | ci_rank | lower_ci_rank | upper_ci_rank | average_annual_count | recent_trend | recent_5_year_trend_in_rate | lower_ci_trend_in_rate | upper_ci_trend_in_rate | year | sex | stage | race | cancer | areatype | age | state_fips | measurement | locale_type | _extracted_at | url | locale | state |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Greenwood County, Kansas | 20073 | Rural | 182 | 144.9 | 228.2 | \N | 3 | 96 | 19 | stable | -0.9 | -1.9 | 0 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 20 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Greenwood County | Kansas |
Geneva County, Alabama | 1061 | Urban | 182 | 162.1 | 204 | \N | 2 | 54 | 64 | stable | -0.4 | -1.1 | 0.30000000000000004 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 1 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Geneva County | Alabama |
Robertson County, Texas | 48395 | Urban | 182 | 152.5 | 216.6 | \N | 13 | 204 | 31 | falling | -0.9 | -1.7000000000000002 | -0.2 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 48 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Robertson County | Texas |
Van Buren County, Arkansas | 5141 | Rural | 182 | 159.5 | 207.6 | \N | 8 | 73 | 51 | falling | -0.7000000000000001 | -1.2 | -0.1 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 5 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Van Buren County | Arkansas |
Perry County, Missouri | 29157 | Rural | 182 | 159.1 | 207.6 | \N | 8 | 107 | 48 | stable | -0.5 | -1.2 | 0.30000000000000004 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 29 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Perry County | Missouri |
Elbert County, Georgia | 13105 | Rural | 181.9 | 157 | 210.4 | \N | 6 | 137 | 41 | stable | -0.7000000000000001 | -1.4 | 0 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 13 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Elbert County | Georgia |
Ware County, Georgia | 13299 | Rural | 181.8 | 161.7 | 204.1 | \N | 7 | 121 | 62 | stable | -0.5 | -1 | 0 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 13 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Ware County | Georgia |
Nemaha County, Kansas | 20131 | Rural | 181.8 | 151 | 217.8 | \N | 4 | 85 | 28 | stable | 1.7000000000000002 | -0.5 | 12.2 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 20 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Nemaha County | Kansas |
Bryan County, Oklahoma (6, 7) | 40013 | Rural | 181.8 | 165.1 | 199.9 | \N | 18 | 69 | 93 | falling | -1.4 | -2.4 | -0.5 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 40 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Bryan County | Oklahoma |
Cherokee County, Alabama | 1019 | Rural | 181.7 | 162.6 | 202.9 | \N | 2 | 53 | 71 | stable | -0.30000000000000004 | -0.8 | 0.30000000000000004 | Latest 5-year average | Both Sexes | Late Stage (Regional & Distant) | White (Non-Hispanic) | All Cancer Sites | By County | All Ages | 1 | mortality | county | 2025-02-10 19:58:25.293785000 | https://statecancerprofiles.cancer.gov/deathrates/index.php?stateFIPS=00&areatype=county&cancer=001&race=07&stage=211&year=0&sex=0&age=001&type=death | Cherokee County | Alabama |
While most users will simply want to download the data, the scraper is available for those who want to run it themselves.
To install the required packages, run the following command:
pip install git+https://github.com/seandavi/state-cancer-profiles-scraper.git
To run the scraper, use the following command:
python -m scps.scraper
The scraper will save the data in current working directory.
This project is licensed under the MIT License - see the LICENSE file for details.
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.