US Wages Dataset
This is a dataset collected from the BLS Occupational Employment & Wage Statistics program.
My plan was to use this to build a comprehensive, data-driven site around wages by occupation at both national, industry and state level. This could be a huge and useful site for works to compare salary from state to state.
The idea was to use the data and some basic maths to answer questions like the following with long-form content including charts, comparisons and historical tables:
- "How much does a {job_title} make?"
- "How much does a {job_title} in {state} make?"
- "Average {{job_title} salary"
- "Average {job_title} salary in state"
- "Highest paying states for {job_title}"
- "Best state to be a {job_title}"
It covers all years from 2014 to the latest available, 2022, and can be used to make some very interesting data analysis such as which professions are the fastest growing by wage.
The BLS also provides an inflation API which could be used to compare whether wages are increasing in line with inflation.
The data includes:
- Mean wage
- Median wage
- Percentiles
- Total employment
- Employment per 1k jobs
- Categorised jobs
- All years from 2014 to 2022
Make sure to read up on the BLS site to understand the data.
The raw data included non-numeric fields that I have cleaned as follows for easy checks when building out the app:
na_values = {
'*': -1, # wage estimate not available
'**': -2, # employment estimate not available
'#': -3, # wage equal to or higher than $100 an hour or $208,000 a year
'~': -4, # percentage of establishments reporting the occupation is less than 0.5%
}
Do with this whatever you like, but let me know if you build something with it - I'd love to check it out.
The Python script works but does require a manual download of 4 files because BLS don't let you scrape them and I couldn't be bothered to figure out proxies for it!
Prerequisites
Before running this script, ensure that you have the following installed:
- Python 3.9
- MySQL Server
- Pipenv for Python package management
Installation
Install the required packages:
pipenv install
Create a MySQL databases named wages (or update database connection in app.py if you prefer a different name).
Run the script:
pipenv run python app.py
Usage
There is an optional -year flag that can be passed in to specify a specific year to download. If no year is specified, the script will default to the most recent year available.
You can run it like this:
pipenv run python app.py -year 2019
Data is appended to the database, so be careful running the same year twice. You may need to delete the data from the database before running the script again.
Data cleaning
The BLS data has some values that have been cleaned to make easier to use in code when building a site around it. These values are represented by different symbols in the original data and are converted to distinct numerical codes for easier identification and processing. Below is a summary of how each NA value is handled:
-
'*'
: Replaced with-1
. Indicates that the wage estimate is not available. -
'**'
: Replaced with-2
. Signifies that the employment estimate is not available. -
'#'
: Replaced with-3
. Used when the wage is equal to or higher than $100 an hour or $208,000 a year. -
'~'
: Replaced with-4
. Represents cases where the percentage of establishments reporting the occupation is less than 0.5%.
Provided as-is
This script was built to xtract the data for my own use and is provided as-is. I have no plans to maintain it or provide support for it.
Use this as a starting point to build your own dataset from.