Regression Model for Data Scientist Pay Range

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fusce tellus odio, dapibus id fermentum quis, suscipit id erat. Mauris metus. Maecenas aliquet accumsan leo. Fusce tellus. Duis pulvinar. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Duis pulvinar.

There are an immense variety of data scientist jobs out there, and the titles provided rarely delineate between job level (ie: Junior/Senior) or actual work (ie: doing NLP vs. deep learning vs. time-series vs. regressions, etc...). So... I was curious if I could make a model that would show given a set of skills what the pay range would be. Before I began this part, I scraped about 10,000 job descriptions from "data scientist" and "data analyst" job descriptions from 20 major metro areas in the United States. Then, I created buckets of words that I was interested in. Here's a sampling of them with part of their list of words:
  • Skill Machine Learning Simple (prediction, regression, linear model, data modeling...)
  • Skill Machine Learning Advanced (decision trees, random forest, sklearn, reinforcement learning, gradient boosting machine...)
  • Skill Machine Learning NLP (natural language processing, nlp, nltk, spacy, sentiment, named entity recognition...)
  • Skill Machine Learning Deep Learning (tensorflow, deep learning, pytorch, keras, neural networks...)
  • Skill SQL (sql)
  • Skill Exel (spreadsheet, excel...)
  • Skill Web Analytics (google analytics, web analytics, cookies, internet, web logs...)
  • Tool Cloud Platform (Azure, AWS, Google Cloud)
  • Tool Data Vizualization (looker, tableau, power bi, spotfire, d3js, matplotlib, ggplot2...)

The output of this processing was basically 64 one-hot-encoded values - that included:
  • various skills (around data engineering, data science, software engineering
  • job titles (parsed out known job titles - like data scientist, data analyst, software engineer, data engineer, etc...)
  • job levels (junior, senior, II, III, director, etc...)
  • geolocation (just Seattle currently, since thats all I care about)
  • degree (bachelor, masters, PhD)
  • departments (marketing, sales, customer service, operations, supply chain, IT)

Using these inputs, I created two generalized linear models using h2o to predict the low and high range for pay based on a list of these skills for a job posting.
First, load your libraries and initialize h2o (note: you need Java installed on your machine or Docker image to run this).
library(h2o)
library(tidyverse)
h2o.init()
seed <- 3336666

Then, I load in the data, filter it to only show jobs that match data analyst, data scientist, analyst, and senior data analyst... and convert it to an h2oframe.
df.final <- read.csv("jobs/df.final.csv")

df.hex <- as.h2o(
	df.final %>% 
		filter(jobtitle_analyst == 1 || jobtitle_datascientist == 1 || jobtitle_seniordataanalyst == 1 || jobtitle_dataanalyst == 1) %>%
		filter(jobtitle_level_director == 0 && jobtitle_level_manager == 0 && jobtitle_level_principle == 0 && jobtitle_level_architect == 0)
	)

Do the train-test split for the h2o data and train the two models.
predictors <- c("geolocation_seattle","degree_bachelor","degree_masters","degree_phd","jobtitle_analyst","jobtitle_datascientist","jobtitle_dataanalyst","jobtitle_seniordataanalyst","jobtitle_dataengineer","jobtitle_machinelearningengineer","jobtitle_businessanalyst","jobtitle_softwareengineer","jobtitle_businessintelligence","jobtitle_level_junior","jobtitle_level_senior","jobtitle_keyword_research","skill_abtesting","skill_agile","skill_algorithms","skill_bigdata","skill_commandline","skill_containers","skill_csharp","skill_databases","skill_datamining","skill_datapipelines","skill_deeplearning","skill_deeplearning_vision","skill_deployment","skill_devops","skill_etl","skill_excel","skill_machinelearning_simple","skill_machinelearning_advanced","skill_machinelearning_nlp","skill_machinelearning_timeseries","skill_problemsolving","skill_python","skill_r","skill_reporting","skill_sas","skill_java","skill_julia","skill_scala","skill_simulation","skill_sourcecontrol","skill_sql","skill_ssrs","skill_statistics","skill_testing","skill_userexperience","skill_webanalytics","tool_cloudplatform","tool_datavis","department_humanresources","department_finance","department_marketing","department_supplychain","department_operations","department_customerservice")

# Split dataset giving the training dataset 75% of the data
wages.split <- h2o.splitFrame(data=df.hex, ratios=0.75)

# Create a training set from the 1st dataset in the split
wages.train <- wages.split[[1]]

# Create a testing set from the 2nd dataset in the split
wages.test <- wages.split[[2]]

#Train GLM Model
response <- "payestimate_low"
wages.glm.low <- h2o.glm(family= "gaussian", x= predictors, y=response, training_frame=wages.train, lambda = 0, compute_p_values = TRUE, seed = seed)

response <- "payestimate_high"
wages.glm.high <- h2o.glm(family= "gaussian", x= predictors, y=response, training_frame=wages.train, lambda = 0, compute_p_values = TRUE, seed = seed)


Get the coefficients for each of the models and save the chart for both.
# Coefficients that can be applied to the non-standardized data, convert to DF
coeffs = h2o.coef(wages.glm.low)
intercept = coeffs["Intercept"]
coeffs = cbind(read.table(text = names(coeffs)), coeffs) %>% filter(V1 != "Intercept")

plot <- ggplot(data=coeffs, aes(x=V1,y=coeffs)) +
	geom_bar(position="dodge",stat="identity") + 
	coord_flip() +
	ggtitle(paste0("Attribute with contribution (Low $", round(intercept,0),")"))

ggsave(file=paste0("coeffs-glm-low.svg"), plot=plot, width=10, height=8)

# Coefficients that can be applied to the non-standardized data, convert to DF
coeffs = h2o.coef(wages.glm.high)
intercept = coeffs["Intercept"]
coeffs = cbind(read.table(text = names(coeffs)), coeffs) %>% filter(V1 != "Intercept")

plot <- ggplot(data=coeffs, aes(x=V1,y=coeffs)) +
	geom_bar(position="dodge",stat="identity") + 
	coord_flip() +
	ggtitle(paste0("Attribute with contribution (High $", round(intercept,0),")"))


ggsave(file=paste0("coeffs-glm-high.svg"), plot=plot, width=10, height=8)

The output looks okay, I guess. For the "low" model, the intercept is $77.6k, and it's dominated by job title and geolocation of seattle. This makes sense, but some features I would have expected to have high value - like a PhD, Deep Learning, and NLP - are either small or negative. On the other hand, this could just be an interaction between the jobtitle_datascientist feature and the others - like, the model goes way high +$15K for job title, and then evens it out, going slightly up and down, around that value depending on the skills (which have smaller impact than the title itself). ![coeffs-glm-low](//images.ctfassets.net/umusgtot1lpw/5lyiwdbFtDtqXWX8vpmr4E/4605691d334179ccd00f73662dda3c9f/coeffs-glm-low.svg)![coeffs-glm-high](//images.ctfassets.net/umusgtot1lpw/kBKyCfMd6GfS7mYeNFrdy/22d403be1221a1e59924b62d794e4895/coeffs-glm-high.svg)
Finally, predict a value for an input (or dataset):
newdata <- as.h2o(tibble(geolocation_seattle = 1,
									degree_bachelor = 1,
									degree_masters = 0,
									degree_phd = 0,
									jobtitle_analyst = 0,
									jobtitle_datascientist = 1,
									jobtitle_dataanalyst = 0,
									jobtitle_seniordataanalyst = 0,
									jobtitle_dataengineer = 0,
									jobtitle_machinelearningengineer = 0,
									jobtitle_businessanalyst = 0,
									jobtitle_softwareengineer = 0,
									jobtitle_businessintelligence = 0,
									jobtitle_level_junior = 0,
									jobtitle_level_senior = 0, 
									jobtitle_keyword_research = 0, 
									skill_abtesting = 1,
									skill_agile = 1,
									skill_algorithms = 0,
									skill_bigdata = 1,
									skill_commandline = 1,
									skill_containers = 0, 
									skill_csharp = 1,
									skill_databases = 1,
									skill_datamining = 1,
									skill_datapipelines = 0,
									skill_deeplearning = 1,
									skill_deeplearning_vision = 0,
									skill_deployment = 0,
									skill_devops = 1,
									skill_etl = 0,
									skill_excel = 0,
									skill_machinelearning_simple = 1,
									skill_machinelearning_advanced = 1,
									skill_machinelearning_nlp = 1,
									skill_machinelearning_timeseries = 1,
									skill_problemsolving = 1,
									skill_python = 1,
									skill_r = 1,
									skill_reporting = 1,
									skill_sas = 0,
									skill_java = 0,
									skill_julia = 0,
									skill_scala = 0,
									skill_simulation = 0,
									skill_sourcecontrol = 1,
									skill_sql = 1,
									skill_ssrs = 0,
									skill_statistics = 1,
									skill_testing = 0,
									skill_userexperience = 0,
									skill_webanalytics = 1, 
									tool_cloudplatform = 1, 
									tool_datavis = 1,
									department_humanresources = 0,
									department_finance = 0,
									department_marketing = 0,
									department_supplychain = 0,
									department_operations = 0,
									department_customerservice = 0))
                  
h2o.predict(object=wages.glm.low, newdata=newdata)
h2o.predict(object=wages.glm.high, newdata=newdata)

These models predict the wage for this job description would be $103K to $140K, plus or minus $3700 (ish). This generally jives with what I've personally seen in job listings for wage estimates - large wage ranges for a large gap in experience between lower data scientists and seasoned ones. So, I think this is a pretty good estimate! This model obviously doesn't account for years of experience in any particular thing - it's just looking for individual words/phrases in titles and descriptions.
BONUS!! Another thing I tried was using the AutoML module of h2o (which... is brilliant, and why I started using this package in the first place). Using this, the linear model's hyperparameters are automatically optimized based on deviance (by default).
response <- "payestimate_low"
aml <- h2o.automl(x= predictors, y=response,
									include_algos = c("GLM"),
									training_frame = wages.train,
									max_models = 20,
									seed = seed)

coeffs = h2o.coef(aml@leader)
intercept = coeffs["Intercept"]
coeffs = cbind(read.table(text = names(coeffs)), coeffs) %>% filter(V1 != "Intercept")

ggplot(data=coeffs, aes(x=V1,y=coeffs)) +
	geom_bar(position="dodge",stat="identity") + 
	coord_flip() +
	ggtitle(paste0("Attribute with contribution (Auto Low $", round(intercept,0),")"))

h2o.predict(object=aml@leader, newdata=newdata)

This model FEELS a lot better, looking at the coefficients. I like that most skills push you UP by some amount, and job titles push you DOWN (ie: analyst/data analyst). It's possible that deviance isn't the right optimization metric here, so you might try some others (like, RMSE).
Back To Blog

Google Maps