TFIDF Analysis on Data-Related Job Descriptions

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fusce tellus odio, dapibus id fermentum quis, suscipit id erat. Mauris metus. Maecenas aliquet accumsan leo. Fusce tellus. Duis pulvinar. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Duis pulvinar.

This is the code/methodology for the article "[What the heck is a data scientist?]( "Data Scientist - Wut?")".

The setup...

In this analysis, I use R (tidyverse, tidytext) to find out what makes a data scientist special - compared to other software and data related jobs. I scraped around 10,000 job descriptions from 20 major metro areas in the United States (like San Francisco, Seattle, Austin, NYC, etc...) for two search queries - "data scientist" and "data analyst". My main goal is to figure out where people draw the line between data analyst jobs in aggregate. I regularly see data analyst jobs that require machine learning skillset, and data scientist jobs that don't require that at all - both of which I find weird.

The data...

The end result of scraping job descriptions and cleaning is was that I had a CSV file with just two columns:
  • Job Category (The job title) - "1 - Data Scientist","2 - Senior Data Analyst","3 - Data Engineer","4 - Data Analyst","5 - Machine Learning Engineer","6 - Business Analyst","7 - Other Data Jobs")
  • Job Description (plain text for the full job description)

The assumptions...

While I could probably further process the job descriptions to dig straight into only the skills section and/or job responsibilities, I thought that most of the fluff about the individual companies would be different enough between job descriptions for the methodology I'm using, and the other fluff that's in all job descriptions would not have any signal either.

The methodology...

I'm using TFIDF to do this analysis. This stands for "Term Frequency Inverse Document Frequency", and it's a word/phrase frequency model to tell how important a word is to a particular document. In this case, the "document" will be the collection of job descriptions for a particular category (job title), and the terms will be the 1 and 2-grams (single and 2-word phrases) in the document.

The code...

First, import the required libraries and load data.
job_descriptions <- read.csv("jobs/job_descriptions.csv")
I've included a bunch of custom stopwords that didn't mean anything to me. You can exclude this part if you want.
custom_stopwords <- tibble(word = c("â", "scientist", "data scientist", "amazon", "microsoft", "cirium", "data engineer",
																		"expedia", "analyst", "senior data analyst", "cerner", "senior business", "business analyst",
																		"mattersight", "engineer", "pimco", "kapsch", "swedish", "ods", "the ods", "kpmg",
																		"password", "employment", "office", "company", "is a", "and or", "we are", "as a", 
																		"elsevier", "perform", "working with", "environment", "2", "level", "opportunities", "ensure",
																		"include", "understand", "duties", "with a", "degree in", "internal", "process", "issues", 
																		"on the", "1", "solving", "external", "analysis and", "status", "national", "clinical", "healthcare", 
																		"care", "description", "products", "medical", "apply", "claims provider", "responsible for", "health",
																		"education", "payroll controls", "departmental payroll", "race", "megacenter", "qualified",
																		"to support", "sources", "for a", "a data", "applicants", "of our", "controls auditor", "assist",
																		"minimum", "color", "to ensure", "origin", "ms", "programs", "religion", "to be", "the ability",
																		"and analyze", "management", "massmutual", "smiths", "bird", "shutterstock", "mount sinai", "sinai", 
																		"smiths medical", "n26", "the data", "capital one", "gsk", "bcg", "virgin pulse", "join",
																		"skills and", "national origin", "school", "detail", "employees", "client", "responsible",
																		"citi", "pm ba", "heart association", "american heart", "and business", "the pm", "life", "skills and",
																		"accenture", "management and", "wells fargo", "fargo", "r n", "n r", "gs", "t r", "pfizer", "pay band", 
																		"johnson", "federal service", "pk", "n t"))
Create the various "n-grams".
# 1-Grams (word)
job_words <- job_descriptions %>%
	filter(! %>%
	unnest_tokens(word, jobdescription) %>%
	count(category, word, sort = TRUE) %>%
	anti_join(custom_stopwords, by = "word") %>%
	anti_join(stop_words, by = "word")

colnames(job_words) <- c("category", "ngram", "n")

# Create 2-grams
job_words_2gram <- job_descriptions %>%
	filter(! %>%
	unnest_tokens(ngram, jobdescription, token="ngrams", n=2)%>%
	count(category, ngram, sort = TRUE) %>%
	anti_join(custom_stopwords, by = "word")
Combine the n-grams and group by "document" (aka category, aka job title).
all_the_grams <- job_words %>%
	union(job_words_2gram) %>%
	anti_join(custom_stopwords, by = c("ngram" = "word"))

words_per_category <- all_the_grams %>% 
	group_by(category) %>% 
	summarize(total = sum(n))
Finally, loop through each category and save an image file for each.
for(cat in categories) {
	catform <- tolower(cat)
	job_words_combined <- left_join(all_the_grams, words_per_category) %>%
		bind_tf_idf(ngram, category, n) %>%
		filter(n > 10 & 
					 	category == cat)
	plot = job_words_combined %>%
		arrange(desc(tf_idf)) %>%
		mutate(ngram = factor(ngram, levels = rev(unique(ngram)))) %>% 
		group_by(category) %>% 
		top_n(50) %>% 
		ungroup() %>%
		ggplot(aes(ngram, tf_idf, fill = category)) +
		geom_col(show.legend = FALSE) +
		labs(x = NULL, y = "tf-idf") +
		facet_wrap(~category, ncol = 2, scales = "free") +
	ggsave(file=paste0(catform,"-tfidf.svg"), plot=plot, width=10, height=8)


-- Original Article:
Back To Blog

Google Maps