Back to questions
This is the same question as problem #8 in the SQL Chapter of Ace the Data Science Interview!
Assume you're given a table containing job postings from various companies on the LinkedIn platform. Write a query to retrieve the count of companies that have posted duplicate job listings.
Definition:
Column Name | Type |
---|---|
job_id | integer |
company_id | integer |
title | string |
description | string |
job_id | company_id | title | description |
---|---|---|---|
248 | 827 | Business Analyst | Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations. |
149 | 845 | Business Analyst | Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations. |
945 | 345 | Data Analyst | Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems. |
164 | 345 | Data Analyst | Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems. |
172 | 244 | Data Engineer | Data engineer works in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. |
duplicate_companies |
---|
1 |
There is one company ID 345 that posted duplicate job listings. The duplicate listings, IDs 945 and 164 have identical titles and descriptions.
The dataset you are querying against may have different input & output - this is just an example!
The first step to solving this LinkedIn problem statement is connecting with me on LinkedIn. 🥺
Step 1: Identify companies with duplicate job listings
But seriously, the first step is to determine which companies have posted duplicate job listings with the same title and description. We can do so by utilizing the aggregate function to calculate the number of entries grouped by , , and .
The output may appear as follows:
company_id | title | description | job_count |
---|---|---|---|
827 | Data Scientist | Data scientist uses data to understand and explain the phenomena around them, and help organizations make better decisions. | 2 |
244 | Data Engineer | Data engineer works in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. | 1 |
845 | Business Analyst | Business analyst evaluates past and current business data with the primary goal of improving decision-making processes within organizations. | 1 |
244 | Software Engineer | Software engineers design and create computer systems and applications to solve real-world problems. | 2 |
345 | Data Analyst | Data analyst reviews data to identify key insights into a business's customers and ways the data can be used to solve problems. | 2 |
Step 2: Filter for companies with duplicate job listings
Next, we can proceed to filter the previous query by converting it into a Common Table Expression (CTE) and selecting only the rows where the is greater than 1, indicating the presence of duplicate job listings.
To obtain a count of unique companies, we apply the keyword on the column.
The output will present the count of companies with duplicate job listings:
duplicate_companies |
---|
3 |
Method #2: Using Subquery
Need a refresher on Subquery? Check out our tutorial!