At Procore, SQL is used quite frequently for analyzing construction project data for insights, and for BI reporting within the company. Because of this, Procore often tests SQL coding questions in interviews for Data Analytics, Data Science, and Data Engineering jobs.
To help you study for the Procore SQL interview, we've collected 11 Procore SQL interview questions – can you solve them?
For a construction management software company like Procore, a VIP user is likely one who not only logs into the app frequently, but who also actively creates and manages projects. We might define a "power user" as someone who logs into Procore at least 5 days a week and creates at least 2 projects per week.
The task would be to write a SQL query to identify these power users from Procore's user activity and project creation logs table and return their user ID, the total number of logins, and the total number of projects created.
Given the following tables:
log_id | user_id | login_date |
---|---|---|
1001 | 1 | 06/10/2022 |
1002 | 1 | 06/11/2022 |
1003 | 2 | 06/10/2022 |
1004 | 1 | 06/12/2022 |
1005 | 2 | 06/13/2022 |
project_id | user_id | creation_date |
---|---|---|
501 | 1 | 06/10/2022 |
502 | 2 | 06/10/2022 |
503 | 1 | 06/11/2022 |
504 | 2 | 06/13/2022 |
505 | 1 | 06/13/2022 |
506 | 1 | 06/12/2022 |
507 | 1 | 06/13/2022 |
For this problem, we will join two subqueries: One counting the unique login days per user and one counting the project creations per user.
This query fetches the users who logged in at least 5 unique days in the past week and those who created at least 2 projects in the past week. Only users who match both criteria (power users) are returned in the final result. Note: refers to the current date and time.
To solve a related customer analytics question on DataLemur's free interactive coding environment, try this Microsoft Azure Cloud SQL Interview Question:
Suppose you're given a dataset containing project feedback for the company Procore. The data are stored in a table titled "project_reviews" where each record indicates a review left by a user for a specific project. Each project review is identified by an unique review id and contains the user id, the date the review was submitted, the specific project id, and a rating out of 5 stars. Procore would like to analyze how its projects are performing monthly, so they need a SQL query that can calculate the average rating of each project per month.
Here's some example input data and expected output:
review_id | user_id | submit_date | project_id | stars |
---|---|---|---|---|
1739 | 123 | 2022-04-19 | 111 | 3 |
1820 | 520 | 2022-04-05 | 222 | 5 |
1934 | 245 | 2022-05-03 | 333 | 4 |
2055 | 340 | 2022-05-17 | 222 | 2 |
2161 | 456 | 2022-06-25 | 111 | 4 |
month | project_id | avg_rating |
---|---|---|
4 | 111 | 3 |
4 | 222 | 5 |
5 | 222 | 2 |
5 | 333 | 4 |
6 | 111 | 4 |
You could use PostgreSQL's function to extract the month from the submit_date. Then, group by month and project_id, and calculate the average rating:
This query breaks down the average rating by project for each month. It also orders the result set by month and project ID so that it's easy to follow how each project is performing over time.
To solve a related window function SQL problem on DataLemur's free online SQL code editor, solve this Google SQL Interview Question:
Database denormalization is when you add redundancy to a database, and break typical normalization rules (specified by 1st, 2nd, 3rd normal forms). There's a few reasons to denormalize a database:
Improved performance: Denormalization often reduces the the number of costly join operations that are needed to retrieve data. This is helpful when the database is being used for OLAP (Online Analytical Processing) use cases, as joins can be expensive and slow.
Simplification: Denormalization can also be used to simplify the design of a database by reducing the number of tables and relationships that need to be managed. This can make it easier to understand and maintain the database.
Ease of use: Denormalization can also make it easier for users to work with a database by providing them with a more intuitive and straightforward data model.
While denormalization can be a useful tool for improving performance and scalability, it's important to keep in mind that it can make update and delete operations more complex. This is because denormalization can create data duplicates, which can make it harder to maintain data integrity. So, to avoid any headaches, it's a good idea to start with a well-normalized database design and then consider denormalization only if it's absolutely necessary for your specific performance and scalability needs. In other words, denormalization can be a bit of a wild card, so it's best to handle it with care!
Procore is a company that provides construction project management software. One of the challenging tasks they face is to track the resource usage in their ongoing projects. Let's assume that for each project, there are different types of resources like personnel, materials, and machines. The question is how would you design a relational database to efficiently keep track of usage of these various resources in each project? Furthermore, can you write a PostgreSQL query to find out the total number of hours each type of resource is utilized on a given project?
Sample Tables:
project_id | project_name | start_date | end_date |
---|---|---|---|
1 | Project Alpha | 06/01/2022 | 09/01/2022 |
2 | Project Beta | 07/01/2022 | 11/01/2022 |
3 | Project Gamma | 08/01/2022 | 12/01/2022 |
resource_type_id | resource_Type |
---|---|
1 | Personnel |
2 | Material |
3 | Machine |
resource_usage_id | project_id | resource_type_id | hours_used |
---|---|---|---|
1 | 1 | 1 | 500 |
2 | 1 | 2 | 300 |
3 | 1 | 3 | 200 |
4 | 2 | 1 | 800 |
5 | 2 | 2 | 400 |
6 | 3 | 1 | 600 |
7 | 3 | 3 | 500 |
project_name | resource_type | total_hours_used |
---|---|---|
Project Alpha | Personnel | 500 |
Project Alpha | Material | 300 |
Project Alpha | Machine | 200 |
Project Beta | Personnel | 800 |
Project Beta | Material | 400 |
Project Gamma | Personnel | 600 |
Project Gamma | Machine | 500 |
This query joins the , , and tables on their respective ID fields. The result is grouped by project name and resource type, and the function is used to add up all the hours used for each combination. This gives us a comprehensive view of resource usage per project for each type of resource.
A join in SQL allows you to retrieve data from multiple tables and combine it into a single set of results.
To demonstrate the difference between left vs. right join, imagine you had two database tables, an advertising campaigns table which had information on Google Ads keywords and how much was bid for each keyword, and a sales table, which has data on how many products were sold and which Google Ads keyword drove that sale.
: A LEFT JOIN retrieves all rows from the left table (in this case, the table) and any matching rows from the right table (the Sales table). If there is no match in the right table, values will be returned for the right table's columns.
: A RIGHT JOIN retrieves all rows from the right table (in this case, the Sales table) and any matching rows from the left table (the table). If there is no match in the left table, values will be returned for the left table's columns.
Given the table below, write a SQL query to calculate the average duration of projects in days for each project manager.
project_id | start_date | end_date | project_manager_id |
---|---|---|---|
1201 | 2020-01-15 | 2020-05-22 | 555 |
1202 | 2020-02-06 | 2020-06-12 | 555 |
1203 | 2020-03-01 | 2020-06-25 | 777 |
1204 | 2020-01-20 | 2020-05-30 | 555 |
1205 | 2020-02-01 | 2020-06-10 | 777 |
This query calculates the duration of each project by subtracting from and then averages these durations for each project manager. This average gives the mean project duration for each manager in days.
To practice a very similar question try this interactive Facebook Average Post Hiatus (Part 1) Question which is similar for calculating duration between dates or this Microsoft Teams Power Users Question which is similar for grouping and ranking data.
There are several normal forms that define the rules for normalizing a database:
A database is in first normal form (1NF) if it meets the following criteria:
A database is in second normal form (2NF) if it meets the following criteria:
Said another way, to achieve 2NF, besides following all the rules from 1NF all the columns in a given table should be dependent only on that table's primary key.
A database is in third normal form (3NF) if it meets the following criteria:
A transitive dependency means that a piece of data in one column is derived from another column. For example, it wouldn't make sense to keep a column called "user's age" and "user's birthdate" (because age can be derived from birthdate.
While there's also a 4th and 5th normal form, it's too pedantic and hence out-of-scope to know for the Procore SQL interview.
Procore is a company that deals with customers worldwide and has a huge customer records database. As a data analyst at Procore, your duty includes filtering out the customers based on their email domains. For this task, your responsibility is to write a SQL query that filters out the customers having an email domain of .
customer_id | first_name | last_name | sign_up_date | |
---|---|---|---|---|
6543 | John | Doe | john.doe@gmail.com | 01/14/2022 |
4312 | Mary | Smith | mary.smith@outlook.com | 02/25/2022 |
5523 | Alice | Johnson | alice.johnson@hotmail.com | 03/14/2022 |
7123 | James | Wilson | james.wilson@gmail.com | 04/05/2022 |
5432 | Emily | Taylor | emily.taylor@yahoo.com | 05/21/2022 |
customer_id | first_name | last_name | |
---|---|---|---|
6543 | John | Doe | john.doe@gmail.com |
7123 | James | Wilson | james.wilson@gmail.com |
This query searches in the 'email' field of the 'customers' table for any entries ending with '@gmail.com'. The '%' sign is a wildcard character that matches any sequence of characters. Thus, it filters out the customers who have a Gmail account.
As a data analyst at Procore, your manager asks you to analyze the company's sales database. The sales database comprises two main tables and . The table contains the customer 'id', 'name', and 'company', while the 'project_spending' table contains 'customer_id', 'project_id', and 'spending'. Write a SQL query to find average project spending per customer and arrange the result by company name.
The tables are structured as follows:
id | name | company |
---|---|---|
1 | John Doe | ABC Corp |
2 | Jane Smith | XYZ Inc |
3 | Robert Johnson | ABC Corp |
4 | Emily Davis | PQR Partners |
customer_id | project_id | spending |
---|---|---|
1 | 1000 | 50000 |
2 | 2000 | 30000 |
3 | 3000 | 40000 |
1 | 4000 | 60000 |
4 | 5000 | 100000 |
This query first performs an inner join on and tables. The join is performed on 'customer_id' from 'project_spending' table and 'id' from the 'customer_info' table. After joining the tables, it groups by 'company' and calculates the average 'spending' for each company. Finally, it arranges the results in order of 'company' name.
company | avg_spending |
---|---|
ABC Corp | 50000 |
PQR Partners | 100000 |
XYZ Inc | 30000 |
Because joins come up so often during SQL interviews, try an interactive Spotify JOIN SQL question:
The COALESCE() function can be used to replace NULL values with a specified value. For instance, if you are a Data Analyst at a company and are working on a customer analytics project to identify the most engaged customers, you may have access to the following data in the table:
customer_id | email_engagement | sms_engagement |
---|---|---|
101 | moderately_active | not_opted_in |
201 | un-subscribed | NULL |
301 | NULL | not_opted_in |
401 | not_active | very_active |
501 | very_active | very_active |
601 | NULL | NULL |
Before you could procede with your analysis, you would need to remove the NULLs, and replace them with the default value for email engagement (not_active), and the default sms_engagement level (not_opted_in).
To do this, you'd run the following query:
This would get you the following output:
customer_id | email_engagement | sms_engagement |
---|---|---|
101 | moderately_active | not_opted_in |
102 | un-subscribed | not_opted_in |
103 | not_active | not_opted_in |
104 | not_active | very_active |
105 | very_active | very_active |
106 | not_active | not_opted_in |
Procore is a software company that provides cloud-based construction management software. Suppose the company is tracking details of projects in a table including data on the estimated and actual costs and wishes to know the cost deviation (percentage) of each project. The cost deviation is (actual_cost-estimated_cost)/estimated_cost*100. Write a SQL query to return the project name, the absolute value of the round-off of the cost deviation (rounded to two decimal places) and the square root of the actual cost. Assume the estimated cost and actual cost are in millions.
project_id | project_name | estimated_cost | actual_cost |
---|---|---|---|
1 | Hotel Construction | 10 | 9.2 |
2 | Mall Construction | 7 | 8.2 |
3 | Bridge Repair | 5 | 6.2 |
4 | Restaurant Interior | 3 | 2.9 |
5 | Highway Expansion | 12 | 11.9 |
project_name | cost_deviation | sqrt_actual_cost |
---|---|---|
Hotel Construction | 8.00 | 3.03 |
Mall Construction | 17.14 | 2.86 |
Bridge Repair | 24.00 | 2.49 |
Restaurant Interior | 3.33 | 1.70 |
Highway Expansion | 0.83 | 3.45 |
This SQL query first calculates the cost deviation by subtracting the estimated cost from the actual one and dividing it by the estimated cost. It then multiplies the result by 100 to get a percentage. It uses the absolute function to ensure the output is always positive. This represents the percentage of deviation from the estimated cost, regardless of whether it's over or under. The query also fetches the square root of the actual cost. Both the cost deviation and the square root of the actual cost are rounded off to two decimal places for optimal precision and readability.
To practice a very similar question try this interactive Wayfair Y-on-Y Growth Rate Question which is similar for requiring numerical calculation on datasets or this McKinsey 3-Topping Pizzas Question which is similar for needing to perform operations on costs.
The key to acing a Procore SQL interview is to practice, practice, and then practice some more! Beyond just solving the above Procore SQL interview questions, you should also solve the 200+ SQL questions from real Data Science & Analytics interviews which come from companies like Google, Facebook, Microsoft and Amazon.
Each interview question has multiple hints, fully explained answers along with a discussion board to see how others solved it and most importantly, there's an online SQL coding environment so you can easily right in the browser your query and have it graded.
To prep for the Procore SQL interview you can also be useful to practice SQL questions from other tech companies like:
However, if your SQL skills are weak, forget about diving straight into solving questions – go learn SQL with this SQL tutorial for Data Scientists & Analysts.
This tutorial covers SQL topics like handling timestamps and aggreage functions like MIN()/MAX() – both of which pop up often in Procore SQL assessments.
In addition to SQL interview questions, the other topics covered in the Procore Data Science Interview are:
The best way to prepare for Procore Data Science interviews is by reading Ace the Data Science Interview. The book's got: