Data Science, Data Engineering and Data Analytics employees at PwC use SQL to analyze large datasets of financial transactions and accounting records for identifying anomalies and trends. It is also used to create complex financial models that forecast revenue and expenses for consulting engagements, which is why PwC uses SQL questions during job interviews.
So, to help you practice for the PwC SQL interview, we've curated 8 PricewaterhouseCoopers SQL interview questions – can you solve them?
PwC is a global network of firms delivering assurance, tax and consulting services for your business. Let's consider a scenario where you have been given a dataset for PwC employees with their Department details, Id, Name, Salary etc.
The data schema for employee
table is as follows:
employee_id | first_name | last_name | department | salary |
---|---|---|---|---|
101 | John | Doe | IT | 50000 |
102 | Jane | Smith | Accounting | 60000 |
103 | Mary | Johnson | Marketing | 55000 |
104 | James | Brown | IT | 70000 |
105 | Patricia | Jones | Accounting | 65000 |
The task is to write a PostgreSQL query that ranks these employees within their respective departments based on their salary, descending (i.e., the employee with the highest salary in a department should have a rank of 1).
In PostgreSQL, you can use the RANK() function which is a part of window functions to solve this problem.
SELECT employee_id, first_name, last_name, department, salary, RANK () OVER ( PARTITION BY department ORDER BY salary DESC ) AS rank FROM employee;
This PostgreSQL query uses the RANK()
window function to return a unique rank for each row within the partition of resultset. The PARTITION BY
clause is used to divide the entire result set into partitions (in this case, 'departments'). For each partition, the rank is calculated and presented. The ORDER BY
clause is used to arrange the salary column in descending format to ensure that the employee having highest salary gets the highest rank i.e., 1.
Please note that in case of a "tie" (same salary), the same rank is assigned to both tied entries, skipping the next rank. If you would prefer not to skip the next rank in case of ties, consider using the DENSE_RANK()
function instead.
p.s. Window functions show up pretty frequently during SQL interviews, so practice the 27+ window function questions on DataLemur
Given a table of PwC employee salary information, write a SQL query to find the top 3 highest paid employees in each department.
employees
Example Input:employee_id | name | salary | department_id |
---|---|---|---|
1 | Emma Thompson | 3800 | 1 |
2 | Daniel Rodriguez | 2230 | 1 |
3 | Olivia Smith | 2000 | 1 |
4 | Noah Johnson | 6800 | 2 |
5 | Sophia Martinez | 1750 | 1 |
8 | William Davis | 6800 | 2 |
10 | James Anderson | 4000 | 1 |
department
Example Input:department_id | department_name |
---|---|
1 | Data Analytics |
2 | Data Science |
department_name | name | salary |
---|---|---|
Data Analytics | James Anderson | 4000 |
Data Analytics | Emma Thompson | 3800 |
Data Analytics | Daniel Rodriguez | 2230 |
Data Science | Noah Johnson | 6800 |
Data Science | William Davis | 6800 |
Check your SQL query for this question and run your code right in DataLemur's online SQL environment:
We use the DENSE_RANK() window function to generate unique ranks for each employee's salary within their department, with higher salaries receiving lower ranks. Then, we wrap this up in a CTE and filter the employees with a ranking of 3 or lower.
WITH ranked_salary AS ( SELECT name, salary, department_id, DENSE_RANK() OVER ( PARTITION BY department_id ORDER BY salary DESC) AS ranking FROM employee ) SELECT d.department_name, rs.name, rs.salary FROM ranked_salary AS rs INNER JOIN department AS d ON rs.department_id = d.department_id WHERE rs.ranking <= 3 ORDER BY d.department_id, rs.salary DESC, rs.name ASC;
If the solution above is hard to understand, you can find a detailed solution here: Top 3 Department Salaries.
To find records in one table that aren't in another, you can use a LEFT JOIN
and check for NULL
values in the right-side table.
Here is an example using two tables, PwC employees and PwC managers:
SELECT * FROM pwc_employees LEFT JOIN pwc_managers ON pwc_employees.id = pwc_managers.id WHERE pwc_managers.id IS NULL;
This will return all rows from PwC employees where there is no matching row in managers based on the id
column.
You can also use the EXCEPT
operator in PostgreSQL and Microsoft SQL Server to return the records that are in the first table but not in the second. Here is an example:
SELECT * FROM pwc_employees EXCEPT SELECT * FROM pwc_managers
This will retrieve all rows from employees that do not appear in managers. The EXCEPT
operator works by retreivingthe rows that are returned by the first query, but not by the second.
Please note that EXCEPT
is not supported by all DBMS systems, such as MySQL and Oracle (however, you can use the MINUS
operator to achieve a similar outcome).
As a part of ongoing efforts at PwC to understand customer behavior and improve services, you've been tasked to filter and analyze the customer data. Write a SQL query that filters down the customers
database to only show records of customers who are from the "Information Technology" industry, are located in "New York" and have had a total revenue of over "50000" in the past year.
customers
Example Input:customer_id | industry | location | revenue_last_year |
---|---|---|---|
5690 | "Information Technology" | "New York" | 60000 |
8472 | "Healthcare" | "Los Angeles" | 75000 |
4926 | "Information Technology" | "San Francisco" | 70000 |
7643 | "Information Technology" | "New York" | 50000 |
3819 | "Manufacturing" | "New York" | 80000 |
9584 | "Information Technology" | "New York" | 65000 |
customers
Example Output:customer_id | industry | location | revenue_last_year |
---|---|---|---|
5690 | "Information Technology" | "New York" | 60000 |
9584 | "Information Technology" | "New York" | 65000 |
SELECT * FROM customers WHERE industry = 'Information Technology' AND location = 'New York' AND revenue_last_year > 50000;
This query uses the WHERE
clause to filter the customers based on boolean conditions. It checks three conditions - industry
being 'Information Technology', location
being 'New York' and revenue_last_year
being more than 50000. The AND
operator is used to ensure that all three conditions must be true for a record to be included in the query result.
The three levels of database normalization (also known as normal forms) are:
First Normal Form (1NF):
Second Normal Form (2NF)
Said another way, to achieve 2NF, besides following the 1NF criteria, the columns should also be dependent only on that table's primary key.
Third normal form (3NF) if it meets the following criteria:
A transitive dependency means values in one column are derived from data in another column. For example, it wouldn't make sense to keep a column called "vehicle's age" and "vehicle's manufacture date" (because the vehicle's age can be derived from the date it was manufactured).
Although there is a 4NF and 5NF, you probably don't need to know it for the PwC interview.
PwC wants to assess the effectiveness of their digital marketing campaigns. They are particularly interested in the click-through-rates (CTR) of their advertisements. Each time an advertisement is displayed it is considered as an "impression". Every time that advertisement is clicked, it is recorded as a "click".
Calculate the CTR for each advertisement for the month of August, where CTR is defined as the total number of clicks divided by the total number of impressions, multiplied by 100 to get it in percentage.
impressions
Example Input:impression_id | ad_id | impression_date |
---|---|---|
1 | 401 | 08/01/2022 00:00:00 |
2 | 402 | 08/01/2022 00:00:00 |
3 | 401 | 08/02/2022 00:00:00 |
4 | 403 | 08/02/2022 00:00:00 |
5 | 402 | 08/03/2022 00:00:00 |
clicks
Example Input:click_id | ad_id | click_date |
---|---|---|
1 | 401 | 08/01/2022 00:00:00 |
2 | 402 | 08/01/2022 00:00:00 |
3 | 401 | 08/01/2022 00:00:00 |
4 | 403 | 08/03/2022 00:00:00 |
SELECT i.ad_id, COUNT(c.click_id) * 100.0 / COUNT(i.impression_id) as ctr FROM impressions i LEFT JOIN clicks c ON i.ad_id = c.ad_id and date(c.click_date) = date(i.impression_date) WHERE date(i.impression_date) BETWEEN '2022-08-01' AND '2022-08-31' GROUP BY i.ad_id;
In this query, we join the two tables on the ad_id, making sure to also join on the date in case a click happens on a different day than the impression. We then group by the ad_id to get the total impressions and clicks for each ad. From this, we calculate the CTR by dividing the number of clicks by the number of impressions and multiplying by 100 to get the percentage.
To practice a similar SQL problem on DataLemur's free interactive coding environment, try this Meta SQL interview question:
While both the BETWEEN
and IN
operators are used to filter data based on some criteria, BETWEEN
selects for values within a given range, whereas for IN
it checks if the value is in a given list of values.
For example, say you had a table called pwc_employees
, which had the salary of the employee, along with the country in which they reside.
To find all employees who made between 120k, you could use the BETWEEN
operator:
SELECT * FROM pwc_employees WHERE salary BETWEEN 80000 AND 120000;
To find all employees that reside in the US or Canada, you could use the IN
operator:
SELECT * FROM pwc_employees WHERE country IN ("USA", "Canada");
Within PwC, they have multiple clients who have different email domains. They store this data within their customer database. They want to write a SQL query which allows them to filter the customer records on the basis of the email domain. The task is to write a PostgreSQL query that returns all customers whose email domain is "hotmail.com".
customer_records
Example Input:customer_id | first_name | last_name | email_address |
---|---|---|---|
1 | William | Smith | william_smith@gmail.com |
2 | Joshua | Brown | joshua_brown@hotmail.com |
3 | Emma | Jones | emma.jones@yahoo.com |
4 | Ava | Evans | ava.evans@hotmail.com |
5 | Sophia | Wright | sophia_wright@gmail.com |
customer_id | first_name | last_name | email_address |
---|---|---|---|
2 | Joshua | Brown | joshua_brown@hotmail.com |
4 | Ava | Evans | ava.evans@hotmail.com |
SELECT * FROM customer_records WHERE email_address LIKE '%hotmail.com';
This query returns all the records from the customer_records
table where the email_address
ends with "hotmail.com". The %
symbol in the LIKE
clause is a wildcard that matches any sequence of characters, ensuring we select all email addresses that end with "hotmail.com", regardless of what comes before it.
The key to acing a PwC SQL interview is to practice, practice, and then practice some more! In addition to solving the above PwC SQL interview questions, you should also solve the 200+ FAANG SQL Questions on DataLemur which come from companies like Google, Uber, and Microsoft.
Each exercise has hints to guide you, full answers and best of all, there's an online SQL coding environment so you can instantly run your SQL query and have it checked.
To prep for the PwC SQL interview you can also be useful to solve SQL questions from other accounting & consulting companies like:
Discover how PwC is using Artificial Intelligence to help organizations solve complex problems and make better decisions!
But if your SQL skills are weak, don't worry about jumping right into solving questions – improve your SQL foundations with this free SQL for Data Analytics course.
This tutorial covers things like filtering on multiple conditions using AND/OR/NOT and SQL joins with practice exercises – both of which show up frequently during PwC interviews.
In addition to SQL query questions, the other types of problems to prepare for the PwC Data Science Interview include:
To prepare for PwC Data Science interviews read the book Ace the Data Science Interview because it's got:
Also focus on the behavioral interview – prepare for it with this list of common Data Scientist behavioral interview questions.