Uber manages a fleet of Cassandra database servers which store petabytes of data, and handle millions of queries per second. To manage this database fleet, and extract insights from their data, Uber uses SQL across every business unit.
That's why Uber evaluates jobseekers on SQL coding questions during interviews for Uber Data Analyst, Uber Business Analyst, Data Science, and BI jobs. For example, one Redditor said they applied for a Data Science role, and it was "3 medium level SQL heavily focused on window functions (PARTITION OVER/LEAD/LAG)."
So, if you're studying for a SQL Assessment, we've collected 10 real Uber SQL interview questions to practice, which were asked recently at Uber – how many can you solve?
Assume you are given the table below on Uber transactions made by users. Write a query to obtain the third transaction of every user. Output the user id, spend and transaction date.
Column Name | Type |
---|---|
user_id | integer |
spend | decimal |
transaction_date | timestamp |
user_id | spend | transaction_date |
---|---|---|
111 | 100.50 | 01/08/2022 12:00:00 |
111 | 55.00 | 01/10/2022 12:00:00 |
121 | 36.00 | 01/18/2022 12:00:00 |
145 | 24.99 | 01/26/2022 12:00:00 |
111 | 89.60 | 02/05/2022 12:00:00 |
user_id | spend | transaction_date |
---|---|---|
111 | 89.60 | 02/05/2022 12:00:00 |
To practice this question interactively, and run SQL queries to this Uber problem directly in the browser, go here:
First, we obtain the order of transaction numbers for each user. We can do this by using the window function where we the all transactions by and the transaction_date.
Here's how the first 5 rows of output looks like:
user_id | spend | transaction_date | row_num |
---|---|---|---|
111 | 100.50 | 01/08/2022 12:00:00 | 1 |
111 | 55.00 | 01/10/2022 12:00:00 | 2 |
111 | 89.60 | 02/05/2022 12:00:00 | 3 |
121 | 36.00 | 01/18/2022 12:00:00 | 1 |
121 | 22.20 | 04/01/2022 12:00:00 | 2 |
From there on, we can simply convert the query into a subquery and filter for the users' third transaction which is their third transaction sorted by the transaction date (and is denoted as ).
If you aren't familiar with the ROW_NUM window function, see this tutorial, or try to solve the many window function questions on DataLemur:
As a data analyst at Uber, it's your job to report the latest metrics for specific groups of Uber users. Some riders create their Uber account the same day they book their first ride; the rider engagement team calls them "in-the-moment" users.
Uber wants to know the average delay between the day of user sign-up and the day of their 2nd ride. Write a query to pull the average 2nd ride delay for "in-the-moment" Uber users. Round the answer to 2-decimal places.
Tip:
Column Name | Type |
---|---|
user_id | integer |
registration_date | date |
user_id | registration_date |
---|---|
1 | 08/15/2022 |
2 | 08/21/2022 |
Column Name | Type |
---|---|
ride_id | integer |
user_id | integer |
ride_date | date |
ride_id | user_id | ride_date |
---|---|---|
1 | 1 | 08/15/2022 |
2 | 1 | 08/16/2022 |
3 | 2 | 09/20/2022 |
4 | 2 | 09/23/2022 |
average_delay |
---|
1 |
To solve this question on DataLemur go here, Uber SQL Interview Question.
Denormalization is a technique used to improve the read performance of a database, typically at the expense of some write performance.
By adding redundant copies of data or grouping data together in a way that does not follow normalization rules, denormalization improves the performance and scalability of a database by eliminating costly join operations, which is important for OLAP use cases that are read-heavy and have minimal updates/inserts.
Uber has a diverse range of vehicles from bikes, scooters, to premium luxury cars. In order to cater their services better, Uber wants to understand their customers' preference. The task is to write a SQL query that filters out the most used vehicle type by Uber's customers in the past year. To provide a more holistic view, the results should also exclude rides that were cancelled by either the driver or the user.
The database contains two tables: and . The table has a record of all rides taken place and the table provides information about the various types of vehicles offered by Uber.
ride_id | user_id | vehicle_type_id | start_time | end_time | cancelled |
---|---|---|---|---|---|
88031 | 61023 | 5 | 2021-07-01 08:15:00 | 2021-07-01 08:45:00 | false |
88032 | 61024 | 1 | 2021-07-01 09:15:00 | 2021-07-01 09:45:00 | false |
88033 | 61025 | 2 | 2021-07-01 10:15:00 | 2021-07-01 10:45:00 | true |
88034 | 61026 | 5 | 2021-07-01 11:15:00 | 2021-07-01 11:45:00 | false |
88035 | 61027 | 3 | 2021-07-01 12:15:00 | 2021-07-01 12:45:00 | false |
type_id | vehicle_type |
---|---|
1 | Bike |
2 | Car |
3 | SUV |
4 | Luxury Car |
5 | Scooter |
This query joins the table with the table based on , then it filters only completed rides () from the past year. The result is grouped by and the number of total rides for each type is counted. In the end, we order the result in descending order by the number of rides and limit the output to the most used vehicle type.
To solve a similar problem about calculating rates, try this SQL interview question from TikTok within DataLemur's interactive SQL code editor:
Normalization involves breaking up your tables into smaller, more specialized ones and using primary and foreign keys to define relationships between them. Not only does this make your database more flexible and scalable, it also makes it easier to maintain. Plus, normalization helps to keep your data accurate by reducing the chance of inconsistencies and errors.
The only downside is now is that your queries will involve more joins, which are slow AF and often a DB performance bottleneck.
Assume that you are a data analyst at Uber. The marketing team runs several digital ads attracting users to book a cab. You are required to analyze the click-through conversion rates, i.e., from viewing an ad to booking a cab.
You have two tables - and .
click_id | user_id | ad_id | click_date |
---|---|---|---|
101 | 7 | 500 | 07/01/2022 00:00:00 |
102 | 8 | 600 | 07/02/2022 00:00:00 |
103 | 7 | 500 | 07/03/2022 00:00:00 |
104 | 8 | 600 | 07/04/2022 00:00:00 |
105 | 9 | 700 | 07/05/2022 00:00:00 |
booking_id | user_id | booking_date |
---|---|---|
201 | 7 | 07/01/2022 00:00:00 |
202 | 7 | 07/03/2022 00:00:00 |
203 | 9 | 07/05/2022 00:00:00 |
204 | 10 | 07/06/2022 00:00:00 |
The task is to find the click-through conversion rate for each ad, which is the number of bookings made after viewing an ad, divided by the total number of clicks for that ad (i.e., ).
In PostgreSQL, you can use the following query:
The query joins the table with the table on and condition that the ad click occurred before the booking. It then groups by and calculates the conversion rate as the count of bookings divided by the count of clicks. We used the typecast to get fractional results instead of integers.
Please note that this logic assumes no other factors influencing a user's decision to book a cab other than seeing these ads.
To solve a related CTR problem on DataLemur's free interactive SQL code editor, try this SQL interview question asked by Facebook:
A self-join is a type of join in which a table is joined to itself. To perform a self-join, you need to specify the table name twice in the FROM clause, and give each instance of the table a different alias. You can then join the two instances of the table using a JOIN clause, and use a WHERE clause to specify the relationship between the rows.
Self-joins are the go-to technique for any data analysis that involves pairs of the same thing, like identifying pairs of products that are frequently purchased together like in this Walmart SQL interview question.
For another example, say you were doing an HR analytics project and needed to analyze how much all Uber employees in the same department interact with each other. Here's a self-join query you could use to retrieve all pairs of Uber employees who work in the same department:
This query returns all pairs of Uber employees who work in the same department, and excludes pairs where the employee's id is the same (since this would represent the same Uber employee being paired with themselves).
As a data analyst for Uber, you are asked to determine each driver's average ratings for each city. This will help Uber to monitor performance and perhaps highlight any problems that might be arising in any specific city.
We have two tables, and .
ride_id | driver_id | city | fare_amount |
---|---|---|---|
101 | 201 | New York | 25.50 |
102 | 202 | San Francisco | 18.00 |
103 | 203 | Chicago | 22.75 |
104 | 201 | San Francisco | 30.00 |
105 | 202 | New York | 20.00 |
ride_id | rating |
---|---|
101 | 4.3 |
102 | 4.1 |
103 | 4.8 |
104 | 4.7 |
105 | 3.9 |
Your task is to write a SQL query that will create a summary table with each driver's average ratings for each city.
driver_id | city | avg_rating |
---|---|---|
201 | New York | 4.3 |
202 | San Francisco | 4.1 |
203 | Chicago | 4.8 |
201 | San Francisco | 4.7 |
202 | New York | 3.9 |
This query joins the table with table on . It then groups the resulting table by and , applying the AVG function to the column. This gives the average rating for each driver in each city.
As an SQL analyst at Uber, you are assigned to filter out the customers who have registered using their Gmail IDs. You are given a database named 'users'. The records in this table contain multiple email domains. You need to write an SQL query that filters only those records where the 'email' field contains 'gmail.com'.
user_id | full_name | registration_date | |
---|---|---|---|
7162 | John Doe | 05/04/2019 | johndoe@gmail.com |
7625 | Jane Smith | 11/09/2020 | janesmith@yahoo.com |
5273 | Steve Johnson | 06/20/2018 | stevejohnson@gmail.com |
6322 | Emily Davis | 08/14/2021 | emilydavis@hotmail.com |
4812 | Olivia Brown | 09/30/2019 | oliviabrown@gmail.com |
user_id | full_name | registration_date | |
---|---|---|---|
7162 | John Doe | 05/04/2019 | johndoe@gmail.com |
5273 | Steve Johnson | 06/20/2018 | stevejohnson@gmail.com |
4812 | Olivia Brown | 09/30/2019 | oliviabrown@gmail.com |
This query uses the LIKE operator in SQL with the wildcard character '%' to filter all records where the 'email' field contains 'gmail.com'. The '%' character is a wildcard in SQL that matches any sequence of characters. When used in conjunction with the LIKE operator, it allows you to perform pattern matching on your data. In this case, we're interested in any email addresses that end with 'gmail.com', regardless of what characters precede it.
A Data Engineer interviewing at Uber was asked about the most important tables when it comes to Uber's ride-sharing service. Specifically, they were asked:
In an interview, you'll want to draw an entity-relationship-diagram (ERD) similar to this:
For a more comprehensive solution, checkout this Uber/Lyft System Design which goes more in-depth into this!
The key to acing a Uber SQL interview is to practice, practice, and then practice some more! In addition to solving the above Uber SQL interview questions, you should also solve the 200+ SQL questions from real Data Science & Analytics interviews which come from companies like FAANG (FB, Apple, Amazon, Netflix, Google).
Each DataLemur SQL question has multiple hints, detailed solutions and best of all, there's an online SQL coding environment so you can right online code up your SQL query and have it executed.
To prep for the Uber SQL interview it is also a great idea to practice interview questions from other tech companies like:
But if your SQL query skills are weak, forget about going right into solving questions – improve your SQL foundations with this SQL interview tutorial.
This tutorial covers things like CASE/WHEN/ELSE statements and turning a subquery into a CTE – both of these show up routinely during Uber SQL assesments.
For the Uber Data Science Interview, in addition to SQL query questions, the other types of questions to practice:
To prepare for Uber Data Science interviews read the book Ace the Data Science Interview because it's got:
Here's a testimonial from someone who actually read the book and got an Uber Data Science job in 2022: