At Lyft, SQL is used day-to-day for analyzing rider behavior patterns to optimize routes and producing insights to streamline ride-hailing processes. Because of this, Lyft typically asks SQL coding questions during interviews for Data Science, Analytics, and & Data Engineering jobs.
As such, to help you ace the Lyft SQL interview, this blog covers 10 Lyft SQL interview questions – able to solve them?
Lyft wants to identify their "VIP" customers which are defined by the number of rides they have taken and the total value of those rides. Specifically, such a user has taken at least 50 rides and spent at least $500 in total. The analysis should be based on the last 6 months data.
ride_id | user_id | date | price |
---|---|---|---|
101 | 567 | 06/15/2022 | 20 |
102 | 890 | 06/17/2022 | 28 |
103 | 890 | 06/25/2022 | 22 |
104 | 567 | 07/24/2022 | 14 |
105 | 567 | 07/26/2022 | 36 |
user_id | name |
---|---|
567 | Alice |
890 | Bob |
The result should list VIP customer's user_id, name and total spending.
This query joins the rides and users tables based on user_id and filters for rides from the last six months. The VIP customers are identified if they satisfy both the thresholds: total rides >= 50 and total spent >= $500. The result is grouped by user_id and name to ensure each row represents a unique customer.
To solve a super-customer analysis question on DataLemur's free online SQL coding environment, try this Microsoft Azure Cloud SQL Interview Question:
Lyft has a database table where riders submit a review about a driver after each trip. It includes the , , , and (rating given by the rider to the driver). The column holds integers between 1 and 5. Your task is to write a SQL query that returns the average rating trusted drivers receive on a monthly basis. A driver is considered trusted if they have more than 100 reviews in total since they registered.
review_id | rider_id | driver_id | submit_date | stars |
---|---|---|---|---|
890 | 312 | 488 | 06/01/2022 00:00:00 | 5 |
891 | 289 | 488 | 06/03/2022 00:00:00 | 4 |
892 | 987 | 921 | 06/05/2022 00:00:00 | 3 |
893 | 123 | 488 | 06/02/2022 00:00:00 | 4 |
894 | 333 | 921 | 06/09/2022 00:00:00 | 5 |
895 | 848 | 488 | 07/02/2022 00:00:00 | 2 |
896 | 776 | 488 | 07/06/2022 00:00:00 | 5 |
897 | 234 | 921 | 07/12/2022 00:00:00 | 2 |
898 | 432 | 621 | 07/14/2022 00:00:00 | 4 |
899 | 555 | 621 | 07/17/2022 00:00:00 | 4 |
This query first creates a subquery to return a list of trusted driver IDs who have more than 100 reviews each. Then we use this list of trusted driver IDs to filter our main query. The main query groups the table by and , and for each group, it computes the average stars. This calculated field is labeled as . Finally, we sort the results by in ascending order and in descending order, so that the output starts with the earliest month and for each month, drivers with higher average ratings come first.
To practice a related window function SQL problem on DataLemur's free interactive coding environment, solve this Amazon SQL question asked in a BI Engineer interview:
Both the and window functions are used to access a row at a specific offset from the current row.
However, the function retrieves a value from a row that follows the current row, whereas the function retrieves a value from a row that precedes the current row.
Say you had a table of salary data for Lyft employees:
name | salary |
---|---|
Amanda | 130000 |
Brandon | 90000 |
Carlita | 80000 |
You could use the function to output the salary of each employee, along with the next highest-paid employee:
This would yield the following output:
name | salary | next_salary |
---|---|---|
Amanda | 130000 | 90000 |
Brandon | 90000 | 80000 |
Carlita | 80000 | NULL |
Swapping for would get you the salary of the person who made just more than you:
name | salary | next_salary |
---|---|---|
Amanda | 130000 | NULL |
Brandon | 90000 | 130000 |
Carlita | 80000 | 90000 |
As a database engineer at Lyft, your manager approached you to assist in understanding some key metrics about the rides happening through Lyft. The company keeps separate tables for drivers, riders, and rides. For rides, they keep track of the ride's ID, the rider's ID, the driver's ID, the start time of the ride, the end time of the ride, the pick-up location, the drop-off location, and the cost of the ride.
Your task is to design a query that will yield, for each driver, the total number of rides they’ve given and the average ride cost. Also, we need to know how many of these rides were over $20 for each particular driver.
driver_id | name |
---|---|
1 | Justin |
2 | Amanda |
3 | Ben |
ride_id | rider_id | driver_id | start_time | end_time | pick_up | drop_off | cost |
---|---|---|---|---|---|---|---|
2001 | 1001 | 1 | 06/08/2021 09:20:00 | 05/08/2021 09:50:00 | "Location-A" | "Location-B" | 25.0 |
2002 | 1002 | 2 | 06/08/2021 10:00:00 | 05/08/2021 11:00:00 | "Location-B" | "Location-C" | 15.0 |
2003 | 1002 | 1 | 06/08/2021 12:00:00 | 05/08/2021 12:30:00 | "Location-C" | "Location-A" | 22.0 |
2004 | 1003 | 1 | 06/08/2021 13:00:00 | 05/08/2021 13:30:00 | "Location-C" | "Location-D" | 30.0 |
2005 | 1004 | 3 | 06/08/2021 14:00:00 | 05/08/2021 14:30:00 | "Location-D" | "Location-A" | 18.0 |
This query will return a summary for each driver, including the total number of rides they've given, the average ride cost, and the number of rides over 20.
Stored procedures are like functions in Python – they can accept input params and return values, and are used to encapsulate complex logic.
For example, if you worked as a Data Analyst in support of the Marketing Analytics team at Lyft, a common task might be to find the conversion rate for your ads given a specific time-frame. Instead of having to write this query over-and-over again, you could write a stored procedure like the following:
To call this stored procedure, you'd execute the following query:
Lyft wants to keep a close eye on the average cost of their trips for each city where they operate. They have a table that keeps track of all individual rides, recording the cost for each ride along with the city where the ride occurred. Given this data, can you write a query to find the average cost of trips in each city?
ride_id | city | trip_cost |
---|---|---|
001 | Rome | 14 |
002 | Paris | 18 |
003 | Rome | 20 |
004 | Berlin | 24 |
005 | Rome | 12 |
006 | Paris | 22 |
007 | Berlin | 30 |
008 | Berlin | 28 |
This SQL query groups the data by the city and calculates the average cost for each group (city). The function AVG() is used to calculate the average cost of trips within each group. The GROUP BY statement is used to arrange the data into groups defined by the column specified, in this case, the city. The average fare for each city is calculated and the result set displays the average trip cost for each city where Lyft operates.
Both left and right joins in SQL allow you to combine data from different tables based on a shared key or set of keys. For a concrete example of the difference between these two join types, say you had sales data exported from Lyft's Salesforce CRM stored in a PostgreSQL database, and had access to two tables: and .
: retrieves all rows from the left table (in this case, the sales table) and any matching rows from the right table (the table). If there is no match in the right table, NULL values will be returned for the right table's columns.
Lyft maintains a database of its users and their associated rides. As a data analyst at Lyft, you are asked to find all the users whose first name starts with 'J' and who have taken a ride in 'San Francisco'. Your task is to write a SQL query to filter these user records from the database.
user_id | first_name | last_name | |
---|---|---|---|
123 | John | Doe | john.doe@lyft.com |
265 | Jane | Smith | jane.smith@lyft.com |
362 | Jack | Taylor | jack.taylor@lyft.com |
192 | Jennifer | Johnson | jennifer.johnson@lyft.com |
981 | Alice | Kim | alice.kim@lyft.com |
ride_id | user_id | city | ride_date |
---|---|---|---|
101 | 123 | San Francisco | 06/08/2022 00:00:00 |
102 | 265 | San Francisco | 06/10/2022 00:00:00 |
103 | 362 | New York | 06/18/2022 00:00:00 |
104 | 192 | San Francisco | 07/26/2022 00:00:00 |
105 | 981 | Los Angeles | 07/05/2022 00:00:00 |
This query first performs an inner join operation between the users and rides tables based on the user_id. The LIKE keyword is then used to filter the user records with 'first_name' starting with 'J'. Furthermore, only those records are considered where the 'city' is 'San Francisco'.
Lyft wants to evaluate its drivers' performance based on the trip completion duration and the tips received per trip. Suppose you are a data scientist and you are tasked with calculating the average absolute tip amount, the square root of the total trips completed by each driver, and the rounded ratio of tip received and the trip duration. Also calculate the total amount of power(2, tips) the driver received. Here is the trips data:
trip_id | driver_id | trip_duration_minutes | tip_amount |
---|---|---|---|
1001 | 1 | 30 | 5 |
1002 | 1 | 45 | 10 |
1003 | 2 | 60 | 7 |
1004 | 2 | 20 | 4 |
1005 | 3 | 50 | 9 |
driver_id | avg_abs_tip | sqrt_total_trips | rounded_tip_duration_ratio | pow_of_tips |
---|---|---|---|---|
1 | 7.5 | 1.4 | 0.27 | 525 |
2 | 5.5 | 1.4 | 0.18 | 121 |
3 | 9.0 | 1.0 | 0.18 | 81 |
You assume that all columns in the input are not null.
In this SQL script, we're using a couple of mathematical functions. Here's what they do:
The script averages the absolute tips, calculates the square root of the total trips a driver has made, computes the rounded ratio of the tips received and the trips duration, sums the power of 2 raised to tips for each driver. It'll give us insights into the driver's performance.
To practice a very similar question try this interactive Amazon Highest-Grossing Items Question which is similar for calculating performance metrics or this Twitter Tweets' Rolling Averages Question which is similar for performing calculations over time.
Note: interviews at Lyft often aren't trying to test you on a specific flavor of SQL. As such, you don't need to exactly know that is available in PostgreSQL and SQL Server, while MINUS is available in MySQL and Oracle – you just need to know the general concept!
Your answer should mention that the / operator is used to remove to return all rows from the first SELECT statement that are not returned by the second SELECT statement.
Here's a PostgreSQL example of using EXCEPT to find all of Lyft's Facebook video ads with more than 50k views that aren't also being run on YouTube:
If you want to retain duplicates, you can use the EXCEPT ALL operator instead of EXCEPT. The EXCEPT ALL operator will return all rows, including duplicates.
Assuming that you've already got basic SQL skills, the next best tip we have to prepare for the Lyft SQL interview is to solve as many practice SQL interview questions as you can! Beyond just solving the earlier Lyft SQL interview questions, you should also solve the 200+ SQL questions from real Data Science & Analytics interviews which come from companies like Google, Uber, and Microsoft.
Each problem on DataLemur has hints to guide you, full answers and most importantly, there's an online SQL code editor so you can right in the browser run your SQL query answer and have it executed.
To prep for the Lyft SQL interview it is also wise to solve interview questions from other tech companies like:
In case your SQL skills are weak, forget about diving straight into solving questions – go learn SQL with this free SQL for Data Analytics course.
This tutorial covers SQL concepts such as UNION and joining a table to itself – both of which show up often in Lyft SQL assessments.
Besides SQL interview questions, the other types of questions to practice for the Lyft Data Science Interview are:
To prepare for Lyft Data Science interviews read the book Ace the Data Science Interview because it's got: