At Spotify, SQL is used across the company for analyzing user listening behaviors for custom playlist creation and managing databases to optimize server performance during peak streaming hours. Unsurprisingly this is why Spotify asks SQL query questions during interviews for Data Analytics, Data Science, and Data Engineering jobs.
In case you're preparing for a SQL Assessment, here’s 9 Spotify SQL interview questions to practice, which are similar to recently asked questions at Spotify – can you solve them?
Sure, here is a SQL interview question potentially suitable for Spotify:
Assume there are three Spotify tables: artists
, songs
, and global_song_rank
, which contain information about the artists, songs, and music charts, respectively.
Write a query to find the top 5 artists whose songs appear most frequently in the Top 10 of the global_song_rank
table. Display the top 5 artist names in ascending order, along with their song appearance ranking.
If two or more artists have the same number of song appearances, they should be assigned the same ranking, and the rank numbers should be continuous (i.e. 1, 2, 2, 3, 4, 5). If you've never seen a rank order like this before, do the rank window function tutorial.
artists
Table:Column Name | Type |
---|---|
artist_id | integer |
artist_name | varchar |
label_owner | varchar |
artists
Example Input:artist_id | artist_name | label_owner |
---|---|---|
101 | Ed Sheeran | Warner Music Group |
120 | Drake | Warner Music Group |
125 | Bad Bunny | Rimas Entertainment |
songs
Table:Column Name | Type |
---|---|
song_id | integer |
artist_id | integer |
name | varchar |
songs
Example Input:song_id | artist_id | name |
---|---|---|
55511 | 101 | Perfect |
45202 | 101 | Shape of You |
22222 | 120 | One Dance |
19960 | 120 | Hotline Bling |
global_song_rank
Table:Column Name | Type |
---|---|
day | integer (1-52) |
song_id | integer |
rank | integer (1-1,000,000) |
global_song_rank
Example Input:day | song_id | rank |
---|---|---|
1 | 45202 | 5 |
3 | 45202 | 2 |
1 | 19960 | 3 |
9 | 19960 | 15 |
artist_name | artist_rank |
---|---|
Ed Sheeran | 1 |
Drake | 2 |
WITH top_10_cte AS ( SELECT artists.artist_name, DENSE_RANK() OVER ( ORDER BY COUNT(songs.song_id) DESC) AS artist_rank FROM artists INNER JOIN songs ON artists.artist_id = songs.artist_id INNER JOIN global_song_rank AS ranking ON songs.song_id = ranking.song_id WHERE ranking.rank <= 10 GROUP BY artists.artist_name ) SELECT artist_name, artist_rank FROM top_10_cte WHERE artist_rank <= 5;
To try this question out for yourself for FREE go here, Spotify SQL Interview Question.
Question Description: Let's assume you are a Data Analyst at Spotify. You are given a data table named artist_listens
containing daily listening counts for different artists. The table has three columns: artist_id
, listen_date
, and daily_listens
.
You are required to write a SQL query to calculate the 7-day rolling average of daily listens for each artist. The rolling average should be calculated for each day for each artist based on the previous 7 days (including the current day).
artist_listens
Example Input:artist_id | listen_date | daily_listens |
---|---|---|
1 | 2022-06-01 | 15000 |
1 | 2022-06-02 | 21000 |
1 | 2022-06-03 | 17000 |
2 | 2022-06-01 | 25000 |
2 | 2022-06-02 | 27000 |
2 | 2022-06-03 | 29000 |
Notice that the listen_date
column is date formatted.
artist_id | listen_date | rolling_avg_listens |
---|---|---|
1 | 2022-06-01 | 15000.00 |
1 | 2022-06-02 | 18000.00 |
1 | 2022-06-03 | 17666.67 |
2 | 2022-06-01 | 25000.00 |
2 | 2022-06-02 | 26000.00 |
2 | 2022-06-03 | 27000.00 |
Please pay attention to the rounding in the result.
Answer:
SELECT artist_id, listen_date, AVG(daily_listens) OVER ( PARTITION BY artist_id ORDER BY listen_date RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW ) AS rolling_avg_listens FROM artist_listens ORDER BY artist_id, listen_date;
Explanation:
This SQL query computes the rolling average by using a window function AVG with a window frame defined as "the previous 6 days plus the current row".
The PARTITION BY
clause ensures the rolling average is calculated separately for each artist.
The ORDER BY
clause is used to order the rows in each partition by the listen_date
.
The RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
clause defines the window frame for the window function. It states that for a given row, consider all rows from 6 days before to the current row.
Finally, the AVG()
function calculates the average of daily_listens over the defined window frame.
The ORDER BY artist_id, listen_date;
at the end just keeps the result set ordered by artist and date.
To solve another question about calculating rates, try this TikTok SQL question on DataLemur's interactive coding environment:
You're given two tables containing data on Spotify users' streaming activity: songs_history
which has historical streaming data, and songs_weekly
which has data from the current week.
Write a query that outputs the user ID, song ID, and cumulative count of song plays up to August 4th, 2022, sorted in descending order.
Assume that there may be new users or songs in the songs_weekly
table that are not present in the songs_history
table.
songs_history
Table:Column Name | Type |
---|---|
history_id | integer |
user_id | integer |
song_id | integer |
song_plays | integer |
songs_history
Example Input:history_id | user_id | song_id | song_plays |
---|---|---|---|
10011 | 777 | 1238 | 11 |
12452 | 695 | 4520 | 1 |
songs_weekly
Table:Column Name | Type |
---|---|
user_id | integer |
song_id | integer |
listen_time | datetime |
songs_weekly
Example Input:user_id | song_id | listen_time |
---|---|---|
777 | 1238 | 08/01/2022 12:00:00 |
695 | 4520 | 08/04/2022 08:00:00 |
125 | 9630 | 08/04/2022 16:00:00 |
695 | 9852 | 08/07/2022 12:00:00 |
user_id | song_id | song_plays |
---|---|---|
777 | 1238 | 12 |
695 | 4520 | 2 |
125 | 9630 | 1 |
WITH history AS ( SELECT user_id, song_id, song_plays FROM songs_history UNION ALL SELECT user_id, song_id, COUNT(song_id) AS song_plays FROM songs_weekly WHERE listen_time <= '08/04/2022 23:59:59' GROUP BY user_id, song_id ) SELECT user_id, song_id, SUM(song_plays) AS song_count FROM history GROUP BY user_id, song_id ORDER BY song_count DESC;
To see an additional solution go to DataLemur and solve this Spotify SQL Interview Question.
As a Data Analyst of Spotify, suppose your team is interested in understanding the listening habits of the users. You're provided with the following tables:
users
table contains information about users.songs
table contains information about songs.artists
table contains information about song artists.streaming
table logs every song listened to by each user.The following relationships hold:
Your goal is to write a SQL query that returns each user's favourite artist, based on the number of songs they've listened to by the artist.
users
Example Input:user_id | username | country |
---|---|---|
1 | user101 | USA |
2 | user202 | UK |
3 | user303 | Brazil |
songs
Example Input:song_id | song_name | artist_id |
---|---|---|
101 | song101 | 1001 |
102 | song102 | 1002 |
103 | song103 | 1001 |
artists
Example Input:artist_id | artist_name |
---|---|
1001 | artist1001 |
1002 | artist1002 |
streaming
Example Input:user_id | song_id | stream_time |
---|---|---|
1 | 101 | 5:00 |
1 | 102 | 5:30 |
1 | 103 | 6:00 |
2 | 101 | 8:00 |
2 | 103 | 9:00 |
3 | 102 | 10:00 |
SELECT u.username, a.artist_name FROM ( SELECT stream.user_id, songs.artist_id, count(*) as num_songs FROM streaming AS stream JOIN songs ON stream.song_id = songs.song_id GROUP BY stream.user_id, songs.artist_id ORDER BY num_songs DESC ) AS sub_query JOIN users AS u ON u.user_id = sub_query.user_id JOIN artists AS a ON a.artist_id = sub_query.artist_id LIMIT 1;
The above query will find the count of songs each user has listened to by each artist, grouping by both user_id and artist_id. This result is sorted in descending order of the count, and the top record for each user represents their favorite artist. The outer query then joins this result back to the artists and users tables to get the respective names.
To practice a related SQL problem on DataLemur's free online SQL coding environment, attempt this Facebook SQL Interview question:
Denormalization is a technique used to improve the read performance of a database, typically at the expense of some write performance.
By adding redundant copies of data or grouping data together in a way that does not follow normalization rules, denormalization improves the performance and scalability of a database by eliminating costly join operations, which is important for OLAP use cases that are read-heavy and have minimal updates/inserts.
As a data analyst at Spotify, you are tasked with extracting a list of active Premium subscribers who have listened to at least 15 different artists in the current month. Active users are those who have logged in within the last 30 days.
Assuming you have two tables:
users
Example Input:|**user_id**|**subscription_status**|**last_login**|
|:----|:----|:----|
|1|Premium|2022-08-20|
|2|Free|2022-08-01|
|3|Premium|2022-07-30|
|4|Premium|2022-08-21|
activity
Example Input:|**user_id**|**artist_name**|**month**|
|:----|:----|:----|
|1|Artist 1|August|
|1|Artist 2|August|
|1|Artist 3|August|
|2|Artist 4|August|
|1|Artist 5|August|
|1|Artist 6|August|
|1|Artist 7|August|
|1|Artist 8|August|
|1|Artist 9|August|
|1|Artist 10|August|
|1|Artist 11|August|
|1|Artist 12|August|
|1|Artist 13|August|
|1|Artist 14|August|
|1|Artist 15|August|
|3|Artist 1|July|
|4|Artist 2|August|
SELECT u.user_id FROM users u INNER JOIN ( SELECT user_id, COUNT(DISTINCT artist_name) as cnt FROM activity WHERE month = 'August' GROUP BY user_id ) a ON u.user_id = a.user_id WHERE u.subscription_status = 'Premium' AND u.last_login >= current_date - interval '30 days' AND a.cnt >= 15;
This query first groups the activity
table by user_id
and calculates the count of distinct artists each user interacted with in August. It then joins this table with the users
table on user_id
. The WHERE clause filters for Premium users who have logged in within the last 30 days and have interacted with at least 15 different artists.
To solve a similar problem about calculating rates, solve this SQL interview question from TikTok within DataLemur's interactive SQL code editor:
"In SQL, a join generally retrieves rows from multiple tables and combines them into a single result set. For an example of the difference between a left vs. right join, suppose you had a table of Spotify orders and Spotify customers.
A LEFT JOIN
retrieves all rows from the left table (in this case, the Orders table) and any matching rows from the right table (the Customers table). If there is no match in the right table, NULL values will be returned for the right table's columns.
A RIGHT JOIN
combines all rows from the right table (in this case, the Customers table) and any matching rows from the left table (the Orders table). If there is no match in the left table, NULL values will be displayed for the left table's columns.
Suppose that Spotify would like to understand better the average listening duration for each genre of music on their platform. As a data scientist, your task is to write a SQL query that calculates the average listening duration per genre for every month.
Assume you have access to a user_activity table and a songs table with the following schema:
user_activity Example Input:
activity_id | user_id | song_id | timestamp | listening_duration_sec |
---|---|---|---|---|
1 | 101 | 5001 | 2022-03-01 09:00:00 | 210 |
2 | 102 | 6985 | 2022-03-01 11:30:00 | 120 |
3 | 103 | 5001 | 2022-03-01 15:45:00 | 300 |
4 | 101 | 6985 | 2022-04-01 08:45:00 | 180 |
5 | 102 | 5001 | 2022-04-01 10:00:00 | 240 |
songs Example Input:
song_id | genre |
---|---|
5001 | Rock |
6985 | Pop |
Your aim is to produce a table like:
mth | genre | avg_listening_duration_sec |
---|---|---|
3 | Rock | 255 |
3 | Pop | 120 |
4 | Rock | 240 |
4 | Pop | 180 |
In order to get to the answer, we need to join the two tables on song_id and then use the GROUP BY clause. The AVG() function can be used with GROUP BY to find the average listening duration for each music genre.
SELECT EXTRACT(MONTH FROM ua.timestamp) AS mth, s.genre, AVG(ua.listening_duration_sec) FROM user_activity ua JOIN songs s ON ua.song_id = s.song_id GROUP BY mth, s.genre;
This SQL query does the following:
To solve another question about calculating rates, solve this SQL interview question from TikTok within DataLemur's interactive coding environment:
As an analyst at Spotify, you're tasked to identify all users who have listened to all albums of the artist "Adele". Assume you have access to a users
table that keeps track of user information and an album_listens
table that keeps track of all instances where a user listened to an album. Here are the table structures:
users table
user_id | user_name |
---|---|
1 | John |
2 | Jane |
3 | Alice |
album_listens table
listen_id | user_id | artist_name | album_name |
---|---|---|---|
101 | 1 | Adele | 19 |
102 | 1 | Adele | 21 |
103 | 1 | Adele | 25 |
104 | 2 | Adele | 21 |
105 | 3 | Adele | 21 |
106 | 3 | Adele | 25 |
Using SQL, write a query to find all users who have listened to all Adele's albums (19, 21, 25).
SELECT u.user_id, u.user_name FROM users u WHERE NOT EXISTS (SELECT 1 FROM (SELECT DISTINCT album_name FROM album_listens WHERE artist_name = 'Adele') a WHERE NOT EXISTS (SELECT 1 FROM album_listens al WHERE al.user_id = u.user_id AND al.album_name = a.album_name AND al.artist_name = 'Adele'))
This query works by first identifying all distinct Adele's albums and then checking whether there are any of these albums that a certain user has not listened to. If a user has listened to all Adele's albums, they will not have any of Adele's albums that they have not listened to. Such users are selected by the query.
To solve a related SQL interview question on DataLemur's free interactive SQL code editor, solve this Meta SQL interview question:
The key to acing a Spotify SQL interview is to practice, practice, and then practice some more!
In addition to solving the above Spotify SQL interview questions, you should also solve the 200+ tricky sql questions on DataLemur which come from companies like Facebook, Google and unicorn tech startups.
Each exercise has multiple hints, detailed solutions and best of all, there is an interactive coding environment so you can instantly run your SQL query and have it executed.
To prep for the Spotify SQL interview you can also be useful to practice SQL problems from other tech companies like:
But if your SQL query skills are weak, forget about going right into solving questions – improve your SQL foundations with this DataLemur SQL Tutorial.
This tutorial covers SQL topics like joining a table to itself and window functions – both of which pop up routinely in Spotify interviews.
For the Spotify Data Science Interview, besides SQL questions, the other types of questions to practice:
The best way to prepare for Spotify Data Science interviews is by reading Ace the Data Science Interview. The book's got: