At Dropbox, SQL is used quite frequently for analyzing product usage data to inform feature development. Because of this, Dropbox almost always asks SQL problems in interviews for Data Science and Data Engineering positions.
As such, to help you ace the Dropbox SQL interview, we've collected 11 Dropbox SQL interview questions – can you solve them?
Given the customer and file activity database for Dropbox, identify the power users who upload files frequently to the platform.
Below are the sample tables with sample data:
user_id | user_name | signup_date |
---|---|---|
101 | James | 2019-04-21 |
102 | Emma | 2020-01-19 |
103 | Olivia | 2021-03-05 |
104 | John | 2019-08-15 |
upload_id | user_id | upload_date | file_size (MB) |
---|---|---|---|
201 | 101 | 2022-06-10 | 350 |
202 | 101 | 2022-06-12 | 120 |
203 | 102 | 2022-06-14 | 500 |
204 | 103 | 2022-06-16 | 150 |
205 | 101 | 2022-06-18 | 320 |
We are interested in users who have uploaded more than 1000 MB of files in the past month.
This query first calculates the total upload size for each user in the past month in the subquery. It then joins this data with the table to get the user names, and finally filters for users who have an uploaded cumulative file size of more than 1000 MB.
To solve a related customer analytics question on DataLemur's free interactive SQL code editor, try this recently asked Microsoft SQL interview question:
For this question, you work at Dropbox and you're interested in analyzing user feedback. The database stores reviews which users submit for various products, where products are identified by their product_id. In the table, each review has a number of stars (0-5) - this is how users rate products. Also, each review has a submit_date (timestamp), user_id, and review_id.
Write a SQL query that calculates the average rating (), for each product (), for each month. The submit_date () field should be used to determine the month. The results should be ordered by year-month in ascending order, and then product_id in ascending order.
review_id | user_id | submit_date | product_id | stars |
---|---|---|---|---|
6171 | 123 | 06-08-2022 00:00:00 | 50001 | 4 |
7802 | 265 | 06-10-2022 00:00:00 | 69852 | 4 |
5293 | 362 | 06-18-2022 00:00:00 | 50001 | 3 |
6352 | 192 | 07-26-2022 00:00:00 | 69852 | 3 |
4517 | 981 | 07-05-2022 00:00:00 | 69852 | 2 |
mth | product | avg_stars |
---|---|---|
2022-06 | 50001 | 3.50 |
2022-06 | 69852 | 4.00 |
2022-07 | 69852 | 2.50 |
This query groups the reviews by month and product_id and calculates the average stars for each group. It uses the TO_CHAR function to format the submit_date in the form 'YYYY-MM', which represents the year and month. Ordering is done first by month in ascending order and then by product_id in ascending order. The AVG function calculates the average stars for each group.
Pro Tip: Window functions are a frequent SQL interview topic, so practice all the window function problems on DataLemur
{#Question-3}
Both the and window functions are used to access a row at a specific offset from the current row.
However, the function retrieves a value from a row that follows the current row, whereas the function retrieves a value from a row that precedes the current row.
Often, the offset for both functions is 1, which gives access to the immediately following/preceding row. Here's a SQL query example:
You are tasked to design a simple SQL database for Dropbox. Dropbox is a hosting service where users can store and share files and folders. They can also track their share activities. For simplification, consider there's only one type of user and there are two tables - and .
The table has columns (unique identifier for users), (email of the user), and (when the user joined Dropbox).
The table has columns (unique identifier for the action), (the user who shared the file, related to in table), (the user with whom the file was shared, related to in table), (unique identifier for the file), (size of the file in MB), and (when the file was shared).
Write a SQL query that returns the email of the user who has shared the most files and the total number of files they have shared.
user_id | date_joined | |
---|---|---|
1 | john@gmail.com | 04/03/2015 |
2 | sarah@yahoo.com | 10/21/2017 |
3 | emma@hotmail.com | 09/12/2020 |
activity_id | sender_id | receiver_id | file_id | file_size | date_shared |
---|---|---|---|---|---|
101 | 1 | 2 | 5001 | 13.5 | 08/30/2021 |
102 | 1 | 3 | 5002 | 30.2 | 09/05/2021 |
103 | 2 | 1 | 5003 | 7.8 | 09/12/2021 |
104 | 1 | 2 | 5004 | 16.7 | 09/15/2021 |
105 | 3 | 1 | 5005 | 24.5 | 09/18/2021 |
This query joins the and the table on the and columns to relate the users with their share activities. It then groups the resulting records by and counts the number of records per user with . The result is ordered in descending order () and limited to the top result () to get the email of the user who has shared the most files and the total number of files they have shared.
In SQL, a join is used to combine rows from different tables based on a shared key or set of keys, resulting in a single merged table of data.
There are four distinct types of JOINs: , , , and .
(INNER) JOIN: Retrieves records that have matching values in both tables involved in the join.
LEFT (OUTER) JOIN: Retrieves all the records/rows from the left and the matched records/rows from the right table.
RIGHT (OUTER) JOIN: Retrieves all the records/rows from the right and the matched records/rows from the left table.
FULL (OUTER) JOIN: Retrieves all the records where there is a match in either the left or right table.
Your task is to help Dropbox identify active users in its database. An active user is defined as a user who has uploaded at least one file within the last 30 days and has a premium account. Filter out these users from the given and tables.
user_id | username | account_type | |
---|---|---|---|
1001 | john_doe | john_doe@gmail.com | Premium |
1002 | mary_smith | mary_smith@yahoo.com | Free |
1003 | sam_brown | sam_brown@hotmail.com | Premium |
1004 | jane_green | jane_green@gmail.com | Premium |
1005 | tom_black | tom_black@yahoo.com | Free |
file_id | user_id | upload_date |
---|---|---|
3001 | 1001 | 07/20/2022 |
3002 | 1002 | 07/28/2022 |
3003 | 1003 | 06/20/2022 |
3004 | 1004 | 07/10/2022 |
3005 | 1001 | 07/28/2022 |
This PostgreSQL query joins the and tables on to associate each user with their respective files. Then, it uses a WHERE clause with AND to apply the given conditions. It filters for users who are of 'Premium' account type and who have at least one file uploaded within the last 30 days.
One way to find duplicatesis to use a clause and then use to find groups
You could also use the operator:
As an employee of Dropbox, we are frequently tasked with analyzing the usage patterns of our users. In this scenario, you are given a 'files' table. Each row represents a file stored in Dropbox, which includes the name of the file, size of the file (in KB) and the user_id of the user who stored the file.
Can you write a PostgreSQL query to determine the average size of the files stored on Dropbox?
file_id | user_id | file_name | size_kb |
---|---|---|---|
1 | 101 | "test.txt" | 50 |
2 | 101 | "rock.mp3" | 7000 |
3 | 102 | "image.png" | 500 |
4 | 103 | "time.csv" | 10 |
5 | 104 | "something.mp4" | 40000 |
average_size_kb |
---|
9432 |
This PostgreSQL query calculates the average size of the files stored in Dropbox. This result can then be utilized to understand how much storage each file typically consumes. The AVG function calculates the average value of a particular column. Here, it's being applied to the size_kb column of the files table.
To practice a very similar question try this interactive Alibaba Compressed Mean Question which is similar for calculating the mean value or this Facebook Average Review Ratings Question which is similar for determining the average.
Dropbox heavily focuses on digital marketing and conversion to paid customers. Let's assume each user starts with a free account and then can click ads or upgrade links to a paid product page. Imagine you're given two tables, one for every time a user views an ad or upgrade link () and one for every time a user has actually clicked (). Calculate the click-through-rate (CTR) for each ad.
view_id | user_id | view_date | ad_id |
---|---|---|---|
101 | 57 | 06/08/2022 00:00:00 | 202 |
102 | 36 | 06/10/2022 00:00:00 | 500 |
103 | 57 | 06/18/2022 00:00:00 | 202 |
104 | 42 | 07/26/2022 00:00:00 | 300 |
105 | 57 | 07/05/2022 00:00:00 | 202 |
click_id | user_id | click_date | ad_id |
---|---|---|---|
6201 | 57 | 06/08/2022 00:00:00 | 202 |
7942 | 36 | 06/10/2022 00:00:00 | 500 |
7523 | 57 | 06/18/2022 00:00:00 | 202 |
This query starts by joining the and table on the user_id and ad_id columns. It then groups by the ad_id and counts the number of unique views and clicks for each ad. The COUNT DISTINCT is used to ensure that multiple views or clicks from the same user on the same ad are only counted once. The click-through-rate (CTR) is then calculated by dividing the number of clicks by the number of views. The NULLIF function is used to prevent division by zero errors.
To practice a similar problem about calculating rates, solve this TikTok SQL question within DataLemur's online SQL coding environment:
Database denormalization is when you add redundancy to a database, and break typical normalization rules (specified by 1st, 2nd, 3rd normal forms). There's a few reasons to denormalize a database:
Improved performance: Denormalization often reduces the the number of costly join operations that are needed to retrieve data. This is helpful when the database is being used for OLAP (Online Analytical Processing) use cases, as joins can be expensive and slow.
Simplification: Denormalization can also be used to simplify the design of a database by reducing the number of tables and relationships that need to be managed. This can make it easier to understand and maintain the database.
Ease of use: Denormalization can also make it easier for users to work with a database by providing them with a more intuitive and straightforward data model.
While denormalization can be a useful tool for improving performance and scalability, it's important to keep in mind that it can make update and delete operations more complex. This is because denormalization can create data duplicates, which can make it harder to maintain data integrity. So, to avoid any headaches, it's a good idea to start with a well-normalized database design and then consider denormalization only if it's absolutely necessary for your specific performance and scalability needs. In other words, denormalization can be a bit of a wild card, so it's best to handle it with care!
As a data analyst at Dropbox, you are given the task to analyze user behavior. Given the and tables shown below, can you write a SQL query that calculates the average storage used per user each month?
user_id | username | registration_date |
---|---|---|
1 | userA | 2021-02-01 |
2 | userB | 2021-05-01 |
3 | userC | 2021-09-01 |
usage_id | user_id | usage_date | storage_used |
---|---|---|---|
101 | 1 | 2021-07-10 | 50 |
102 | 1 | 2021-07-20 | 55 |
103 | 2 | 2021-07-15 | 60 |
104 | 2 | 2021-07-25 | 75 |
105 | 3 | 2021-07-30 | 85 |
Your output should have the format:
month | year | avg_storage |
---|---|---|
7 | 2021 | 70 |
This query extracts the month and year from the and groups by these extracted values to calculate the average storage used per month. The function is used to convert the date into a string in the specified format for grouping and the function is used for average calculation. The result is then ordered by year and month.
The best way to prepare for a SQL interview, besides making sure you have strong SQL fundamentals, is to practice a ton of real SQL questions that were asked in recent job interviews. Beyond just solving the earlier Dropbox SQL interview questions, you should also solve the 200+ tricky sql questions on DataLemur which come from companies like Amazon, Microsoft, Meta, and smaller tech companies.
Each problem on DataLemur has multiple hints, fully explained answers along with a discussion board to see how others solved it and best of all, there's an interactive coding environment so you can instantly run your SQL query answer and have it graded.
To prep for the Dropbox SQL interview it is also helpful to solve SQL problems from other tech companies like:
However, if your SQL foundations are weak, don't worry about diving straight into solving questions – improve your SQL foundations with this SQL interview tutorial.
This tutorial covers SQL concepts such as using ORDER BY and LEAD/LAG – both of which come up routinely during SQL job interviews at Dropbox.
In addition to SQL interview questions, the other topics to prepare for the Dropbox Data Science Interview are:
The best way to prepare for Dropbox Data Science interviews is by reading Ace the Data Science Interview. The book's got: