At Box, SQL is often used for analyzing customer usage trends for product enhancements and managing vast databases for cloud content management services. For this reason Box often tests SQL questions during interviews for Data Science and Data Engineering positions.
To help you prepare for the Box SQL interview, we've collected 10 Box SQL interview questions – how many can you solve?
Assume you are working for Box, an online file sharing and personal cloud content management service for businesses. Your task is to write a SQL query to find the "whale" users, i.e., the users who have uploaded the most files over the last 30 days.
The two tables you have access to are the users
and uploads
tables.
users
Example Input:user_id | name | signup_date |
---|---|---|
1 | John Doe | 2016-01-01 |
2 | Jane Smith | 2019-03-05 |
3 | Mary Johnson | 2021-06-15 |
4 | James Brown | 2018-09-30 |
uploads
Example Input:upload_id | user_id | upload_date | size_gb |
---|---|---|---|
501 | 1 | 2022-06-01 | 10 |
502 | 2 | 2022-06-30 | 20 |
503 | 3 | 2022-07-01 | 5 |
504 | 2 | 2022-07-21 | 12 |
505 | 1 | 2022-07-22 | 15 |
We want to see all users who have uploaded more than 10GB in the last 30 days. We assume that today's date is 2022-07-31.
SELECT u.user_id, u.name, SUM(u.size_gb) as total_upload_gb FROM users u JOIN uploads up ON u.user_id = up.user_id WHERE upload_date >= NOW() - INTERVAL '30 days' GROUP BY u.user_id HAVING SUM(u.size_gb) > 10 ORDER BY total_upload_gb DESC;
This query joins the users
and uploads
tables on the user_id
field. It then groups by user_id
and checks for users who have uploaded more than 10GB in the last 30 days. These users are then ordered by the total amount they have uploaded, in descending order to list the top "whale" users first.
To practice a similar power-user data analysis problem question on DataLemur's free online SQL code editor, try this Microsoft Teams Power User SQL Interview Question:
At Box, we are interested in understanding our product's performance over time. To achieve this, one of the metrics we look at is the average rating that each product receives each month.
Using the reviews
table, write a SQL query to calculate the monthly average rating for each product. The reviews
table have the following columns:
review_id
: A unique identifier for each reviewuser_id
: The ID of the user who left the reviewsubmit_date
: The date the review was submittedproduct_id
: The ID of the product being reviewedstars
: The number of stars given by the user in the review (from 1 to 5)reviews
Example Input:review_id | user_id | submit_date | product_id | stars |
---|---|---|---|---|
6171 | 123 | 06/08/2022 | 50001 | 4 |
7802 | 265 | 06/10/2022 | 69852 | 4 |
5293 | 362 | 06/18/2022 | 50001 | 3 |
6352 | 192 | 07/26/2022 | 69852 | 3 |
4517 | 981 | 07/05/2022 | 69852 | 2 |
SELECT DATE_PART('month', submit_date) AS mth, product_id AS product, AVG(stars) OVER(PARTITION BY product_id, DATE_TRUNC('month', submit_date)) AS avg_stars FROM reviews;
This query uses window function AVG()
with PARTITION BY product_id, DATE_TRUNC('month', submit_date)
to calculate the average rating for each product per month. DATE_PART('month', submit_date)
extracts the month portion of the submit_date
. DATE_TRUNC('month', submit_date)
truncates the submit_date
to the start of the month, effectively grouping all submit_dates in the same month together. The AVG(stars) OVER(...)
calculates the average rating for each product per month.
Please note that actual syntax might vary slightly based on the SQL dialect you are using (PostgreSQL, MS SQL, MySQL, etc.).
To solve another window function question on DataLemur's free online SQL coding environment, solve this Amazon BI Engineer interview question:
A primary key is a special type of column or set of columns in a database table that is used to identify each row in the table. It helps to ensure that all the rows are distinct and prevents null values. Additionally, primary keys play a crucial role in maintaining the accuracy of the data by making sure that there are no duplicate rows. They are also used to establish relationships between tables.
To create a primary key in a SQL database, you can use the PRIMARY KEY
constraint. For example, in the table box_employees
, the EmployeeID
column is the primary key.
CREATE TABLE box_employees ( EmployeeID INTEGER PRIMARY KEY, LastName VARCHAR(255), FirstName VARCHAR(255), Age INTEGER, Salary DECIMAL(8,2) );
Box, a cloud storage company, is interested in understanding its users' interactions with their service. Specifically, they want to analyze the active users and their file uploads over the past month. Active users are defined as those who have uploaded at least one file in the past month.
To accomplish this, you will be provided with two tables, 'users' and 'uploads'. 'users' contains the list of Box's users with their details, and 'uploads' hold the details of each file uploaded by users.
users
Sample Input:user_id | user_name | registration_date |
---|---|---|
1 | John | 2021-01-20 |
2 | Linda | 2021-10-05 |
3 | Sara | 2022-01-11 |
4 | Mike | 2022-07-07 |
uploads
Sample Input:upload_id | user_id | file_name | upload_date |
---|---|---|---|
1 | 1 | file1.docx | 2022-08-01 |
2 | 2 | file2.pdf | 2022-08-15 |
3 | 3 | file3.jpg | 2022-07-30 |
4 | 1 | file4.docx | 2022-08-05 |
5 | 2 | file5.pdf | 2022-08-02 |
Create a SQL query to identify active users by their names and count the number of files they uploaded in August 2022.
SELECT u.user_name, COUNT(u.upload_id) AS num_of_uploads FROM users AS us INNER JOIN uploads AS u ON u.user_id = us.user_id WHERE u.upload_date BETWEEN '2022-08-01' AND '2022-08-31' GROUP BY u.user_name;
This query performs an INNER JOIN on the 'users' and 'uploads' tables on the 'user_id' fields. The WHERE clause filters the data to only include uploads from August 2022. The GROUP BY statement groups the data according to 'user_name', and the COUNT function is used to calculate the number of uploads for each active user. The result would be a table listing active users by their names and the number of files they uploaded in August 2022.
Database views are created to provide customized, read-only versions of your data that you can query just like a regular table. So why even use one if they're just like a regular table?
Views are useful for creating a simplified version of your data for specific users, or for hiding sensitive data from certain users while still allowing them to access other data.
At Box, the amount of storage that users are utilizing is very important for business operations. Could you write a SQL query to find the average amount of storage used by each user?
user_storage
Example Input:user_id | username | storage_used_gb |
---|---|---|
101 | johndoe | 10 |
102 | janedoe | 15 |
103 | superuser | 30 |
104 | doejohn | 20 |
105 | superadmin | 50 |
average_storage_gb |
---|
25 |
SELECT AVG(storage_used_gb) AS average_storage_gb FROM user_storage;
This SQL query uses the AVG
function to calculate the average storage used by each user in the user_storage
table. The AS
clause is used to give a friendly alias (average_storage_gb
) to the calculated result for easy understanding of the result.
To practice a very similar question try this interactive Twitter Histogram of Tweets Question which is similar for requiring per-user averages or this Amazon Average Review Ratings Question which is similar for calculating average user metrics.
To normalize a database, tables are divided into smaller, more specialized ones and relationships between them are defined via primary and foreign keys. This minimizes redundancy, making the database more flexible, scalable, and easier to maintain. Normalization also helps to ensure the accuracy of the data by reducing the likelihood of inconsistencies and errors.
Box is an online file storage and sharing service. In an effort to improve their subscription service, they have started a marketing campaign and are monitoring their ad click-through rates (clicks/impressions) and conversion rates (adding a product to the cart after viewing it). {#Question-8}
They have two tables - ads
records the impressions and clicks of ads for their products, conversions
records the product views and adds to the cart.
ads
Example Input:ad_id | product_id | impressions | clicks |
---|---|---|---|
101 | 123 | 20000 | 300 |
102 | 456 | 15000 | 600 |
103 | 789 | 10000 | 200 |
104 | 123 | 25000 | 500 |
105 | 789 | 30000 | 400 |
conversions
Example Input:conversion_id | product_id | views | add_to_cart |
---|---|---|---|
201 | 123 | 5000 | 100 |
202 | 456 | 3500 | 60 |
203 | 789 | 4000 | 40 |
204 | 123 | 6000 | 150 |
205 | 789 | 7000 | 80 |
In PostgreSQL SQL dialect:
-- calculate ad click-through rate WITH ads_ctr AS ( SELECT product_id, SUM(clicks) * 1.0 / SUM(impressions) as click_through_rate FROM ads GROUP BY product_id ), -- calculate conversion rate conversions_rate AS ( SELECT product_id, SUM(add_to_cart) * 1.0 / SUM(views) as conversion_rate FROM conversions GROUP BY product_id ) -- join ads_ctr and conversions_rate SELECT a.product_id, a.click_through_rate, c.conversion_rate FROM ads_ctr a JOIN conversions_rate c ON a.product_id = c.product_id;
This query first calculates the click-through rate for each product from the ads
table, and the conversion rate for each product from the conversions
table, in two separate subqueries (CTEs). The final SELECT statement joins these two subqueries on product_id
to provide a combined view of both the click-through rates and conversion rates for each product.
To solve another question about calculating rates, try this SQL interview question from TikTok on DataLemur's interactive coding environment:
Box is a cloud content management and file sharing service for businesses. Assume the company wants to understand the average usage of their cloud storage service by their customers by calculating average storage used in a certain billing period.
Data are stored in a table usage
:
usage
Example Input:usage_id | billing_period | customer_id | storage_used (GB) |
---|---|---|---|
125 | January-2023 | 2335 | 75 |
126 | January-2023 | 5092 | 105 |
127 | February-2023 | 8871 | 185 |
128 | February-2023 | 9025 | 215 |
129 | March-2023 | 2335 | 85 |
130 | March-2023 | 9025 | 225 |
The company wants to know the average storage used per billing period to better understand resource usage.
billing_period | avg_storage_used (GB) |
---|---|
January-2023 | 90 |
February-2023 | 200 |
March-2023 | 155 |
SELECT billing_period, AVG(storage_used) as avg_storage_used FROM usage GROUP BY billing_period;
This SQL query groups the usage data by billing_period
and then calculates the average storage used per each billing period. It uses the AVG function which gives the average value of a numeric column.
The GROUP BY
clause groups a selected set of rows into a set of summary rows by values of columns or expressions.
Normalization can reduce redundancy and improve performance and flexibility in a database.
By dividing larger tables into smaller, more specific ones linked by foreign keys, it is often possible to improve query speed. Note: it is not always the case that query speed will be improved due to costly nature of joins. Still, we think normalization is a good idea because it can also increase flexibility by making it easier to alter individual tables without affecting others, which can be useful for adapting Box's database to ever-changing business needs.
The key to acing a Box SQL interview is to practice, practice, and then practice some more!
In addition to solving the above Box SQL interview questions, you should also solve the 200+ SQL coding questions which come from companies like FAANG and tech startups.
Each DataLemur SQL question has multiple hints, step-by-step solutions and best of all, there is an online SQL code editor so you can instantly run your query and have it graded.
To prep for the Box SQL interview you can also be helpful to practice SQL questions from other tech companies like:
In case your SQL foundations are weak, forget about going right into solving questions – refresh your SQL knowledge with this free SQL tutorial.
This tutorial covers topics including math functions in SQL and handling date/timestamp data – both of which come up often during Box SQL interviews.
Besides SQL interview questions, the other types of problems to prepare for the Box Data Science Interview are:
To prepare for Box Data Science interviews read the book Ace the Data Science Interview because it's got: