At GitLab, SQL is used frequently for analyzing product usage data to improve source-control software features. Because they are an open-source company, the put out neat SQL style docs for the world to learn from. So when it comes to the technical interview process, it shouldn't surpirse you that GitLab LOVES to ask SQL coding questions during interviews for Data Science, Data Engineering and Data Analytics jobs.
Thus, to help you ace the GitLab SQL interview, here’s 10 GitLab SQL interview questions – can you answer each one?
Assume GitLab keeps track of its users' activities in a table . Each time a user performs an important activity (e.g. commit code, create merge request, or push code), a new row is added to the table. Write a SQL query to calculate the number of monthly active users (MAU), where an active user in a given month is defined as a user who has performed at least one activity in that month.
Here is some sample data for :
activity_id | user_id | activity_date | activity_type |
---|---|---|---|
101 | 123 | 2022-08-06 00:00:00 | commit |
102 | 265 | 2022-10-06 00:00:00 | merge_request |
103 | 362 | 2022-11-18 00:00:00 | commit |
104 | 192 | 2023-02-26 00:00:00 | push |
105 | 981 | 2023-05-05 00:00:00 | merge_request |
The output should have one row for each month (with the month formatted as yyyy-mm), and include the total count of active users for that month.
month | active_users |
---|---|
2022-08 | 1 |
2022-10 | 1 |
2022-11 | 1 |
2023-02 | 1 |
2023-05 | 1 |
This SQL query first creates a column for the month (formatted as 'yyyy-mm') of each activity using the function. It then groups by this month column and in each group, it counts the unique users (). The result is ordered by the month.
p.s. Window functions show up super frequently during SQL interviews, so practice the 27+ window function questions on DataLemur
As an analyst at GitLab, one of your task is to monitor user activity on the platform. Create an SQL query that filters GitLab users who had logged in within the last 30 days but have not used any GitLab features within the same period.
Please consider the following markdown-formatted tables for this task:
user_id | username | login_date |
---|---|---|
6171 | user123 | 08/28/2022 |
7802 | user265 | 09/08/2022 |
5293 | user362 | 07/20/2022 |
6352 | user192 | 08/30/2022 |
4517 | user981 | 07/12/2022 |
activity_id | user_id | feature_used | date |
---|---|---|---|
8521 | 6171 | Repository Creation | 08/15/2022 |
3102 | 7802 | Merge Requests | 09/08/2022 |
4209 | 5293 | CI/CD Pipeline | 05/18/2022 |
7305 | 6352 | Issue Tracker | 07/26/2022 |
9701 | 4517 | Repository Creation | 07/05/2022 |
This query combines data from the and tables using a operation. It filters the users who have logged in in the last 30 days using . Among these users, it filters those who have not used any GitLab features within the same 30-day period by checking that the user activity date is either more than 30 days ago or there is null activity (), indicating no activity from the user.
Denormalization is the practice of altering a database schema in a way that breaks the normalization rules (1NF, 2NF, 3NF, etc.).
This is typically done to improve the performance of the database when it is being used for online analytics processing (OLAP), rather than online transaction processing (OLTP) use cases.
Denormalization can improve query performance by reducing the number of expensive joins required to retrieve data, but it comes with it's own drawbacks too. By adding redundant columns, you incur more data storage costs, and there's need for more complex update and delete operations in order to maintain data integrity across all the duplicated data. Thus, it's important to carefully consider the trade-offs involved before implementing denormalization.
GitLab uses the concept of "merge requests" for proposing changes from one branch to another. In a typical setup, developers make changes in their local branches and then create a merge request to propose those changes into a shared branch (like 'master' or 'main').
Assuming we have a table that records each merge request, including its , the (the developer who proposed the change), the (the date and time when the merge request was created), the (the date and time when the merge request was accepted and merged), and the (the id of the branch into which the changes are proposed), can you write a SQL query to find the average time (in hours) it takes for a merge request to be accepted for each branch?
id | requester_id | creation_date | merge_date | branch_id |
---|---|---|---|---|
1 | 101 | 2022-06-01 10:00:00 | 2022-06-01 12:00:00 | 5001 |
2 | 102 | 2022-06-02 09:00:00 | 2022-06-02 11:30:00 | 5001 |
3 | 101 | 2022-07-01 14:00:00 | 2022-07-02 10:00:00 | 5002 |
4 | 103 | 2022-07-02 09:00:00 | 2022-07-03 09:30:00 | 5002 |
5 | 104 | 2022-07-03 11:00:00 | NULL | 5002 |
branch_id | avg_merge_time_in_hours |
---|---|
5001 | 2.25 |
5002 | 20.25 |
The above PostgreSQL query first calculates the duration between and in hours using the function for each row where is not null. Then it takes the average of these durations grouped by to get the average merge time in hours for each branch. The clause is used to exclude merge requests that have not been accepted yet (i.e., is null). These are not included in the average merge time calculation. The clause is required because we want the average for each individual branch (not the overall average).
To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating total time or this Twitter Tweets' Rolling Averages Question which is similar for using window function.
The operator is used to select rows that fall within a certain range of values, while the operator is used to select rows that match values in a specified list.
For example, suppose you are a data analyst at GitLab and have a table of advertising campaign data. To find campaigns with a spend between 5k, you could use BETWEEN:
To find advertising campaigns that were video and image based (as opposed to text or billboard ads), you could use the operator:
In GitLab, there are two important touchpoints users go through. The first is visiting the home page (), and the second is clicking on a particular repository ().
The click-through rate (CTR) is calculated as the number of repository clicks divided by the number of page views. Your task is to write an SQL query to determine the CTR for each day.
view_id | user_id | view_date | page_id |
---|---|---|---|
12345 | 567 | 2022/06/08 | 0101 |
12346 | 568 | 2022/06/08 | 0101 |
12347 | 569 | 2022/06/09 | 0101 |
12348 | 570 | 2022/06/09 | 0101 |
12349 | 571 | 2022/06/10 | 0101 |
click_id | user_id | click_date | repo_id |
---|---|---|---|
56789 | 567 | 2022/06/08 | 0202 |
56790 | 568 | 2022/06/08 | 0101 |
56791 | 569 | 2022/06/09 | 0101 |
56792 | 560 | 2022/06/10 | 0202 |
56793 | 561 | 2022/06/10 | 0101 |
In this answer, we first form two Common Table Expressions (CTE) named and . Both CTEs perform a GROUP BY operation to sum up the total page views and repository clicks for each date.
Next, we join these two CTEs on the date and use the formula to calculate the CTR (number of clicks divided by the number of page views). We use the NULLIF function to avoid division by zero errors if there were no page views on a particular date. This will result in a table with the date and the CTR for each date.
To practice a similar SQL problem on DataLemur's free online SQL code editor, attempt this Facebook SQL Interview question:
Cross joins and natural joins are two types of JOIN operations in SQL that are used to combine data from multiple tables. A cross join creates a new table by combining each row from the first table with every row from the second table, and is also known as a cartesian join. On the other hand, a natural join combines rows from two or more tables based on their common columns, forming a new table. One key difference between these types of JOINs is that cross joins do not require common columns between the tables being joined, while natural joins do.
Here's an example of a cross join:
If you have 20 products and 10 colors, that's 200 rows right there!
Here's a natural join example using two tables, GitLab employees and GitLab managers:
This natural join returns all rows from GitLab employees where there is no matching row in managers based on the column.
GitLab is a popular online service that allows developers to collaborate on projects. Suppose GitLab has a users database, and you're given a task to find all users who are using a specific company email domain (for example, 'gitlab.com') for their account.
Please filter the data from the database table to return only those records that have an email address ending in '@gitlab.com'. Additionally, we are interested in the 'username', 'email', 'confirmed_at', and 'last_sign_in_at' fields.
user_id | username | confirmed_at | last_sign_in_at | |
---|---|---|---|---|
1001 | user1 | user1@company.com | 2020-01-01 | 2021-12-30 |
1002 | user2 | user2@company.com | 2020-02-02 | 2022-01-15 |
1003 | user3 | user3@gitlab.com | 2020-03-03 | 2022-01-20 |
1004 | user4 | user4@gitlab.com | 2020-04-04 | 2021-12-25 |
1005 | user5 | user5@anothercompany.com | 2020-05-05 | 2022-01-25 |
This SQL query uses the LIKE operator with a wildcard to match any email addresses that ends with '@gitlab.com'. It will return the 'username', 'email', 'confirmed_at', and 'last_sign_in_at' fields for those matching records from the table.
username | confirmed_at | last_sign_in_at | |
---|---|---|---|
user3 | user3@gitlab.com | 2020-03-03 | 2022-01-20 |
user4 | user4@gitlab.com | 2020-04-04 | 2021-12-25 |
As a data analyst at GitLab, one company data you might be interested in is the engagement level of different projects happening within the company. You are given two tables, and . The table contains information about each user's name and user ID. The table contains data about each project's commit history: the user ID of the contributor, the project ID they worked on, and the date they made the commit.
Your task is to write an SQL query to find the total number of commits each user has made per project, join that with the users' table, and order the result by the project ID in ascending order and then the total number of commits in descending order.
user_id | user_name |
---|---|
101 | Alice |
102 | Bob |
103 | Charlie |
104 | David |
user_id | project_id | commit_date |
---|---|---|
101 | 1001 | 2022-01-02 |
101 | 1001 | 2022-01-05 |
102 | 1001 | 2022-01-07 |
101 | 1002 | 2022-01-08 |
103 | 1002 | 2022-01-11 |
103 | 1002 | 2022-01-14 |
104 | 1003 | 2022-01-22 |
In the above query, we first find the total number of commits each user has made per project using the clause and the . Then, we join this result with the table to get the names of the users. Lastly, we order the result by the project ID in ascending order and the total number of commits in descending order using the clause.
Because joins come up frequently during SQL interviews, take a stab at an interactive Spotify JOIN SQL question:
Database views are created to provide customized, read-only versions of your data that you can query just like a regular table. So why even use one if they're just like a regular table?
Views are useful for creating a simplified version of your data for specific users, or for hiding sensitive data from certain users while still allowing them to access other data.
The key to acing a GitLab SQL interview is to practice, practice, and then practice some more! Beyond just solving the above GitLab SQL interview questions, you should also solve the 200+ SQL questions on DataLemur which come from companies like Netflix, Google, and Amazon.
Each interview question has hints to guide you, fully explained answers along with a discussion board to see how others solved it and best of all, there is an interactive coding environment so you can right online code up your SQL query answer and have it checked.
To prep for the GitLab SQL interview it is also useful to practice interview questions from other tech companies like:
In case your SQL skills are weak, don't worry about jumping right into solving questions – improve your SQL foundations with this interactive SQL tutorial.
This tutorial covers SQL topics like filtering on multiple conditions using AND/OR/NOT and aggreage functions like MIN()/MAX() – both of which come up routinely in SQL interviews at GitLab.
In addition to SQL interview questions, the other question categories tested in the GitLab Data Science Interview are:
To prepare for GitLab Data Science interviews read the book Ace the Data Science Interview because it's got: