logo

10 GitLab SQL Interview Questions (Updated 2024)

Updated on

March 2, 2024

At GitLab, SQL is used frequently for analyzing product usage data to improve source-control software features. Because they are an open-source company, the put out neat SQL style docs for the world to learn from. So when it comes to the technical interview process, it shouldn't surpirse you that GitLab LOVES to ask SQL coding questions during interviews for Data Science, Data Engineering and Data Analytics jobs.

Thus, to help you ace the GitLab SQL interview, here’s 10 GitLab SQL interview questions – can you answer each one?

10 GitLab SQL Interview Questions

SQL Question 1: Calculate the Monthly Active Users

Assume GitLab keeps track of its users' activities in a table . Each time a user performs an important activity (e.g. commit code, create merge request, or push code), a new row is added to the table. Write a SQL query to calculate the number of monthly active users (MAU), where an active user in a given month is defined as a user who has performed at least one activity in that month.

Here is some sample data for :

Example Input:
activity_iduser_idactivity_dateactivity_type
1011232022-08-06 00:00:00commit
1022652022-10-06 00:00:00merge_request
1033622022-11-18 00:00:00commit
1041922023-02-26 00:00:00push
1059812023-05-05 00:00:00merge_request

The output should have one row for each month (with the month formatted as yyyy-mm), and include the total count of active users for that month.

Example Output:
monthactive_users
2022-081
2022-101
2022-111
2023-021
2023-051

Answer:


This SQL query first creates a column for the month (formatted as 'yyyy-mm') of each activity using the function. It then groups by this month column and in each group, it counts the unique users (). The result is ordered by the month.

p.s. Window functions show up super frequently during SQL interviews, so practice the 27+ window function questions on DataLemur

DataLemur SQL Questions

SQL Question 2: Filtering GitLab User Activities

As an analyst at GitLab, one of your task is to monitor user activity on the platform. Create an SQL query that filters GitLab users who had logged in within the last 30 days but have not used any GitLab features within the same period.

Please consider the following markdown-formatted tables for this task:

Example Input:
user_idusernamelogin_date
6171user12308/28/2022
7802user26509/08/2022
5293user36207/20/2022
6352user19208/30/2022
4517user98107/12/2022
Example Input:
activity_iduser_idfeature_useddate
85216171Repository Creation08/15/2022
31027802Merge Requests09/08/2022
42095293CI/CD Pipeline05/18/2022
73056352Issue Tracker07/26/2022
97014517Repository Creation07/05/2022

Answer:


This query combines data from the and tables using a operation. It filters the users who have logged in in the last 30 days using . Among these users, it filters those who have not used any GitLab features within the same 30-day period by checking that the user activity date is either more than 30 days ago or there is null activity (), indicating no activity from the user.

SQL Question 3: What is denormalization?

Denormalization is the practice of altering a database schema in a way that breaks the normalization rules (1NF, 2NF, 3NF, etc.).

This is typically done to improve the performance of the database when it is being used for online analytics processing (OLAP), rather than online transaction processing (OLTP) use cases.

Denormalization can improve query performance by reducing the number of expensive joins required to retrieve data, but it comes with it's own drawbacks too. By adding redundant columns, you incur more data storage costs, and there's need for more complex update and delete operations in order to maintain data integrity across all the duplicated data. Thus, it's important to carefully consider the trade-offs involved before implementing denormalization.

GitLab SQL Interview Questions

SQL Question 4: Calculating Average Merge Request Time in GitLab

GitLab uses the concept of "merge requests" for proposing changes from one branch to another. In a typical setup, developers make changes in their local branches and then create a merge request to propose those changes into a shared branch (like 'master' or 'main').

Assuming we have a table that records each merge request, including its , the (the developer who proposed the change), the (the date and time when the merge request was created), the (the date and time when the merge request was accepted and merged), and the (the id of the branch into which the changes are proposed), can you write a SQL query to find the average time (in hours) it takes for a merge request to be accepted for each branch?

Example Input:
idrequester_idcreation_datemerge_datebranch_id
11012022-06-01 10:00:002022-06-01 12:00:005001
21022022-06-02 09:00:002022-06-02 11:30:005001
31012022-07-01 14:00:002022-07-02 10:00:005002
41032022-07-02 09:00:002022-07-03 09:30:005002
51042022-07-03 11:00:00NULL5002
Example Output:
branch_idavg_merge_time_in_hours
50012.25
500220.25

Answer:


The above PostgreSQL query first calculates the duration between and in hours using the function for each row where is not null. Then it takes the average of these durations grouped by to get the average merge time in hours for each branch. The clause is used to exclude merge requests that have not been accepted yet (i.e., is null). These are not included in the average merge time calculation. The clause is required because we want the average for each individual branch (not the overall average).

To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating total time or this Twitter Tweets' Rolling Averages Question which is similar for using window function.

SQL Question 5: What sets the 'BETWEEN' and 'IN' operators apart?

The operator is used to select rows that fall within a certain range of values, while the operator is used to select rows that match values in a specified list.

For example, suppose you are a data analyst at GitLab and have a table of advertising campaign data. To find campaigns with a spend between 1kand1k and 5k, you could use BETWEEN:


To find advertising campaigns that were video and image based (as opposed to text or billboard ads), you could use the operator:


SQL Question 6: Analyze Click-Through-Rate for GitLab

In GitLab, there are two important touchpoints users go through. The first is visiting the home page (), and the second is clicking on a particular repository ().

The click-through rate (CTR) is calculated as the number of repository clicks divided by the number of page views. Your task is to write an SQL query to determine the CTR for each day.

Example Input:
view_iduser_idview_datepage_id
123455672022/06/080101
123465682022/06/080101
123475692022/06/090101
123485702022/06/090101
123495712022/06/100101
Example Input:
click_iduser_idclick_daterepo_id
567895672022/06/080202
567905682022/06/080101
567915692022/06/090101
567925602022/06/100202
567935612022/06/100101

Answer:


In this answer, we first form two Common Table Expressions (CTE) named and . Both CTEs perform a GROUP BY operation to sum up the total page views and repository clicks for each date.

Next, we join these two CTEs on the date and use the formula to calculate the CTR (number of clicks divided by the number of page views). We use the NULLIF function to avoid division by zero errors if there were no page views on a particular date. This will result in a table with the date and the CTR for each date.

To practice a similar SQL problem on DataLemur's free online SQL code editor, attempt this Facebook SQL Interview question: Meta SQL interview question

SQL Question 7: How do cross joins and natural joins differ?

Cross joins and natural joins are two types of JOIN operations in SQL that are used to combine data from multiple tables. A cross join creates a new table by combining each row from the first table with every row from the second table, and is also known as a cartesian join. On the other hand, a natural join combines rows from two or more tables based on their common columns, forming a new table. One key difference between these types of JOINs is that cross joins do not require common columns between the tables being joined, while natural joins do.

Here's an example of a cross join:


If you have 20 products and 10 colors, that's 200 rows right there!

Here's a natural join example using two tables, GitLab employees and GitLab managers:


This natural join returns all rows from GitLab employees where there is no matching row in managers based on the column.

SQL Question 8: Finding GitLab Users Based on Email Domain

GitLab is a popular online service that allows developers to collaborate on projects. Suppose GitLab has a users database, and you're given a task to find all users who are using a specific company email domain (for example, 'gitlab.com') for their account.

Please filter the data from the database table to return only those records that have an email address ending in '@gitlab.com'. Additionally, we are interested in the 'username', 'email', 'confirmed_at', and 'last_sign_in_at' fields.

Example Input:

user_idusernameemailconfirmed_atlast_sign_in_at
1001user1user1@company.com2020-01-012021-12-30
1002user2user2@company.com2020-02-022022-01-15
1003user3user3@gitlab.com2020-03-032022-01-20
1004user4user4@gitlab.com2020-04-042021-12-25
1005user5user5@anothercompany.com2020-05-052022-01-25

Answer:


This SQL query uses the LIKE operator with a wildcard to match any email addresses that ends with '@gitlab.com'. It will return the 'username', 'email', 'confirmed_at', and 'last_sign_in_at' fields for those matching records from the table.

Example Output:

usernameemailconfirmed_atlast_sign_in_at
user3user3@gitlab.com2020-03-032022-01-20
user4user4@gitlab.com2020-04-042021-12-25

SQL Question 9: Determining the Project Engagement Level

As a data analyst at GitLab, one company data you might be interested in is the engagement level of different projects happening within the company. You are given two tables, and . The table contains information about each user's name and user ID. The table contains data about each project's commit history: the user ID of the contributor, the project ID they worked on, and the date they made the commit.

Your task is to write an SQL query to find the total number of commits each user has made per project, join that with the users' table, and order the result by the project ID in ascending order and then the total number of commits in descending order.

Example Input:
user_iduser_name
101Alice
102Bob
103Charlie
104David
Example Input:
user_idproject_idcommit_date
10110012022-01-02
10110012022-01-05
10210012022-01-07
10110022022-01-08
10310022022-01-11
10310022022-01-14
10410032022-01-22

Answer:


In the above query, we first find the total number of commits each user has made per project using the clause and the . Then, we join this result with the table to get the names of the users. Lastly, we order the result by the project ID in ascending order and the total number of commits in descending order using the clause.

Because joins come up frequently during SQL interviews, take a stab at an interactive Spotify JOIN SQL question: Spotify JOIN SQL question

SQL Question 10: What are database views used for?

Database views are created to provide customized, read-only versions of your data that you can query just like a regular table. So why even use one if they're just like a regular table?

Views are useful for creating a simplified version of your data for specific users, or for hiding sensitive data from certain users while still allowing them to access other data.

How To Prepare for the GitLab SQL Interview

The key to acing a GitLab SQL interview is to practice, practice, and then practice some more! Beyond just solving the above GitLab SQL interview questions, you should also solve the 200+ SQL questions on DataLemur which come from companies like Netflix, Google, and Amazon. DataLemur Question Bank

Each interview question has hints to guide you, fully explained answers along with a discussion board to see how others solved it and best of all, there is an interactive coding environment so you can right online code up your SQL query answer and have it checked.

To prep for the GitLab SQL interview it is also useful to practice interview questions from other tech companies like:

In case your SQL skills are weak, don't worry about jumping right into solving questions – improve your SQL foundations with this interactive SQL tutorial.

DataLemur SQL Course

This tutorial covers SQL topics like filtering on multiple conditions using AND/OR/NOT and aggreage functions like MIN()/MAX() – both of which come up routinely in SQL interviews at GitLab.

GitLab Data Science Interview Tips

What Do GitLab Data Science Interviews Cover?

In addition to SQL interview questions, the other question categories tested in the GitLab Data Science Interview are:

  • Statistics and Probability Questions
  • Python or R Programming Questions
  • Product Analytics Questions
  • Machine Learning Questions
  • Resume-Based Behavioral Questions

GitLab Data Scientist

How To Prepare for GitLab Data Science Interviews?

To prepare for GitLab Data Science interviews read the book Ace the Data Science Interview because it's got:

  • 201 interview questions sourced from Microsoft, Amazon & startups
  • a crash course covering SQL, Product-Sense & ML
  • over 900+ 5-star reviews on Amazon

Ace the DS Interview