At Trainline, SQL is used day-to-day for extracting and analyzing railway booking data, and for managing and manipulating databases for customer trend prediction. So, it shouldn't surprise you that Trainline frequently asks SQL coding questions during interviews for Data Science and Data Engineering positions.
To help you study for the Trainline SQL interview, we've collected 10 Trainline SQL interview questions – able to answer them all?
Trainline, being a digital platform for selling train tickets, may be interested in understanding how their ticket sales have evolved overtime for each route.
Let's say we have a simple database table , which contains the details of each ticket sold. The table schema and some sample data are as follows:
sale_id | route_id | sale_date | ticket_price |
---|---|---|---|
2321 | 1001 | 06/08/2022 00:00:00 | 50.00 |
3452 | 2001 | 06/10/2022 00:00:00 | 45.00 |
9563 | 1001 | 06/30/2022 00:00:00 | 55.00 |
4222 | 3001 | 07/01/2022 00:00:00 | 80.00 |
7245 | 2001 | 07/22/2022 00:00:00 | 46.00 |
8562 | 3001 | 07/25/2022 00:00:00 | 80.00 |
The SQL interview question could be: Write a SQL query to calculate the total monthly revenue and average ticket price for each route.
This question requires knowledge of SQL window functions to sum, average over a group of records which fall within a specific date range (month), and partition by each route.
The expected output may look as follows:
mth | route_id | total_revenue | avg_ticket_price |
---|---|---|---|
6 | 1001 | 105.00 | 52.50 |
6 | 2001 | 45.00 | 45.00 |
7 | 2001 | 46.00 | 46.00 |
7 | 3001 | 160.00 | 80.00 |
This query groups the data by month and route_id and then calculates the total and average ticket price for each group. By using the date_trunc function, we can easily group the data by month. The ORDER BY clause ensures that the result is sorted by month and then by total revenue in descending order.
Pro Tip: Window functions are a frequent SQL interview topic, so practice all the window function problems on DataLemur
Imagine you are a data engineer at Trainline, a company that sells train tickets. Your task is to design a database for handling bookings. Essential entities to consider are stations, trains, and tickets.
Trainline's ticket booking system should support these functionalities:
station_id | station_name |
---|---|
1 | London |
2 | Manchester |
3 | Glasgow |
train_id | train_name |
---|---|
1 | Galia Express |
2 | Victoria Line |
schedule_id | train_id | departure_station_id | arrival_station_id | departure_time | arrival_time |
---|---|---|---|---|---|
1 | 1 | 1 | 2 | 09:00:00 | 11:30:00 |
2 | 2 | 2 | 3 | 14:00:00 | 16:00:00 |
ticket_id | passenger_name | booking_date | departure_station_id | arrival_station_id | departure_time | arrival_time |
---|---|---|---|---|---|---|
1 | John Doe | 2022-01-01 | 1 | 2 | 09:00:00 | 11:30:00 |
Answer to how many tickets were sold between 2 stations in a specific time range:
This query will return the count of all tickets sold for a specific route in a specific time range. You replace , , , and with your desired values. The operator in the clause will ensure you only consider tickets booked within the time range you want.
To find records in one table that aren't in another, you can use a and check for values in the right-side table.
Here's an example using two tables, Trainline employees and Trainline managers:
This query returns all rows from Trainline employees where there is no matching row in managers based on the column.
You can also use the operator in PostgreSQL and Microsoft SQL Server to return the records that are in the first table but not in the second. Here is an example:
This will return all rows from employees that are not in managers. The operator works by returning the rows that are returned by the first query, but not by the second.
Note that isn't supported by all DBMS systems, like in MySQL and Oracle (but have no fear, since you can use the operator to achieve a similar result).
As a data analyst for Trainline, your task is to filter the customer records based on the following conditions:
You are provided with two tables, and .
The table is formatted as follows:
booking_id | customer_id | booking_date | departure_station | arrival_station | travel_class |
---|---|---|---|---|---|
1001 | 200 | 2022-01-20 | London | Manchester | First Class |
1002 | 201 | 2022-02-15 | Birmingham | London | Standard |
1003 | 200 | 2022-03-10 | London | Edinburgh | First Class |
1004 | 202 | 2022-04-25 | London | Bristol | Standard |
1005 | 203 | 2022-05-30 | Manchester | Birmingham | First Class |
The table is formatted as follows:
cancel_id | booking_id | cancel_date |
---|---|---|
501 | 1002 | 2022-02-16 |
502 | 1005 | 2022-06-01 |
503 | 1004 | 2022-04-27 |
504 | 1004 | 2022-05-01 |
Here is the SQL Postgres query to solve the above problem:
This query filters out the customers who satisfy all the given conditions, providing the customer_id of the user who made their most recent booking under these conditions.
First, it filters out the records with a booking date within the last 6 months, where the departure station is London and the travel class is First Class. It then excludes the bookings which are present in the cancellations table within the last year. The ORDER BY and LIMIT 1 clause ensure that we retrieve only the most recent booking that meets these conditions.
The constraint is used to establish a relationship between two tables in a database. This ensures the referential integrity of the data in the database.
For example, if you have a table of Trainline customers and an orders table, the customer_id column in the orders table could be a that references the id column (which is the primary key) in the Trainline customers table.
Trainline is a company that offers various routes and ticket bookings for trains and coaches. As a part of their digital marketing strategy, they often use ads that lead a user to their website or app. From there they hope user will not just view the different routes and tickets, but also add them to their cart.
Given two tables, and , your task is to calculate the click-through rate (CTR) from viewing a route to adding it to the cart by date. The table logs every click on an ad that redirects to a route view while the table logs every addition of a ticket to the cart.
click_id | user_id | click_date | route_id |
---|---|---|---|
1508 | 429 | 06/08/2022 | 30169 |
1782 | 713 | 06/10/2022 | 40590 |
1769 | 446 | 06/12/2022 | 30169 |
2295 | 145 | 06/15/2022 | 50702 |
2790 | 366 | 06/18/2022 | 40590 |
addition_id | user_id | add_date | route_id |
---|---|---|---|
2715 | 429 | 06/08/2022 | 30169 |
3463 | 713 | 06/10/2022 | 40590 |
3673 | 446 | 06/12/2022 | 30169 |
4953 | 145 | 06/15/2022 | 50702 |
5702 | 434 | 06/18/2022 | 40590 |
To calculate the CTR, this query joins and tables on , and fields. Then it counts the number of tickets added to the cart and divides it by number of ad clicks for each date and route. This value is then multiplied by 100 to get a percentage value. The results are grouped by date and route_id. The query orders by to provide a chronological view of CTR.
To practice another question about calculating rates, try this TikTok SQL question within DataLemur's online SQL code editor:
In a database, an index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and the use of more storage space to maintain the index data structure.
There are several types of indexes that can be used in a database:
Suppose we want to find out the most popular routes by the number of tickets sold per month. Each ticket sale is logged to a sales table, which includes the train route id, ticket id, and the sale date.
sale_id | route_id | ticket_id | sale_date |
---|---|---|---|
2387 | 501 | 10301 | 06/08/2022 |
1293 | 702 | 13456 | 06/10/2022 |
5093 | 501 | 12872 | 06/18/2022 |
1132 | 702 | 14236 | 07/26/2022 |
4917 | 501 | 15641 | 07/05/2022 |
month | route | ticket_count |
---|---|---|
6 | 501 | 2 |
6 | 702 | 1 |
7 | 501 | 1 |
7 | 702 | 1 |
This SQL query will group all ticket sales by month and route_id, count the number of tickets for each group, and order the results by month and the number of tickets in descending order. This gives us the most popular routes (identified by 'route') for each month.
Assume you are given a database of Trainline customers who have booked train tickets in the past. Your task is to find customers whose first name starts with 'M' and have booked their tickets on 'London-Plymouth' route. The query should return the customer's first name, last name, and the date they booked the ticket.
customer_id | first_name | last_name |
---|---|---|
007 | Mark | Smith |
463 | John | Brown |
591 | Matthew | Jones |
812 | Mario | Rossi |
073 | Madeline | Archibald |
booking_id | customer_id | booking_date | route |
---|---|---|---|
521 | 007 | 06/08/2022 | London-Plymouth |
612 | 463 | 06/10/2022 | London-Bristol |
349 | 591 | 06/18/2022 | London-Plymouth |
724 | 812 | 07/26/2022 | London-Edinburgh |
965 | 073 | 07/05/2022 | London-Plymouth |
This SQL command joins the 'customers' and 'bookings' tables on the 'customer_id'. It then filters the result where the customer's first name starts with 'M' and the route is 'London-Plymouth'.
The keyword removes duplicates from a query.
Suppose you had a table of Trainline customers, and wanted to figure out which cities the customers lived in, but didn't want duplicate results.
table:
name | city |
---|---|
Akash | SF |
Brittany | NYC |
Carlos | NYC |
Diego | Seattle |
Eva | SF |
Faye | Seattle |
You could write a query like this to filter out the repeated cities:
Your result would be:
city |
---|
SF |
NYC |
Seattle |
The key to acing a Trainline SQL interview is to practice, practice, and then practice some more! Beyond just solving the earlier Trainline SQL interview questions, you should also solve the 200+ FAANG SQL Questions on DataLemur which come from companies like Facebook, Google and unicorn tech startups.
Each problem on DataLemur has multiple hints, step-by-step solutions and most importantly, there's an interactive coding environment so you can easily right in the browser your SQL query and have it checked.
To prep for the Trainline SQL interview it is also helpful to practice SQL questions from other tech companies like:
But if your SQL foundations are weak, don't worry about going right into solving questions – go learn SQL with this SQL tutorial for Data Scientists & Analysts.
This tutorial covers topics including CASE/WHEN/ELSE statements and handling missing data (NULLs) – both of these pop up often during Trainline SQL assessments.
In addition to SQL interview questions, the other types of questions tested in the Trainline Data Science Interview are:
The best way to prepare for Trainline Data Science interviews is by reading Ace the Data Science Interview. The book's got: