So far in the SQL tutorial we've only been analyzing data from one table at a time. However, in the real world, companies have databases containing thousands of tables. To combine multiple tables, and analyze their data simultaneously, we can write a SQL query β the focus of this tutorial.
For example, let's pretend we work at Spotify where we have a table for albums, a table for artists, a table to represent users, and the list π΅ goes on and on and onπ΅ (sung to the tune of Don't Stop Believin').
As a Data Analyst or Data Scientist, you'll almost always be pulling data from multiple tables, which is why the SQL JOIN is crucial.
For example, at Spotify you might analyze Drake π by writing queries that combine two tables like:
Now that you understand the real-world use case of a , let's write a specific SQL join query using Spotify data.
Suppose you work at Spotify and want to join information in the table with information in the table. Here's the schema for the two tables:
Example Input:
artist_id | artist_name | label_owner |
---|---|---|
101 | Ed Sheeran | Warner Music Group |
120 | Drake | Warner Music Group |
125 | Bad Bunny | Rimas Entertainment |
Example Input:
song_id | artist_id | name |
---|---|---|
55511 | 101 | Perfect |
45202 | 101 | Shape of You |
22222 | 120 | One Dance |
19960 | 120 | Hotline Bling |
Notice how the two tables have the column in common:
Now, let's actually build our SQL JOIN query to combine the and table.
There are three key parts to get right when writing a in SQL. First, you need to specify what columns to show. For our x example, we'll just use for that.
Secondly, you need to specify the names of the two tables we are joining. Since we are joining the with table, our query so far is as follows:
The third and final in a SQL JOIN query is the clause, which explains to the RDBMS (Relational Database Management System) how the two tables relate to each other.
We need to explicitly write in the SQL query that the column in the table matches up against the column in the table with this clause:
Putting these three parts together, we get the following SQL JOIN query:
The above query combines info from the and table into the result below:
artist_id | artist_name | label_owner | song_id | artist_id | name |
---|---|---|---|---|---|
101 | Ed Sheeran | Warner Music Group | 55511 | 101 | Perfect |
101 | Ed Sheeran | Warner Music Group | 45202 | 101 | Shape of You |
120 | Drake | Warner Music Group | 22222 | 120 | One Dance |
120 | Drake | Warner Music Group | 19960 | 120 | Hotline Bling |
125 | Bad Bunny | Rimas Entertainment | 12636 | 125 | Mia |
Now, let's practice writing a on a different dataset.
Suppose you work as a Data Scientist at the stock-trading app Robinhood. Assume you're given access to a table called which contains information about trades placed on the platform, and a table called which has information about a specific user.
Here's what the data looks like in both tables:
Example Input:
order_id | user_id | price | quantity | status | timestamp |
---|---|---|---|---|---|
100101 | 111 | 9.80 | 10 | Cancelled | 08/17/2022 12:00:00 |
100102 | 111 | 10.00 | 10 | Completed | 08/17/2022 12:00:00 |
100259 | 148 | 5.10 | 35 | Completed | 08/25/2022 12:00:00 |
100264 | 148 | 4.80 | 40 | Completed | 08/26/2022 12:00:00 |
100305 | 300 | 10.00 | 15 | Completed | 09/05/2022 12:00:00 |
100400 | 178 | 9.90 | 15 | Completed | 09/09/2022 12:00:00 |
100565 | 265 | 25.60 | 5 | Completed | 12/19/2022 12:00:00 |
Example Input:
user_id | city | signup_date | |
---|---|---|---|
111 | San Francisco | rrok10@gmail.com | 08/03/2021 12:00:00 |
148 | Boston | sailor9820@gmail.com | 08/20/2021 12:00:00 |
178 | San Francisco | harrypotterfan182@gmail.com | 01/05/2022 12:00:00 |
265 | Denver | shadower_@hotmail.com | 02/26/2022 12:00:00 |
300 | San Francisco | houstoncowboy1122@hotmail.com | 06/30/2022 12:00:00 |
Write a SQL query to join the and table.
The output should look something like this:
order_id | user_id | quantity | status | date | price | user_id | city | signup_date | |
---|---|---|---|---|---|---|---|---|---|
100102 | 111 | 10 | Completed | 08/17/2022 12:00:00 | 10.00 | 111 | San Francisco | rrok10@gmail.com | 08/03/2021 12:00:00 |
100101 | 111 | 10 | Cancelled | 08/17/2022 12:00:00 | 9.80 | 111 | San Francisco | rrok10@gmail.com | 08/03/2021 12:00:00 |
100900 | 148 | 50 | Completed | 07/14/2022 12:00:00 | 9.78 | 148 | Boston | sailor9820@gmail.com | 08/20/2021 12:00:00 |
100259 | 148 | 35 | Completed | 08/25/2022 12:00:00 | 5.10 | 148 | Boston | sailor9820@gmail.com | 08/20/2021 12:00:00 |
100264 | 148 | 40 | Completed | 08/26/2022 12:00:00 | 4.80 | 148 | Boston | sailor9820@gmail.com | 08/20/2021 12:00:00 |
100777 | 178 | 60 | Completed | 07/25/2022 17:47:00 | 3.56 | 178 | San Francisco | harrypotterfan182@gmail.com | 01/05/2022 12:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Be sure to complete this SQL JOIN practice exercise, as it's meant to warm you up before you solve a real SQL interview question using the same data in the next section.
A Data Scientist interviewing at Robinhood was asked this SQL interview question where you need to write a SQL JOIN query to find the top three cities that have the highest number of completed trade orders.
Your output should look like this:
city | total_orders |
---|---|
San Francisco | 3 |
Boston | 2 |
Denver | 1 |
Besides needing a , you'll also need and commands to solve this interview question.
So far in this tutorial, we've simply use the keyword, which is short for an . There are actually 4 different types of JOINS in SQL which we'll now cover in detail.
Type of Joins | Description |
---|---|
INNER JOIN | Returns only the rows with matching values from both tables. |
LEFT JOIN | Returns all the rows from the left table and the matching rows from the right table. |
RIGHT JOIN | Returns all the rows from the right table and the matching rows from the left table. |
FULL OUTER JOIN | Returns all rows when there is a match in either the left or the right table. If there is no match, values are returned for columns from the table without a match. |
We'll be using the Goodreads Books dataset to demonstrate these different types of JOINs.
An returns only the rows with matching values from both tables.
Picture this: you want to find out how many books priced at $20 and above have been ordered, and who the buyers are.
The returns only the rows where there is a matching book ID in both the and tables, focusing on books priced $20 and above.
Next, let's explore how the works using our and tables.
Suppose we want to retrieve all the orders along with their corresponding deliveries information. Here's the query:
With , all rows from the left table () are fetched, along with matching rows from the right table (). If there is no matching data in the right table, the result will still include the left table's data with values in the columns from the right table.
In our example, we're displaying a subset of 10 rows out of 20:
But what if we flip the tables? By swapping the positions of the tables in the , the same result can be achieved:
Notice how order IDs 2005, 2008, and 2010 are missing from the results? Since these order IDs are absent in the table, there are no corresponding matches in the table. This results in the values you see.
A is the opposite of a . It returns all rows from the right table and pairs them with matching rows from the left table. If there is no match, the columns from the left table have values.
Run the following query:
You can also swap the positions of the tables, and you'll get the exact same results as a .
π‘A little tip: In the real world, is rarely used. Most people naturally think from left to right, making easier to understand and implement. So, if you find yourself in a situation where you might consider a , simply switch the positions of the tables and use a instead!
Now, let's explore the which allows you to bring unmatched rows into the results.
Ever wondered what happens when both tables have something to offer, but they're not a perfect match?
A returns all rows when there is a match in either the left or the right table. If there is no match, values are returned for columns from the table without a match.
Running this query is like uniting two tables: and . Each row represents an attempt at a match. If there's no match, you'll see values in the columns from the table without a match.
The first 20 rows are expected (we're just showing you the first 5 rows):
Scroll down, and you'll encounter rows with order IDs (marked in the red box). The ensures that even if there isn't an to match these rows, they're still include in the results:
π‘ A little tip: Be careful with as it can introduce duplicate rows to your results without you realizing!
Conditional joins involve using single or multiple conditions using logical operators like and , or incorporating complex logical expressions such as ranges, pattern matching, or subqueries.
These conditions allow you to filter and retrieve data based on specific criteria during the join operation using the clause to specify conditions.
Example 1: Joining and table with a condition on quantity
Retrieve the list of book titles and their quantities ordered where the quantity ordered is more than 2.
What happens is:
Example 2: Joining and tables with a condition on
Retrieve order IDs and their corresponding delivery status where the delivery status is either 'Delivered' or 'Shipped'.
Example 3: Joining and with multiple conditions
Retrieve book titles, their authors, and the order dates for books released after 2015 and ordered in quantities greater than 1.
Example 4: Joining all three tables with a condition on and
Retrieve the book titles, their average ratings, order dates, and delivery statuses for books with a rating higher than 4.0 that have been delivered.
These examples illustrate how to use conditional joins to filter and retrieve specific data based on various conditions across multiple tables.
Let's tackle a real Facebook/Meta SQL Interview which requires a JOIN along with some null handling.
Remember the last lesson on how we used CASE WHEN to segment Marvel Actors based on how active they were on social media:
Let's combine the statement, along with a to solve this real Meta SQL interview question about segmenting Facebook advertisers based on their payment status:
Now that we've learned to join tables, there's one last thing we need to cover in order to finish the intermediate SQL tutorial: handling dates & timestamps.
Next Lesson
DATE FUNCTIONS π