When used effectively, a database index in SQL can feel like magic. However, as this series of exercises will demonstrate, the fundamental principles of SQL indexes and their proper usage are quite logical.
This series, SQL Indexes Explained, will delve into the reasons behind using indexes for data retrieval and the design choices made by modern RDBMSs. We’ll also examine the algorithms used to fetch data for particular query patterns.
You only need a basic understanding of SQL and any programming language to follow along with SQL Indexes Explained. Here’s a glimpse of the main topics we’ll be exploring:
- The necessity of SQL database indexes and visualizing execution plans with indexes
- Index design: Selecting indexes that optimize query speed and efficiency
- Writing queries that effectively leverage indexes
- The impact of SQL index usage on read/write performance
- Covering indexes
- Partitioning, its influence on reading and writing, and when to employ it
This is more than a basic SQL index tutorial; it’s a thorough examination of the inner workings of indexes.
We will decipher how an RDBMS utilizes indexes through practical exercises and analysis of our problem-solving techniques. Our practice material will be in the form of read-only Google Sheets. To participate in an exercise, simply make a copy of the Google Sheet (File → Make a copy) or transfer its contents to your personal Google Sheet.
We’ll be using Oracle syntax for SQL queries in each exercise. Dates will adhere to the ISO 8601 format: YYYY-MM-DD.
Exercise 1: Retrieving All Reservations for a Client
Your initial task—hold off on starting just yet—is to locate all rows within the Reservation spreadsheet pertaining to a specific client in a hotel reservation system and replicate them to your spreadsheet. This simulates the execution of the following query:
| |
We’ll tackle this using a particular method.
Approach 1: No Sorting or Filtering
For your first attempt, refrain from using any sorting or filtering features. Please track your completion time. The resulting sheet should contain 73 rows.
This pseudocode outlines the algorithm for completing the task without sorting:
| |
In this scenario, we had to examine all 841 rows to retrieve and copy the 73 rows that met the condition.
Approach 2: Sorting Only
This time, sort the sheet by the values in the ClientID column without employing filters. Note the time taken and compare it to the time required without sorting.
The approach post-sorting looks like this:
| |
This method required checking only 780 rows. If we could directly access the first relevant row, the process would be even faster.
However, creating a program for this task would render this solution slower than the initial approach. This is because sorting the entire dataset would necessitate accessing each row at least once. This method is efficient only if the sheet is pre-sorted in the desired order.
Exercise 2: Determining Reservation Count for a Specific Date
Your next task is to calculate the number of check-ins on August 16th, 2020:
| |
Utilize the spreadsheet from Exercise 1 and compare the time taken to complete the task with and without sorting. The correct count is 91.
The algorithm for the non-sorting approach mirrors that of Exercise 1.
The sorting approach is also comparable to the previous exercise, with the loop divided into two segments:
| |
Exercise 3: Assisting a Criminal Investigation
A police inspector requires a list of guests who checked into the hotel on August 13th and 14th, 2020.
| |
Approach 1: Sorting by Date Only
Speed is crucial for the inspector. We know that sorting the table/spreadsheet by arrival date would be beneficial. If we just completed Exercise 2, we’re in luck, as the table is already sorted. Let’s use an approach similar to the one from Exercise 2.
Attempt this approach, noting the time taken, the number of rows read, and the number of entries on the final list.
| |
This approach required reading 511 rows to compile a 46-guest list. If we had a method to pinpoint the first relevant row, we could have avoided 324 redundant reads in the repeat cycle, which were solely for locating the first arrival on August 13th. Nonetheless, we still had to process over 100 rows to verify arrivals at the hotel with a HotelID of 3.
The inspector wouldn’t be pleased with the results after the long wait. Instead of receiving guest names and pertinent information, they only received a list of IDs, which hold little meaning in this context.
We’ll address this later in the series. For now, let’s focus on generating the list more efficiently.
Approach 2: Sorting by Hotel, Then Date
To sort rows by HotelID followed by DateFrom, select all columns and use the Google Sheets menu option Data → Sort range.
| |
We had to bypass the first 338 arrivals to find the first one relevant to our hotel. Then, we reviewed 103 earlier arrivals to identify the first one on August 13th. Finally, we copied 46 consecutive ClientID values. Copying a consecutive block of IDs in the third step was helpful. Ideally, we could jump straight to the first row of that block.
Approach 3: Sorting by Hotel Only
Now, let’s retry the exercise using the spreadsheet sorted solely by HotelID.
The algorithm applied to the table sorted by HotelID alone is less efficient than sorting by HotelID and then DateFrom:
| |
Here, we have to examine all 166 arrivals at the hotel with a HotelID of 3 and verify if each DateFrom falls within the specified timeframe.
Approach 4: Sorting by Date, Then Hotel
Is there a significant difference between sorting by HotelID then DateFrom versus the reverse order? Let’s investigate by sorting first by DateFrom and then by HotelID.
| |
We found the first row with the target date and continued reading until locating the first arrival at the desired hotel. For a certain number of rows, both conditions (correct date and hotel) were met. However, after arrivals at Hotel 3, we encountered arrivals for hotels 4, 5, and so on, for the same date. Subsequently, we had to read rows for the next day, starting with hotels 1 and 2, until reaching consecutive arrivals for our target hotel.

Evidently, each approach involves a single continuous block of data representing partially matched information within the complete dataset. Approaches 2 and 4 are the only ones where the logic allows us to halt the algorithm before reaching the end of these partial matches.
While Approach 4 has fully matched data in two separate blocks, only Approach 2 consolidates all target data into one consecutive block.
| Approach 1 | Approach 2 | Approach 3 | Approach 4 | |
|---|---|---|---|---|
| Initial skippable rows | 324 | 338 + 103 = 441 | 342 | 324 |
| Candidate rows to examine | 188 | 46 | 166 | 159 |
| Skippable rows after algorithm stops | 328 | 353 | 332 | 357 |
| Total skippable rows | 652 | 794 | 674 | 681 |
The numbers highlight the advantages of Approach 2 in this scenario.
SQL Indexes Explained: Key Takeaways and Next Steps
These exercises highlight several key points:
- Retrieving data from a pre-sorted table is faster.
- If a table requires sorting, it’s sorting takes more time than reading from an unsorted table.
- A mechanism to jump directly to the first matching row within a sorted table would significantly reduce read operations.
- Having a pre-sorted table is highly advantageous.
- Maintaining sorted copies of the table for frequently executed queries would be beneficial.
The concept of a sorted table copy closely resembles a database index. The next installment of SQL Indexes Explained will delve into a basic index implementation. Thank you for reading!