Explanation of SQL Indexes, Part 1

When used effectively, a database index in SQL can feel like magic. However, as this series of exercises will demonstrate, the fundamental principles of SQL indexes and their proper usage are quite logical.

This series, SQL Indexes Explained, will delve into the reasons behind using indexes for data retrieval and the design choices made by modern RDBMSs. We’ll also examine the algorithms used to fetch data for particular query patterns.

You only need a basic understanding of SQL and any programming language to follow along with SQL Indexes Explained. Here’s a glimpse of the main topics we’ll be exploring:

  • The necessity of SQL database indexes and visualizing execution plans with indexes
  • Index design: Selecting indexes that optimize query speed and efficiency
  • Writing queries that effectively leverage indexes
  • The impact of SQL index usage on read/write performance
  • Covering indexes
  • Partitioning, its influence on reading and writing, and when to employ it

This is more than a basic SQL index tutorial; it’s a thorough examination of the inner workings of indexes.

We will decipher how an RDBMS utilizes indexes through practical exercises and analysis of our problem-solving techniques. Our practice material will be in the form of read-only Google Sheets. To participate in an exercise, simply make a copy of the Google Sheet (File → Make a copy) or transfer its contents to your personal Google Sheet.

We’ll be using Oracle syntax for SQL queries in each exercise. Dates will adhere to the ISO 8601 format: YYYY-MM-DD.

Exercise 1: Retrieving All Reservations for a Client

Your initial task—hold off on starting just yet—is to locate all rows within the Reservation spreadsheet pertaining to a specific client in a hotel reservation system and replicate them to your spreadsheet. This simulates the execution of the following query:

1
2
3
4
5
6
SELECT
        *
    FROM
        Reservations
    WHERE
        ClientID = 12;

We’ll tackle this using a particular method.

Approach 1: No Sorting or Filtering

For your first attempt, refrain from using any sorting or filtering features. Please track your completion time. The resulting sheet should contain 73 rows.

This pseudocode outlines the algorithm for completing the task without sorting:

1
2
For each row from Reservations
  If Reservations.ClientID = 12 then fetch Reservations.*

In this scenario, we had to examine all 841 rows to retrieve and copy the 73 rows that met the condition.

Approach 2: Sorting Only

This time, sort the sheet by the values in the ClientID column without employing filters. Note the time taken and compare it to the time required without sorting.

The approach post-sorting looks like this:

1
2
3
For each row from Reservations
  If ClientID = 12 then fetch Reservations.*
  Else if ClientID > 12 exit

This method required checking only 780 rows. If we could directly access the first relevant row, the process would be even faster.

However, creating a program for this task would render this solution slower than the initial approach. This is because sorting the entire dataset would necessitate accessing each row at least once. This method is efficient only if the sheet is pre-sorted in the desired order.

Exercise 2: Determining Reservation Count for a Specific Date

Your next task is to calculate the number of check-ins on August 16th, 2020:

1
2
3
4
5
6
SELECT
        COUNT (*)
    FROM
        Reservations
    WHERE
        DateFrom = TO_DATE('2020-08-16', 'YYYY-MM-DD')

Utilize the spreadsheet from Exercise 1 and compare the time taken to complete the task with and without sorting. The correct count is 91.

The algorithm for the non-sorting approach mirrors that of Exercise 1.

The sorting approach is also comparable to the previous exercise, with the loop divided into two segments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Assumption: Table reservation is sorted by DateFrom
-- Find the first reservation from the 16th of August 2020.
Repeat
  Read next row
Until DateFrom = '2020-08-16'

-- Calculate the count
While DateFrom = '2020-08-16'
  Increase the count
  Read the next row

Exercise 3: Assisting a Criminal Investigation

A police inspector requires a list of guests who checked into the hotel on August 13th and 14th, 2020.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT
        ClientID
    FROM
        Reservations
    WHERE
        DateFrom BETWEEN (
            TO_DATE('2020-08-13', 'YYYY-MM-DD') AND
            TO_DATE('2020-08-14', 'YYYY-MM-DD')
        )
        AND HotelID = 3;

Approach 1: Sorting by Date Only

Speed is crucial for the inspector. We know that sorting the table/spreadsheet by arrival date would be beneficial. If we just completed Exercise 2, we’re in luck, as the table is already sorted. Let’s use an approach similar to the one from Exercise 2.

Attempt this approach, noting the time taken, the number of rows read, and the number of entries on the final list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Assumption: Table reservation is sorted by DateFrom
-- Find the first reservation from the 13th of August 2020.
Repeat
  Read next row
Until DateFrom >= '2020-08-13'

-- Prepare the list
While DateFrom < '2020-08-15'
  If HotelID = 3 then write down the ClientID
  Read the next row

This approach required reading 511 rows to compile a 46-guest list. If we had a method to pinpoint the first relevant row, we could have avoided 324 redundant reads in the repeat cycle, which were solely for locating the first arrival on August 13th. Nonetheless, we still had to process over 100 rows to verify arrivals at the hotel with a HotelID of 3.

The inspector wouldn’t be pleased with the results after the long wait. Instead of receiving guest names and pertinent information, they only received a list of IDs, which hold little meaning in this context.

We’ll address this later in the series. For now, let’s focus on generating the list more efficiently.

Approach 2: Sorting by Hotel, Then Date

To sort rows by HotelID followed by DateFrom, select all columns and use the Google Sheets menu option Data → Sort range.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Assumption: Sorted according to HotelID and DateFrom
-- Find the first reservation for the HotelID = 3.
Repeat
  Read next row
Until HotelID >= 3

-- Find the first arrival at the hotel on 13th of August
While HotelID = 3 and DateFrom < '2020-08-13'
  Read the next row

-- Prepare the list
While HotelID = 3 and DateFrom < '2020-08-15'
  Write down the ClientID
  Read the next row

We had to bypass the first 338 arrivals to find the first one relevant to our hotel. Then, we reviewed 103 earlier arrivals to identify the first one on August 13th. Finally, we copied 46 consecutive ClientID values. Copying a consecutive block of IDs in the third step was helpful. Ideally, we could jump straight to the first row of that block.

Approach 3: Sorting by Hotel Only

Now, let’s retry the exercise using the spreadsheet sorted solely by HotelID.

The algorithm applied to the table sorted by HotelID alone is less efficient than sorting by HotelID and then DateFrom:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
-- Assumption: Sorted according to HotelID
-- Find the first reservation for the HotelID = 3.
Repeat
  Read next row
Until HotelID >= 3

-- Prepare the list
While HotelID = 3
  If DateFrom between '2020-08-13' and '2020-08-14'
      Write down the ClientID
  Read the next row

Here, we have to examine all 166 arrivals at the hotel with a HotelID of 3 and verify if each DateFrom falls within the specified timeframe.

Approach 4: Sorting by Date, Then Hotel

Is there a significant difference between sorting by HotelID then DateFrom versus the reverse order? Let’s investigate by sorting first by DateFrom and then by HotelID.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
-- Assumption: Sorted according to DateFrom and HotelID
-- Find the first arrival on 13th of August
While DateFrom < '2020-08-13'
  Read the next row

--Find the first arrival at the Hotel 
While HotelID < 3 and DateFrom < '2020-08-15'
	Read the next row

Repeat

  If HotelID = 3
    Write down the ClientID

  Read the next row

Until DateFrom > '2020-08-14' or (DateFrom = '2020-08-14' and HotelID > 3)

We found the first row with the target date and continued reading until locating the first arrival at the desired hotel. For a certain number of rows, both conditions (correct date and hotel) were met. However, after arrivals at Hotel 3, we encountered arrivals for hotels 4, 5, and so on, for the same date. Subsequently, we had to read rows for the next day, starting with hotels 1 and 2, until reaching consecutive arrivals for our target hotel.

Illustration of data layout using different sorting approaches, as described further in the article text.

Evidently, each approach involves a single continuous block of data representing partially matched information within the complete dataset. Approaches 2 and 4 are the only ones where the logic allows us to halt the algorithm before reaching the end of these partial matches.

While Approach 4 has fully matched data in two separate blocks, only Approach 2 consolidates all target data into one consecutive block.

 Approach 1Approach 2Approach 3Approach 4
Initial skippable rows324338 + 103 = 441342324
Candidate rows to examine18846166159
Skippable rows after algorithm stops328353332357
Total skippable rows652794674681

The numbers highlight the advantages of Approach 2 in this scenario.

SQL Indexes Explained: Key Takeaways and Next Steps

These exercises highlight several key points:

  1. Retrieving data from a pre-sorted table is faster.
  2. If a table requires sorting, it’s sorting takes more time than reading from an unsorted table.
  3. A mechanism to jump directly to the first matching row within a sorted table would significantly reduce read operations.
  4. Having a pre-sorted table is highly advantageous.
  5. Maintaining sorted copies of the table for frequently executed queries would be beneficial.

The concept of a sorted table copy closely resembles a database index. The next installment of SQL Indexes Explained will delve into a basic index implementation. Thank you for reading!

Licensed under CC BY-NC-SA 4.0