Exploring the Performance of Entity Framework When Utilizing the "Contains" Method

In my daily development work, I heavily rely on Entity Framework. While it offers great convenience, it sometimes suffers from performance issues. Although there’s plenty of excellent advice available on enhancing EF performance, such as simplifying queries, using parameters effectively, and selecting only necessary fields, certain scenarios pose challenges. One such case is performing complex Contains operations on multiple fields, essentially involving joins with in-memory lists.

Problem

Let’s examine this issue with an example:

1
2
3
4
5
6
7
8
9
var localData = GetDataFromApiOrUser();
var query = from p in context.Prices
            join s in context.Securities on 
              p.SecurityId equals s.SecurityId
            join t in localData  on 
              new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
              new { t.Ticker, t.TradedOn, t.PriceSourceId }
            select p;
var result = query.ToList();

This code doesn’t function in EF 6, and although it works in EF Core, the join operation is executed locally. With a database containing ten million records, this results in the entire dataset being loaded into memory, leading to significant consumption. While this behavior is expected and not a bug in EF, a solution for such scenarios would be highly beneficial. This article explores alternative approaches to address this performance bottleneck.

Solution

We’ll experiment with various methods, progressing from basic to more advanced techniques. Each step will include code snippets, performance metrics like execution time and memory usage, and a maximum runtime limit of ten minutes for benchmarking purposes.

The benchmarking program, available at repository, is implemented using C#, .NET Core, EF Core, and PostgreSQL. The testing environment consists of a machine equipped with an Intel Core i5 processor, 8 GB RAM, and an SSD.

The database schema used for testing is as follows:

Tables in the database: prices, securities and price sources

Just three tables: prices, securities and price sources. The prices table has tens of millions of records.

Option 1: Simple and Naive

We’ll begin with a straightforward approach as a starting point.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
var result = new List<Price>();
using (var context = CreateContext())
{
  foreach (var testElement in TestData)
  {
    result.AddRange(context.Prices.Where(
      x => x.Security.Ticker == testElement.Ticker &&
           x.TradedOn == testElement.TradedOn &&
           x.PriceSourceId == testElement.PriceSourceId));
  }
}

This algorithm is quite basic: for each element in the test data, it locates a corresponding element in the database and adds it to the result set. While easy to implement, read, and maintain, its simplicity comes at the cost of performance. Even with indexes on all three columns, network communication overhead creates a bottleneck.

The performance metrics are as follows:

For larger datasets, the execution time reaches approximately one minute. Memory consumption remains reasonable.

Option 2: Naive with Parallelism

Let’s introduce parallelism into the mix. The underlying idea is to leverage multiple threads to query the database concurrently, potentially improving performance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
var result = new ConcurrentBag<Price>();
var partitioner = Partitioner.Create(0, TestData.Count);

Parallel.ForEach(partitioner, range =>
{
  var subList = TestData.Skip(range.Item1)
                        .Take(range.Item2 - range.Item1)
                        .ToList();
  using (var context = CreateContext())
  {
    foreach (var testElement in subList)
    {
      var query = context.Prices.Where(
            x => x.Security.Ticker == testElement.Ticker &&
                 x.TradedOn == testElement.TradedOn &&
                 x.PriceSourceId == testElement.PriceSourceId);
      foreach (var el in query)
      {
        result.Add(el);
      }
    }
  }
});

Interestingly, this approach performs slower than the first solution for smaller datasets. However, it outperforms the initial approach for larger datasets, demonstrating a two-fold improvement in this particular instance. Memory consumption shows a slight increase but remains relatively unchanged.

Option 3: Multiple Contains

Let’s explore another strategy:

Create three separate collections containing unique values for Ticker, PriceSourceId, and Date.
Execute a single query filtered by three Contains operations.
Perform a local recheck (explained below).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
var result = new List<Price>();
using (var context = CreateContext())
{
  var tickers = TestData.Select(x => x.Ticker).Distinct().ToList();
  var dates = TestData.Select(x => x.TradedOn).Distinct().ToList();
  var ps = TestData.Select(x => x.PriceSourceId)
                   .Distinct().ToList();

  var data = context.Prices
               .Where(x => tickers.Contains(x.Security.Ticker) &&
                           dates.Contains(x.TradedOn) &&
                           ps.Contains(x.PriceSourceId))
               .Select(x => new { 
                           x.PriceSourceId,
                           Price = x, 
                           Ticker = x.Security.Ticker, 
                      })
                .ToList();

  var lookup = data.ToLookup(x => 
     $"{x.Ticker}, {x.Price.TradedOn}, {x.PriceSourceId}");

  foreach (var el in TestData)
  {
    var key = $"{el.Ticker}, {el.TradedOn}, {el.PriceSourceId}";
    result.AddRange(lookup[key].Select(x => x.Price));
  }
}

This approach poses some issues due to its data-dependent nature. While it can be highly performant if only the required records are retrieved, there’s a risk of fetching significantly more data, potentially up to 100 times the necessary amount.

Consider the following test data:

This query aims to retrieve prices for Ticker1 traded on 2018-01-01 and Ticker2 traded on 2018-01-02. However, the query returns four records.

The unique values for Ticker are Ticker1 and Ticker2, while for TradedOn, they are 2018-01-01 and 2018-01-02. As a result, four records satisfy the query criteria.

This highlights the need for a local recheck and the potential pitfalls of this approach.

Here are the metrics:

The memory consumption is exceptionally high, and tests with larger datasets timed out after ten minutes.

Option 4: Predicate Builder

Let’s shift gears and explore building an Expression for each test dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
var result = new List<Price>();
using (var context = CreateContext())
{
  var baseQuery = from p in context.Prices
                  join s in context.Securities on 
                    p.SecurityId equals s.SecurityId
                  select new TestData()
                  {
                    Ticker = s.Ticker,
                    TradedOn = p.TradedOn,
                    PriceSourceId = p.PriceSourceId,
                    PriceObject = p
                  };

  var tradedOnProperty = typeof(TestData).GetProperty("TradedOn");
  var priceSourceIdProperty =
    typeof(TestData).GetProperty("PriceSourceId");
  var tickerProperty = typeof(TestData).GetProperty("Ticker");

  var paramExpression = Expression.Parameter(typeof(TestData));
  Expression wholeClause = null;
  foreach (var td in TestData)
  {
    var elementClause = 
      Expression.AndAlso(
        Expression.Equal(
          Expression.MakeMemberAccess(
            paramExpression, tradedOnProperty),
          Expression.Constant(td.TradedOn)
        ),
        Expression.AndAlso(
          Expression.Equal(
            Expression.MakeMemberAccess(
              paramExpression, priceSourceIdProperty),
            Expression.Constant(td.PriceSourceId)
          ),
          Expression.Equal(
            Expression.MakeMemberAccess(
              paramExpression, tickerProperty), 
              Expression.Constant(td.Ticker))
          ));

    if (wholeClause == null)
      wholeClause = elementClause;
    else
      wholeClause = Expression.OrElse(wholeClause, elementClause);
  }

  var query = baseQuery.Where(
  (Expression<Func<TestData, bool>>)Expression.Lambda(
     wholeClause, paramExpression)).Select(x => x.PriceObject);

  result.AddRange(query);
}

The resulting code becomes quite complex, as building expressions often involves reflection, which can be slow. However, this approach enables us to construct a single query using a series of ... (.. AND .. AND ..) OR (.. AND .. AND ..) OR (.. AND .. AND ..) ... conditions.

Let’s analyze the results:

Unfortunately, this approach performs even worse than the previous ones.

Option 5: Shared Query Data Table

Our next approach involves introducing an additional table in the database to store query data temporarily. This table facilitates the following steps for each query:

Initiate a transaction (if not already started).
Populate the temporary table with query data.
Execute the query.
Roll back the transaction to remove the temporary data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
var result = new List<Price>();
using (var context = CreateContext())
{
  context.Database.BeginTransaction();

  var reducedData = TestData.Select(x => new SharedQueryModel()
  {
    PriceSourceId = x.PriceSourceId,
    Ticker = x.Ticker,
    TradedOn = x.TradedOn
  }).ToList();
  
  // Here query data is stored to shared table
  context.QueryDataShared.AddRange(reducedData);
  context.SaveChanges();

  var query = from p in context.Prices
         join s in context.Securities on 
           p.SecurityId equals s.SecurityId
         join t in context.QueryDataShared on 
           new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
           new { t.Ticker, t.TradedOn, t.PriceSourceId }
         select p;
  result.AddRange(query);

  context.Database.RollbackTransaction();
}

Let’s examine the metrics:

This approach yields impressive results with fast execution times and reasonable memory consumption. However, it comes with several drawbacks:

An extra table is required solely for this specific query type.
Transactions, which consume database resources, are necessary.
Data is written to the database during a read operation, potentially causing issues with read replicas.

Despite these drawbacks, this approach proves to be efficient, readable, and benefits from query plan caching.

Option 6: MemoryJoin Extension

This approach utilizes a NuGet package called EntityFrameworkCore.MemoryJoin, which, despite its name, supports both EF 6 and EF Core. MemoryJoin sends the query data as VALUES to the server, enabling the SQL server to handle the entire operation.

Let’s take a look at the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
var result = new List<Price>();
using (var context = CreateContext())
{
  // better to select needed properties only, for better performance
  var reducedData = TestData.Select(x => new { 
    x.Ticker, 
    x.TradedOn, 
    x.PriceSourceId 
  }).ToList();

  var queryable = context.FromLocalList(reducedData);
  var query = from p in context.Prices
              join s in context.Securities on 
                p.SecurityId equals s.SecurityId
              join t in queryable on 
                new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
                new { t.Ticker, t.TradedOn, t.PriceSourceId }
              select p;

  result.AddRange(query);
}

And here are the metrics:

The results are remarkable, achieving three times the speed of the previous approach and clocking in at 3.5 seconds for 64K records. The code remains simple and understandable, and it functions seamlessly with read-only replicas.

Let’s analyze the generated query for three elements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SELECT "p"."PriceId",
       "p"."ClosePrice",
       "p"."OpenPrice",
       "p"."PriceSourceId",
       "p"."SecurityId",
       "p"."TradedOn",
       "t"."Ticker",
       "t"."TradedOn",
       "t"."PriceSourceId"
FROM "Price" AS "p"
INNER JOIN "Security" AS "s" ON "p"."SecurityId" = "s"."SecurityId"
INNER JOIN
  ( SELECT "x"."string1" AS "Ticker",
           "x"."date1" AS "TradedOn",
           CAST("x"."long1" AS int4) AS "PriceSourceId"
   FROM
     ( SELECT *
      FROM (
            VALUES (1, @__gen_q_p0, @__gen_q_p1, @__gen_q_p2), 
                   (2, @__gen_q_p3, @__gen_q_p4, @__gen_q_p5), 
                   (3, @__gen_q_p6, @__gen_q_p7, @__gen_q_p8)
           ) AS __gen_query_data__ (id, string1, date1, long1)
       ) AS "x"
   ) AS "t" ON (("s"."Ticker" = "t"."Ticker")
AND ("p"."PriceSourceId" = "t"."PriceSourceId")

In this case, the actual values are efficiently transferred from memory to the SQL server using the VALUES construction. This allows the SQL server to leverage indexes effectively and perform a fast join operation.

However, there are a few limitations (more details available on my blog):

An extra DbSet needs to be added to the model (although it doesn’t have to be created in the database).
The extension has limitations regarding the number of properties supported in model classes, supporting up to three properties of each data type: string, date, GUID, float/double, and int/byte/long/decimal. While this limitation might be irrelevant in most cases, custom classes can be used as a workaround.

Conclusion

Based on the tested approaches, MemoryJoin emerges as the clear winner. While some might argue against its use due to its limitations, the performance gains it offers are significant and often outweigh the drawbacks. Similar to any powerful tool, it requires careful consideration and understanding of its inner workings, especially in the hands of experienced developers. This extension has the potential to significantly boost performance, and perhaps one day, Microsoft might incorporate dynamic VALUES support directly into EF Core.

Finally, here are some diagrams to summarize the results:

The first diagram showcases the execution time for each approach. MemoryJoin clearly stands out as the only option that completes the task within a reasonable timeframe. Only four approaches, including MemoryJoin, can handle larger datasets effectively.

Time taken in various cases for each experiment

The next diagram illustrates memory consumption, showing comparable results for all approaches except for the “Multiple Contains” method, which exhibits exceptionally high memory usage as previously discussed.

Memory consumption in various cases for each experiment