Design Meeting Notes - August 2, 2012

CUD Batching

There have been various CodePlex discussions around adding batching support to EF. These have resulted in some useful information and ideas which are being gathered together here with ideas from the team. This then creates a starting point for the EF team and/or the community in implementing this feature.

Currently when executing SaveChanges EF produces and executes discrete DB operations for each CUD operation. However SQL Server and some other backends provide a way of sending a set of operations as a single unit which may result in better performance, especially in situations where the connection to the database has high latencies.

This is the most voted issue on CodePlex with 14 votes and a close second on UserVoice with 1492 votes.

To support batching we would need to solve two problems:

  1. Enable batching in the SQL Server provider
  2. Change the provider model to enable different batching implementations

SQL Server

There are several approaches we could consider:

  • Append all operations in a single command text.
    • If server generated values are present we could either
      • Preserve the order of the operations and output the values in different result sets
      • Alter the operations to output the generated values to a temporary table and then query that values there in a single statement
    • This option can “break” plan caching
      • Not clear that this will be a real problem in the wild
      • Benefits of batching will likely outweigh it, especially in high-latency scenarios
      • Is it possible to disable plan caching? Is it worth it?
      • Might be a reason for allowing batching to be switched off
  • For inserts it is possible to use a single statement (INSERT INTO … OUTPUT INSERTED … VALUES)
    • We need to do some prototyping to get perf measurements to see how much of an advantage this would provide
  • Use SqlDataAdapter
    • It seems the only public way to do this is with DataSets which is a dependency we should not take and may not provide real perf improvements anyway

Provider Model

Other backends might be restricted to one of the above options or have a different way of achieving this. We need a flexible model to accommodate them.

If any of the operations rely on a server generated value from a different operation they usually can’t be batched together. We can deal with this by either:

  • Splitting the batches in the update pipeline before sending to the provider
    • This is basically not dealing with the problem, but could be a first step
  • Add the notion of temporary keys to command trees
    • This requires the provider to understand temp keys and key propagation, which is non trivial
  • Add client key generation strategies: GUID, sequences, hi-lo, etc.
    • When using key generation strategies the key propagation is handled by the state manager before sending anything to the provider which means that we would not need to split the batches

Different providers may have different limits on what they can batch. This means that the provider must be able to split the batch independently. Some options:

  • Send one update at a time to the provider.
    • The provider may choose hold onto the update or send it to the database batched with previous updates.
    • If it sends the updates then it will return information back to us in terms of multiple result sets.
    • We will also tell the provider when we have no more updates to send so that the provider can finish the last batch.
  • Send one update at a time to the provider with the provider using an event to give information back when it decides to send the batch to the server.
  • Send all updates to the provider at one.
    • This could be a push from EF to the provider or allow the provider to pull updates from EF
    • The provider returns one or more data readers with multiple result sets back to EF

Open questions

  • Where MaxBatchSize should be exposed?

Possible classes for one-at-a-time approach (async version):

abstract class DbBatchCommand : IDisposable
{
    DbTransaction Transaction
    DbConnection Connection
    int Timeout
    DbBatchResult AddAndExecute(DbCommandTree) 
    Task<dbbatchresult> AddAndExecuteAsync(DbCommandTree)
    DbBatchResult Execute()
    Task<dbbatchresult> ExecuteAsync()
}

class DbBatchResult
{
    bool Executed
    bool HasReader
    int RowsAffected
    DbDataReader Reader
}

abstract class DbProviderServices
{
    DbBatchCommand StartBatch()
}

Bulk operations

There have also been CodePlex discussions about bulk operations. The idea here is to improve the perf of updates and deletes by providing a way to perform many of them in the server without first having to bring the entities into the context. Some suggestions for how the code might look are:

var user = new UserInfoFields(); 
var update = user.Update().Set 
( 
   user.Field1 = 1, 
   user.Field2 = "xxxx" 
).Where(user.Name == "Jim" && user.Enable); 

update.Execute();

context.Employees 
    .Where(e => e.Title == "Spectre") 
    .Update(e => new Northwind.Employee 
    { 
        Title = "Commander"
    });

There are two important questions that need to be answered about a bulk operation implementation:

  • Should calling the API cause the updates to happen immediately or should they happen when SaveChanges is called?
    • Deferring until SaveChanges is called at first seems to match existing EF behavior. However, SaveChanges is currently only concerned with writing changes that have been detected in tracked entities. For bulk updates to happen as well the state manager would have to track which bulk operations are also pending, which is a significant change in both mental model and implementation.
    • In addition, the APIs feel like they should send updates immediately, so deferring until SaveChanges could be unintuitive.
    • Decision: Use immediate execution
  • Should performing a bulk update affect local entities being tracked by the state manager?
    • If local entities are not touched then it would be very easy to have the state manager get out-of-sync with the database resulting in unexpected behavior on subsequent uses of the context. In other words, it would be easy for people to shoot themselves in the foot.
    • The problem is that it is not in general possible to know exactly what changes will be made to the database such that these changes can be reflected.
    • This could be due to the database being out of sync with the context or because of semantic differences in how we interpret the query compared to how the database interprets it—for example, string compare differences.
    • Decision: While it is not possible to be sure that the local changes will exactly match the database changes it seems that we may be able to get close enough to avoid most foot-shots. We should aim for this.

Last edited Dec 14, 2012 at 5:20 PM by ajcvickers, version 2

Comments

No comments yet.