Query generation in batch insert scenarios

Topics: EF Runtime
Jul 19, 2012 at 11:30 PM
Edited Jul 24, 2012 at 12:09 PM

I have noticed that EF produces a large number of discrete insert statements when SaveChanges is called with a large number of new rows. It would be sensible here if instead of using individual INSERT INTO VALUE() statements a single INSERT INTO VALUES() statement was called.

In my test this has improved performance of regular SQL statements by a factor of about 8 or 9 and also performs much better in high latency scenarios. 

Jul 20, 2012 at 4:55 PM


This is definitely something we want to add to the product, we've even done some prototyping in the past around this. The relevant feature on the backlog is here - http://entityframework.codeplex.com/workitem/53.


Jul 21, 2012 at 11:10 PM

Hi Rowan, can you make those prototyping public either here or on ADO.NET blog?

Jul 22, 2012 at 2:54 AM
Edited Jul 22, 2012 at 4:13 AM

Hi Ladislav,

Great to hear from you! I don't think we have kept a copy of the prototype Rowan is referring to. It was anyway only an initial prototype that was used as a proof of concept to demonstrate the potential perf benefits of batching, not really a general purpose implementation. Here are some details I remember:

  1. It was implemented as a wrapping provider (based on the same library for wrapping providers we have here: http://code.msdn.microsoft.com/EFProviderWrappers-c0b88f32). 
  2. Our goal was not to batch inserts but all CUD operations performed by SaveChanges. Therefore, it didn't really created single INSERT statemement for all inserts but rather concatenated groups of CUD operations in a single CommandText.
  3. DbCommand.ExecuteNoQuery would cause in most cases the CommanText to be appended to the previous commands instead of causing the actual execution.
  4. The connection kept track of pending commands and parameters.
  5. Only in some cases we would flush the pending commands:
    1. If an output parameter was passed (we had some ideas to manage store generated values but we didn't implement any on the prototype, so the strategy was simply to give up on batching if a INSERT/UPDATE was sent that required retrieving store generated values afterwards)
    2. If a stored procedure was used or some other way of executing the command, e.g. ExecuteScalar
    3. The configured maximum size of the batch was reached on the connection.

Challenges of the design:

  1. One of the main challenges with this approach (and the reason the prototype wasn't a good starting point for the real implementation) is that EF expects for each CUD operation executed in SaveChanges to get a rows affected notification of 1. EF uses this mainly to detect concurrency errors. The strategy implemented in the prototype was to simply lie to EF and return always 1, which isn't really a safe thing to do. A real implementation of batching would need to address this somehow. It is not necessarily the only approach but one thing that comes to mind is to add a new API to the provider that could be used to build batches and to somehow defer the detection of concurrency errors until the command is actually executed.
  2. As mentioned above we also would like to experiment with ways we can handle store generated values without having to break the batch. Store generated values often need to be propagated to other in-memory objects but also to other CUD statements that could be part of the same batch or another batch. For the case of IDENTITY columns in particular if we had a first class concept of identity generators we could rely on this instead of having to retrieve the values from the server on every INSERT, but it would be nice to solve the general problem as well.
  3. One of the things we need to figure out is what is the minimal abstraction that data providers would need to support (we want the feature to be implementable by any provider, but different back-ends and providers may have different capabilities that may affect how batching needs to work).


Jul 22, 2012 at 10:38 AM

Hi Diego,

that was quite interesting idea to make the batching based on provider wrapper - I never thought about it this way. My own initial design (without knowledge of the code which must be changed) was based on solving the second challenge and I think it could solve the first challenge as well - but it is little bit brutal approach.

Every modification command is now followed by some select command returning values for identity or computed columns and the row count. With computed and database generated columns this simply cannot be avoided and using database generated Ids was like "a best practice" for all versions of EF - I don't think this can be simply changed to generators. There should be still some support to handle batching with database generated columns and computed columns as well.

My high level idea for solving the challenge with database generated values is based on:

  • Use SSDL to define temporary table for the batch - the table will have an record Id column and columns for every identity or computed property from all entities in the batch
  • Update command trees to insert record to the temporary table instead of selecting values after the command - each command tree will have its unique Id in some map which will pair the command (and the record in temporary table) with the entity. The map must allow pairing entity with multiple commands to support TPT and entity splitting. The map must also allow pairing single command with multiple entities to support table splitting.
  • Create temporary table at the beginning of the batch and delete it when batch is completed
  • Select all records from the temporary table when all modifications from the batch are executed
  • Use the id from the table to pair record from result set with the entity and update the entity with retrieved values
  • Context options will have a control property to enable or disable batching
  • Context options will have a control property defining how many commands will be minimum for batching to be used (because handling of temporary table may have some performance impact)
  • Context options will have a control property to define maximum size of the batch - multiple batches will be used if the batch size doesn't allow all necessary commands to be executed in single batch but the temporary table will be shared among all these batches.

Its biggest challenge is the same as your mentioned third challenge. It can cause breaking changes with all existing providers. It will also not work with mapped stored procedures or custom modification commands.

I think this idea can be extended to include row count as well but EF will be notified about concurrency error only after executing the whole batch which is a big disadvantage.


Jul 24, 2012 at 11:48 AM
Edited Jul 24, 2012 at 12:44 PM

Hey Diego + Ladislav

I quite like the idea of pushing the EF CUD operations into a single command text for transmission to the database server. In my tests this method did around 20% better than EF 5 on batch inserts. I think this would be a great addition especially for batch (but not group based) updates.

However my tests suguest that you can do heaps better by going to a single insert statement for batch create.

On my box I could insert 10k rows into a simple table in ~8s with a single command text containing multiple statements and ~1s when using a single statement*

I think that so long as the dataset can be reduced to a consecutive set of inserts to a table (which i believe the current implementation does?) this method could give drastic performance benefits when inserting more than 100 items.

I'm not too sure what the story is around concurrency detection but at a guess I would think something like the following statement structure:

  Insert into [test-db].[dbo].[TestEntities] (TestInt,TestString) OUTPUT INSERTED.[id]  values (1,'sffsf'),(2,'sfdf')

If this returns <2 rows then the statement has failed, if not we can harvest UIDs from this to populate entites.

Perhaps this is a bit simplistic, am I missing something key?

I'm still not too sure how this will work with a provider model, I think that you would need to have some kind of fall-back mechanism to no batching if the provider doesn't support some part of the batched query.

* (this isn't quite true, I actually chunked it into 200 row statements due to http://stackoverflow.com/questions/8635818/multiple-insert-statements-vs-single-insert-with-multiple-values)

Jul 25, 2012 at 9:54 AM

@lukemcgregor: Nice! I didn't know about OUTPUT feature of T-SQL modification commands. That makes my idea with temporary table for SQL Server obsolete.

Jul 25, 2012 at 11:21 AM

Yeah it seems pretty cool, im not too sure about performance on it as i havent run it through my tester yet but hopefully it will be comparable. Im also not sure how many providers implement something similar. I think in oracle its replaced by a RETURNING clause

Jul 30, 2012 at 3:20 PM

Hello, guys! Good work!

Just my 5 cents: what might also be interesting is the possibility to have different id generation strategies instead of server-generated ones. I'm thinking of NHibernate's HiLo, for example, which allows batching very efficiently.



Jul 30, 2012 at 3:25 PM

Even created a feature suggestion a while ago:


Aug 3, 2012 at 9:08 AM

Adding batch inserts in an ORM is a trap. Here's why. An ORM is a system with a persistence core and on top of that entity services which provide additional functionality to the developer for working with the entities and entity sets.

As soon as you add batching for inserts or updates, you can't have any pre/post save services for an entity, if they rely on data inside the DB. For example, auditing successful saves is impossible, as a batch fails or succeeds in 1 go (it's 1 command). Also things like graph persistence is not helped by batching: any FK synchronization between PK and FK side inside a batch is not possible.

If you look at the use cases for batching and when it can be efficient, there are two which come to mind:

1) useless benchmarks
2) import systems.

I'm sure you'll agree that use case 1) can be ignored. Use case 2 is an interesting one. Every database system has its own bulk import system. In general these systems are optimized to bypass the SQL interpreter entirely and are optimized to move as much data directly from input to destination. Building an import system based on an ORM is very inefficient compared to any bulk import system, like e.g. SQL bulk copy. The only situation where you might want to use an ORM in this situation is when you want to do client-side pre-processing on the entity data before you insert the entity. In my 10 years I'm busy writing ORMs now I can assure you, the amount of times I've run into clients using that kind of systems is rare compared to the general usage of an ORM system.

For updates, SQL bulk copy isn't useful, but a separate bulk API on the ORM is, e.g. to bulk delete or bulk update entity sets directly inside the DB with 1 query.

These two things: sql bulk copy and bulk update/deletes at the API level, combined with the limited use case of when batching really is more efficient, show that this feature, although it's requested much, is not something most applications will really benefit from.


Aug 3, 2012 at 8:15 PM

Hi Frans,

Thanks for the feedback. It is our intention to test various options in high-latency situations where it appears that batching could improve performance for relatively common scenarios--i.e. not simply bulk imports and not unrepresentative benchmarks. If the results from these tests indicate that there would be no real benefit from batching then we would reconsider implementing it.


Aug 17, 2012 at 8:48 AM

Hey guys

I did a few tests around the comparative performance of the different SQL insert methods which show up some of the details around Query Plan calculation, SQL bulk copy and how all of this stacks up against doing individual inserts.

Sql insert performance comparison


For more details have a look at my post on this available here.

Aug 17, 2012 at 5:15 PM

Hi Luke,

That's some great information. One thing I'm curious about is how these numbers change as the latency between the app and the server increases. What is your current setup in terms of where the server is located compared to the app? Same machine? Local network? Different continents? It might be possible to use networking tools to artificially increase the latency to the server--Andriy on our team did something like this a while back for async testing, so he can probably provide some pointers if needed.


Aug 17, 2012 at 5:40 PM
Edited Aug 22, 2012 at 9:22 AM

Hey Arthur

Yeah those tests were to a local database. Ill grab a latency generator and try a few more scenarios, we use a few for work already as NZ is a long way from everywhere :)



Aug 27, 2012 at 10:45 AM

Hey Arthur

I've done a few tests around how latency affects insert performance and the results are about what you would expect, IE when you batch multiple statements into a single statement or transmission you are affected far less by latency. For an example in my tests @100ms (intercontinental latency) it took around 17mins to insert 10k rows via sequential insert commands (such as those currently used by EF) but only 7.5s to insert those same rows when using batched multi-row inserts (ie approx 135x faster). With a more reasonable 10ms latency it took a little over 2 mins for sequential inserts and 1.6 seconds for multi-row inserts (ie approx 75x faster).

Have a read of my article on this which goes into more details around the results.

Aug 27, 2012 at 11:05 AM

In all honesty, who's going to insert 10K rows over a 100ms latency connection? Be aware that batching only has benefits in bulk inserts.

Aug 27, 2012 at 12:37 PM
Edited Aug 27, 2012 at 12:38 PM

Hi Frans

I can tell that you do not want to modify the query generation for Entity framework, however I am unsure why you are so concerned about the proposed modifications.

When testing often its important to test beyond the scope of normal operation. Such testing especially in performance scenarios outlines and exaggerates potential bottlenecks and Big O issues. 

If it would perhaps seem more reasonable for you to conciser a much smaller amount of latency and fewer rows I would encourage you to read my results and examine the differences at a mere 5ms RTT and 100 rows.

I think it would also be really important for you to note that EF is a batch based transactional system by design. This means that Inserts updates and deletes are all processed at the same time when you call SaveChanges. While in other frameworks (which from my experience mostly transmit to SQL on the spot and then commit on save) there is less option to reduce latency by aggregating statements EF is somewhat uniquely in a position to do this.

As a bit of a footnote i would like to challenge your assumption that ORMs shouldn't be used in bulk scenarios. Often performance is not the only factor for such scenarios, Entity framework provides a number of benefits around readability and simplicity of code over and above simply allowing you to write to a database. In addition I feel that there is value in being able to use a single ORM across your whole application even if a section of that application involves some kind of bulk operation. Because of these reasons developers are often willing to take a performance hit (as long as it isn't too large) to provide a more maintainable system. 

As an example of this just last week I was asked if there was any way for someone to use EF to do a 1,000,000 row recalculation operation. While you may feel that this is the job of a bulk tool the particular individual who asked this had found that using a bulk tool for this (while giving performance gains) substantially increased the maintenance costs on the application, in addition the code was harder to follow and had to have additional transformation logic. This recalculation was a small component of the application as a whole which used successfully used EF for the remainder of the code. In this particular scenario the developer was willing to sacrifice some performance for some application simplicity (and this isn't the only case like this I've heard).

I hope this helps you understand why examining options for query improvement has some really great potential gains for Entity Framework. I think everyone here is interested in getting the very best results possible and would not be interested in such improvement if they felt they were not warranted or sensible.


Luke McGregor

Sep 5, 2012 at 11:46 PM


Thanks for your excellent analysis. This is some really interesting information. We're currently starting to ramp up on this on our side and are looking at how we can collaborate with you and others on the implementation, possibly through a shared fork or similar.


Sep 6, 2012 at 12:26 AM

Hey Arthur

Sounds great, I think EF is a really brilliant framework and I'm really keen to contribute wherever I can :)


Sep 7, 2012 at 4:18 PM
Edited Sep 7, 2012 at 4:18 PM

I created a fork to prototype the approaches mentioned above and see how they compare in EF. If you would like to contribute please send me a message.



Sep 11, 2012 at 10:22 PM
Edited Sep 11, 2012 at 10:24 PM

Yesterday I made the first commit to the fork with some initial prototyping of the provider API for batching:

public class DbBatchCommand : IDisposable
    public virtual DbTransaction Transaction { get; set; }
    public virtual DbConnection Connection { get; set; }
    public virtual int? CommandTimeout { get; set; }
    public virtual DbParameterCollection Parameters { get; }

    public virtual AddToBatchResult TryAdd(DbCommandTree commandTree,
Dictionary<string, Tuple<TypeUsage, object>> parameterValues, bool hasReader) public DbDataReader Execute() public DbDataReader Execute(CommandBehavior behavior) public Task<DbDataReader> ExecuteAsync(CancellationToken cancellationToken) public Task<DbDataReader> ExecuteAsync(CommandBehavior behavior, CancellationToken cancellationToken) } public enum AddToBatchResult { NotAdded, AddedSameResultSet, AddedDifferentResultSet } abstract class DbProviderServices { public DbBatchCommand StartBatch(); } public interface IDbBatchConfiguration { int MaxUpdateBatchSize { get; } }

The idea behind it is both the provider and the caller can end the batch at any time. And the result of executing the batch is presented as a single DbDataReader that could have a different result set for each of the batched commands or some of the commands grouped in one result set. The commands that would not normally return a DbDataReader should now return the number of rows affected as a row. This allows checking for concurrency issues on an individual command basis while still allowing the provider to batch together different types of commands.

This approach however limits batching to groups of commands that are mutually independent (e.g. don't use server-generated values from previous commands). One way of dealing with this as already mentioned in the thread is using id generation strategies, this could also be prototyped in the fork. Another way is to change the command trees sent to the provider to include the notion of temporary ids that are used in EF internally by doing this the provider could replace them with parameters and still execute all the commands in one batch, but it would break existing providers.

I also included a default provider implementation that creates batches of a single command. And updated the SQL Server provider to do batching by concatenating the commands.

We are interested in measuring how different batching strategies compare when used with EF. Currently the candidates are:

  1. Concatenating CRUD statements (already in the fork)
  2. Grouping sequential inserts or deletes in a single statement
  3. Using Table-Valued parameters to group sequential inserts or updates

Since the last two options only work for certain type of operations a combination of different strategies could be used to achieve the best results.

I am currently still busy with other tasks, so I probably won't be able to work on the prototype in the following two weeks, but will monitor the discussions. Any further feedback, ideas or code contributions would be very helpful at this stage.

Some other open questions include handling output parameters and store procedures that return multiple result sets as well as whether other providers can accommodate the results from a batch into one data reader or should we change the API to accept multiple readers.



Sep 19, 2012 at 8:55 AM

Hey Andriy

I've just been taking a look through the EF source to try and understand how SaveChanges fits together, I'm especially looking to understand at what point the batching stuff is hooked into the process. Im getting a bit of an idea from checking out the changes in your last commit but if you guys have any diagrams or the like I would find them really helpful.


Sep 26, 2012 at 7:59 PM

There isn't anything that explains the internal workings of EF yet. I could give you an overview of what happens during SaveChanges:

  1. Extract changes from the ObjectStateManager
  2. Group changes by C-Space extent (entity set)
  3. For each C-Space extent use the CQT view represented by a DbCommandTree to perform propagation (get changes in S-Space terms represented by ChangeNodes). This is handled by System.Data.Entity.Core.Mapping.Update.Internal.Propagator
  4. Merge S-Space inserts and deletes into updates where appropriate
  5. Produce S-Space commands implementing the modifications (insert, delete and update SQL statements) represented by UpdateCommands
  6. Produce a topological ordering of the commands as some of them could depend on the server-generated values produced from the other ones. The secondary ordering is by kind (SQL statement of stored procedure call), operation type, target table, entity container of the table, keys in the table and finally a deterministic differentiator based on identifier values. 
  7. Execute the commands recording the server-generated values and propagating them to subsequent commands
  8. Propagate the server-generated values to the ObjectStateManager

Most of this is handled by System.Data.Entity.Core.Mapping.Update.Internal.UpdateTranslator

The batching support only affects steps 6 and 7. Now we use the topological ordering to produce batches of commands with each command in a batch independent from the others. And then we execute the commands in batches taking in consideration that the provider could end a batch after any command.

Oct 9, 2012 at 12:19 PM
Edited Oct 9, 2012 at 12:22 PM
AndriySvyryd wrote:

    public virtual AddToBatchResult TryAdd(DbCommandTree commandTree,
        Dictionary<string, Tuple<TypeUsage, object>> parameterValues, bool hasReader)


Did you consider using a method definition like the following?

 public virtual AddToBatchResult TryAdd(DbCommandDefinition commandDefinition,
        Dictionary<string, Tuple<TypeUsage, object>> parameterValues, bool hasReader)

In this way during common scenarios, like inserting/updating/deleting multiple records to the same table, the provider does not have to construct the command definition for each command, because the same DbCommandDefinition instance can be passed here (that should clone a prototype DbCommand instance). I currently do not see the downsides of this solution, so let me know if I am wrong.

Oct 10, 2012 at 11:16 PM

Hello tamasflamich,

Yes, I did consider the signature that you suggesting. Currently we don't reuse the same DbCommandDefinition instance, but even if we did choose to use this optimization it would still be possible to do on the provider side. The main reason I decided to pass DbCommandTree instead is that it allows the provider to build one DbCommandDefinition that represents several trees. The secondary reason is in the future we might add temporary identifiers to the tree to allow batching of dependent commands.

Oct 13, 2012 at 9:09 AM

Hey Andriy

I think i understand a little more about how the code fits together but before I go too much further I just wanted to bounce my idea for implementing multiple inserts in a single statement off you.

I've taken a look in DbmlSqlGenerator which is compiling the query structure of insert statements and broken it into parts to separate out the individual row generation portion of the insert statement from the surrounding code. This means that we execute the middle row generation portion multiple times in a batch scenario. 

The next step is to differentiate between batch inserts and regular inserts and to do this I think the best way is to create a new CommandTree for batch inserts which is basically an enumerable of insert command trees. This could be passed down from your batch command where instead of creating multiple individual insert command trees we only create one bulk insert command tree. 


Oct 15, 2012 at 10:08 PM

Hi Luke,

I think you are moving in the right direction. If creating a single tree proves too cumbersome an alternative could be adding an overload that accepts an IEnumerable of CommandTree instead of just one CommmandTree.

It's good to see you are making progress.



Nov 21, 2012 at 1:31 PM

Hello Luke,

I'm wondering if you have anything new to say about this. We regularly have to insert 1000+ rows on some tasks, and having this as an batch operation would speed up the process so much more.

Nov 21, 2012 at 6:13 PM

Hey @SomebodyYouKnow

I have been meaning to get a bit more traction on the improvements I've outlined in this tread but have been pretty busy lately so haven't had as much time to work on it as I would like. In any case such changes will take a while to filter into the main trunk and be released publicly.

Having said that there are a bunch of improvements you can do to EF to tune performance in scenarios such as those you are talking about, under the current releases. I would recommend reading up on DetectChanges and what impact this has on EF performance as you increase the batch size, as this is probably the biggest optimisation you can currently make with the framework. There are a bunch of articles out there on what DetectChanges does and how disabling AutoDetectChanges while in a batch can improve performance in those scenarios.

Arthur Vickers has a great article series on how DetectChanges works, and i've written an article on what these performance impacts look like as batch size increases (http://blog.staticvoid.co.nz/2012/05/entityframework-performance-and.html).

Hope this helps

Feb 7, 2013 at 5:25 PM
I'm pretty sure Entity Framework doesn't even use prepared commands when doing inserts/updates. E.g. It's not doing a DbCommand.Prepare() the first time through and then only sending parameter values when doing the actual inserts/updates. I found this surprising. Does anyone know if there is a reason for this? I'm pretty sure if they at least used prepared commands that insert/update performance would be significantly improved even without batching. I'm wondering if I should make a separate feature request for this.

Also, I would like a method that allows you to clear the DbContext so that you can minimize the number of objects that are being tracked. NHibernate has a way of doing this. As far as I know, the only way to do this with Entity Framework is to create a new DbContext and dispose of the old one. I have found that doing this, significantly slows things down.

Regarding what someone else was saying about how batch updates aren't something that EF should focus on. I agree with what Luke was saying about how it's still useful in terms of code simplicity and maintainability even if it is slower. It just depends how much slower. The good thing about EF is that it's DBMS independent. Let's say you want to develop an import application that supports any DBMS. EF is a great way to do that. It would be unfortunate if your main app was DBMS independent due to using EF and then you had to write lower-level code for an import app that isn't DBMS independent.
Mar 10, 2013 at 2:27 AM
I just wanted to say thanks for all the work everyone is putting into this. I'm definitely looking forward to when this gets into a usable form. I currently have one project that was using datasets over MySQL, and Visual Studio 2012 doesn't seem to like the dataset designer anymore, so I've been converting the project over to using EF 5, however, after I got done I quickly realized just how much slower EF 5 was than datasets because of the lack of batch support for insert/update/deletes. The application regularly sends between 1-60k SQL statements (insert/update/delete) on each thread, with 8 threads to a remote database (10-50ms latency). It is extremely painful to watch.

Disappointing, as EF seemed like such a perfect fit. All the data is already in objects that mirror a database table, reading from multiple different source types, and I wanted a database agnostic solution because we are using MySQL currently and likely will continue to be for some time, we do want to move it to another database platform "at some point".

Plus, we were already using EF at my day job so it seemed nice to consolidate on one technology if possible. I've always hated datasets, and LINQtoSQL was great, but it only worked on SQL Server. EF being able to work over SQL Server, SQL Compact, MySQL, Postgresql, and AzureSQL seemed like it would do the trick.
Mar 14, 2013 at 10:45 PM
In our data access layer based on EF 5.0, we've had great results for some time now doing bulk inserts using the LinqEntityDataReader. It basically lets you build up a huge collection of entities, then bulk insert them using the SqlBulkCopy object.

Here is a basic example of the usage. In this case TempSessionId is an EF entity that was created in our DB-first model. I build up a huge list of new ones that I want to bulk insert, then invoke a SQLBulkCopy, referencing the table name from which the entity is derived.
List<TempSessionId> newEntries = new List<TempSessionId>();

            foreach (int id in itemIds)
                TempSessionId atsa = new TempSessionId();
                atsa.ItemID = id;
                atsa.SessionID = setId;

            //Bulk insert account temp session
            using (SqlConnection con = new SqlConnection(context.Database.Connection.ConnectionString))
                using (SqlTransaction tran = con.BeginTransaction())
                    SqlBulkCopy bc = new SqlBulkCopy(con,
                      SqlBulkCopyOptions.CheckConstraints |
                      SqlBulkCopyOptions.FireTriggers |
                      SqlBulkCopyOptions.KeepNulls, tran);

                    bc.BatchSize = 1000;
                    bc.DestinationTableName = "TempSessionIds";