Thursday, March 2, 2017

Speeding up INSERTs to Columnstore index by bypassing the Deltastore

Columnstore index is designed for retrieving large number of records much faster with aggregations and grouping. With SQL Server 2016, we can make the table as Clustered Columnstore table and have additional Rowstore indexes for handing SEEK types of requests. Generally we make our data warehousing tables as Clustered Columnstore table but it is not limited to data warehouses.

When inserting records to Clustered Columnstore table, there are few best practices for speeding up the process;
  1. Use Bulk Operations for inserting instead of inserting row by row.
  2. Load data parallel
There are multiple ways of inserting records as a bulk insert. Generally, when we have data to be inserted, in data warehousing, we load the data into staging and then load to main table. When loading from staging to main table, if we use INSERT INTO SELECT... then it becomes a bulk operation. What is the benefit? Benefit is, getting data directly into Row Group without sending them into Deltastore. However, this happens only when the batch size is 102,400 or above.

See below code. First code load data using a bulk operation but number of records is less than 102,400. See the result after that.

INSERT INTO dbo.Sales_2
 ([DateKey] ,[channelKey] ,[StoreKey] ,[ProductKey] ,[PromotionKey] ,[CurrencyKey]
      ,[UnitCost] ,[UnitPrice] ,[SalesQuantity] ,[ReturnQuantity] ,[ReturnAmount] ,[DiscountQuantity]
      ,[DiscountAmount] ,[TotalCost] ,[SalesAmount] ,[ETLLoadID] ,[LoadDate] ,[UpdateDate])
SELECT TOP (5000) [DateKey] ,[channelKey] ,[StoreKey] ,[ProductKey] ,[PromotionKey] ,[CurrencyKey]
      ,[UnitCost] ,[UnitPrice] ,[SalesQuantity] ,[ReturnQuantity] ,[ReturnAmount] ,[DiscountQuantity]
      ,[DiscountAmount] ,[TotalCost] ,[SalesAmount] ,[ETLLoadID] ,[LoadDate] ,[UpdateDate]
FROM ContosoRetailDW.dbo.FactSales
WHERE YEAR(datekey) IN (2008, 2009);


As you see, Deltastore is used for holding records. The next code inserts more than 102,400 records. See the result.

SELECT * 
FROM sys.dm_db_column_store_row_group_physical_stats
WHERE object_id = OBJECT_ID('dbo.Sales_2');

INSERT INTO dbo.Sales_2
 ([DateKey] ,[channelKey] ,[StoreKey] ,[ProductKey] ,[PromotionKey] ,[CurrencyKey]
      ,[UnitCost] ,[UnitPrice] ,[SalesQuantity] ,[ReturnQuantity] ,[ReturnAmount] ,[DiscountQuantity]
      ,[DiscountAmount] ,[TotalCost] ,[SalesAmount] ,[ETLLoadID] ,[LoadDate] ,[UpdateDate])
SELECT TOP (102401) [DateKey] ,[channelKey] ,[StoreKey] ,[ProductKey] ,[PromotionKey] ,[CurrencyKey]
      ,[UnitCost] ,[UnitPrice] ,[SalesQuantity] ,[ReturnQuantity] ,[ReturnAmount] ,[DiscountQuantity]
      ,[DiscountAmount] ,[TotalCost] ,[SalesAmount] ,[ETLLoadID] ,[LoadDate] ,[UpdateDate]
FROM ContosoRetailDW.dbo.FactSales
WHERE YEAR(datekey) IN (2008, 2009)
ORDER BY DateKey;

-- check all rowgroups
SELECT * 
FROM sys.dm_db_column_store_row_group_physical_stats
WHERE object_id = OBJECT_ID('dbo.Sales_2');


As you see, records go directly to Compressed Row Group which is the best. See what happen when you insert same number of records without using a bulk operation.

SELECT * 
FROM sys.dm_db_column_store_row_group_physical_stats
WHERE object_id = OBJECT_ID('dbo.Sales_2');

DECLARE cur CURSOR
FOR
SELECT TOP (102401) [DateKey] ,[channelKey] ,[StoreKey] ,[ProductKey] ,[PromotionKey] ,[CurrencyKey]
      ,[UnitCost] ,[UnitPrice] ,[SalesQuantity] ,[ReturnQuantity] ,[ReturnAmount] ,[DiscountQuantity]
      ,[DiscountAmount] ,[TotalCost] ,[SalesAmount] ,[ETLLoadID] ,[LoadDate] ,[UpdateDate]
FROM ContosoRetailDW.dbo.FactSales
WHERE YEAR(datekey) IN (2008, 2009)
ORDER BY DateKey;

DECLARE @DateKey datetime ,@channelKey int ,@StoreKey int ,@ProductKey int ,@PromotionKey int ,@CurrencyKey int ,
  @UnitCost money ,@UnitPrice money ,@SalesQuantity int ,@ReturnQuantity int ,@ReturnAmount money ,@DiscountQuantity int ,
  @DiscountAmount money ,@TotalCost money ,@SalesAmount money ,@ETLLoadID int ,@LoadDate datetime ,@UpdateDate datetime 

OPEN cur
FETCH NEXT FROM cur INTO @DateKey , @channelKey, @StoreKey, @ProductKey, @PromotionKey, @CurrencyKey,@UnitCost, @UnitPrice,
  @SalesQuantity, @ReturnQuantity, @ReturnAmount, @DiscountQuantity, @DiscountAmount, @TotalCost,
  @SalesAmount, @ETLLoadID, @LoadDate, @UpdateDate

WHILE (@@FETCH_STATUS = 0)
BEGIN

 INSERT INTO dbo.Sales_2
  ([DateKey] ,[channelKey] ,[StoreKey] ,[ProductKey] ,[PromotionKey] ,[CurrencyKey]
    ,[UnitCost] ,[UnitPrice] ,[SalesQuantity] ,[ReturnQuantity] ,[ReturnAmount] ,[DiscountQuantity]
    ,[DiscountAmount] ,[TotalCost] ,[SalesAmount] ,[ETLLoadID] ,[LoadDate] ,[UpdateDate])
 values
  (@DateKey , @channelKey, @StoreKey, @ProductKey, @PromotionKey, @CurrencyKey,@UnitCost, @UnitPrice,
  @SalesQuantity, @ReturnQuantity, @ReturnAmount, @DiscountQuantity, @DiscountAmount, @TotalCost,
  @SalesAmount, @ETLLoadID, @LoadDate, @UpdateDate)

 FETCH NEXT FROM cur INTO @DateKey , @channelKey, @StoreKey, @ProductKey, @PromotionKey, @CurrencyKey,@UnitCost, @UnitPrice,
  @SalesQuantity, @ReturnQuantity, @ReturnAmount, @DiscountQuantity, @DiscountAmount, @TotalCost,
  @SalesAmount, @ETLLoadID, @LoadDate, @UpdateDate
END 

CLOSE cur
DEALLOCATE cur

-- check all rowgroups
SELECT * 
FROM sys.dm_db_column_store_row_group_physical_stats
WHERE object_id = OBJECT_ID('dbo.Sales_2');
GO


See the result. Row group is not compressed. This will be automatically compressed when the number of records exceed 1,048,576 but it is better to use a bulk operation for getting it compressed with initial loading.

No comments: