Monday, February 22, 2010

Understanding the TOP and TABLESAMPLE operators

Result set can be filtered in many ways. Most common ways is, using the 3rd primary property of the SQL SELECT statement which is the WHERE condition which filters specific set of rows from the result set.

This post is not about WHERE condition, it is all about TOP and TABLESAMPLE.

The TOP operator gives you top n number of records as you need. The “top n” can be changed by changing the order of the result set by using ORDER BY clause.

The TABLESAMPLE operator allows you to randomly pick data from the table. You can instruct to SQL Server to return specific number of records or percent of rows.

Where we can use this? I think that this is commonly used for getting the average from a large result set. If you have a table with millions of records and average has to be calculated on one of the columns, it is worthwhile to use TABLESAMPLE (10 PERCENT) rather than using all records.

   1: -- use all rows in the table for calculating
   2: SELECT AVG(Freight)
   3: FROM Sales.SalesOrderHeader
   4: -- use approximately 10% of rows
   5: SELECT AVG(Freight)
   6: FROM Sales.SalesOrderHeader
   7: TABLESAMPLE (10 PERCENT)

REPEATABLE OPTION
If you execute the above two codes again and again, the average returns from first statement is always same but second statement. This is because of the way it picks records. If you need the same average for all the execution of second statement, REPEATABLE option should be used. The REPEATABLE option has to be used with a repeat_seed, and as long as the repeat_seed is same and no records have been change, same average is returned.

   1: SELECT AVG(Freight)
   2: FROM Sales.SalesOrderHeader
   3: TABLESAMPLE (10 PERCENT) REPEATABLE (1)

What is SYSTEM option?
The SYSTEM option is optional, but it is used by default though you do not use it in your query.

   1: SELECT AVG(Freight)
   2: FROM Sales.SalesOrderHeader
   3: TABLESAMPLE SYSTEM (10 PERCENT)

The SYSTEM option returns approximate percentage of rows and generates a random value for each data page. SQL Server decides which data pages to be included for the sample based on the random value generated and the percentage specified in the query. If a page is decided to to be included, all rows in the page is included for sampling, else the page is excluded. If you use set the TABLESAMPLE with number of rows, instead of a percentage, number of rows will be converted to a percent and process same way.

No comments: