One of my favourite questions to ask during our SQL performance training is whether people think an index scan is better or worse than an index seek. Normally most people will go for the worse option and sure, in many cases that could well be the case, but just like many other questions in SQL Server the actual answer is “it depends”!

Consider the following query on the AdventureWorks database:

SELECT *
FROM [AdventureWorks].[Production].[ProductSubcategory]
WHERE ProductCategoryID = 2;

Pretty straightforward and in order to (hopefully) assist with the where clause (ProductCategoryID = 2) I’ve added the following non-clustered index on the ProductCategoryID column:

CREATE NONCLUSTERED INDEX [IX_NC_ProductCategory] ON [Production].[ProductSubcategory]
(
[ProductCategoryID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

Running the query returns 14 results and the actual execution plan looks like the following:

The plan shows a clustered index scan which might not be what we expected considering that a perfectly good index is available that satisfies the where clause. So in this case, why has the optimiser chosen a scan of the clustered index ahead of an index seek?

Let’s have a look at the estimated subtree cost of the current execution plan by hovering the mouse over the select operator in our graphical plan where the following tooltip is displayed, here we can see that the cost for this query is 0.0033227:

Now we’re going to run the query again but this time we’ll overrule the optimiser from doing what it thinks is the best way to execute the query and force an index seek on the ProductSubCategory table. Our query now looks like this:

SELECT
*
FROM [AdventureWorks].[Production].[ProductSubcategory] WITH (FORCESEEK)
WHERE ProductCategoryID = 2;

We get the following execution plan:

Now we get some different operators. We have the index seek as expected but underneath it we’ve got a key lookup to the clustered index and there’s a nested loop join too. With this additional work let’s see if the cost of this query is higher (worse) or lower (better) than the cost of the query containing the index scan:

Our cost now is 0.0086943 which is actually over double the cost of our index scan. This means that for this particular query an index seek is not the most efficient way to return the desired results which is exactly why the optimiser chose to use an index scan in the first place.

To answer why the optimiser chose this route is all down to the presence of the key lookup operator, which accounts for 61% of our query cost. Because our non-clustered index does not contain the columns asked for in the SELECT statement, in this case * or all of them the execution plan has to join on to the clustered index to retrieve the columns that we’ve asked to be returned and we can check the properties of the lookup operator (again hover the mouse for the tooltip) to see what these columns are (Output List):

The question then becomes is the SELECT * absolutely necessary. If we modify our statement so we only return the column in our non-clustered index, ProductCategoryID, then the query will now look like this:

SELECT
ProductCategoryID
FROM [AdventureWorks].[Production].[ProductSubcategory]
WHERE ProductCategoryID = 2;

We now get the index seek on the non-clustered index:

Interestingly enough the cost is now 0.0032988 which is only slightly less than our original index scan but the main point here is that the optimiser has chosen to utilise the non-clustered index and it is the less expensive option.

In this example an index scan proved to be much better in terms of cost than an index seek but that was all down to the use of SELECT * in the query. If we didn’t actually need every column then we’ve forced the optimiser to perform a complete scan of the table (clustered index) rather than use an index seek, and for no good reason.

Whilst scans can be used by the optimiser because of things like low row counts, non-selective columns or a lack of indexing for example you can imagine how frustrating it would be if a lot of time has been spent on designing effective indexes only for poor coding techniques to completely bypass them, not to mention the potential performance problems.

In this case the difference wasn’t that much in terms of cost difference but the reality is that SELECT * type queries can cause a lot of unnecessary overhead, particularly on large tables so when writing queries remember to limit the returned result sets so that they only contain the columns that are actually required.

WARNING!

The final point is on hints, like the FORCESEEK one we used earlier. Although useful, query or table hints should only be used for very specific performance reasons (or demo purposes). It’s worth bearing in mind that table data will inevitably change over time and that means your execution plans will too, at that point having hard-coded hints to force optimiser behaviour could be seriously detrimental to performance.

 

It Depends: Scans vs Seeks and the dangers of SELECT *
Tagged on:             

Leave a Reply

Your email address will not be published. Required fields are marked *