SQL Server has been around over 30 years, and I’ve been working with SQL Server for almost as long. I’ve seen a lot of changes over the years (and decades!) and versions of this incredible product. In these posts, I’ll share with you how I look at some of the features or aspects of SQL Server, sometimes along with a bit of historical perspective.
Last time I talked about hashing in a SQL Server query plan as a potentially problematic operator. Hashing is frequently used for joins and aggregation when there is no useful index. And like scans (which I talked about in the first post in this series), there are times when hashing is actually a better choice than the alternatives. For hash joins, one of the alternatives is LOOP JOIN, which I also told you about last time.
In this post, I’ll tell you about another alternative for hashing. Most of the alternatives to hashing require that the data be sorted, so either the plan needs to include a SORT operator, or the required data must be already sorted due to existing indexes.
For JOIN operations, the most common and useful type of JOIN is a LOOP JOIN. I described the algorithm for a LOOP JOIN in the previous post. Although the data itself doesn’t need to be sorted for a LOOP JOIN, the presence of an index on the inner table makes the join much more efficient and as you should know, the presence of an index implies some sorting. While a clustered index sorts the data itself, a nonclustered index sorts the index key columns. In fact, in most cases, without the index, SQL Server’s optimizer will choose to use the HASH JOIN algorithm. We saw this in the example last time, that without indexes, HASH JOIN was chosen, and with indexes, we got a LOOP JOIN.
The third type of join is a MERGE JOIN. This algorithm works on two already sorted datasets. If we are trying to combine (or JOIN) two sets of data that are already sorted, it just takes a single pass through each set to find the matching rows. Here’s the pseudocode for the merge join algorithm:
get first row R1 from input 1
get first row R2 from input 2
while not at the end of either input
if R1 joins with R2
output (R1, R2)
get next row R2 from input 2
else if R1 < R2
get next row R1 from input 1
get next row R2 from input 2
Although MERGE JOIN is a very efficient algorithm, it does require that both input datasets be sorted by the join key, which usually means having a clustered index on the join key for both of the tables. Since you only get one clustered index per table, choosing the clustered key column just to allow MERGE JOINS to happen might not be the best overall choice for clustering key. So usually, I don’t recommend that you try to build indexes just for purpose of MERGE JOINS, but if you end up getting a MERGE JOIN due to already existing indexes, it’s usually a good thing. In addition to requiring that both input datasets be sorted, MERGE JOIN also requires that at least one of the datasets have unique values for the join key. Let’s look at an example. First, we’ll recreate the Headers and Details tables:
DROP TABLE IF EXISTS Details;
SELECT * INTO Details FROM Sales.SalesOrderDetail;
DROP TABLE IF EXISTS Headers;
SELECT * INTO Headers FROM Sales.SalesOrderHeader;
CREATE CLUSTERED INDEX Header_index on Headers(SalesOrderID);
CREATE CLUSTERED INDEX Detail_index on Details(SalesOrderID);
Next, look at the plan for a join between these tables:
FROM Details d JOIN Headers h
ON d.SalesOrderID = h.SalesOrderID;
Here’s the plan:
Note that even with a clustered index on both tables, we get a HASH JOIN. We can rebuild one of the indexes to be UNIQUE. In this case, it has to be the index on the Headers table, because that is the only one that has unique values for SalesOrderID.
CREATE UNIQUE CLUSTERED INDEX Header_index on Headers(SalesOrderID) WITH DROP_EXISTING;
Now, run the query again, and notice that the plan does how a MERGE JOIN.
These plans benefit from having the data already sorted in an index, as the execution plan can take advantage of the sorting. But sometimes, SQL Server has to do sorting as part of its query execution. You may occasionally see a SORT operator show up in a plan even if you don’t ask for sorted output. If SQL Server thinks that a MERGE JOIN might be a good option, but one of the tables does not have the appropriate clustered index, and it is small enough to make the sorting very inexpensive, a SORT could be performed to allow MERGE JOIN to be used.
But usually, the SORT operator shows up in queries where we have asked for sorted data with ORDER BY, as in the following example.
SELECT * FROM Details
ORDER BY ProductID;
The clustered index is scanned (which is the same as scanning the table) and then the rows are sorted as requested.
But what if the data is already sorted in a clustered index, and the query includes an ORDER BY on the clustered key column? In the example above, we built a clustered index on SalesOrderID in the Details table. Look at the following two queries:
SELECT * FROM Details;
SELECT * FROM Details
ORDER BY SalesOrderID;
If we run these queries together, the Quest Spotlight Tuning Pack Analysis Window indicates that the two plans are equal cost; each is 50% of the total. So, what is actually the difference between them?
Both queries are scanning the clustered index and SQL Server knows that if the pages of the leaf level are followed in order, the data will come back in clustered key order. No additional sorting needs to be done, so no SORT operator is added to the plan. But there IS a difference. We can click on the Clustered Index Scan operator and will get some detailed information.
First, look at the detailed information for the first plan, for the query without the ORDER BY.
The details tell us that the “Ordered” property is False. There is no requirement here that the data be returned in sorted order. It turns out that in most cases, the easiest way to retrieve the data is to follow the pages of the clustered index, so the data will end up being returned in order, but there is no guarantee. What the False property means is that there is no requirement that SQL Server follows the ordered pages to return the result. There are actually other ways that SQL Server can get all the rows for the table, without following the clustered index. If during execution, SQL Server chooses to use a different method to get the rows, we would not see ordered results.
For the second query, the details look like this:
Because the query included an ORDER BY, there IS a requirement that the data be returned in sorted order and SQL Server will follow the pages of the clustered index, in order.
The most important thing to remember here is that NO guarantee of sorted data if you don’t include ORDER BY in your query. Just because you have a clustered index, there is still no guarantee! Even if every single time for last year that you ran the query, you got the data back in order without the ORDER BY, there is no guarantee that you will continue to get the data back in order. Using ORDER BY is the only way to guarantee the order in which your results are returned.
So, is a SORT an operation to be avoided? Just like scans and hash operations, the answer is, of course, ‘it depends’. Sorting can be very expensive, especially on large datasets. Proper indexing does help SQL Server avoid performing SORT operations because an index basically means your data is presorted. But indexing comes with a cost. There is storage cost, in addition to maintenance cost, for every index. If your data is heavily updated, you need to keep the number of indexes to a minimum.
If you find that some of your slow running queries do show SORT operations in their plans, and if those SORTs are among the most expensive operators in the plan, you can consider building indexes that allow SQL Server to avoid the sorting. But you’ll need to do thorough testing to make sure that the additional indexes do not slow down other queries that are crucial to your overall application performance.