T-SQL命令性能比较– NOT IN与SQL NOT EXISTS与SQL LEFT JOIN与SQL EXCEPT
This articles gives you a performance comparison for NOT IN, SQL Not Exists, SQL LEFT JOIN and SQL EXCEPT.
本文为您提供了NOT IN,SQL Not Exists,SQL LEFT JOIN和SQL EXCEPT的性能比较。
The T-SQL commands library, available in Microsoft SQL Server and updated in each version with new commands and enhancements to the existing commands, provides us with different ways to perform the same action. In addition to an ever evolving toolkit of commands, different developers will apply different techniques and approaches to the same problem sets and challenges
T-SQL命令库在Microsoft SQL Server中可用,并且在每个版本中都通过新命令和对现有命令的增强进行了更新,为我们提供了执行相同操作的不同方法。 除了不断发展的命令工具包之外,不同的开发人员还将对相同的问题集和挑战采用不同的技术和方法。
For example, three different SQL Server developers can get the same data using three different queries, with each developer having his own approach to writing the T-SQL queries to retrieve or modify the data. But the database administrator will not be necessarily be happy with all of these approaches, he is looking to these methods from different aspects that they may not concentrate on. Although all of them may get the same required result, each query will behave in a different way, consume a different amount of SQL Server resources with different execution times. All of these parameters that the database administrator concentrates on shape the query performance. And it is the database administrator’s rule here to tune the performance of these queries and choose the best method with the minimum possible effect on the overall SQL Server performance.
例如,三个不同SQL Server开发人员可以使用三个不同的查询来获取相同的数据,每个开发人员都有自己的编写T-SQL查询的方法来检索或修改数据。 但是数据库管理员不一定会对所有这些方法都满意,他正在从可能不专心的各个方面来寻找这些方法。 尽管所有查询都可能获得相同的所需结果,但是每个查询的行为都将有所不同,并以不同的执行时间消耗不同数量SQL Server资源。 数据库管理员集中在所有这些参数上来影响查询性能。 这是数据库管理员的规则,它可以调整这些查询的性能并选择对SQL Server整体性能的影响最小的最佳方法。
In this article, we will describe the different ways that can be used to retrieve data from a table that does not exist in another table and compare the performance of these different approaches. These methods will use the NOT IN, SQL NOT EXISTS, LEFT JOIN and EXCEPT T-SQL commands. Before starting the performance comparison between the different methods, we will provide a brief description of each one of these T-SQL commands.
在本文中,我们将描述可用于从另一个表中不存在的表中检索数据的不同方法,并比较这些不同方法的性能。 这些方法将使用NOT IN , SQL NOT EXISTS , LEFT JOIN和EXCEPT T-SQL命令。 在开始比较不同方法之间的性能之前,我们将对这些T-SQL命令中的每一个进行简要说明。
The SQL NOT IN command allows you to specify multiple values in the WHERE clause. You can imagine it as a series of NOT EQUAL TO commands that are separated by the OR condition. The NO IN command compares specific column values from the first table with another column values in the second table or a subquery, and returns all values from the first table that are not found in the second table, without performing any filter for the distinct values. The NULL is considered and returned by the NOT IN command as a value.
SQL NOT IN命令允许您在WHERE子句中指定多个值。 您可以将其想象为一系列由OR条件分隔的NOT EQUAL TO命令。 NO IN命令将第一个表中的特定列值与第二个表或子查询中的另一个列值进行比较,并返回第二个表中未找到的第一个表中的所有值,而无需对不同的值执行任何过滤。 NULL由NOT IN命令考虑并作为值返回。
The SQL NOT EXISTS command is used to check for the existence of specific values in the provided subquery. The subquery will not return any data; it returns TRUE or FALSE values depend on the subquery values existence check.
SQL NOT EXISTS命令用于检查提供的子查询中是否存在特定值。 子查询将不返回任何数据; 它返回TRUE或FALSE值取决于子查询值的存在性检查。
The LEFT JOIN command is used to return all records from the first left table, the matched records from the second right table and NULL values from the right side for the left table records that have no match in the right table.
LEFT JOIN命令用于返回第一个左表中的所有记录,第二个右表中的匹配记录以及右表中不匹配的左表记录的右侧NULL值。
The EXCEPT command is used to return all distinct records from the first SELECT statement that are not returned from the second SELECT statement, with each SELECT statement will be considered as a separate dataset. In other words, it returns all distinct records from the first dataset and remove from that result the records that are returned from the second dataset. You can imagine it as a combination of the SQL NOT EXISTS command and the DISTINCT clause. Take into consideration that the left and the right datasets of the EXCEPT command should have the same number of columns.
EXCEPT命令用于返回第一个SELECT语句中未从第二个SELECT语句中返回的所有不同记录,而每个SELECT语句将被视为一个单独的数据集。 换句话说,它从第一个数据集返回所有不同的记录,并从结果中删除从第二个数据集返回的记录。 您可以将其想象为SQL NOT EXISTS命令和DISTINCT子句的组合。 考虑到EXCEPT命令的左和右数据集应具有相同的列数。
Now, let us see, in practical terms, how we could retrieve data from one table that does not exist in another table using different methods and compare the performance of these methods to conclude which one behaves in the best way. We will start with creating two new tables, using the T-SQL script below:
现在,让我们实际地了解如何使用不同的方法从一个表中不存在的数据中检索数据,并比较这些方法的性能以得出哪种方法表现最佳。 我们将从使用以下T-SQL脚本创建两个新表开始:
USE SQLShackDemo
GO
CREATE TABLE Category_A
( Cat_ID INT ,
Cat_Name VARCHAR(50)
)
GO
CREATE TABLE Category_B
( Cat_ID INT ,
Cat_Name VARCHAR(50)
)
GO
After creating the tables, we will fill each table with 10K records for testing purposes, using ApexSQL Generate as shown below:
创建表后,我们将使用ApexSQL Generate将每个表填充10K记录以进行测试,如下所示:
The testing tables are ready now. We will enable the TIME and IO statistics to use these statistics to compare the different methods performance. After that we will prepare the T-SQL queries that are used to pull the data that exists in Category_A table but not exists in Category_B table using four methods; NOT IN command, SQL NOT EXISTS command, LEFT JOIN command and finally EXCEPT command. This can be achieved using the T-SQL script below:
现在已经准备好测试表。 我们将使TIME和IO统计信息能够使用这些统计信息来比较不同方法的性能。 之后,我们将使用四种方法准备用于提取存在于Category_A表中但不存在于Category_B表中的数据的T-SQL查询。 NOT IN命令,SQL NOT EXISTS命令,LEFT JOIN命令以及最后的EXCEPT命令。 这可以使用下面的T-SQL脚本来实现:
USE SQLShackDemo
GO
SET STATISTICS TIME ON
SET STATISTICS IO ON
--- NOT INT
SELECT Cat_ID
FROM Category_A WHERE Cat_ID NOT IN (SELECT Cat_ID FROM Category_B)
GO
-- NOT EXISTS
SELECT A.Cat_ID
FROM Category_A A WHERE NOT EXISTS (SELECT B.Cat_ID FROM Category_B B WHERE B.Cat_ID = A.Cat_ID)
GO
-- LEFT JOIN
SELECT A.Cat_ID
FROM Category_A A
LEFT JOIN Category_B B ON A.Cat_ID = B.Cat_ID
WHERE B.Cat_ID IS NULL
GO
-- EXCEPT
SELECT A.Cat_ID
FROM Category_A A
EXCEPT
SELECT B.Cat_ID
FROM Category_B B
GO
If you execute the previous script, you will find that the four methods will return the same result, as shown in the below result that contains the number of returned record by each command:
如果执行前一个脚本,则会发现这四个方法将返回相同的结果,如以下结果所示,其中包含每个命令返回的记录数:
At this step, the SQL Server developer will be happy, as any method he will use, will return the same result for him. But what about the SQL Server database administrator who needs to check the performance of each approach? If we review the IO and TIME statistics that are generated after executing the previous script, you will see that the script that uses NOT IN command performs 10062 logical reads on the Category_B table, takes 228ms to be completed successfully and 63ms from the CPU time as shown below:
在这一步,SQL Server开发人员将很高兴,因为他将使用的任何方法都将为他返回相同的结果。 但是,需要检查每种方法的性能SQL Server数据库管理员呢? 如果我们回顾正在执行前面的脚本后生成的IO和时间统计,你会看到脚本使用NOT IN的命令执行10062上Category_B表逻辑读取,需要228ms从CPU的时间,成功地与63毫秒完成如下图所示:
On the other hand, the script that uses the SQL NOT EXISTS command performs only 29 logical reads on the Category_B table, takes 154ms to be completed successfully and 15ms from the CPU time, which is much better that the previous method that uses NOT IN from all aspects, as shown below:
另一方面,使用SQL NOT EXISTS命令的脚本仅对Category_B表执行29次逻辑读取,成功完成154ms的时间和15ms的CPU时间,这比以前的使用NOT IN的方法要好得多。各方面,如下图:
For the script that uses the LEFT JOIN command, it performs the same number of logical reads as the previous SQL NOT EXISTS method, which is 29 logical reads, takes 151ms to be completed successfully and 16ms from the CPU time, which is somehow similar to the statistics derived from the previous SQL NOT EXISTS method, as shown below:
对于使用左侧的脚本JOIN命令时,它执行相同数目的逻辑读取作为以前SQL NOT EXISTS方法,它是29逻辑读取,需要成功地完成151ms和16ms的来自CPU的时间,这是某种类似于从先前SQL NOT EXISTS方法派生的统计信息,如下所示:
Finally, the statistics generated after running the method that uses the EXCEPT command show that it performs again 29 logical reads, takes 218ms to be completed successfully, and consumes 15ms from the CPU time, which is worse than SQL NOT EXISTS and LEFT JOIN methods in term of execution time, as shown below:
最后,运行使用EXCEPT命令的方法后生成的统计信息表明,它再次执行29次逻辑读取,成功完成218ms的操作 ,并且从CPU时间消耗了15ms的时间,这比SQL NOT EXISTS和LEFT JOIN方法差。执行时间,如下所示:
Until this step, we can derive from the IO and TIME statistics that the methods that use the SQL NOT EXISTS and LEFT JOIN commands act in the best way, with the best overall performance. But will the queries execution plans tell us the same result? Let us check the execution plans that are generated from the previous queries using ApexSQL Plan, a free tool for SQL Server query plan analysis.
在此步骤之前,我们可以从IO和TIME统计信息中得出,使用SQL NOT EXISTS和LEFT JOIN命令的方法以最佳方式运行,并且具有最佳总体性能。 但是查询执行计划会告诉我们相同的结果吗? 让我们使用ApexSQL Plan (先前用于SQL Server查询计划分析的免费工具)检查从先前查询生成的执行计划。
The execution plans cost summary window, below, shows that the methods that use the SQL NOT EXISTS and the LEFT JOIN commands has the least execution costs, and the method that uses the NOT IN command has the heaviest query cost, as shown below:
下面的执行计划成本摘要窗口显示,使用SQL NOT EXISTS和LEFT JOIN命令的方法的执行成本最低,而使用NOT IN命令的方法的查询成本最高,如下所示:
Let us dive deeply to understand how each method behaves by studying the execution plans for these methods.
让我们深入研究每种方法的行为,方法是研究这些方法的执行计划。
The execution plan for the query that uses NOT IN command is a complex plan with number of heavy operators that perform looping and counting operations. What we will concentrate on in here, for performance comparison purposes, are the Nested Loops operators. Under the Nested Loops operators, you can see that these operators are not real join operators, it performs something called Left Anti Semi Join. This partial join operator will return all rows from the first left table with no matching rows in the second right table, skipping all matching rows between the two tables. The heaviest operator in the below execution plan generated by the query using the NOT IN command is the Row Count Spool operator, that performs scans on the unsorted Category_B table, counting how many rows are returned, and returns only the rows count without any data, for rows existence check purposes only. The execution plan will be as follows:
使用NOT IN命令的查询的执行计划是一个复杂的计划,其中包含大量执行循环和计数操作的繁重运算符。 为了进行性能比较,我们将在此处重点介绍的是嵌套循环运算符。 在嵌套循环运算符下,您可以看到这些运算符不是真正的联接运算符,它执行称为Left Anti Semi Join的操作 。 此部分联接运算符将返回第一个左表中的所有行,而第二个右表中没有匹配的行,从而跳过两个表之间的所有匹配行。 在查询中使用NOT IN命令生成的以下执行计划中最重的运算符是“ 行计数假脱机”运算符,它对未排序的Category_B表执行扫描,计算返回的行数,并且仅返回不包含任何数据的行数,仅用于行存在检查目的。 执行计划如下:
The below execution plan that is generated by the query using the SQL NOT EXIST command is simpler that the previous plan, with the heaviest operator in that plan is the Hash Match operator that again performs a Left Anti Semi Join partial join operation that checks for unmatched rows existence as described previously. This plan will be as follows:
使用SQL NOT EXIST命令通过查询生成的以下执行计划比以前的计划更简单,该计划中最重的运算符是哈希匹配运算符,该运算符再次执行Left Anti Semi Join部分联接操作,以检查不匹配项如前所述存在行。 该计划如下:
Comparing the previous execution plan generated by the query that uses the SQL NOT EXISTS command with the below execution plan that is generated by the query using the LEFT JOIN command, the new plan replaces the partial join with a FILTER operator that performs the IS NULL filtering for the data returned from the Right OUTER JOIN operator, that returns the matching rows from the second table which may include duplicates. This plan will be as shown below:
将使用SQL NOT EXISTS命令的查询所生成的先前执行计划与使用LEFT JOIN命令的查询所生成的以下执行计划进行比较,新计划用执行IS NULL过滤的FILTER运算符替换了部分联接。对于从Right OUTER JOIN运算符返回的数据,它从第二个表返回匹配的行,其中可能包含重复的行。 该计划如下所示:
The last execution plan generated by the query that uses the EXCEPT command contains also the Left Anti Semi Join partial join operation that checks for unmatched rows existence as shown previously. It also performs an Aggregate operation due to the large size of the table and the unsorted records on it. The Hash Aggregate process creates a hash table in the memory, that makes it a heavy operation, and a hash value will be calculated for each processed row and for each calculated hash value. After that, it checks the rows in the resulting hash bucket for the joining rows. The plan will be as shown below:
由查询使用EXCEPT命令生成的最后一个执行计划还包含Left Anti Semi Join部分连接操作,该操作检查是否存在不匹配的行,如前所示。 由于表和表上未排序的记录很大,它还会执行聚合操作。 Hash Aggregate进程在内存中创建一个哈希表,这使其成为繁重的操作,并且将为每个处理的行和每个计算出的哈希值计算一个哈希值。 之后,它将检查生成的哈希存储桶中的行是否有连接的行。 该计划将如下所示:
We can conclude again from the previous execution plans generated by each used command that the best two methods are the ones using the SQL NOT EXISTS and LEFT JOIN commands. Recall that the data in the previous tables is not sorted due to the absence of the indexes. So, let us create an index on the joining column, the Cat_ID, in both tables using the T-SQL script below:
我们可以从每个使用的命令生成的先前执行计划中再次得出结论,最好的两种方法是使用SQL NOT EXISTS和LEFT JOIN命令的方法。 回想一下,由于没有索引,先前表中的数据未排序。 因此,让我们使用下面的T-SQL脚本在两个表的连接列Cat_ID上创建索引:
USE [SQLShackDemo]
GO
CREATE NONCLUSTERED INDEX [IX_CAT_ID] ON [dbo].[Category_A]
(
[Cat_ID] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_CAT_ID] ON [dbo].[Category_B]
(
[Cat_ID] ASC
)
GO
The execution plans cost summary window generated using ApexSQL Plan after running the previous SELECT statements, shows that the method that uses the SQL NOT EXISTS command still the best one and the one using the EXCEPT command enhanced clearly after adding the indexes to the tables, as shown below:
在运行之前的SELECT语句之后,使用ApexSQL Plan生成的执行计划成本摘要窗口显示,使用SQL NOT EXISTS命令的方法仍然是最佳方法,而使用EXCEPT命令的方法在将索引添加到表后得到了明显增强,如下所示:如下图所示:
Checking the execution plan for the SQL NOT EXISTS command, the previous partial join operation is eliminated now and replaced by the Merge Join operator, as the data is sorted now in the tables after adding the indexes. The new plan will be as shown below:
检查SQL NOT EXISTS命令的执行计划,现在消除了先前的部分联接操作,并由“ 合并联接”运算符代替,因为现在在添加索引之后在表中对数据进行排序。 新计划如下所示:
The query that uses the EXCEPT command enhanced clearly after adding the indexes to the tables and become one of the best methods to achieve our goal here. This also appeared in the below query execution plan, in which the previous partial join operation is also replaced by the Merge Join operator, as the data is sorted by adding the indexes. The Hash Aggregate operator is also replaced now with a Stream Aggregate operator as it aggregates a sorted data after adding the indexes.
将索引添加到表后,使用EXCEPT命令的查询明显得到了增强,并成为实现我们的目标的最佳方法之一。 这也出现在下面的查询执行计划中,其中先前的部分联接操作也被“ 合并联接”运算符代替,因为通过添加索引对数据进行了排序。 Hash Aggregate运算符现在也被Stream Aggregate运算符代替,因为它在添加索引后聚合排序的数据。
The new plan will be as follows:
新计划如下:
结论: (Conclusion:)
SQL Server provides us with different ways to retrieve the same data, leaving it to the SQL Server developer to follow his own development approach to achieve that. For example, there are different ways that can be used to retrieve data from one table that doe not exist in another table. In this article, we described how to get such data using NOT IN, SQL NOT EXISTS, LEFT JOIN and EXCEPT T-SQL commands after providing a brief description of each command and compare the performance of these queries. We conclude, first, that using the SQL NOT EXISTS or the LEFT JOIN commands are the best choice from all performance aspects. We tried also to add an index on the joining column on both tables, where the query that uses the EXCEPT command enhanced clearly and showed a better performance, besides the SQL NOT EXISTS command that still the best choice overall.
SQL Server为我们提供了不同的方法来检索相同的数据,然后由SQL Server开发人员遵循他自己的开发方法来实现。 例如,可以使用多种方法从一个表中检索数据,而另一表中不存在该数据。 在本文中,我们介绍了每个命令的简要说明并比较了这些查询的性能之后,介绍了如何使用NOT IN,SQL NOT EXISTS,LEFT JOIN和EXCEPT T-SQL命令获取此类数据。 首先,我们得出结论,从所有性能方面来看,使用SQL NOT EXISTS或LEFT JOIN命令是最佳选择。 我们还尝试在两个表的联接列上添加索引,其中使用EXCEPT命令的查询除了SQL NOT EXISTS命令仍然是总体上最好的选择以外,明显增强了查询并显示了更好的性能。
有用的链接 (Useful links)
- EXISTS (Transact-SQL) 存在(Transact-SQL)
- Subqueries with EXISTS 具有EXISTS的子查询
- Set Operators – EXCEPT and INTERSECT (Transact-SQL) 集合运算符– EXCEPT和INTERSECT(Transact-SQL)
- Left Anti Semi Join Showplan Operator 左反半联接Showplan运算符
上一篇: Mysql条件查询json数据
下一篇: 04.17 Shell高级编程