Menu:
miscellaneous-optimization-tips:: Other Optimization Tips
The core logic of a database application is performed through SQL statements, whether issued directly through an interpreter or submitted behind the scenes through an API. The tuning guidelines in this section help to speed up all kinds of MySQL applications. The guidelines cover SQL operations that read and write data, the behind-the-scenes overhead for SQL operations in general, and operations used in specific scenarios such as database monitoring.
File: manual.info.tmp, Node: select-optimization, Next: subquery-optimization, Prev: statement-optimization, Up: statement-optimization
Menu:
table-scan-avoidance:: Avoiding Full Table Scans
Queries, in the form of *note 'SELECT': select. statements, perform all the lookup operations in the database. Tuning these statements is a top priority, whether to achieve sub-second response times for dynamic web pages, or to chop hours off the time to generate huge overnight reports.
Besides note 'SELECT': select. statements, the tuning techniques for queries also apply to constructs such as note 'CREATE TABLE...AS SELECT': create-table-select, note 'INSERT INTO...SELECT': insert-select, and 'WHERE' clauses in note 'DELETE': delete. statements. Those statements have additional performance considerations because they combine write operations with the read-oriented query operations.
NDB Cluster supports a join pushdown optimization whereby a qualifying join is sent in its entirety to NDB Cluster data nodes, where it can be distributed among them and executed in parallel. For more information about this optimization, see *note ndb_join_pushdown-conditions::.
The main considerations for optimizing queries are:
To make a slow 'SELECT ... WHERE' query faster, the first thing to check is whether you can add an index. Set up indexes on columns used in the 'WHERE' clause, to speed up evaluation, filtering, and the final retrieval of results. To avoid wasted disk space, construct a small set of indexes that speed up many related queries used in your application.
Indexes are especially important for queries that reference different tables, using features such as joins and foreign keys. You can use the note 'EXPLAIN': explain. statement to determine which indexes are used for a note 'SELECT': select. See note mysql-indexes:: and note using-explain::.
Isolate and tune any part of the query, such as a function call, that takes excessive time. Depending on how the query is structured, a function could be called once for every row in the result set, or even once for every row in the table, greatly magnifying any inefficiency.
Minimize the number of full table scans in your queries, particularly for big tables.
Keep table statistics up to date by using the *note 'ANALYZE TABLE': analyze-table. statement periodically, so the optimizer has the information needed to construct an efficient execution plan.
Learn the tuning techniques, indexing techniques, and configuration parameters that are specific to the storage engine for each table. Both 'InnoDB' and 'MyISAM' have sets of guidelines for enabling and sustaining high performance in queries. For details, see note optimizing-innodb-queries:: and note optimizing-queries-myisam::.
You can optimize single-query transactions for 'InnoDB' tables, using the technique in *note innodb-performance-ro-txn::.
Avoid transforming the query in ways that make it hard to understand, especially if the optimizer does some of the same transformations automatically.
If a performance issue is not easily solved by one of the basic guidelines, investigate the internal details of the specific query by reading the note 'EXPLAIN': explain. plan and adjusting your indexes, 'WHERE' clauses, join clauses, and so on. (When you reach a certain level of expertise, reading the note 'EXPLAIN': explain. plan might be your first step for every query.)
Adjust the size and properties of the memory areas that MySQL uses for caching. With efficient use of the 'InnoDB' buffer pool, 'MyISAM' key cache, and the MySQL query cache, repeated queries run faster because the results are retrieved from memory the second and subsequent times.
Even for a query that runs fast using the cache memory areas, you might still optimize further so that they require less cache memory, making your application more scalable. Scalability means that your application can handle more simultaneous users, larger requests, and so on without experiencing a big drop in performance.
Deal with locking issues, where the speed of your query might be affected by other sessions accessing the tables at the same time.
File: manual.info.tmp, Node: where-optimization, Next: range-optimization, Prev: select-optimization, Up: select-optimization
8.2.1.1 WHERE Clause Optimization .................................
This section discusses optimizations that can be made for processing 'WHERE' clauses. The examples use note 'SELECT': select. statements, but the same optimizations apply for 'WHERE' clauses in note 'DELETE': delete. and *note 'UPDATE': update. statements.
Note:
Because work on the MySQL optimizer is ongoing, not all of the optimizations that MySQL performs are documented here.
You might be tempted to rewrite your queries to make arithmetic operations faster, while sacrificing readability. Because MySQL does similar optimizations automatically, you can often avoid this work, and leave the query in a more understandable and maintainable form. Some of the optimizations performed by MySQL follow:
Removal of unnecessary parentheses:
((a AND b) AND c OR (((a AND b) AND (c AND d))))
-> (a AND b AND c) OR (a AND b AND c AND d)
Constant folding:
(a<b AND b=c) AND a=5
-> b>5 AND b=c AND a=5
Constant condition removal:
(b>=5 AND b=5) OR (b=6 AND 5=5) OR (b=7 AND 5=6)
-> b=5 OR b=6
Constant expressions used by indexes are evaluated only once.
'COUNT(*)' on a single table without a 'WHERE' is retrieved directly from the table information for 'MyISAM' and 'MEMORY' tables. This is also done for any 'NOT NULL' expression when used with only one table.
Early detection of invalid constant expressions. MySQL quickly detects that some *note 'SELECT': select. statements are impossible and returns no rows.
'HAVING' is merged with 'WHERE' if you do not use 'GROUP BY' or aggregate functions ('COUNT()', 'MIN()', and so on).
For each table in a join, a simpler 'WHERE' is constructed to get a fast 'WHERE' evaluation for the table and also to skip rows as soon as possible.
All constant tables are read first before any other tables in the query. A constant table is any of the following:
* An empty table or a table with one row.
* A table that is used with a 'WHERE' clause on a 'PRIMARY KEY'
or a 'UNIQUE' index, where all index parts are compared to
constant expressions and are defined as 'NOT NULL'.
All of the following tables are used as constant tables:
SELECT * FROM t WHERE PRIMARY_KEY=1;
SELECT * FROM t1,t2
WHERE t1.PRIMARY_KEY=1 AND t2.PRIMARY_KEY=t1.id;
The best join combination for joining the tables is found by trying all possibilities. If all columns in 'ORDER BY' and 'GROUP BY' clauses come from the same table, that table is preferred first when joining.
If there is an 'ORDER BY' clause and a different 'GROUP BY' clause, or if the 'ORDER BY' or 'GROUP BY' contains columns from tables other than the first table in the join queue, a temporary table is created.
If you use the 'SQL_SMALL_RESULT' modifier, MySQL uses an in-memory temporary table.
Each table index is queried, and the best index is used unless the optimizer believes that it is more efficient to use a table scan. At one time, a scan was used based on whether the best index spanned more than 30% of the table, but a fixed percentage no longer determines the choice between using an index or a scan. The optimizer now is more complex and bases its estimate on additional factors such as table size, number of rows, and I/O block size.
In some cases, MySQL can read rows from the index without even consulting the data file. If all columns used from the index are numeric, only the index tree is used to resolve the query.
Before each row is output, those that do not match the 'HAVING' clause are skipped.
Some examples of queries that are very fast:
SELECT COUNT(*) FROM TBL_NAME;
SELECT MIN(KEY_PART1),MAX(KEY_PART1) FROM TBL_NAME;
SELECT MAX(KEY_PART2) FROM TBL_NAME
WHERE KEY_PART1=CONSTANT;
SELECT ... FROM TBL_NAME
ORDER BY KEY_PART1,KEY_PART2,... LIMIT 10;
SELECT ... FROM TBL_NAME
ORDER BY KEY_PART1 DESC, KEY_PART2 DESC, ... LIMIT 10;
MySQL resolves the following queries using only the index tree, assuming that the indexed columns are numeric:
SELECT KEY_PART1,KEY_PART2 FROM TBL_NAME WHERE KEY_PART1=VAL;
SELECT COUNT(*) FROM TBL_NAME
WHERE KEY_PART1=VAL1 AND KEY_PART2=VAL2;
SELECT MAX(KEY_PART2) FROM TBL_NAME GROUP BY KEY_PART1;
The following queries use indexing to retrieve the rows in sorted order without a separate sorting pass:
SELECT ... FROM TBL_NAME
ORDER BY KEY_PART1,KEY_PART2,... ;
SELECT ... FROM TBL_NAME
ORDER BY KEY_PART1 DESC, KEY_PART2 DESC, ... ;
File: manual.info.tmp, Node: range-optimization, Next: index-merge-optimization, Prev: where-optimization, Up: select-optimization
8.2.1.2 Range Optimization ..........................
The 'range' access method uses a single index to retrieve a subset of table rows that are contained within one or several index value intervals. It can be used for a single-part or multiple-part index. The following sections describe conditions under which the optimizer uses range access.
*note range-access-single-part::
*note range-access-multi-part::
*note equality-range-optimization::
*note row-constructor-range-optimization::
*note range-optimization-memory-use::
Range Access Method for Single-Part Indexes
For a single-part index, index value intervals can be conveniently represented by corresponding conditions in the 'WHERE' clause, denoted as range conditions rather than 'intervals.'
The definition of a range condition for a single-part index is as follows:
For both 'BTREE' and 'HASH' indexes, comparison of a key part with a constant value is a range condition when using the '=', '<=>', 'IN()', 'IS NULL', or 'IS NOT NULL' operators.
Additionally, for 'BTREE' indexes, comparison of a key part with a constant value is a range condition when using the '>', '<', '>=', '<=', 'BETWEEN', '!=', or '<>' operators, or 'LIKE' comparisons if the argument to 'LIKE' is a constant string that does not start with a wildcard character.
For all index types, multiple range conditions combined with 'OR' or 'AND' form a range condition.
'Constant value' in the preceding descriptions means one of the following:
A constant from the query string
A column of a 'const' or 'system' table from the same join
The result of an uncorrelated subquery
Any expression composed entirely from subexpressions of the preceding types
Here are some examples of queries with range conditions in the 'WHERE' clause:
SELECT * FROM t1
WHERE KEY_COL > 1
AND KEY_COL < 10;
SELECT * FROM t1
WHERE KEY_COL = 1
OR KEY_COL IN (15,18,20);
SELECT * FROM t1
WHERE KEY_COL LIKE 'ab%'
OR KEY_COL BETWEEN 'bar' AND 'foo';
Some nonconstant values may be converted to constants during the optimizer constant propagation phase.
MySQL tries to extract range conditions from the 'WHERE' clause for each of the possible indexes. During the extraction process, conditions that cannot be used for constructing the range condition are dropped, conditions that produce overlapping ranges are combined, and conditions that produce empty ranges are removed.
Consider the following statement, where 'key1' is an indexed column and 'nonkey' is not indexed:
SELECT * FROM t1 WHERE
(key1 < 'abc' AND (key1 LIKE 'abcde%' OR key1 LIKE '%b')) OR
(key1 < 'bar' AND nonkey = 4) OR
(key1 < 'uux' AND key1 > 'z');
The extraction process for key 'key1' is as follows:
Start with original 'WHERE' clause:
(key1 < 'abc' AND (key1 LIKE 'abcde%' OR key1 LIKE '%b')) OR
(key1 < 'bar' AND nonkey = 4) OR
(key1 < 'uux' AND key1 > 'z')
Remove 'nonkey = 4' and 'key1 LIKE '%b'' because they cannot be used for a range scan. The correct way to remove them is to replace them with 'TRUE', so that we do not miss any matching rows when doing the range scan. Replacing them with 'TRUE' yields:
(key1 < 'abc' AND (key1 LIKE 'abcde%' OR TRUE)) OR
(key1 < 'bar' AND TRUE) OR
(key1 < 'uux' AND key1 > 'z')
Collapse conditions that are always true or false:
* '(key1 LIKE 'abcde%' OR TRUE)' is always true
* '(key1 < 'uux' AND key1 > 'z')' is always false
Replacing these conditions with constants yields:
(key1 < 'abc' AND TRUE) OR (key1 < 'bar' AND TRUE) OR (FALSE)
Removing unnecessary 'TRUE' and 'FALSE' constants yields:
(key1 < 'abc') OR (key1 < 'bar')
Combining overlapping intervals into one yields the final condition to be used for the range scan:
(key1 < 'bar')
In general (and as demonstrated by the preceding example), the condition used for a range scan is less restrictive than the 'WHERE' clause. MySQL performs an additional check to filter out rows that satisfy the range condition but not the full 'WHERE' clause.
The range condition extraction algorithm can handle nested 'AND'/'OR' constructs of arbitrary depth, and its output does not depend on the order in which conditions appear in 'WHERE' clause.
MySQL does not support merging multiple ranges for the 'range' access method for spatial indexes. To work around this limitation, you can use a note 'UNION': union. with identical note 'SELECT': select. statements, except that you put each spatial predicate in a different *note 'SELECT': select.
Range Access Method for Multiple-Part Indexes
Range conditions on a multiple-part index are an extension of range conditions for a single-part index. A range condition on a multiple-part index restricts index rows to lie within one or several key tuple intervals. Key tuple intervals are defined over a set of key tuples, using ordering from the index.
For example, consider a multiple-part index defined as 'key1(KEY_PART1, KEY_PART2, KEY_PART3)', and the following set of key tuples listed in key order:
KEY_PART1 KEY_PART2 KEY_PART3
NULL 1 'abc'
NULL 1 'xyz'
NULL 2 'foo'
1 1 'abc'
1 1 'xyz'
1 2 'abc'
2 1 'aaa'
The condition 'KEY_PART1 = 1' defines this interval:
(1,-inf,-inf) <= (KEY_PART1,KEY_PART2,KEY_PART3) < (1,+inf,+inf)
The interval covers the 4th, 5th, and 6th tuples in the preceding data set and can be used by the range access method.
By contrast, the condition 'KEY_PART3 = 'abc'' does not define a single interval and cannot be used by the range access method.
The following descriptions indicate how range conditions work for multiple-part indexes in greater detail.
For 'HASH' indexes, each interval containing identical values can be used. This means that the interval can be produced only for conditions in the following form:
KEY_PART1 CMP CONST1
AND KEY_PART2 CMP CONST2
AND ...
AND KEY_PARTN CMP CONSTN;
Here, CONST1, CONST2, ... are constants, CMP is one of the '=', '<=>', or 'IS NULL' comparison operators, and the conditions cover all index parts. (That is, there are N conditions, one for each part of an N-part index.) For example, the following is a range condition for a three-part 'HASH' index:
KEY_PART1 = 1 AND KEY_PART2 IS NULL AND KEY_PART3 = 'foo'
For the definition of what is considered to be a constant, see *note range-access-single-part::.
For a 'BTREE' index, an interval might be usable for conditions combined with 'AND', where each condition compares a key part with a constant value using '=', '<=>', 'IS NULL', '>', '<', '>=', '<=', '!=', '<>', 'BETWEEN', or 'LIKE 'PATTERN'' (where ''PATTERN'' does not start with a wildcard). An interval can be used as long as it is possible to determine a single key tuple containing all rows that match the condition (or two intervals if '<>' or '!=' is used).
The optimizer attempts to use additional key parts to determine the interval as long as the comparison operator is '=', '<=>', or 'IS NULL'. If the operator is '>', '<', '>=', '<=', '!=', '<>', 'BETWEEN', or 'LIKE', the optimizer uses it but considers no more key parts. For the following expression, the optimizer uses '=' from the first comparison. It also uses '>=' from the second comparison but considers no further key parts and does not use the third comparison for interval construction:
KEY_PART1 = 'foo' AND KEY_PART2 >= 10 AND KEY_PART3 > 10
The single interval is:
('foo',10,-inf) < (KEY_PART1,KEY_PART2,KEY_PART3) < ('foo',+inf,+inf)
It is possible that the created interval contains more rows than the initial condition. For example, the preceding interval includes the value '('foo', 11, 0)', which does not satisfy the original condition.
If conditions that cover sets of rows contained within intervals are combined with 'OR', they form a condition that covers a set of rows contained within the union of their intervals. If the conditions are combined with 'AND', they form a condition that covers a set of rows contained within the intersection of their intervals. For example, for this condition on a two-part index:
(KEY_PART1 = 1 AND KEY_PART2 < 2) OR (KEY_PART1 > 5)
The intervals are:
(1,-inf) < (KEY_PART1,KEY_PART2) < (1,2)
(5,-inf) < (KEY_PART1,KEY_PART2)
In this example, the interval on the first line uses one key part for the left bound and two key parts for the right bound. The interval on the second line uses only one key part. The 'key_len' column in the *note 'EXPLAIN': explain. output indicates the maximum length of the key prefix used.
In some cases, 'key_len' may indicate that a key part was used, but that might be not what you would expect. Suppose that KEY_PART1 and KEY_PART2 can be 'NULL'. Then the 'key_len' column displays two key part lengths for the following condition:
KEY_PART1 >= 1 AND KEY_PART2 < 2
But, in fact, the condition is converted to this:
KEY_PART1 >= 1 AND KEY_PART2 IS NOT NULL
For a description of how optimizations are performed to combine or eliminate intervals for range conditions on a single-part index, see *note range-access-single-part::. Analogous steps are performed for range conditions on multiple-part indexes.
Equality Range Optimization of Many-Valued Comparisons
Consider these expressions, where COL_NAME is an indexed column:
COL_NAME IN(VAL1, ..., VALN)
COL_NAME = VAL1 OR ... OR COL_NAME = VALN
Each expression is true if COL_NAME is equal to any of several values. These comparisons are equality range comparisons (where the 'range' is a single value). The optimizer estimates the cost of reading qualifying rows for equality range comparisons as follows:
If there is a unique index on COL_NAME, the row estimate for each range is 1 because at most one row can have the given value.
Otherwise, any index on COL_NAME is nonunique and the optimizer can estimate the row count for each range using dives into the index or index statistics.
With index dives, the optimizer makes a dive at each end of a range and uses the number of rows in the range as the estimate. For example, the expression 'COL_NAME IN (10, 20, 30)' has three equality ranges and the optimizer makes two dives per range to generate a row estimate. Each pair of dives yields an estimate of the number of rows that have the given value.
Index dives provide accurate row estimates, but as the number of comparison values in the expression increases, the optimizer takes longer to generate a row estimate. Use of index statistics is less accurate than index dives but permits faster row estimation for large value lists.
The 'eq_range_index_dive_limit' system variable enables you to configure the number of values at which the optimizer switches from one row estimation strategy to the other. To permit use of index dives for comparisons of up to N equality ranges, set 'eq_range_index_dive_limit' to N + 1. To disable use of statistics and always use index dives regardless of N, set 'eq_range_index_dive_limit' to 0.
To update table index statistics for best estimates, use *note 'ANALYZE TABLE': analyze-table.
Even under conditions when index dives would otherwise be used, they are skipped for queries that satisfy all these conditions:
A single-index 'FORCE INDEX' index hint is present. The idea is that if index use is forced, there is nothing to be gained from the additional overhead of performing dives into the index.
The index is nonunique and not a 'FULLTEXT' index.
No subquery is present.
No 'DISTINCT', 'GROUP BY', or 'ORDER BY' clause is present.
Those dive-skipping conditions apply only for single-table queries. Index dives are not skipped for multiple-table queries (joins).
Range Optimization of Row Constructor Expressions
The optimizer is able to apply the range scan access method to queries of this form:
SELECT ... FROM t1 WHERE ( col_1, col_2 ) IN (( 'a', 'b' ), ( 'c', 'd' ));
Previously, for range scans to be used, it was necessary to write the query as:
SELECT ... FROM t1 WHERE ( col_1 = 'a' AND col_2 = 'b' )
OR ( col_1 = 'c' AND col_2 = 'd' );
For the optimizer to use a range scan, queries must satisfy these conditions:
Only 'IN()' predicates are used, not 'NOT IN()'.
On the left side of the 'IN()' predicate, the row constructor contains only column references.
On the right side of the 'IN()' predicate, row constructors contain only runtime constants, which are either literals or local column references that are bound to constants during execution.
On the right side of the 'IN()' predicate, there is more than one row constructor.
For more information about the optimizer and row constructors, see *note row-constructor-optimization::
Limiting Memory Use for Range Optimization
To control the memory available to the range optimizer, use the 'range_optimizer_max_mem_size' system variable:
A value of 0 means 'no limit.'
With a value greater than 0, the optimizer tracks the memory consumed when considering the range access method. If the specified limit is about to be exceeded, the range access method is abandoned and other methods, including a full table scan, are considered instead. This could be less optimal. If this happens, the following warning occurs (where N is the current 'range_optimizer_max_mem_size' value):
Warning 3170 Memory capacity of N bytes for
'range_optimizer_max_mem_size' exceeded. Range
optimization was not done for this query.
For note 'UPDATE': update. and note 'DELETE': delete. statements, if the optimizer falls back to a full table scan and the 'sql_safe_updates' system variable is enabled, an error occurs rather than a warning because, in effect, no key is used to determine which rows to modify. For more information, see *note safe-updates::.
For individual queries that exceed the available range optimization memory and for which the optimizer falls back to less optimal plans, increasing the 'range_optimizer_max_mem_size' value may improve performance.
To estimate the amount of memory needed to process a range expression, use these guidelines:
For a simple query such as the following, where there is one candidate key for the range access method, each predicate combined with 'OR' uses approximately 230 bytes:
SELECT COUNT(*) FROM t
WHERE a=1 OR a=2 OR a=3 OR .. . a=N;
Similarly for a query such as the following, each predicate combined with 'AND' uses approximately 125 bytes:
SELECT COUNT(*) FROM t
WHERE a=1 AND b=1 AND c=1 ... N;
For a query with 'IN()' predicates:
SELECT COUNT(*) FROM t
WHERE a IN (1,2, ..., M) AND b IN (1,2, ..., N);
Each literal value in an 'IN()' list counts as a predicate combined with 'OR'. If there are two 'IN()' lists, the number of predicates combined with 'OR' is the product of the number of literal values in each list. Thus, the number of predicates combined with 'OR' in the preceding case is M x N.
Before 5.7.11, the number of bytes per predicate combined with 'OR' was higher, approximately 700 bytes.
File: manual.info.tmp, Node: index-merge-optimization, Next: engine-condition-pushdown-optimization, Prev: range-optimization, Up: select-optimization
8.2.1.3 Index Merge Optimization ................................
The Index Merge access method retrieves rows with multiple 'range' scans and merges their results into one. This access method merges index scans from a single table only, not scans across multiple tables. The merge can produce unions, intersections, or unions-of-intersections of its underlying scans.
Example queries for which Index Merge may be used:
SELECT * FROM TBL_NAME WHERE KEY1 = 10 OR KEY2 = 20;
SELECT * FROM TBL_NAME
WHERE (KEY1 = 10 OR KEY2 = 20) AND NON_KEY = 30;
SELECT * FROM t1, t2
WHERE (t1.KEY1 IN (1,2) OR t1.KEY2 LIKE 'VALUE%')
AND t2.KEY1 = t1.SOME_COL;
SELECT * FROM t1, t2
WHERE t1.KEY1 = 1
AND (t2.KEY1 = t1.SOME_COL OR t2.KEY2 = t1.SOME_COL2);
Note:
The Index Merge optimization algorithm has the following known limitations:
If your query has a complex 'WHERE' clause with deep 'AND'/'OR' nesting and MySQL does not choose the optimal plan, try distributing terms using the following identity transformations:
(X AND Y) OR Z => (X OR Z) AND (Y OR Z)
(X OR Y) AND Z => (X AND Z) OR (Y AND Z)
Index Merge is not applicable to full-text indexes.
In *note 'EXPLAIN': explain. output, the Index Merge method appears as 'index_merge' in the 'type' column. In this case, the 'key' column contains a list of indexes used, and 'key_len' contains a list of the longest key parts for those indexes.
The Index Merge access method has several algorithms, which are displayed in the 'Extra' field of *note 'EXPLAIN': explain. output:
'Using intersect(...)'
'Using union(...)'
'Using sort_union(...)'
The following sections describe these algorithms in greater detail. The optimizer chooses between different possible Index Merge algorithms and other access methods based on cost estimates of the various available options.
Use of Index Merge is subject to the value of the 'index_merge', 'index_merge_intersection', 'index_merge_union', and 'index_merge_sort_union' flags of the 'optimizer_switch' system variable. See *note switchable-optimizations::. By default, all those flags are 'on'. To enable only certain algorithms, set 'index_merge' to 'off', and enable only such of the others as should be permitted.
*note index-merge-intersection::
*note index-merge-union::
*note index-merge-sort-union::
Index Merge Intersection Access Algorithm
This access algorithm is applicable when a 'WHERE' clause is converted to several range conditions on different keys combined with 'AND', and each condition is one of the following:
An N-part expression of this form, where the index has exactly N parts (that is, all index parts are covered):
KEY_PART1 = CONST1 AND KEY_PART2 = CONST2 ... AND KEY_PARTN = CONSTN
Any range condition over the primary key of an 'InnoDB' table.
Examples:
SELECT * FROM INNODB_TABLE
WHERE PRIMARY_KEY < 10 AND KEY_COL1 = 20;
SELECT * FROM TBL_NAME
WHERE KEY1_PART1 = 1 AND KEY1_PART2 = 2 AND KEY2 = 2;
The Index Merge intersection algorithm performs simultaneous scans on all used indexes and produces the intersection of row sequences that it receives from the merged index scans.
If all columns used in the query are covered by the used indexes, full table rows are not retrieved (*note 'EXPLAIN': explain. output contains 'Using index' in 'Extra' field in this case). Here is an example of such a query:
SELECT COUNT(*) FROM t1 WHERE key1 = 1 AND key2 = 1;
If the used indexes do not cover all columns used in the query, full rows are retrieved only when the range conditions for all used keys are satisfied.
If one of the merged conditions is a condition over the primary key of an 'InnoDB' table, it is not used for row retrieval, but is used to filter out rows retrieved using other conditions.
Index Merge Union Access Algorithm
The criteria for this algorithm are similar to those for the Index Merge intersection algorithm. The algorithm is applicable when the table's 'WHERE' clause is converted to several range conditions on different keys combined with 'OR', and each condition is one of the following:
An N-part expression of this form, where the index has exactly N parts (that is, all index parts are covered):
KEY_PART1 = CONST1 OR KEY_PART2 = CONST2 ... OR KEY_PARTN = CONSTN
Any range condition over a primary key of an 'InnoDB' table.
A condition for which the Index Merge intersection algorithm is applicable.
Examples:
SELECT * FROM t1
WHERE KEY1 = 1 OR KEY2 = 2 OR KEY3 = 3;
SELECT * FROM INNODB_TABLE
WHERE (KEY1 = 1 AND KEY2 = 2)
OR (KEY3 = 'foo' AND KEY4 = 'bar') AND KEY5 = 5;
Index Merge Sort-Union Access Algorithm
This access algorithm is applicable when the 'WHERE' clause is converted to several range conditions combined by 'OR', but the Index Merge union algorithm is not applicable.
Examples:
SELECT * FROM TBL_NAME
WHERE KEY_COL1 < 10 OR KEY_COL2 < 20;
SELECT * FROM TBL_NAME
WHERE (KEY_COL1 > 10 OR KEY_COL2 = 20) AND NONKEY_COL = 30;
The difference between the sort-union algorithm and the union algorithm is that the sort-union algorithm must first fetch row IDs for all rows and sort them before returning any rows.
File: manual.info.tmp, Node: engine-condition-pushdown-optimization, Next: index-condition-pushdown-optimization, Prev: index-merge-optimization, Up: select-optimization
8.2.1.4 Engine Condition Pushdown Optimization ..............................................
This optimization improves the efficiency of direct comparisons between a nonindexed column and a constant. In such cases, the condition is 'pushed down' to the storage engine for evaluation. This optimization can be used only by the *note 'NDB': mysql-cluster. storage engine.
For NDB Cluster, this optimization can eliminate the need to send nonmatching rows over the network between the cluster's data nodes and the MySQL server that issued the query, and can speed up queries where it is used by a factor of 5 to 10 times over cases where condition pushdown could be but is not used.
Suppose that an NDB Cluster table is defined as follows:
CREATE TABLE t1 (
a INT,
b INT,
KEY(a)
) ENGINE=NDB;
Engine condition pushdown can be used with queries such as the one shown here, which includes a comparison between a nonindexed column and a constant:
SELECT a, b FROM t1 WHERE b = 10;
The use of engine condition pushdown can be seen in the output of *note 'EXPLAIN': explain.:
mysql> EXPLAIN SELECT a,b FROM t1 WHERE b = 10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 10
Extra: Using where with pushed condition
However, engine condition pushdown cannot be used with either of these two queries:
SELECT a,b FROM t1 WHERE a = 10;
SELECT a,b FROM t1 WHERE b + 1 = 10;
Engine condition pushdown is not applicable to the first query because an index exists on column 'a'. (An index access method would be more efficient and so would be chosen in preference to condition pushdown.) Engine condition pushdown cannot be employed for the second query because the comparison involving the nonindexed column 'b' is indirect. (However, engine condition pushdown could be applied if you were to reduce 'b + 1 = 10' to 'b = 9' in the 'WHERE' clause.)
Engine condition pushdown may also be employed when an indexed column is compared with a constant using a '>' or '<' operator:
mysql> EXPLAIN SELECT a, b FROM t1 WHERE a < 2\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
type: range
possible_keys: a
key: a
key_len: 5
ref: NULL
rows: 2
Extra: Using where with pushed condition
Other supported comparisons for engine condition pushdown include the following:
'COLUMN [NOT] LIKE PATTERN'
PATTERN must be a string literal containing the pattern to be matched; for syntax, see *note string-comparison-functions::.
'COLUMN IS [NOT] NULL'
'COLUMN IN (VALUE_LIST)'
Each item in the VALUE_LIST must be a constant, literal value.
'COLUMN BETWEEN CONSTANT1 AND CONSTANT2'
CONSTANT1 and CONSTANT2 must each be a constant, literal value.
In all of the cases in the preceding list, it is possible for the condition to be converted into the form of one or more direct comparisons between a column and a constant.
Engine condition pushdown is enabled by default. To disable it at server startup, set the 'optimizer_switch' system variable's 'engine_condition_pushdown' flag to 'off'. For example, in a 'my.cnf' file, use these lines:
[mysqld]
optimizer_switch=engine_condition_pushdown=off
At runtime, disable condition pushdown like this:
SET optimizer_switch='engine_condition_pushdown=off';
Limitations
Engine condition pushdown is subject to the following limitations:
Engine condition pushdown is supported only by the *note 'NDB': mysql-cluster. storage engine.
Columns may be compared with constants only; however, this includes expressions which evaluate to constant values.
Columns used in comparisons cannot be of any of the note 'BLOB': blob. or note 'TEXT': blob. types. This exclusion extends to note 'JSON': json, note 'BIT': bit-type, and *note 'ENUM': enum. columns as well.
A string value to be compared with a column must use the same collation as the column.
Joins are not directly supported; conditions involving multiple tables are pushed separately where possible. Use extended note 'EXPLAIN': explain. output to determine which conditions are actually pushed down. See note explain-extended::.
File: manual.info.tmp, Node: index-condition-pushdown-optimization, Next: nested-loop-joins, Prev: engine-condition-pushdown-optimization, Up: select-optimization
8.2.1.5 Index Condition Pushdown Optimization .............................................
Index Condition Pushdown (ICP) is an optimization for the case where MySQL retrieves rows from a table using an index. Without ICP, the storage engine traverses the index to locate rows in the base table and returns them to the MySQL server which evaluates the 'WHERE' condition for the rows. With ICP enabled, and if parts of the 'WHERE' condition can be evaluated by using only columns from the index, the MySQL server pushes this part of the 'WHERE' condition down to the storage engine. The storage engine then evaluates the pushed index condition by using the index entry and only if this is satisfied is the row read from the table. ICP can reduce the number of times the storage engine must access the base table and the number of times the MySQL server must access the storage engine.
Applicability of the Index Condition Pushdown optimization is subject to these conditions:
ICP is used for the 'range', 'ref', 'eq_ref', and 'ref_or_null' access methods when there is a need to access full table rows.
ICP can be used for note 'InnoDB': innodb-storage-engine. and note 'MyISAM': myisam-storage-engine. tables, including partitioned 'InnoDB' and 'MyISAM' tables.
For 'InnoDB' tables, ICP is used only for secondary indexes. The goal of ICP is to reduce the number of full-row reads and thereby reduce I/O operations. For 'InnoDB' clustered indexes, the complete record is already read into the 'InnoDB' buffer. Using ICP in this case does not reduce I/O.
ICP is not supported with secondary indexes created on virtual generated columns. 'InnoDB' supports secondary indexes on virtual generated columns.
Conditions that refer to subqueries cannot be pushed down.
Conditions that refer to stored functions cannot be pushed down. Storage engines cannot invoke stored functions.
Triggered conditions cannot be pushed down. (For information about triggered conditions, see *note subquery-optimization-with-exists::.)
To understand how this optimization works, first consider how an index scan proceeds when Index Condition Pushdown is not used:
Get the next row, first by reading the index tuple, and then by using the index tuple to locate and read the full table row.
Test the part of the 'WHERE' condition that applies to this table. Accept or reject the row based on the test result.
Using Index Condition Pushdown, the scan proceeds like this instead:
Get the next row's index tuple (but not the full table row).
Test the part of the 'WHERE' condition that applies to this table and can be checked using only index columns. If the condition is not satisfied, proceed to the index tuple for the next row.
If the condition is satisfied, use the index tuple to locate and read the full table row.
Test the remaining part of the 'WHERE' condition that applies to this table. Accept or reject the row based on the test result.
*note 'EXPLAIN': explain. output shows 'Using index condition' in the 'Extra' column when Index Condition Pushdown is used. It does not show 'Using index' because that does not apply when full table rows must be read.
Suppose that a table contains information about people and their addresses and that the table has an index defined as 'INDEX (zipcode, lastname, firstname)'. If we know a person's 'zipcode' value but are not sure about the last name, we can search like this:
SELECT * FROM people
WHERE zipcode='95054'
AND lastname LIKE '%etrunia%'
AND address LIKE '%Main Street%';
MySQL can use the index to scan through people with 'zipcode='95054''. The second part ('lastname LIKE '%etrunia%'') cannot be used to limit the number of rows that must be scanned, so without Index Condition Pushdown, this query must retrieve full table rows for all people who have 'zipcode='95054''.
With Index Condition Pushdown, MySQL checks the 'lastname LIKE '%etrunia%'' part before reading the full table row. This avoids reading full rows corresponding to index tuples that match the 'zipcode' condition but not the 'lastname' condition.
Index Condition Pushdown is enabled by default. It can be controlled with the 'optimizer_switch' system variable by setting the 'index_condition_pushdown' flag:
SET optimizer_switch = 'index_condition_pushdown=off';
SET optimizer_switch = 'index_condition_pushdown=on';
See *note switchable-optimizations::.
File: manual.info.tmp, Node: nested-loop-joins, Next: nested-join-optimization, Prev: index-condition-pushdown-optimization, Up: select-optimization
8.2.1.6 Nested-Loop Join Algorithms ...................................
MySQL executes joins between tables using a nested-loop algorithm or variations on it.
*note nested-loop-join-algorithm::
*note block-nested-loop-join-algorithm::
Nested-Loop Join Algorithm
A simple nested-loop join (NLJ) algorithm reads rows from the first table in a loop one at a time, passing each row to a nested loop that processes the next table in the join. This process is repeated as many times as there remain tables to be joined.
Assume that a join between three tables 't1', 't2', and 't3' is to be executed using the following join types:
Table Join Type
t1 range
t2 ref
t3 ALL
If a simple NLJ algorithm is used, the join is processed like this:
for each row in t1 matching range {
for each row in t2 matching reference key {
for each row in t3 {
if row satisfies join conditions, send to client
}
}
}
Because the NLJ algorithm passes rows one at a time from outer loops to inner loops, it typically reads tables processed in the inner loops many times.
Block Nested-Loop Join Algorithm
A Block Nested-Loop (BNL) join algorithm uses buffering of rows read in outer loops to reduce the number of times that tables in inner loops must be read. For example, if 10 rows are read into a buffer and the buffer is passed to the next inner loop, each row read in the inner loop can be compared against all 10 rows in the buffer. This reduces by an order of magnitude the number of times the inner table must be read.
MySQL join buffering has these characteristics:
Join buffering can be used when the join is of type 'ALL' or 'index' (in other words, when no possible keys can be used, and a full scan is done, of either the data or index rows, respectively), or 'range'. Use of buffering is also applicable to outer joins, as described in *note bnl-bka-optimization::.
A join buffer is never allocated for the first nonconstant table, even if it would be of type 'ALL' or 'index'.
Only columns of interest to a join are stored in its join buffer, not whole rows.
The 'join_buffer_size' system variable determines the size of each join buffer used to process a query.
One buffer is allocated for each join that can be buffered, so a given query might be processed using multiple join buffers.
A join buffer is allocated prior to executing the join and freed after the query is done.
For the example join described previously for the NLJ algorithm (without buffering), the join is done as follows using join buffering:
for each row in t1 matching range {
for each row in t2 matching reference key {
store used columns from t1, t2 in join buffer
if buffer is full {
for each row in t3 {
for each t1, t2 combination in join buffer {
if row satisfies join conditions, send to client
}
}
empty join buffer
}
}
}
if buffer is not empty {
for each row in t3 {
for each t1, t2 combination in join buffer {
if row satisfies join conditions, send to client
}
}
}
If S is the size of each stored 't1', 't2' combination in the join buffer and C is the number of combinations in the buffer, the number of times table 't3' is scanned is:
(S * C)/join_buffer_size + 1
The number of 't3' scans decreases as the value of 'join_buffer_size' increases, up to the point when 'join_buffer_size' is large enough to hold all previous row combinations. At that point, no speed is gained by making it larger.
File: manual.info.tmp, Node: nested-join-optimization, Next: outer-join-optimization, Prev: nested-loop-joins, Up: select-optimization
8.2.1.7 Nested Join Optimization ................................
The syntax for expressing joins permits nested joins. The following discussion refers to the join syntax described in *note join::.
The syntax of TABLE_FACTOR is extended in comparison with the SQL Standard. The latter accepts only TABLE_REFERENCE, not a list of them inside a pair of parentheses. This is a conservative extension if we consider each comma in a list of TABLE_REFERENCE items as equivalent to an inner join. For example:
SELECT * FROM t1 LEFT JOIN (t2, t3, t4)
ON (t2.a=t1.a AND t3.b=t1.b AND t4.c=t1.c)
Is equivalent to:
SELECT * FROM t1 LEFT JOIN (t2 CROSS JOIN t3 CROSS JOIN t4)
ON (t2.a=t1.a AND t3.b=t1.b AND t4.c=t1.c)
In MySQL, 'CROSS JOIN' is syntactically equivalent to 'INNER JOIN'; they can replace each other. In standard SQL, they are not equivalent. 'INNER JOIN' is used with an 'ON' clause; 'CROSS JOIN' is used otherwise.
In general, parentheses can be ignored in join expressions containing only inner join operations. Consider this join expression:
t1 LEFT JOIN (t2 LEFT JOIN t3 ON t2.b=t3.b OR t2.b IS NULL)
ON t1.a=t2.a
After removing parentheses and grouping operations to the left, that join expression transforms into this expression:
(t1 LEFT JOIN t2 ON t1.a=t2.a) LEFT JOIN t3
ON t2.b=t3.b OR t2.b IS NULL
Yet, the two expressions are not equivalent. To see this, suppose that the tables 't1', 't2', and 't3' have the following state:
Table 't1' contains rows '(1)', '(2)'
Table 't2' contains row '(1,101)'
Table 't3' contains row '(101)'
In this case, the first expression returns a result set including the rows '(1,1,101,101)', '(2,NULL,NULL,NULL)', whereas the second expression returns the rows '(1,1,101,101)', '(2,NULL,NULL,101)':
mysql> SELECT *
FROM t1
LEFT JOIN
(t2 LEFT JOIN t3 ON t2.b=t3.b OR t2.b IS NULL)
ON t1.a=t2.a;
+------+------+------+------+
| a | a | b | b |
+------+------+------+------+
| 1 | 1 | 101 | 101 |
| 2 | NULL | NULL | NULL |
+------+------+------+------+
mysql> SELECT *
FROM (t1 LEFT JOIN t2 ON t1.a=t2.a)
LEFT JOIN t3
ON t2.b=t3.b OR t2.b IS NULL;
+------+------+------+------+
| a | a | b | b |
+------+------+------+------+
| 1 | 1 | 101 | 101 |
| 2 | NULL | NULL | 101 |
+------+------+------+------+
In the following example, an outer join operation is used together with an inner join operation:
t1 LEFT JOIN (t2, t3) ON t1.a=t2.a
That expression cannot be transformed into the following expression:
t1 LEFT JOIN t2 ON t1.a=t2.a, t3
For the given table states, the two expressions return different sets of rows:
mysql> SELECT *
FROM t1 LEFT JOIN (t2, t3) ON t1.a=t2.a;
+------+------+------+------+
| a | a | b | b |
+------+------+------+------+
| 1 | 1 | 101 | 101 |
| 2 | NULL | NULL | NULL |
+------+------+------+------+
mysql> SELECT *
FROM t1 LEFT JOIN t2 ON t1.a=t2.a, t3;
+------+------+------+------+
| a | a | b | b |
+------+------+------+------+
| 1 | 1 | 101 | 101 |
| 2 | NULL | NULL | 101 |
+------+------+------+------+
Therefore, if we omit parentheses in a join expression with outer join operators, we might change the result set for the original expression.
More exactly, we cannot ignore parentheses in the right operand of the left outer join operation and in the left operand of a right join operation. In other words, we cannot ignore parentheses for the inner table expressions of outer join operations. Parentheses for the other operand (operand for the outer table) can be ignored.
The following expression:
(t1,t2) LEFT JOIN t3 ON P(t2.b,t3.b)
Is equivalent to this expression for any tables 't1,t2,t3' and any condition 'P' over attributes 't2.b' and 't3.b':
t1, t2 LEFT JOIN t3 ON P(t2.b,t3.b)
Whenever the order of execution of join operations in a join expression (JOINED_TABLE) is not from left to right, we talk about nested joins. Consider the following queries:
SELECT * FROM t1 LEFT JOIN (t2 LEFT JOIN t3 ON t2.b=t3.b) ON t1.a=t2.a
WHERE t1.a > 1
SELECT * FROM t1 LEFT JOIN (t2, t3) ON t1.a=t2.a
WHERE (t2.b=t3.b OR t2.b IS NULL) AND t1.a > 1
Those queries are considered to contain these nested joins:
t2 LEFT JOIN t3 ON t2.b=t3.b
t2, t3
In the first query, the nested join is formed with a left join operation. In the second query, it is formed with an inner join operation.
In the first query, the parentheses can be omitted: The grammatical structure of the join expression dictates the same order of execution for join operations. For the second query, the parentheses cannot be omitted, although the join expression here can be interpreted unambiguously without them. In our extended syntax, the parentheses in '(t2, t3)' of the second query are required, although theoretically the query could be parsed without them: We still would have unambiguous syntactical structure for the query because 'LEFT JOIN' and 'ON' play the role of the left and right delimiters for the expression '(t2,t3)'.
The preceding examples demonstrate these points:
For join expressions involving only inner joins (and not outer joins), parentheses can be removed and joins evaluated left to right. In fact, tables can be evaluated in any order.
The same is not true, in general, for outer joins or for outer joins mixed with inner joins. Removal of parentheses may change the result.
Queries with nested outer joins are executed in the same pipeline manner as queries with inner joins. More exactly, a variation of the nested-loop join algorithm is exploited. Recall the algorithm by which the nested-loop join executes a query (see *note nested-loop-joins::). Suppose that a join query over 3 tables 'T1,T2,T3' has this form:
SELECT * FROM T1 INNER JOIN T2 ON P1(T1,T2)
INNER JOIN T3 ON P2(T2,T3)
WHERE P(T1,T2,T3)
Here, 'P1(T1,T2)' and 'P2(T3,T3)' are some join conditions (on expressions), whereas 'P(T1,T2,T3)' is a condition over columns of tables 'T1,T2,T3'.
The nested-loop join algorithm would execute this query in the following manner:
FOR each row t1 in T1 {
FOR each row t2 in T2 such that P1(t1,t2) {
FOR each row t3 in T3 such that P2(t2,t3) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
}
}
}
The notation 't1||t2||t3' indicates a row constructed by concatenating the columns of rows 't1', 't2', and 't3'. In some of the following examples, 'NULL' where a table name appears means a row in which 'NULL' is used for each column of that table. For example, 't1||t2||NULL' indicates a row constructed by concatenating the columns of rows 't1' and 't2', and 'NULL' for each column of 't3'. Such a row is said to be 'NULL'-complemented.
Now consider a query with nested outer joins:
SELECT * FROM T1 LEFT JOIN
(T2 LEFT JOIN T3 ON P2(T2,T3))
ON P1(T1,T2)
WHERE P(T1,T2,T3)
For this query, modify the nested-loop pattern to obtain:
FOR each row t1 in T1 {
BOOL f1:=FALSE;
FOR each row t2 in T2 such that P1(t1,t2) {
BOOL f2:=FALSE;
FOR each row t3 in T3 such that P2(t2,t3) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
f2=TRUE;
f1=TRUE;
}
IF (!f2) {
IF P(t1,t2,NULL) {
t:=t1||t2||NULL; OUTPUT t;
}
f1=TRUE;
}
}
IF (!f1) {
IF P(t1,NULL,NULL) {
t:=t1||NULL||NULL; OUTPUT t;
}
}
}
In general, for any nested loop for the first inner table in an outer join operation, a flag is introduced that is turned off before the loop and is checked after the loop. The flag is turned on when for the current row from the outer table a match from the table representing the inner operand is found. If at the end of the loop cycle the flag is still off, no match has been found for the current row of the outer table. In this case, the row is complemented by 'NULL' values for the columns of the inner tables. The result row is passed to the final check for the output or into the next nested loop, but only if the row satisfies the join condition of all embedded outer joins.
In the example, the outer join table expressed by the following expression is embedded:
(T2 LEFT JOIN T3 ON P2(T2,T3))
For the query with inner joins, the optimizer could choose a different order of nested loops, such as this one:
FOR each row t3 in T3 {
FOR each row t2 in T2 such that P2(t2,t3) {
FOR each row t1 in T1 such that P1(t1,t2) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
}
}
}
For queries with outer joins, the optimizer can choose only such an order where loops for outer tables precede loops for inner tables. Thus, for our query with outer joins, only one nesting order is possible. For the following query, the optimizer evaluates two different nestings. In both nestings, 'T1' must be processed in the outer loop because it is used in an outer join. 'T2' and 'T3' are used in an inner join, so that join must be processed in the inner loop. However, because the join is an inner join, 'T2' and 'T3' can be processed in either order.
SELECT * T1 LEFT JOIN (T2,T3) ON P1(T1,T2) AND P2(T1,T3)
WHERE P(T1,T2,T3)
One nesting evaluates 'T2', then 'T3':
FOR each row t1 in T1 {
BOOL f1:=FALSE;
FOR each row t2 in T2 such that P1(t1,t2) {
FOR each row t3 in T3 such that P2(t1,t3) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
f1:=TRUE
}
}
IF (!f1) {
IF P(t1,NULL,NULL) {
t:=t1||NULL||NULL; OUTPUT t;
}
}
}
The other nesting evaluates 'T3', then 'T2':
FOR each row t1 in T1 {
BOOL f1:=FALSE;
FOR each row t3 in T3 such that P2(t1,t3) {
FOR each row t2 in T2 such that P1(t1,t2) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
f1:=TRUE
}
}
IF (!f1) {
IF P(t1,NULL,NULL) {
t:=t1||NULL||NULL; OUTPUT t;
}
}
}
When discussing the nested-loop algorithm for inner joins, we omitted some details whose impact on the performance of query execution may be huge. We did not mention so-called 'pushed-down' conditions. Suppose that our 'WHERE' condition 'P(T1,T2,T3)' can be represented by a conjunctive formula:
P(T1,T2,T2) = C1(T1) AND C2(T2) AND C3(T3).
In this case, MySQL actually uses the following nested-loop algorithm for the execution of the query with inner joins:
FOR each row t1 in T1 such that C1(t1) {
FOR each row t2 in T2 such that P1(t1,t2) AND C2(t2) {
FOR each row t3 in T3 such that P2(t2,t3) AND C3(t3) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
}
}
}
You see that each of the conjuncts 'C1(T1)', 'C2(T2)', 'C3(T3)' are pushed out of the most inner loop to the most outer loop where it can be evaluated. If 'C1(T1)' is a very restrictive condition, this condition pushdown may greatly reduce the number of rows from table 'T1' passed to the inner loops. As a result, the execution time for the query may improve immensely.
For a query with outer joins, the 'WHERE' condition is to be checked only after it has been found that the current row from the outer table has a match in the inner tables. Thus, the optimization of pushing conditions out of the inner nested loops cannot be applied directly to queries with outer joins. Here we must introduce conditional pushed-down predicates guarded by the flags that are turned on when a match has been encountered.
Recall this example with outer joins:
P(T1,T2,T3)=C1(T1) AND C(T2) AND C3(T3)
For that example, the nested-loop algorithm using guarded pushed-down conditions looks like this:
FOR each row t1 in T1 such that C1(t1) {
BOOL f1:=FALSE;
FOR each row t2 in T2
such that P1(t1,t2) AND (f1?C2(t2):TRUE) {
BOOL f2:=FALSE;
FOR each row t3 in T3
such that P2(t2,t3) AND (f1&&f2?C3(t3):TRUE) {
IF (f1&&f2?TRUE:(C2(t2) AND C3(t3))) {
t:=t1||t2||t3; OUTPUT t;
}
f2=TRUE;
f1=TRUE;
}
IF (!f2) {
IF (f1?TRUE:C2(t2) && P(t1,t2,NULL)) {
t:=t1||t2||NULL; OUTPUT t;
}
f1=TRUE;
}
}
IF (!f1 && P(t1,NULL,NULL)) {
t:=t1||NULL||NULL; OUTPUT t;
}
}
In general, pushed-down predicates can be extracted from join conditions such as 'P1(T1,T2)' and 'P(T2,T3)'. In this case, a pushed-down predicate is guarded also by a flag that prevents checking the predicate for the 'NULL'-complemented row generated by the corresponding outer join operation.
Access by key from one inner table to another in the same nested join is prohibited if it is induced by a predicate from the 'WHERE' condition.
File: manual.info.tmp, Node: outer-join-optimization, Next: outer-join-simplification, Prev: nested-join-optimization, Up: select-optimization
8.2.1.8 Outer Join Optimization ...............................
Outer joins include 'LEFT JOIN' and 'RIGHT JOIN'.
MySQL implements an 'A LEFT JOIN B JOIN_SPECIFICATION' as follows:
Table B is set to depend on table A and all tables on which A depends.
Table A is set to depend on all tables (except B) that are used in the 'LEFT JOIN' condition.
The 'LEFT JOIN' condition is used to decide how to retrieve rows from table B. (In other words, any condition in the 'WHERE' clause is not used.)
All standard join optimizations are performed, with the exception that a table is always read after all tables on which it depends. If there is a circular dependency, an error occurs.
All standard 'WHERE' optimizations are performed.
If there is a row in A that matches the 'WHERE' clause, but there is no row in B that matches the 'ON' condition, an extra B row is generated with all columns set to 'NULL'.
If you use 'LEFT JOIN' to find rows that do not exist in some table and you have the following test: 'COL_NAME IS NULL' in the 'WHERE' part, where COL_NAME is a column that is declared as 'NOT NULL', MySQL stops searching for more rows (for a particular key combination) after it has found one row that matches the 'LEFT JOIN' condition.
The 'RIGHT JOIN' implementation is analogous to that of 'LEFT JOIN' with the table roles reversed. Right joins are converted to equivalent left joins, as described in *note outer-join-simplification::.
For a 'LEFT JOIN', if the 'WHERE' condition is always false for the generated 'NULL' row, the 'LEFT JOIN' is changed to an inner join. For example, the 'WHERE' clause would be false in the following query if 't2.column1' were 'NULL':
SELECT * FROM t1 LEFT JOIN t2 ON (column1) WHERE t2.column2=5;
Therefore, it is safe to convert the query to an inner join:
SELECT * FROM t1, t2 WHERE t2.column2=5 AND t1.column1=t2.column1;
Now the optimizer can use table 't2' before table 't1' if doing so would result in a better query plan. To provide a hint about the table join order, use 'STRAIGHT_JOIN'; see note select::. However, 'STRAIGHT_JOIN' may prevent indexes from being used because it disables semijoin transformations; see note semijoins::.
File: manual.info.tmp, Node: outer-join-simplification, Next: mrr-optimization, Prev: outer-join-optimization, Up: select-optimization
8.2.1.9 Outer Join Simplification .................................
Table expressions in the 'FROM' clause of a query are simplified in many cases.
At the parser stage, queries with right outer join operations are converted to equivalent queries containing only left join operations. In the general case, the conversion is performed such that this right join:
(T1, ...) RIGHT JOIN (T2, ...) ON P(T1, ..., T2, ...)
Becomes this equivalent left join:
(T2, ...) LEFT JOIN (T1, ...) ON P(T1, ..., T2, ...)
All inner join expressions of the form 'T1 INNER JOIN T2 ON P(T1,T2)' are replaced by the list 'T1,T2', 'P(T1,T2)' being joined as a conjunct to the 'WHERE' condition (or to the join condition of the embedding join, if there is any).
When the optimizer evaluates plans for outer join operations, it takes into consideration only plans where, for each such operation, the outer tables are accessed before the inner tables. The optimizer choices are limited because only such plans enable outer joins to be executed using the nested-loop algorithm.
Consider a query of this form, where 'R(T2)' greatly narrows the number of matching rows from table 'T2':
SELECT * T1 FROM T1
LEFT JOIN T2 ON P1(T1,T2)
WHERE P(T1,T2) AND R(T2)
If the query is executed as written, the optimizer has no choice but to access the less-restricted table 'T1' before the more-restricted table 'T2', which may produce a very inefficient execution plan.
Instead, MySQL converts the query to a query with no outer join operation if the 'WHERE' condition is null-rejected. (That is, it converts the outer join to an inner join.) A condition is said to be null-rejected for an outer join operation if it evaluates to 'FALSE' or 'UNKNOWN' for any 'NULL'-complemented row generated for the operation.
Thus, for this outer join:
T1 LEFT JOIN T2 ON T1.A=T2.A
Conditions such as these are null-rejected because they cannot be true for any 'NULL'-complemented row (with 'T2' columns set to 'NULL'):
T2.B IS NOT NULL
T2.B > 3
T2.C <= T1.C
T2.B < 2 OR T2.C > 1
Conditions such as these are not null-rejected because they might be true for a 'NULL'-complemented row:
T2.B IS NULL
T1.B < 3 OR T2.B IS NOT NULL
T1.B < 3 OR T2.B > 3
The general rules for checking whether a condition is null-rejected for an outer join operation are simple:
It is of the form 'A IS NOT NULL', where 'A' is an attribute of any of the inner tables
It is a predicate containing a reference to an inner table that evaluates to 'UNKNOWN' when one of its arguments is 'NULL'
It is a conjunction containing a null-rejected condition as a conjunct
It is a disjunction of null-rejected conditions
A condition can be null-rejected for one outer join operation in a query and not null-rejected for another. In this query, the 'WHERE' condition is null-rejected for the second outer join operation but is not null-rejected for the first one:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A
LEFT JOIN T3 ON T3.B=T1.B
WHERE T3.C > 0
If the 'WHERE' condition is null-rejected for an outer join operation in a query, the outer join operation is replaced by an inner join operation.
For example, in the preceding query, the second outer join is null-rejected and can be replaced by an inner join:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A
INNER JOIN T3 ON T3.B=T1.B
WHERE T3.C > 0
For the original query, the optimizer evaluates only plans compatible with the single table-access order 'T1,T2,T3'. For the rewritten query, it additionally considers the access order 'T3,T1,T2'.
A conversion of one outer join operation may trigger a conversion of another. Thus, the query:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A
LEFT JOIN T3 ON T3.B=T2.B
WHERE T3.C > 0
Is first converted to the query:
SELECT * FROM T1 LEFT JOIN T2 ON T2.A=T1.A
INNER JOIN T3 ON T3.B=T2.B
WHERE T3.C > 0
Which is equivalent to the query:
SELECT * FROM (T1 LEFT JOIN T2 ON T2.A=T1.A), T3
WHERE T3.C > 0 AND T3.B=T2.B
The remaining outer join operation can also be replaced by an inner join because the condition 'T3.B=T2.B' is null-rejected. This results in a query with no outer joins at all:
SELECT * FROM (T1 INNER JOIN T2 ON T2.A=T1.A), T3
WHERE T3.C > 0 AND T3.B=T2.B
Sometimes the optimizer succeeds in replacing an embedded outer join operation, but cannot convert the embedding outer join. The following query:
SELECT * FROM T1 LEFT JOIN
(T2 LEFT JOIN T3 ON T3.B=T2.B)
ON T2.A=T1.A
WHERE T3.C > 0
Is converted to:
SELECT * FROM T1 LEFT JOIN
(T2 INNER JOIN T3 ON T3.B=T2.B)
ON T2.A=T1.A
WHERE T3.C > 0
That can be rewritten only to the form still containing the embedding outer join operation:
SELECT * FROM T1 LEFT JOIN
(T2,T3)
ON (T2.A=T1.A AND T3.B=T2.B)
WHERE T3.C > 0
Any attempt to convert an embedded outer join operation in a query must take into account the join condition for the embedding outer join together with the 'WHERE' condition. In this query, the 'WHERE' condition is not null-rejected for the embedded outer join, but the join condition of the embedding outer join 'T2.A=T1.A AND T3.C=T1.C' is null-rejected:
SELECT * FROM T1 LEFT JOIN
(T2 LEFT JOIN T3 ON T3.B=T2.B)
ON T2.A=T1.A AND T3.C=T1.C
WHERE T3.D > 0 OR T1.D > 0
Consequently, the query can be converted to:
SELECT * FROM T1 LEFT JOIN
(T2, T3)
ON T2.A=T1.A AND T3.C=T1.C AND T3.B=T2.B
WHERE T3.D > 0 OR T1.D > 0
File: manual.info.tmp, Node: mrr-optimization, Next: bnl-bka-optimization, Prev: outer-join-simplification, Up: select-optimization
8.2.1.10 Multi-Range Read Optimization ......................................
Reading rows using a range scan on a secondary index can result in many random disk accesses to the base table when the table is large and not stored in the storage engine's cache. With the Disk-Sweep Multi-Range Read (MRR) optimization, MySQL tries to reduce the number of random disk access for range scans by first scanning the index only and collecting the keys for the relevant rows. Then the keys are sorted and finally the rows are retrieved from the base table using the order of the primary key. The motivation for Disk-sweep MRR is to reduce the number of random disk accesses and instead achieve a more sequential scan of the base table data.
The Multi-Range Read optimization provides these benefits:
MRR enables data rows to be accessed sequentially rather than in random order, based on index tuples. The server obtains a set of index tuples that satisfy the query conditions, sorts them according to data row ID order, and uses the sorted tuples to retrieve data rows in order. This makes data access more efficient and less expensive.
MRR enables batch processing of requests for key access for operations that require access to data rows through index tuples, such as range index scans and equi-joins that use an index for the join attribute. MRR iterates over a sequence of index ranges to obtain qualifying index tuples. As these results accumulate, they are used to access the corresponding data rows. It is not necessary to acquire all index tuples before starting to read data rows.
The MRR optimization is not supported with secondary indexes created on virtual generated columns. 'InnoDB' supports secondary indexes on virtual generated columns.
The following scenarios illustrate when MRR optimization can be advantageous:
Scenario A: MRR can be used for 'InnoDB' and 'MyISAM' tables for index range scans and equi-join operations.
A portion of the index tuples are accumulated in a buffer.
The tuples in the buffer are sorted by their data row ID.
Data rows are accessed according to the sorted index tuple sequence.
Scenario B: MRR can be used for *note 'NDB': mysql-cluster. tables for multiple-range index scans or when performing an equi-join by an attribute.
A portion of ranges, possibly single-key ranges, is accumulated in a buffer on the central node where the query is submitted.
The ranges are sent to the execution nodes that access data rows.
The accessed rows are packed into packages and sent back to the central node.
The received packages with data rows are placed in a buffer.
Data rows are read from the buffer.
When MRR is used, the 'Extra' column in *note 'EXPLAIN': explain. output shows 'Using MRR'.
'InnoDB' and 'MyISAM' do not use MRR if full table rows need not be accessed to produce the query result. This is the case if results can be produced entirely on the basis on information in the index tuples (through a covering index); MRR provides no benefit.
Two 'optimizer_switch' system variable flags provide an interface to the use of MRR optimization. The 'mrr' flag controls whether MRR is enabled. If 'mrr' is enabled ('on'), the 'mrr_cost_based' flag controls whether the optimizer attempts to make a cost-based choice between using and not using MRR ('on') or uses MRR whenever possible ('off'). By default, 'mrr' is 'on' and 'mrr_cost_based' is 'on'. See *note switchable-optimizations::.
For MRR, a storage engine uses the value of the 'read_rnd_buffer_size' system variable as a guideline for how much memory it can allocate for its buffer. The engine uses up to 'read_rnd_buffer_size' bytes and determines the number of ranges to process in a single pass.
File: manual.info.tmp, Node: bnl-bka-optimization, Next: condition-filtering, Prev: mrr-optimization, Up: select-optimization
8.2.1.11 Block Nested-Loop and Batched Key Access Joins .......................................................
In MySQL, a Batched Key Access (BKA) Join algorithm is available that uses both index access to the joined table and a join buffer. The BKA algorithm supports inner join, outer join, and semijoin operations, including nested outer joins. Benefits of BKA include improved join performance due to more efficient table scanning. Also, the Block Nested-Loop (BNL) Join algorithm previously used only for inner joins is extended and can be employed for outer join and semijoin operations, including nested outer joins.
The following sections discuss the join buffer management that underlies the extension of the original BNL algorithm, the extended BNL algorithm, and the BKA algorithm. For information about semijoin strategies, see *note semijoins::
*note join-buffer-management::
*note bnl-optimization::
*note bka-optimization::
*note bnl-bka-optimizer-hints::
Join Buffer Management for Block Nested-Loop and Batched Key Access Algorithms
MySQL can employ join buffers to execute not only inner joins without index access to the inner table, but also outer joins and semijoins that appear after subquery flattening. Moreover, a join buffer can be effectively used when there is an index access to the inner table.
The join buffer management code slightly more efficiently utilizes join buffer space when storing the values of the interesting row columns: No additional bytes are allocated in buffers for a row column if its value is 'NULL', and the minimum number of bytes is allocated for any value of the *note 'VARCHAR': char. type.
The code supports two types of buffers, regular and incremental. Suppose that join buffer 'B1' is employed to join tables 't1' and 't2' and the result of this operation is joined with table 't3' using join buffer 'B2':
A regular join buffer contains columns from each join operand. If 'B2' is a regular join buffer, each row R put into 'B2' is composed of the columns of a row R1 from 'B1' and the interesting columns of a matching row R2 from table 't3'.
An incremental join buffer contains only columns from rows of the table produced by the second join operand. That is, it is incremental to a row from the first operand buffer. If 'B2' is an incremental join buffer, it contains the interesting columns of the row R2 together with a link to the row R1 from 'B1'.
Incremental join buffers are always incremental relative to a join buffer from an earlier join operation, so the buffer from the first join operation is always a regular buffer. In the example just given, the buffer 'B1' used to join tables 't1' and 't2' must be a regular buffer.
Each row of the incremental buffer used for a join operation contains only the interesting columns of a row from the table to be joined. These columns are augmented with a reference to the interesting columns of the matched row from the table produced by the first join operand. Several rows in the incremental buffer can refer to the same row R whose columns are stored in the previous join buffers insofar as all these rows match row R.
Incremental buffers enable less frequent copying of columns from buffers used for previous join operations. This provides a savings in buffer space because in the general case a row produced by the first join operand can be matched by several rows produced by the second join operand. It is unnecessary to make several copies of a row from the first operand. Incremental buffers also provide a savings in processing time due to the reduction in copying time.
The 'block_nested_loop' and 'batched_key_access' flags of the 'optimizer_switch' system variable control how the optimizer uses the Block Nested-Loop and Batched Key Access join algorithms. By default, 'block_nested_loop' is 'on' and 'batched_key_access' is 'off'. See note switchable-optimizations::. Optimizer hints may also be applied; see note bnl-bka-optimizer-hints::.
For information about semijoin strategies, see *note semijoins::
Block Nested-Loop Algorithm for Outer Joins and Semijoins
The original implementation of the MySQL BNL algorithm is extended to support outer join and semijoin operations.
When these operations are executed with a join buffer, each row put into the buffer is supplied with a match flag.
If an outer join operation is executed using a join buffer, each row of the table produced by the second operand is checked for a match against each row in the join buffer. When a match is found, a new extended row is formed (the original row plus columns from the second operand) and sent for further extensions by the remaining join operations. In addition, the match flag of the matched row in the buffer is enabled. After all rows of the table to be joined have been examined, the join buffer is scanned. Each row from the buffer that does not have its match flag enabled is extended by 'NULL' complements ('NULL' values for each column in the second operand) and sent for further extensions by the remaining join operations.
The 'block_nested_loop' flag of the 'optimizer_switch' system variable controls how the optimizer uses the Block Nested-Loop algorithm. By default, 'block_nested_loop' is 'on'. See note switchable-optimizations::. Optimizer hints may also be applied; see note bnl-bka-optimizer-hints::.
In *note 'EXPLAIN': explain. output, use of BNL for a table is signified when the 'Extra' value contains 'Using join buffer (Block Nested Loop)' and the 'type' value is 'ALL', 'index', or 'range'.
Some cases involving the combination of one or more subqueries with one or more left joins, particularly those returning many rows, may use BNL even though it is not ideal in such instances. This is a known issue which is fixed in MySQL 8.0. If upgrading MySQL is not immediately feasible for you, you may wish to disable BNL in the meantime by setting 'optimizer_switch='block_nested_loop=off'' or employing the 'NO_BNL' optimizer hint to let the optimizer choose a better plan, using one or more index hints (see *note index-hints::), or some combination of these, to improve the performance of such queries.
For information about semijoin strategies, see *note semijoins::
Batched Key Access Joins
MySQL implements a method of joining tables called the Batched Key Access (BKA) join algorithm. BKA can be applied when there is an index access to the table produced by the second join operand. Like the BNL join algorithm, the BKA join algorithm employs a join buffer to accumulate the interesting columns of the rows produced by the first operand of the join operation. Then the BKA algorithm builds keys to access the table to be joined for all rows in the buffer and submits these keys in a batch to the database engine for index lookups. The keys are submitted to the engine through the Multi-Range Read (MRR) interface (see *note mrr-optimization::). After submission of the keys, the MRR engine functions perform lookups in the index in an optimal way, fetching the rows of the joined table found by these keys, and starts feeding the BKA join algorithm with matching rows. Each matching row is coupled with a reference to a row in the join buffer.
When BKA is used, the value of 'join_buffer_size' defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access is made to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the 'batched_key_access' flag of the 'optimizer_switch' system variable must be set to 'on'. BKA uses MRR, so the 'mrr' flag must also be 'on'. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for 'mrr_cost_based' to be 'off' for BKA to be used. The following setting enables BKA:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
There are two scenarios by which MRR functions execute:
The first scenario is used for conventional disk-based storage engines such as note 'InnoDB': innodb-storage-engine. and note 'MyISAM': myisam-storage-engine. For these engines, usually the keys for all rows from the join buffer are submitted to the MRR interface at once. Engine-specific MRR functions perform index lookups for the submitted keys, get row IDs (or primary keys) from them, and then fetch rows for all these selected row IDs one by one by request from BKA algorithm. Every row is returned with an association reference that enables access to the matched row in the join buffer. The rows are fetched by the MRR functions in an optimal way: They are fetched in the row ID (primary key) order. This improves performance because reads are in disk order rather than random order.
The second scenario is used for remote storage engines such as *note 'NDB': mysql-cluster. A package of keys for a portion of rows from the join buffer, together with their associations, is sent by a MySQL Server (SQL node) to NDB Cluster data nodes. In return, the SQL node receives a package (or several packages) of matching rows coupled with corresponding associations. The BKA join algorithm takes these rows and builds new joined rows. Then a new set of keys is sent to the data nodes and the rows from the returned packages are used to build new joined rows. The process continues until the last keys from the join buffer are sent to the data nodes, and the SQL node has received and joined all rows matching these keys. This improves performance because fewer key-bearing packages sent by the SQL node to the data nodes means fewer round trips between it and the data nodes to perform the join operation.
With the first scenario, a portion of the join buffer is reserved to store row IDs (primary keys) selected by index lookups and passed as a parameter to the MRR functions.
There is no special buffer to store keys built for rows from the join buffer. Instead, a function that builds the key for the next row in the buffer is passed as a parameter to the MRR functions.
In *note 'EXPLAIN': explain. output, use of BKA for a table is signified when the 'Extra' value contains 'Using join buffer (Batched Key Access)' and the 'type' value is 'ref' or 'eq_ref'.
Optimizer Hints for Block Nested-Loop and Batched Key Access Algorithms
In addition to using the 'optimizer_switch' system variable to control optimizer use of the BNL and BKA algorithms session-wide, MySQL supports optimizer hints to influence the optimizer on a per-statement basis. See *note optimizer-hints::.
To use a BNL or BKA hint to enable join buffering for any inner table of an outer join, join buffering must be enabled for all inner tables of the outer join.
File: manual.info.tmp, Node: condition-filtering, Next: is-null-optimization, Prev: bnl-bka-optimization, Up: select-optimization
8.2.1.12 Condition Filtering ............................
In join processing, prefix rows are those rows passed from one table in a join to the next. In general, the optimizer attempts to put tables with low prefix counts early in the join order to keep the number of row combinations from increasing rapidly. To the extent that the optimizer can use information about conditions on rows selected from one table and passed to the next, the more accurately it can compute row estimates and choose the best execution plan.
Without condition filtering, the prefix row count for a table is based on the estimated number of rows selected by the 'WHERE' clause according to whichever access method the optimizer chooses. Condition filtering enables the optimizer to use other relevant conditions in the 'WHERE' clause not taken into account by the access method, and thus improve its prefix row count estimates. For example, even though there might be an index-based access method that can be used to select rows from the current table in a join, there might also be additional conditions for the table in the 'WHERE' clause that can filter (further restrict) the estimate for qualifying rows passed to the next table.
A condition contributes to the filtering estimate only if:
It refers to the current table.
It depends on a constant value or values from earlier tables in the join sequence.
It was not already taken into account by the access method.
In *note 'EXPLAIN': explain. output, the 'rows' column indicates the row estimate for the chosen access method, and the 'filtered' column reflects the effect of condition filtering. 'filtered' values are expressed as percentages. The maximum value is 100, which means no filtering of rows occurred. Values decreasing from 100 indicate increasing amounts of filtering.
The prefix row count (the number of rows estimated to be passed from the current table in a join to the next) is the product of the 'rows' and 'filtered' values. That is, the prefix row count is the estimated row count, reduced by the estimated filtering effect. For example, if 'rows' is 1000 and 'filtered' is 20%, condition filtering reduces the estimated row count of 1000 to a prefix row count of 1000 x 20% = 1000 x .2 = 200.
Consider the following query:
SELECT *
FROM employee JOIN department ON employee.dept_no = department.dept_no
WHERE employee.first_name = 'John'
AND employee.hire_date BETWEEN '2018-01-01' AND '2018-06-01';
Suppose that the data set has these characteristics:
The 'employee' table has 1024 rows.
The 'department' table has 12 rows.
Both tables have an index on 'dept_no'.
The 'employee' table has an index on 'first_name'.
8 rows satisfy this condition on 'employee.first_name':
employee.first_name = 'John'
150 rows satisfy this condition on 'employee.hire_date':
employee.hire_date BETWEEN '2018-01-01' AND '2018-06-01'
1 row satisfies both conditions:
employee.first_name = 'John'
AND employee.hire_date BETWEEN '2018-01-01' AND '2018-06-01'
Without condition filtering, *note 'EXPLAIN': explain. produces output like this:
+----+------------+--------+------------------+---------+---------+------+----------+
| id | table | type | possible_keys | key | ref | rows | filtered |
+----+------------+--------+------------------+---------+---------+------+----------+
| 1 | employee | ref | name,h_date,dept | name | const | 8 | 100.00 |
| 1 | department | eq_ref | PRIMARY | PRIMARY | dept_no | 1 | 100.00 |
+----+------------+--------+------------------+---------+---------+------+----------+
For 'employee', the access method on the 'name' index picks up the 8 rows that match a name of ''John''. No filtering is done ('filtered' is 100%), so all rows are prefix rows for the next table: The prefix row count is 'rows' x 'filtered' = 8 x 100% = 8.
With condition filtering, the optimizer additionally takes into account conditions from the 'WHERE' clause not taken into account by the access method. In this case, the optimizer uses heuristics to estimate a filtering effect of 16.31% for the 'BETWEEN' condition on 'employee.hire_date'. As a result, *note 'EXPLAIN': explain. produces output like this:
+----+------------+--------+------------------+---------+---------+------+----------+
| id | table | type | possible_keys | key | ref | rows | filtered |
+----+------------+--------+------------------+---------+---------+------+----------+
| 1 | employee | ref | name,h_date,dept | name | const | 8 | 16.31 |
| 1 | department | eq_ref | PRIMARY | PRIMARY | dept_no | 1 | 100.00 |
+----+------------+--------+------------------+---------+---------+------+----------+
Now the prefix row count is 'rows' x 'filtered' = 8 x 16.31% = 1.3, which more closely reflects actual data set.
Normally, the optimizer does not calculate the condition filtering effect (prefix row count reduction) for the last joined table because there is no next table to pass rows to. An exception occurs for *note 'EXPLAIN': explain.: To provide more information, the filtering effect is calculated for all joined tables, including the last one.
To control whether the optimizer considers additional filtering conditions, use the 'condition_fanout_filter' flag of the 'optimizer_switch' system variable (see *note switchable-optimizations::). This flag is enabled by default but can be disabled to suppress condition filtering (for example, if a particular query is found to yield better performance without it).
If the optimizer overestimates the effect of condition filtering, performance may be worse than if condition filtering is not used. In such cases, these techniques may help:
If a column is not indexed, index it so that the optimizer has some information about the distribution of column values and can improve its row estimates.
Change the join order. Ways to accomplish this include join-order optimizer hints (see *note optimizer-hints::), 'STRAIGHT_JOIN' immediately following the 'SELECT', and the 'STRAIGHT_JOIN' join operator.
Disable condition filtering for the session:
SET optimizer_switch = 'condition_fanout_filter=off';
File: manual.info.tmp, Node: is-null-optimization, Next: order-by-optimization, Prev: condition-filtering, Up: select-optimization
8.2.1.13 IS NULL Optimization .............................
MySQL can perform the same optimization on COL_NAME 'IS NULL' that it can use for COL_NAME '=' CONSTANT_VALUE. For example, MySQL can use indexes and ranges to search for 'NULL' with 'IS NULL'.
Examples:
SELECT * FROM TBL_NAME WHERE KEY_COL IS NULL;
SELECT * FROM TBL_NAME WHERE KEY_COL <=> NULL;
SELECT * FROM TBL_NAME
WHERE KEY_COL=CONST1 OR KEY_COL=CONST2 OR KEY_COL IS NULL;
If a 'WHERE' clause includes a COL_NAME 'IS NULL' condition for a column that is declared as 'NOT NULL', that expression is optimized away. This optimization does not occur in cases when the column might produce 'NULL' anyway (for example, if it comes from a table on the right side of a 'LEFT JOIN').
MySQL can also optimize the combination 'COL_NAME = EXPR OR COL_NAME IS NULL', a form that is common in resolved subqueries. *note 'EXPLAIN': explain. shows 'ref_or_null' when this optimization is used.
This optimization can handle one 'IS NULL' for any key part.
Some examples of queries that are optimized, assuming that there is an index on columns 'a' and 'b' of table 't2':
SELECT * FROM t1 WHERE t1.a=EXPR OR t1.a IS NULL;
SELECT * FROM t1, t2 WHERE t1.a=t2.a OR t2.a IS NULL;
SELECT * FROM t1, t2
WHERE (t1.a=t2.a OR t2.a IS NULL) AND t2.b=t1.b;
SELECT * FROM t1, t2
WHERE t1.a=t2.a AND (t2.b=t1.b OR t2.b IS NULL);
SELECT * FROM t1, t2
WHERE (t1.a=t2.a AND t2.a IS NULL AND ...)
OR (t1.a=t2.a AND t2.a IS NULL AND ...);
'ref_or_null' works by first doing a read on the reference key, and then a separate search for rows with a 'NULL' key value.
The optimization can handle only one 'IS NULL' level. In the following query, MySQL uses key lookups only on the expression '(t1.a=t2.a AND t2.a IS NULL)' and is not able to use the key part on 'b':
SELECT * FROM t1, t2
WHERE (t1.a=t2.a AND t2.a IS NULL)
OR (t1.b=t2.b AND t2.b IS NULL);
File: manual.info.tmp, Node: order-by-optimization, Next: group-by-optimization, Prev: is-null-optimization, Up: select-optimization
8.2.1.14 ORDER BY Optimization ..............................
This section describes when MySQL can use an index to satisfy an 'ORDER BY' clause, the 'filesort' operation used when an index cannot be used, and execution plan information available from the optimizer about 'ORDER BY'.
An 'ORDER BY' with and without 'LIMIT' may return rows in different orders, as discussed in *note limit-optimization::.
*note order-by-index-use::
*note order-by-filesort::
*note order-by-optimizer-control::
*note order-by-diagnostic-information::
Use of Indexes to Satisfy ORDER BY
In some cases, MySQL may use an index to satisfy an 'ORDER BY' clause and avoid the extra sorting involved in performing a 'filesort' operation.
The index may also be used even if the 'ORDER BY' does not match the index exactly, as long as all unused portions of the index and all extra 'ORDER BY' columns are constants in the 'WHERE' clause. If the index does not contain all columns accessed by the query, the index is used only if index access is cheaper than other access methods.
Assuming that there is an index on '(KEY_PART1, KEY_PART2)', the following queries may use the index to resolve the 'ORDER BY' part. Whether the optimizer actually does so depends on whether reading the index is more efficient than a table scan if columns not in the index must also be read.
In this query, the index on '(KEY_PART1, KEY_PART2)' enables the optimizer to avoid sorting:
SELECT * FROM t1
ORDER BY KEY_PART1, KEY_PART2;
However, the query uses 'SELECT ', which may select more columns than KEY_PART1 and KEY_PART2. In that case, scanning an entire index and looking up table rows to find columns not in the index may be more expensive than scanning the table and sorting the results. If so, the optimizer is not likely to use the index. If 'SELECT ' selects only the index columns, the index is used and sorting avoided.
If 't1' is an 'InnoDB' table, the table primary key is implicitly part of the index, and the index can be used to resolve the 'ORDER BY' for this query:
SELECT PK, KEY_PART1, KEY_PART2 FROM t1
ORDER BY KEY_PART1, KEY_PART2;
In this query, KEY_PART1 is constant, so all rows accessed through the index are in KEY_PART2 order, and an index on '(KEY_PART1, KEY_PART2)' avoids sorting if the 'WHERE' clause is selective enough to make an index range scan cheaper than a table scan:
SELECT * FROM t1
WHERE KEY_PART1 = CONSTANT
ORDER BY KEY_PART2;
In the next two queries, whether the index is used is similar to the same queries without 'DESC' shown previously:
SELECT * FROM t1
ORDER BY KEY_PART1 DESC, KEY_PART2 DESC;
SELECT * FROM t1
WHERE KEY_PART1 = CONSTANT
ORDER BY KEY_PART2 DESC;
In the next two queries, KEY_PART1 is compared to a constant. The index is used if the 'WHERE' clause is selective enough to make an index range scan cheaper than a table scan:
SELECT * FROM t1
WHERE KEY_PART1 > CONSTANT
ORDER BY KEY_PART1 ASC;
SELECT * FROM t1
WHERE KEY_PART1 < CONSTANT
ORDER BY KEY_PART1 DESC;
In the next query, the 'ORDER BY' does not name KEY_PART1, but all rows selected have a constant KEY_PART1 value, so the index can still be used:
SELECT * FROM t1
WHERE KEY_PART1 = CONSTANT1 AND KEY_PART2 > CONSTANT2
ORDER BY KEY_PART2;
In some cases, MySQL cannot use indexes to resolve the 'ORDER BY', although it may still use indexes to find the rows that match the 'WHERE' clause. Examples:
The query uses 'ORDER BY' on different indexes:
SELECT * FROM t1 ORDER BY KEY1, KEY2;
The query uses 'ORDER BY' on nonconsecutive parts of an index:
SELECT * FROM t1 WHERE KEY2=CONSTANT ORDER BY KEY1_PART1, KEY1_PART3;
The query mixes 'ASC' and 'DESC':
SELECT * FROM t1 ORDER BY KEY_PART1 DESC, KEY_PART2 ASC;
The index used to fetch the rows differs from the one used in the 'ORDER BY':
SELECT * FROM t1 WHERE KEY2=CONSTANT ORDER BY KEY1;
The query uses 'ORDER BY' with an expression that includes terms other than the index column name:
SELECT * FROM t1 ORDER BY ABS(KEY);
SELECT * FROM t1 ORDER BY -KEY;
The query joins many tables, and the columns in the 'ORDER BY' are not all from the first nonconstant table that is used to retrieve rows. (This is the first table in the *note 'EXPLAIN': explain. output that does not have a 'const' join type.)
The query has different 'ORDER BY' and 'GROUP BY' expressions.
There is an index on only a prefix of a column named in the 'ORDER BY' clause. In this case, the index cannot be used to fully resolve the sort order. For example, if only the first 10 bytes of a *note 'CHAR(20)': char. column are indexed, the index cannot distinguish values past the 10th byte and a 'filesort' is needed.
The index does not store rows in order. For example, this is true for a 'HASH' index in a 'MEMORY' table.
Availability of an index for sorting may be affected by the use of column aliases. Suppose that the column 't1.a' is indexed. In this statement, the name of the column in the select list is 'a'. It refers to 't1.a', as does the reference to 'a' in the 'ORDER BY', so the index on 't1.a' can be used:
SELECT a FROM t1 ORDER BY a;
In this statement, the name of the column in the select list is also 'a', but it is the alias name. It refers to 'ABS(a)', as does the reference to 'a' in the 'ORDER BY', so the index on 't1.a' cannot be used:
SELECT ABS(a) AS a FROM t1 ORDER BY a;
In the following statement, the 'ORDER BY' refers to a name that is not the name of a column in the select list. But there is a column in 't1' named 'a', so the 'ORDER BY' refers to 't1.a' and the index on 't1.a' can be used. (The resulting sort order may be completely different from the order for 'ABS(a)', of course.)
SELECT ABS(a) AS b FROM t1 ORDER BY a;
By default, MySQL sorts 'GROUP BY COL1, COL2, ...' queries as if you also included 'ORDER BY COL1, COL2, ...' in the query. If you include an explicit 'ORDER BY' clause that contains the same column list, MySQL optimizes it away without any speed penalty, although the sorting still occurs.
If a query includes 'GROUP BY' but you want to avoid the overhead of sorting the result, you can suppress sorting by specifying 'ORDER BY NULL'. For example:
INSERT INTO foo
SELECT a, COUNT(*) FROM bar GROUP BY a ORDER BY NULL;
The optimizer may still choose to use sorting to implement grouping operations. 'ORDER BY NULL' suppresses sorting of the result, not prior sorting done by grouping operations to determine the result.
Note:
'GROUP BY' implicitly sorts by default (that is, in the absence of 'ASC' or 'DESC' designators for 'GROUP BY' columns). However, relying on implicit 'GROUP BY' sorting (that is, sorting in the absence of 'ASC' or 'DESC' designators) or explicit sorting for 'GROUP BY' (that is, by using explicit 'ASC' or 'DESC' designators for 'GROUP BY' columns) is deprecated. To produce a given sort order, provide an 'ORDER BY' clause.
Use of filesort to Satisfy ORDER BY
If an index cannot be used to satisfy an 'ORDER BY' clause, MySQL performs a 'filesort' operation that reads table rows and sorts them. A 'filesort' constitutes an extra sorting phase in query execution.
To obtain memory for 'filesort' operations, the optimizer allocates a fixed amount of 'sort_buffer_size' bytes up front. Individual sessions can change the session value of this variable as desired to avoid excessive memory use, or to allocate more memory as necessary.
A 'filesort' operation uses temporary disk files as necessary if the result set is too large to fit in memory. Some types of queries are particularly suited to completely in-memory 'filesort' operations. For example, the optimizer can use 'filesort' to efficiently handle in memory, without temporary files, the 'ORDER BY' operation for queries (and subqueries) of the following form:
SELECT ... FROM SINGLE_TABLE ... ORDER BY NON_INDEX_COLUMN [DESC] LIMIT [M,]N;
Such queries are common in web applications that display only a few rows from a larger result set. Examples:
SELECT col1, ... FROM t1 ... ORDER BY name LIMIT 10;
SELECT col1, ... FROM t1 ... ORDER BY RAND() LIMIT 15;
Influencing ORDER BY Optimization
For slow 'ORDER BY' queries for which 'filesort' is not used, try lowering the 'max_length_for_sort_data' system variable to a value that is appropriate to trigger a 'filesort'. (A symptom of setting the value of this variable too high is a combination of high disk activity and low CPU activity.)
To increase 'ORDER BY' speed, check whether you can get MySQL to use indexes rather than an extra sorting phase. If this is not possible, try the following strategies:
Increase the 'sort_buffer_size' variable value. Ideally, the value should be large enough for the entire result set to fit in the sort buffer (to avoid writes to disk and merge passes), but at minimum the value must be large enough to accommodate 15 tuples. (Up to 15 temporary disk files are merged and there must be room in memory for at least one tuple per file.)
Take into account that the size of column values stored in the sort buffer is affected by the 'max_sort_length' system variable value. For example, if tuples store values of long string columns and you increase the value of 'max_sort_length', the size of sort buffer tuples increases as well and may require you to increase 'sort_buffer_size'. For column values calculated as a result of string expressions (such as those that invoke a string-valued function), the 'filesort' algorithm cannot tell the maximum length of expression values, so it must allocate 'max_sort_length' bytes for each tuple.
To monitor the number of merge passes (to merge temporary files), check the 'Sort_merge_passes' status variable.
Increase the 'read_rnd_buffer_size' variable value so that more rows are read at a time.
Change the 'tmpdir' system variable to point to a dedicated file system with large amounts of free space. The variable value can list several paths that are used in round-robin fashion; you can use this feature to spread the load across several directories. Separate the paths by colon characters (':') on Unix and semicolon characters (';') on Windows. The paths should name directories in file systems located on different physical disks, not different partitions on the same disk.
ORDER BY Execution Plan Information Available
With note 'EXPLAIN': explain. (see note using-explain::), you can check whether MySQL can use indexes to resolve an 'ORDER BY' clause:
If the 'Extra' column of *note 'EXPLAIN': explain. output does not contain 'Using filesort', the index is used and a 'filesort' is not performed.
If the 'Extra' column of *note 'EXPLAIN': explain. output contains 'Using filesort', the index is not used and a 'filesort' is performed.
In addition, if a 'filesort' is performed, optimizer trace output includes a 'filesort_summary' block. For example:
"filesort_summary": {
"rows": 100,
"examined_rows": 100,
"number_of_tmp_files": 0,
"sort_buffer_size": 25192,
"sort_mode": "<sort_key, packed_additional_fields>"
}
The 'sort_mode' value provides information about the contents of tuples in the sort buffer:
'
'
'
note 'EXPLAIN': explain. does not distinguish whether the optimizer does or does not perform a 'filesort' in memory. Use of an in-memory 'filesort' can be seen in optimizer trace output. Look for 'filesort_priority_queue_optimization'. For information about the optimizer trace, see note optimizer-tracing::.
File: manual.info.tmp, Node: group-by-optimization, Next: distinct-optimization, Prev: order-by-optimization, Up: select-optimization
8.2.1.15 GROUP BY Optimization ..............................
The most general way to satisfy a 'GROUP BY' clause is to scan the whole table and create a new temporary table where all rows from each group are consecutive, and then use this temporary table to discover groups and apply aggregate functions (if any). In some cases, MySQL is able to do much better than that and avoid creation of temporary tables by using index access.
The most important preconditions for using indexes for 'GROUP BY' are that all 'GROUP BY' columns reference attributes from the same index, and that the index stores its keys in order (as is true, for example, for a 'BTREE' index, but not for a 'HASH' index). Whether use of temporary tables can be replaced by index access also depends on which parts of an index are used in a query, the conditions specified for these parts, and the selected aggregate functions.
There are two ways to execute a 'GROUP BY' query through index access, as detailed in the following sections. The first method applies the grouping operation together with all range predicates (if any). The second method first performs a range scan, and then groups the resulting tuples.
In MySQL, 'GROUP BY' is used for sorting, so the server may also apply 'ORDER BY' optimizations to grouping. However, relying on implicit or explicit 'GROUP BY' sorting is deprecated. See *note order-by-optimization::.
*note loose-index-scan::
*note tight-index-scan::
Loose Index Scan
The most efficient way to process 'GROUP BY' is when an index is used to directly retrieve the grouping columns. With this access method, MySQL uses the property of some index types that the keys are ordered (for example, 'BTREE'). This property enables use of lookup groups in an index without having to consider all keys in the index that satisfy all 'WHERE' conditions. This access method considers only a fraction of the keys in an index, so it is called a Loose Index Scan. When there is no 'WHERE' clause, a Loose Index Scan reads as many keys as the number of groups, which may be a much smaller number than that of all keys. If the 'WHERE' clause contains range predicates (see the discussion of the 'range' join type in *note using-explain::), a Loose Index Scan looks up the first key of each group that satisfies the range conditions, and again reads the smallest possible number of keys. This is possible under the following conditions:
The query is over a single table.
The 'GROUP BY' names only columns that form a leftmost prefix of the index and no other columns. (If, instead of 'GROUP BY', the query has a 'DISTINCT' clause, all distinct attributes refer to columns that form a leftmost prefix of the index.) For example, if a table 't1' has an index on '(c1,c2,c3)', Loose Index Scan is applicable if the query has 'GROUP BY c1, c2'. It is not applicable if the query has 'GROUP BY c2, c3' (the columns are not a leftmost prefix) or 'GROUP BY c1, c2, c4' ('c4' is not in the index).
The only aggregate functions used in the select list (if any) are 'MIN()' and 'MAX()', and all of them refer to the same column. The column must be in the index and must immediately follow the columns in the 'GROUP BY'.
Any other parts of the index than those from the 'GROUP BY' referenced in the query must be constants (that is, they must be referenced in equalities with constants), except for the argument of 'MIN()' or 'MAX()' functions.
For columns in the index, full column values must be indexed, not just a prefix. For example, with 'c1 VARCHAR(20), INDEX (c1(10))', the index uses only a prefix of 'c1' values and cannot be used for Loose Index Scan.
If Loose Index Scan is applicable to a query, the *note 'EXPLAIN': explain. output shows 'Using index for group-by' in the 'Extra' column.
Assume that there is an index 'idx(c1,c2,c3)' on table 't1(c1,c2,c3,c4)'. The Loose Index Scan access method can be used for the following queries:
SELECT c1, c2 FROM t1 GROUP BY c1, c2;
SELECT DISTINCT c1, c2 FROM t1;
SELECT c1, MIN(c2) FROM t1 GROUP BY c1;
SELECT c1, c2 FROM t1 WHERE c1 < CONST GROUP BY c1, c2;
SELECT MAX(c3), MIN(c3), c1, c2 FROM t1 WHERE c2 > CONST GROUP BY c1, c2;
SELECT c2 FROM t1 WHERE c1 < CONST GROUP BY c1, c2;
SELECT c1, c2 FROM t1 WHERE c3 = CONST GROUP BY c1, c2;
The following queries cannot be executed with this quick select method, for the reasons given:
There are aggregate functions other than 'MIN()' or 'MAX()':
SELECT c1, SUM(c2) FROM t1 GROUP BY c1;
The columns in the 'GROUP BY' clause do not form a leftmost prefix of the index:
SELECT c1, c2 FROM t1 GROUP BY c2, c3;
The query refers to a part of a key that comes after the 'GROUP BY' part, and for which there is no equality with a constant:
SELECT c1, c3 FROM t1 GROUP BY c1, c2;
Were the query to include 'WHERE c3 = CONST', Loose Index Scan could be used.
The Loose Index Scan access method can be applied to other forms of aggregate function references in the select list, in addition to the 'MIN()' and 'MAX()' references already supported:
'AVG(DISTINCT)', 'SUM(DISTINCT)', and 'COUNT(DISTINCT)' are supported. 'AVG(DISTINCT)' and 'SUM(DISTINCT)' take a single argument. 'COUNT(DISTINCT)' can have more than one column argument.
There must be no 'GROUP BY' or 'DISTINCT' clause in the query.
The Loose Index Scan limitations described previously still apply.
Assume that there is an index 'idx(c1,c2,c3)' on table 't1(c1,c2,c3,c4)'. The Loose Index Scan access method can be used for the following queries:
SELECT COUNT(DISTINCT c1), SUM(DISTINCT c1) FROM t1;
SELECT COUNT(DISTINCT c1, c2), COUNT(DISTINCT c2, c1) FROM t1;
Tight Index Scan
A Tight Index Scan may be either a full index scan or a range index scan, depending on the query conditions.
When the conditions for a Loose Index Scan are not met, it still may be possible to avoid creation of temporary tables for 'GROUP BY' queries. If there are range conditions in the 'WHERE' clause, this method reads only the keys that satisfy these conditions. Otherwise, it performs an index scan. Because this method reads all keys in each range defined by the 'WHERE' clause, or scans the whole index if there are no range conditions, it is called a Tight Index Scan. With a Tight Index Scan, the grouping operation is performed only after all keys that satisfy the range conditions have been found.
For this method to work, it is sufficient that there be a constant equality condition for all columns in a query referring to parts of the key coming before or in between parts of the 'GROUP BY' key. The constants from the equality conditions fill in any 'gaps' in the search keys so that it is possible to form complete prefixes of the index. These index prefixes then can be used for index lookups. If the 'GROUP BY' result requires sorting, and it is possible to form search keys that are prefixes of the index, MySQL also avoids extra sorting operations because searching with prefixes in an ordered index already retrieves all the keys in order.
Assume that there is an index 'idx(c1,c2,c3)' on table 't1(c1,c2,c3,c4)'. The following queries do not work with the Loose Index Scan access method described previously, but still work with the Tight Index Scan access method.
There is a gap in the 'GROUP BY', but it is covered by the condition 'c2 = 'a'':
SELECT c1, c2, c3 FROM t1 WHERE c2 = 'a' GROUP BY c1, c3;
The 'GROUP BY' does not begin with the first part of the key, but there is a condition that provides a constant for that part:
SELECT c1, c2, c3 FROM t1 WHERE c1 = 'a' GROUP BY c2, c3;
File: manual.info.tmp, Node: distinct-optimization, Next: limit-optimization, Prev: group-by-optimization, Up: select-optimization
8.2.1.16 DISTINCT Optimization ..............................
'DISTINCT' combined with 'ORDER BY' needs a temporary table in many cases.
Because 'DISTINCT' may use 'GROUP BY', learn how MySQL works with columns in 'ORDER BY' or 'HAVING' clauses that are not part of the selected columns. See *note group-by-handling::.
In most cases, a 'DISTINCT' clause can be considered as a special case of 'GROUP BY'. For example, the following two queries are equivalent:
SELECT DISTINCT c1, c2, c3 FROM t1
WHERE c1 > CONST;
SELECT c1, c2, c3 FROM t1
WHERE c1 > CONST GROUP BY c1, c2, c3;
Due to this equivalence, the optimizations applicable to 'GROUP BY' queries can be also applied to queries with a 'DISTINCT' clause. Thus, for more details on the optimization possibilities for 'DISTINCT' queries, see *note group-by-optimization::.
When combining 'LIMIT ROW_COUNT' with 'DISTINCT', MySQL stops as soon as it finds ROW_COUNT unique rows.
If you do not use columns from all tables named in a query, MySQL stops scanning any unused tables as soon as it finds the first match. In the following case, assuming that 't1' is used before 't2' (which you can check with *note 'EXPLAIN': explain.), MySQL stops reading from 't2' (for any particular row in 't1') when it finds the first row in 't2':
SELECT DISTINCT t1.a FROM t1, t2 where t1.a=t2.a;
File: manual.info.tmp, Node: limit-optimization, Next: function-optimization, Prev: distinct-optimization, Up: select-optimization
8.2.1.17 LIMIT Query Optimization .................................
If you need only a specified number of rows from a result set, use a 'LIMIT' clause in the query, rather than fetching the whole result set and throwing away the extra data.
MySQL sometimes optimizes a query that has a 'LIMIT ROW_COUNT' clause and no 'HAVING' clause:
If you select only a few rows with 'LIMIT', MySQL uses indexes in some cases when normally it would prefer to do a full table scan.
If you combine 'LIMIT ROW_COUNT' with 'ORDER BY', MySQL stops sorting as soon as it has found the first ROW_COUNT rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the 'LIMIT' clause are selected, and most or all of them are sorted, before the first ROW_COUNT are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.
One manifestation of this behavior is that an 'ORDER BY' query with and without 'LIMIT' may return rows in different order, as described later in this section.
If you combine 'LIMIT ROW_COUNT' with 'DISTINCT', MySQL stops as soon as it finds ROW_COUNT unique rows.
In some cases, a 'GROUP BY' can be resolved by reading the index in order (or doing a sort on the index), then calculating summaries until the index value changes. In this case, 'LIMIT ROW_COUNT' does not calculate any unnecessary 'GROUP BY' values.
As soon as MySQL has sent the required number of rows to the client, it aborts the query unless you are using 'SQL_CALC_FOUND_ROWS'. In that case, the number of rows can be retrieved with 'SELECT FOUND_ROWS()'. See *note information-functions::.
'LIMIT 0' quickly returns an empty set. This can be useful for checking the validity of a query. It can also be employed to obtain the types of the result columns within applications that use a MySQL API that makes result set metadata available. With the *note 'mysql': mysql. client program, you can use the '--column-type-info' option to display result column types.
If the server uses temporary tables to resolve a query, it uses the 'LIMIT ROW_COUNT' clause to calculate how much space is required.
If an index is not used for 'ORDER BY' but a 'LIMIT' clause is also present, the optimizer may be able to avoid using a merge file and sort the rows in memory using an in-memory 'filesort' operation.
If multiple rows have identical values in the 'ORDER BY' columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns.
One factor that affects the execution plan is 'LIMIT', so an 'ORDER BY' query with and without 'LIMIT' may return rows in different orders. Consider this query, which is sorted by the 'category' column but nondeterministic with respect to the 'id' and 'rating' columns:
mysql> SELECT * FROM ratings ORDER BY category;
+----+----------+--------+
| id | category | rating |
+----+----------+--------+
| 1 | 1 | 4.5 |
| 5 | 1 | 3.2 |
| 3 | 2 | 3.7 |
| 4 | 2 | 3.5 |
| 6 | 2 | 3.5 |
| 2 | 3 | 5.0 |
| 7 | 3 | 2.7 |
+----+----------+--------+
Including 'LIMIT' may affect order of rows within each 'category' value. For example, this is a valid query result:
mysql> SELECT * FROM ratings ORDER BY category LIMIT 5;
+----+----------+--------+
| id | category | rating |
+----+----------+--------+
| 1 | 1 | 4.5 |
| 5 | 1 | 3.2 |
| 4 | 2 | 3.5 |
| 3 | 2 | 3.7 |
| 6 | 2 | 3.5 |
+----+----------+--------+
In each case, the rows are sorted by the 'ORDER BY' column, which is all that is required by the SQL standard.
If it is important to ensure the same row order with and without 'LIMIT', include additional columns in the 'ORDER BY' clause to make the order deterministic. For example, if 'id' values are unique, you can make rows for a given 'category' value appear in 'id' order by sorting like this:
mysql> SELECT * FROM ratings ORDER BY category, id;
+----+----------+--------+
| id | category | rating |
+----+----------+--------+
| 1 | 1 | 4.5 |
| 5 | 1 | 3.2 |
| 3 | 2 | 3.7 |
| 4 | 2 | 3.5 |
| 6 | 2 | 3.5 |
| 2 | 3 | 5.0 |
| 7 | 3 | 2.7 |
+----+----------+--------+
mysql> SELECT * FROM ratings ORDER BY category, id LIMIT 5;
+----+----------+--------+
| id | category | rating |
+----+----------+--------+
| 1 | 1 | 4.5 |
| 5 | 1 | 3.2 |
| 3 | 2 | 3.7 |
| 4 | 2 | 3.5 |
| 6 | 2 | 3.5 |
+----+----------+--------+
For a query with an 'ORDER BY' or 'GROUP BY' and a 'LIMIT' clause, the optimizer tries to choose an ordered index by default when it appears doing so would speed up query execution. Prior to MySQL 5.7.33, there was no way to override this behavior, even in cases where using some other optimization might be faster. Beginning with MySQL 5.7.33, it is possible to turn off this optimization by setting the 'optimizer_switch' system variable's 'prefer_ordering_index' flag to 'off'.
Example: First we create and populate a table 't' as shown here:
# Create and populate a table t:
mysql> CREATE TABLE t (
-> id1 BIGINT NOT NULL,
-> id2 BIGINT NOT NULL,
-> c1 VARCHAR(50) NOT NULL,
-> c2 VARCHAR(50) NOT NULL,
-> PRIMARY KEY (id1),
-> INDEX i (id2, c1)
-> );
# [Insert some rows into table t - not shown]
Verify that the 'prefer_ordering_index' flag is enabled:
mysql> SELECT @@optimizer_switch LIKE '%prefer_ordering_index=on%';
+------------------------------------------------------+
| @@optimizer_switch LIKE '%prefer_ordering_index=on%' |
+------------------------------------------------------+
| 1 |
+------------------------------------------------------+
Since the following query has a 'LIMIT' clause, we expect it to use an ordered index, if possible. In this case, as we can see from the *note 'EXPLAIN': explain. output, it uses the table's primary key.
mysql> EXPLAIN SELECT c2 FROM t
-> WHERE id2 > 3
-> ORDER BY id1 ASC LIMIT 2\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
partitions: NULL
type: index
possible_keys: i
key: PRIMARY
key_len: 8
ref: NULL
rows: 2
filtered: 70.00
Extra: Using where
Now we disable the 'prefer_ordering_index' flag, and re-run the same query; this time it uses the index 'i' (which includes the 'id2' column used in the 'WHERE' clause), and a filesort:
mysql> SET optimizer_switch = "prefer_ordering_index=off";
mysql> EXPLAIN SELECT c2 FROM t
-> WHERE id2 > 3
-> ORDER BY id1 ASC LIMIT 2\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t
partitions: NULL
type: range
possible_keys: i
key: i
key_len: 8
ref: NULL
rows: 14
filtered: 100.00
Extra: Using index condition; Using filesort
See also *note switchable-optimizations::.
File: manual.info.tmp, Node: function-optimization, Next: row-constructor-optimization, Prev: limit-optimization, Up: select-optimization
8.2.1.18 Function Call Optimization ...................................
MySQL functions are tagged internally as deterministic or nondeterministic. A function is nondeterministic if, given fixed values for its arguments, it can return different results for different invocations. Examples of nondeterministic functions: 'RAND()', 'UUID()'.
If a function is tagged nondeterministic, a reference to it in a 'WHERE' clause is evaluated for every row (when selecting from one table) or combination of rows (when selecting from a multiple-table join).
MySQL also determines when to evaluate functions based on types of arguments, whether the arguments are table columns or constant values. A deterministic function that takes a table column as argument must be evaluated whenever that column changes value.
Nondeterministic functions may affect query performance. For example, some optimizations may not be available, or more locking might be required. The following discussion uses 'RAND()' but applies to other nondeterministic functions as well.
Suppose that a table 't' has this definition:
CREATE TABLE t (id INT NOT NULL PRIMARY KEY, col_a VARCHAR(100));
Consider these two queries:
SELECT * FROM t WHERE id = POW(1,2);
SELECT * FROM t WHERE id = FLOOR(1 + RAND() * 49);
Both queries appear to use a primary key lookup because of the equality comparison against the primary key, but that is true only for the first of them:
The first query always produces a maximum of one row because 'POW()' with constant arguments is a constant value and is used for index lookup.
The second query contains an expression that uses the nondeterministic function 'RAND()', which is not constant in the query but in fact has a new value for every row of table 't'. Consequently, the query reads every row of the table, evaluates the predicate for each row, and outputs all rows for which the primary key matches the random value. This might be zero, one, or multiple rows, depending on the 'id' column values and the values in the 'RAND()' sequence.
The effects of nondeterminism are not limited to note 'SELECT': select. statements. This note 'UPDATE': update. statement uses a nondeterministic function to select rows to be modified:
UPDATE t SET col_a = SOME_EXPR WHERE id = FLOOR(1 + RAND() * 49);
Presumably the intent is to update at most a single row for which the primary key matches the expression. However, it might update zero, one, or multiple rows, depending on the 'id' column values and the values in the 'RAND()' sequence.
The behavior just described has implications for performance and replication:
Because a nondeterministic function does not produce a constant value, the optimizer cannot use strategies that might otherwise be applicable, such as index lookups. The result may be a table scan.
'InnoDB' might escalate to a range-key lock rather than taking a single row lock for one matching row.
Updates that do not execute deterministically are unsafe for replication.
The difficulties stem from the fact that the 'RAND()' function is evaluated once for every row of the table. To avoid multiple function evaluations, use one of these techniques:
Move the expression containing the nondeterministic function to a separate statement, saving the value in a variable. In the original statement, replace the expression with a reference to the variable, which the optimizer can treat as a constant value:
SET @keyval = FLOOR(1 + RAND() * 49);
UPDATE t SET col_a = SOME_EXPR WHERE id = @keyval;
Assign the random value to a variable in a derived table. This technique causes the variable to be assigned a value, once, prior to its use in the comparison in the 'WHERE' clause:
SET optimizer_switch = 'derived_merge=off';
UPDATE t, (SELECT @keyval := FLOOR(1 + RAND() * 49)) AS dt
SET col_a = SOME_EXPR WHERE id = @keyval;
As mentioned previously, a nondeterministic expression in the 'WHERE' clause might prevent optimizations and result in a table scan. However, it may be possible to partially optimize the 'WHERE' clause if other expressions are deterministic. For example:
SELECT * FROM t WHERE partial_key=5 AND some_column=RAND();
If the optimizer can use 'partial_key' to reduce the set of rows selected, 'RAND()' is executed fewer times, which diminishes the effect of nondeterminism on optimization.
File: manual.info.tmp, Node: row-constructor-optimization, Next: table-scan-avoidance, Prev: function-optimization, Up: select-optimization
8.2.1.19 Row Constructor Expression Optimization ................................................
Row constructors permit simultaneous comparisons of multiple values. For example, these two statements are semantically equivalent:
SELECT * FROM t1 WHERE (column1,column2) = (1,1);
SELECT * FROM t1 WHERE column1 = 1 AND column2 = 1;
In addition, the optimizer handles both expressions the same way.
The optimizer is less likely to use available indexes if the row constructor columns do not cover the prefix of an index. Consider the following table, which has a primary key on '(c1, c2, c3)':
CREATE TABLE t1 (
c1 INT, c2 INT, c3 INT, c4 CHAR(100),
PRIMARY KEY(c1,c2,c3)
);
In this query, the 'WHERE' clause uses all columns in the index. However, the row constructor itself does not cover an index prefix, with the result that the optimizer uses only 'c1' ('key_len=4', the size of 'c1'):
mysql> EXPLAIN SELECT * FROM t1
WHERE c1=1 AND (c2,c3) > (1,1)\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 3
filtered: 100.00
Extra: Using where
In such cases, rewriting the row constructor expression using an equivalent nonconstructor expression may result in more complete index use. For the given query, the row constructor and equivalent nonconstructor expressions are:
(c2,c3) > (1,1)
c2 > 1 OR ((c2 = 1) AND (c3 > 1))
Rewriting the query to use the nonconstructor expression results in the optimizer using all three columns in the index ('key_len=12'):
mysql> EXPLAIN SELECT * FROM t1
WHERE c1 = 1 AND (c2 > 1 OR ((c2 = 1) AND (c3 > 1)))\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 12
ref: NULL
rows: 3
filtered: 100.00
Extra: Using where
Thus, for better results, avoid mixing row constructors with 'AND'/'OR' expressions. Use one or the other.
Under certain conditions, the optimizer can apply the range access method to 'IN()' expressions that have row constructor arguments. See *note row-constructor-range-optimization::.
File: manual.info.tmp, Node: table-scan-avoidance, Prev: row-constructor-optimization, Up: select-optimization
8.2.1.20 Avoiding Full Table Scans ..................................
The output from *note 'EXPLAIN': explain. shows 'ALL' in the 'type' column when MySQL uses a full table scan to resolve a query. This usually happens under the following conditions:
The table is so small that it is faster to perform a table scan than to bother with a key lookup. This is common for tables with fewer than 10 rows and a short row length.
There are no usable restrictions in the 'ON' or 'WHERE' clause for indexed columns.
You are comparing indexed columns with constant values and MySQL has calculated (based on the index tree) that the constants cover too large a part of the table and that a table scan would be faster. See *note where-optimization::.
You are using a key with low cardinality (many rows match the key value) through another column. In this case, MySQL assumes that by using the key it is likely to perform many key lookups and that a table scan would be faster.
For small tables, a table scan often is appropriate and the performance impact is negligible. For large tables, try the following techniques to avoid having the optimizer incorrectly choose a table scan:
Use 'ANALYZE TABLE TBL_NAME' to update the key distributions for the scanned table. See *note analyze-table::.
Use 'FORCE INDEX' for the scanned table to tell MySQL that table scans are very expensive compared to using the given index:
SELECT * FROM t1, t2 FORCE INDEX (INDEX_FOR_COLUMN)
WHERE t1.COL_NAME=t2.COL_NAME;
See *note index-hints::.
Start note 'mysqld': mysqld. with the '--max-seeks-for-key=1000' option or use 'SET max_seeks_for_key=1000' to tell the optimizer to assume that no key scan causes more than 1,000 key seeks. See note server-system-variables::.
File: manual.info.tmp, Node: subquery-optimization, Next: information-schema-optimization, Prev: select-optimization, Up: statement-optimization
Menu:
derived-table-optimization:: Optimizing Derived Tables and View References with Merging or Materialization
The MySQL query optimizer has different strategies available to evaluate subqueries:
For 'IN' (or '=ANY') subqueries, the optimizer has these choices:
* Semijoin
* Materialization
* 'EXISTS' strategy
For 'NOT IN' (or '<>ALL') subqueries, the optimizer has these choices:
* Materialization
* 'EXISTS' strategy
For derived tables, the optimizer has these choices (which also apply to view references):
Merge the derived table into the outer query block
Materialize the derived table to an internal temporary table
The following discussion provides more information about the preceding optimization strategies.
Note:
A limitation on note 'UPDATE': update. and note 'DELETE': delete. statements that use a subquery to modify a single table is that the optimizer does not use semijoin or materialization subquery optimizations. As a workaround, try rewriting them as multiple-table note 'UPDATE': update. and note 'DELETE': delete. statements that use a join rather than a subquery.
File: manual.info.tmp, Node: semijoins, Next: subquery-materialization, Prev: subquery-optimization, Up: subquery-optimization
8.2.2.1 Optimizing Subqueries, Derived Tables, and View References with Semijoin Transformations ................................................................................................
A semijoin is a preparation-time transformation that enables multiple execution strategies such as table pullout, duplicate weedout, first match, loose scan, and materialization. The optimizer uses semijoin strategies to improve subquery execution, as described in this section.
For an inner join between two tables, the join returns a row from one table as many times as there are matches in the other table. But for some questions, the only information that matters is whether there is a match, not the number of matches. Suppose that there are tables named 'class' and 'roster' that list classes in a course curriculum and class rosters (students enrolled in each class), respectively. To list the classes that actually have students enrolled, you could use this join:
SELECT class.class_num, class.class_name
FROM class INNER JOIN roster
WHERE class.class_num = roster.class_num;
However, the result lists each class once for each enrolled student. For the question being asked, this is unnecessary duplication of information.
Assuming that 'class_num' is a primary key in the 'class' table, duplicate suppression is possible by using *note 'SELECT DISTINCT': select, but it is inefficient to generate all matching rows first only to eliminate duplicates later.
The same duplicate-free result can be obtained by using a subquery:
SELECT class_num, class_name
FROM class
WHERE class_num IN (SELECT class_num FROM roster);
Here, the optimizer can recognize that the 'IN' clause requires the subquery to return only one instance of each class number from the 'roster' table. In this case, the query can use a semijoin; that is, an operation that returns only one instance of each row in 'class' that is matched by rows in 'roster'.
Outer join and inner join syntax is permitted in the outer query specification, and table references may be base tables, derived tables, or view references.
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an 'IN' (or '=ANY') subquery that appears at the top level of the 'WHERE' or 'ON' clause, possibly as a term in an 'AND' expression. For example:
SELECT ...
FROM ot1, ...
WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, 'ot_I' and 'it_I' represent tables in the outer and inner parts of the query, and 'oe_I' and 'ie_I' represent expressions that refer to columns in the outer and inner tables.
It must be a single note 'SELECT': select. without note 'UNION': union. constructs.
It must not contain a 'GROUP BY' or 'HAVING' clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have 'ORDER BY' with 'LIMIT'.
The statement must not use the 'STRAIGHT_JOIN' join type in the outer query.
The 'STRAIGHT_JOIN' modifier must not be present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
The subquery may be correlated or uncorrelated. 'DISTINCT' is permitted, as is 'LIMIT' unless 'ORDER BY' is also used.
If a subquery meets the preceding criteria, MySQL converts it to a semijoin and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semijoin as if it was a join and remove duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.
Materialize the subquery into an indexed temporary table that is used to perform a join, where the index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables; if not, the table is scanned. For more information about materialization, see *note subquery-materialization::.
Each of these strategies can be enabled or disabled using the following 'optimizer_switch' system variable flags:
The 'semijoin' flag controls whether semijoins are used.
If 'semijoin' is enabled, the 'firstmatch', 'loosescan', 'duplicateweedout', and 'materialization' flags enable finer control over the permitted semijoin strategies.
If the 'duplicateweedout' semijoin strategy is disabled, it is not used unless all other applicable strategies are also disabled.
If 'duplicateweedout' is disabled, on occasion the optimizer may generate a query plan that is far from optimal. This occurs due to heuristic pruning during greedy search, which can be avoided by setting 'optimizer_prune_level=0'.
These flags are enabled by default. See *note switchable-optimizations::.
The optimizer minimizes differences in handling of views and derived tables. This affects queries that use the 'STRAIGHT_JOIN' modifier and a view with an 'IN' subquery that can be converted to a semijoin. The following query illustrates this because the change in processing causes a change in transformation, and thus a different execution strategy:
CREATE VIEW v AS
SELECT *
FROM t1
WHERE a IN (SELECT b
FROM t2);
SELECT STRAIGHT_JOIN *
FROM t3 JOIN v ON t3.x = v.a;
The optimizer first looks at the view and converts the 'IN' subquery to a semijoin, then checks whether it is possible to merge the view into the outer query. Because the 'STRAIGHT_JOIN' modifier in the outer query prevents semijoin, the optimizer refuses the merge, causing derived table evaluation using a materialized table.
*note 'EXPLAIN': explain. output indicates the use of semijoin strategies as follows:
Semijoined tables show up in the outer select. For extended note 'EXPLAIN': explain. output, the text displayed by a following note 'SHOW WARNINGS': show-warnings. shows the rewritten query, which displays the semijoin structure. (See *note explain-extended::.) From this you can get an idea about which tables were pulled out of the semijoin. If a subquery was converted to a semijoin, you can see that the subquery predicate is gone and its tables and 'WHERE' clause were merged into the outer query join list and 'WHERE' clause.
Temporary table use for Duplicate Weedout is indicated by 'Start temporary' and 'End temporary' in the 'Extra' column. Tables that were not pulled out and are in the range of *note 'EXPLAIN': explain. output rows covered by 'Start temporary' and 'End temporary' have their 'rowid' in the temporary table.
'FirstMatch(TBL_NAME)' in the 'Extra' column indicates join shortcutting.
'LooseScan(M..N)' in the 'Extra' column indicates use of the LooseScan strategy. M and N are key part numbers.
Temporary table use for materialization is indicated by rows with a 'select_type' value of 'MATERIALIZED' and rows with a 'table' value of '
File: manual.info.tmp, Node: subquery-materialization, Next: subquery-optimization-with-exists, Prev: semijoins, Up: subquery-optimization
8.2.2.2 Optimizing Subqueries with Materialization ..................................................
The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.
Subquery materialization uses an in-memory temporary table when possible, falling back to on-disk storage if the table becomes too large. See *note internal-temporary-tables::.
If materialization is not used, the optimizer sometimes rewrites a noncorrelated subquery as a correlated subquery. For example, the following 'IN' subquery is noncorrelated (WHERE_CONDITION involves only columns from 't2' and not 't1'):
SELECT * FROM t1
WHERE t1.a IN (SELECT t2.b FROM t2 WHERE WHERE_CONDITION);
The optimizer might rewrite this as an 'EXISTS' correlated subquery:
SELECT * FROM t1
WHERE EXISTS (SELECT t2.b FROM t2 WHERE WHERE_CONDITION AND t1.a=t2.b);
Subquery materialization using a temporary table avoids such rewrites and makes it possible to execute the subquery only once rather than once per row of the outer query.
For subquery materialization to be used in MySQL, the 'optimizer_switch' system variable 'materialization' flag must be enabled. (See *note switchable-optimizations::.) With the 'materialization' flag enabled, materialization applies to subquery predicates that appear anywhere (in the select list, 'WHERE', 'ON', 'GROUP BY', 'HAVING', or 'ORDER BY'), for predicates that fall into any of these use cases:
The predicate has this form, when no outer expression OE_I or inner expression IE_I is nullable. N is 1 or larger.
(OE_1, OE_2, ..., OE_N) [NOT] IN (SELECT IE_1, I_2, ..., IE_N ...)
The predicate has this form, when there is a single outer expression OE and inner expression IE. The expressions can be nullable.
OE [NOT] IN (SELECT IE ...)
The predicate is 'IN' or 'NOT IN' and a result of 'UNKNOWN' ('NULL') has the same meaning as a result of 'FALSE'.
The following examples illustrate how the requirement for equivalence of 'UNKNOWN' and 'FALSE' predicate evaluation affects whether subquery materialization can be used. Assume that WHERE_CONDITION involves columns only from 't2' and not 't1' so that the subquery is noncorrelated.
This query is subject to materialization:
SELECT * FROM t1
WHERE t1.a IN (SELECT t2.b FROM t2 WHERE WHERE_CONDITION);
Here, it does not matter whether the 'IN' predicate returns 'UNKNOWN' or 'FALSE'. Either way, the row from 't1' is not included in the query result.
An example where subquery materialization is not used is the following query, where 't2.b' is a nullable column:
SELECT * FROM t1
WHERE (t1.a,t1.b) NOT IN (SELECT t2.a,t2.b FROM t2
WHERE WHERE_CONDITION);
The following restrictions apply to the use of subquery materialization:
The types of the inner and outer expressions must match. For example, the optimizer might be able to use materialization if both expressions are integer or both are decimal, but cannot if one expression is integer and the other is decimal.
The inner expression cannot be a *note 'BLOB': blob.
Use of *note 'EXPLAIN': explain. with a query provides some indication of whether the optimizer uses subquery materialization:
Compared to query execution that does not use materialization, 'select_type' may change from 'DEPENDENT SUBQUERY' to 'SUBQUERY'. This indicates that, for a subquery that would be executed once per outer row, materialization enables the subquery to be executed just once.
For extended note 'EXPLAIN': explain. output, the text displayed by a following note 'SHOW WARNINGS': show-warnings. includes 'materialize' and 'materialized-subquery'.
File: manual.info.tmp, Node: subquery-optimization-with-exists, Next: derived-table-optimization, Prev: subquery-materialization, Up: subquery-optimization
8.2.2.3 Optimizing Subqueries with the EXISTS Strategy ......................................................
Certain optimizations are applicable to comparisons that use the 'IN' (or '=ANY') operator to test subquery results. This section discusses these optimizations, particularly with regard to the challenges that 'NULL' values present. The last part of the discussion suggests how you can help the optimizer.
Consider the following subquery comparison:
OUTER_EXPR IN (SELECT INNER_EXPR FROM ... WHERE SUBQUERY_WHERE)
MySQL evaluates queries 'from outside to inside.' That is, it first obtains the value of the outer expression OUTER_EXPR, and then runs the subquery and captures the rows that it produces.
A very useful optimization is to 'inform' the subquery that the only rows of interest are those where the inner expression INNER_EXPR is equal to OUTER_EXPR. This is done by pushing down an appropriate equality into the subquery's 'WHERE' clause to make it more restrictive. The converted comparison looks like this:
EXISTS (SELECT 1 FROM ... WHERE SUBQUERY_WHERE AND OUTER_EXPR=INNER_EXPR)
After the conversion, MySQL can use the pushed-down equality to limit the number of rows it must examine to evaluate the subquery.
More generally, a comparison of N values to a subquery that returns N-value rows is subject to the same conversion. If OE_I and IE_I represent corresponding outer and inner expression values, this subquery comparison:
(OE_1, ..., OE_N) IN
(SELECT IE_1, ..., IE_N FROM ... WHERE SUBQUERY_WHERE)
Becomes:
EXISTS (SELECT 1 FROM ... WHERE SUBQUERY_WHERE
AND OE_1 = IE_1
AND ...
AND OE_N = IE_N)
For simplicity, the following discussion assumes a single pair of outer and inner expression values.
The conversion just described has its limitations. It is valid only if we ignore possible 'NULL' values. That is, the 'pushdown' strategy works as long as both of these conditions are true:
OUTER_EXPR and INNER_EXPR cannot be 'NULL'.
You need not distinguish 'NULL' from 'FALSE' subquery results. If the subquery is a part of an 'OR' or 'AND' expression in the 'WHERE' clause, MySQL assumes that you do not care. Another instance where the optimizer notices that 'NULL' and 'FALSE' subquery results need not be distinguished is this construct:
... WHERE OUTER_EXPR IN (SUBQUERY)
In this case, the 'WHERE' clause rejects the row whether 'IN (SUBQUERY)' returns 'NULL' or 'FALSE'.
When either or both of those conditions do not hold, optimization is more complex.
Suppose that OUTER_EXPR is known to be a non-'NULL' value but the subquery does not produce a row such that OUTER_EXPR = INNER_EXPR. Then 'OUTER_EXPR IN (SELECT ...)' evaluates as follows:
'NULL', if the *note 'SELECT': select. produces any row where INNER_EXPR is 'NULL'
'FALSE', if the *note 'SELECT': select. produces only non-'NULL' values or produces nothing
In this situation, the approach of looking for rows with 'OUTER_EXPR = INNER_EXPR' is no longer valid. It is necessary to look for such rows, but if none are found, also look for rows where INNER_EXPR is 'NULL'. Roughly speaking, the subquery can be converted to something like this:
EXISTS (SELECT 1 FROM ... WHERE SUBQUERY_WHERE AND
(OUTER_EXPR=INNER_EXPR OR INNER_EXPR IS NULL))
The need to evaluate the extra 'IS NULL' condition is why MySQL has the 'ref_or_null' access method:
mysql> EXPLAIN
SELECT OUTER_EXPR IN (SELECT t2.maybe_null_key
FROM t2, t3 WHERE ...)
FROM t1;
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t1
...
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: t2
type: ref_or_null
possible_keys: maybe_null_key
key: maybe_null_key
key_len: 5
ref: func
rows: 2
Extra: Using where; Using index
...
The 'unique_subquery' and 'index_subquery' subquery-specific access methods also have 'or 'NULL'' variants.
The additional 'OR ... IS NULL' condition makes query execution slightly more complicated (and some optimizations within the subquery become inapplicable), but generally this is tolerable.
The situation is much worse when OUTER_EXPR can be 'NULL'. According to the SQL interpretation of 'NULL' as 'unknown value,' 'NULL IN (SELECT INNER_EXPR ...)' should evaluate to:
'NULL', if the *note 'SELECT': select. produces any rows
'FALSE', if the *note 'SELECT': select. produces no rows
For proper evaluation, it is necessary to be able to check whether the *note 'SELECT': select. has produced any rows at all, so 'OUTER_EXPR = INNER_EXPR' cannot be pushed down into the subquery. This is a problem because many real world subqueries become very slow unless the equality can be pushed down.
Essentially, there must be different ways to execute the subquery depending on the value of OUTER_EXPR.
The optimizer chooses SQL compliance over speed, so it accounts for the possibility that OUTER_EXPR might be 'NULL':
If OUTER_EXPR is 'NULL', to evaluate the following expression, it is necessary to execute the *note 'SELECT': select. to determine whether it produces any rows:
NULL IN (SELECT INNER_EXPR FROM ... WHERE SUBQUERY_WHERE)
It is necessary to execute the original *note 'SELECT': select. here, without any pushed-down equalities of the kind mentioned previously.
On the other hand, when OUTER_EXPR is not 'NULL', it is absolutely essential that this comparison:
OUTER_EXPR IN (SELECT INNER_EXPR FROM ... WHERE SUBQUERY_WHERE)
Be converted to this expression that uses a pushed-down condition:
EXISTS (SELECT 1 FROM ... WHERE SUBQUERY_WHERE AND OUTER_EXPR=INNER_EXPR)
Without this conversion, subqueries are slow.
To solve the dilemma of whether or not to push down conditions into the subquery, the conditions are wrapped within 'trigger' functions. Thus, an expression of the following form:
OUTER_EXPR IN (SELECT INNER_EXPR FROM ... WHERE SUBQUERY_WHERE)
Is converted into:
EXISTS (SELECT 1 FROM ... WHERE SUBQUERY_WHERE
AND trigcond(OUTER_EXPR=INNER_EXPR))
More generally, if the subquery comparison is based on several pairs of outer and inner expressions, the conversion takes this comparison:
(OE_1, ..., OE_N) IN (SELECT IE_1, ..., IE_N FROM ... WHERE SUBQUERY_WHERE)
And converts it to this expression:
EXISTS (SELECT 1 FROM ... WHERE SUBQUERY_WHERE
AND trigcond(OE_1=IE_1)
AND ...
AND trigcond(OE_N=IE_N)
)
Each 'trigcond(X)' is a special function that evaluates to the following values:
X when the 'linked' outer expression OE_I is not 'NULL'
'TRUE' when the 'linked' outer expression OE_I is 'NULL'
Note:
Trigger functions are not triggers of the kind that you create with *note 'CREATE TRIGGER': create-trigger.
Equalities that are wrapped within 'trigcond()' functions are not first class predicates for the query optimizer. Most optimizations cannot deal with predicates that may be turned on and off at query execution time, so they assume any 'trigcond(X)' to be an unknown function and ignore it. Triggered equalities can be used by those optimizations:
Reference optimizations: 'trigcond(X=Y [OR Y IS NULL])' can be used to construct 'ref', 'eq_ref', or 'ref_or_null' table accesses.
Index lookup-based subquery execution engines: 'trigcond(X=Y)' can be used to construct 'unique_subquery' or 'index_subquery' accesses.
Table-condition generator: If the subquery is a join of several tables, the triggered condition is checked as soon as possible.
When the optimizer uses a triggered condition to create some kind of index lookup-based access (as for the first two items of the preceding list), it must have a fallback strategy for the case when the condition is turned off. This fallback strategy is always the same: Do a full table scan. In *note 'EXPLAIN': explain. output, the fallback shows up as 'Full scan on NULL key' in the 'Extra' column:
mysql> EXPLAIN SELECT t1.col1,
t1.col1 IN (SELECT t2.key1 FROM t2 WHERE t2.col2=t1.col2) FROM t1\G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t1
...
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: t2
type: index_subquery
possible_keys: key1
key: key1
key_len: 5
ref: func
rows: 2
Extra: Using where; Full scan on NULL key
If you run note 'EXPLAIN': explain. followed by note 'SHOW WARNINGS': show-warnings, you can see the triggered condition:
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: select `test`.`t1`.`col1` AS `col1`,
<in_optimizer>(`test`.`t1`.`col1`,
<exists>(<index_lookup>(<cache>(`test`.`t1`.`col1`) in t2
on key1 checking NULL
where (`test`.`t2`.`col2` = `test`.`t1`.`col2`) having
trigcond(<is_not_null_test>(`test`.`t2`.`key1`))))) AS
`t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2)`
from `test`.`t1`
The use of triggered conditions has some performance implications. A 'NULL IN (SELECT ...)' expression now may cause a full table scan (which is slow) when it previously did not. This is the price paid for correct results (the goal of the trigger-condition strategy is to improve compliance, not speed).
For multiple-table subqueries, execution of 'NULL IN (SELECT ...)' is particularly slow because the join optimizer does not optimize for the case where the outer expression is 'NULL'. It assumes that subquery evaluations with 'NULL' on the left side are very rare, even if there are statistics that indicate otherwise. On the other hand, if the outer expression might be 'NULL' but never actually is, there is no performance penalty.
To help the query optimizer better execute your queries, use these suggestions:
Declare a column as 'NOT NULL' if it really is. This also helps other aspects of the optimizer by simplifying condition testing for the column.
If you need not distinguish a 'NULL' from 'FALSE' subquery result, you can easily avoid the slow execution path. Replace a comparison that looks like this:
OUTER_EXPR IN (SELECT INNER_EXPR FROM ...)
with this expression:
(OUTER_EXPR IS NOT NULL) AND (OUTER_EXPR IN (SELECT INNER_EXPR FROM ...))
Then 'NULL IN (SELECT ...)' is never evaluated because MySQL stops evaluating 'AND' parts as soon as the expression result is clear.
Another possible rewrite:
EXISTS (SELECT INNER_EXPR FROM ...
WHERE INNER_EXPR=OUTER_EXPR)
This would apply when you need not distinguish 'NULL' from 'FALSE' subquery results, in which case you may actually want 'EXISTS'.
The 'subquery_materialization_cost_based' flag of the 'optimizer_switch' system variable enables control over the choice between subquery materialization and 'IN'-to-'EXISTS' subquery transformation. See *note switchable-optimizations::.
File: manual.info.tmp, Node: derived-table-optimization, Prev: subquery-optimization-with-exists, Up: subquery-optimization
8.2.2.4 Optimizing Derived Tables and View References with Merging or Materialization .....................................................................................
The optimizer can handle derived table references using two strategies (which also apply to view references):
Merge the derived table into the outer query block
Materialize the derived table to an internal temporary table
Example 1:
SELECT * FROM (SELECT * FROM t1) AS derived_t1;
With merging of the derived table 'derived_t1', that query is executed similar to:
SELECT * FROM t1;
Example 2:
SELECT *
FROM t1 JOIN (SELECT t2.f1 FROM t2) AS derived_t2 ON t1.f2=derived_t2.f1
WHERE t1.f1 > 0;
With merging of the derived table 'derived_t2', that query is executed similar to:
SELECT t1.*, t2.f1
FROM t1 JOIN t2 ON t1.f2=t2.f1
WHERE t1.f1 > 0;
With materialization, 'derived_t1' and 'derived_t2' are each treated as a separate table within their respective queries.
The optimizer handles derived tables and view references the same way: It avoids unnecessary materialization whenever possible, which enables pushing down conditions from the outer query to derived tables and produces more efficient execution plans. (For an example, see *note subquery-materialization::.)
If merging would result in an outer query block that references more than 61 base tables, the optimizer chooses materialization instead.
The optimizer propagates an 'ORDER BY' clause in a derived table or view reference to the outer query block if these conditions are all true:
The outer query is not grouped or aggregated.
The outer query does not specify 'DISTINCT', 'HAVING', or 'ORDER BY'.
The outer query has this derived table or view reference as the only source in the 'FROM' clause.
Otherwise, the optimizer ignores the 'ORDER BY' clause.
The following means are available to influence whether the optimizer attempts to merge derived tables and view references into the outer query block:
The 'derived_merge' flag of the 'optimizer_switch' system variable can be used, assuming that no other rule prevents merging. See *note switchable-optimizations::. By default, the flag is enabled to permit merging. Disabling the flag prevents merging and avoids 'ER_UPDATE_TABLE_USED' (https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_update_table_used) errors.
The 'derived_merge' flag also applies to views that contain no 'ALGORITHM' clause. Thus, if an 'ER_UPDATE_TABLE_USED' (https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_update_table_used) error occurs for a view reference that uses an expression equivalent to the subquery, adding 'ALGORITHM=TEMPTABLE' to the view definition prevents merging and takes precedence over the 'derived_merge' value.
It is possible to disable merging by using in the subquery any constructs that prevent merging, although these are not as explicit in their effect on materialization. Constructs that prevent merging are the same for derived tables and view references:
* Aggregate functions ('SUM()', 'MIN()', 'MAX()', 'COUNT()', and
so forth)
* 'DISTINCT'
* 'GROUP BY'
* 'HAVING'
* 'LIMIT'
* *note 'UNION': union. or *note 'UNION ALL': union.
* Subqueries in the select list
* Assignments to user variables
* Refererences only to literal values (in this case, there is no
underlying table)
The 'derived_merge' flag also applies to views that contain no 'ALGORITHM' clause. Thus, if an 'ER_UPDATE_TABLE_USED' (https://dev.mysql.com/doc/mysql-errors/5.7/en/server-error-reference.html#error_er_update_table_used) error occurs for a view reference that uses an expression equivalent to the subquery, adding 'ALGORITHM=TEMPTABLE' to the view definition prevents merging and takes precedence over the current 'derived_merge' value.
If the optimizer chooses the materialization strategy rather than merging for a derived table, it handles the query as follows:
The optimizer postpones derived table materialization until its contents are needed during query execution. This improves performance because delaying materialization may result in not having to do it at all. Consider a query that joins the result of a derived table to another table: If the optimizer processes that other table first and finds that it returns no rows, the join need not be carried out further and the optimizer can completely skip materializing the derived table.
During query execution, the optimizer may add an index to a derived table to speed up row retrieval from it.
Consider the following note 'EXPLAIN': explain. statement, for a note 'SELECT': select. query that contains a derived table:
EXPLAIN SELECT * FROM (SELECT * FROM t1) AS derived_t1;
The optimizer avoids materializing the derived table by delaying it until the result is needed during note 'SELECT': select. execution. In this case, the query is not executed (because it occurs in an note 'EXPLAIN': explain. statement), so the result is never needed.
Even for queries that are executed, delay of derived table materialization may enable the optimizer to avoid materialization entirely. When this happens, query execution is quicker by the time needed to perform materialization. Consider the following query, which joins the result of a derived table to another table:
SELECT *
FROM t1 JOIN (SELECT t2.f1 FROM t2) AS derived_t2
ON t1.f2=derived_t2.f1
WHERE t1.f1 > 0;
If the optimization processes 't1' first and the 'WHERE' clause produces an empty result, the join must necessarily be empty and the derived table need not be materialized.
For cases when a derived table requires materialization, the optimizer may add an index to the materialized table to speed up access to it. If such an index enables 'ref' access to the table, it can greatly reduce amount of data read during query execution. Consider the following query:
SELECT *
FROM t1 JOIN (SELECT DISTINCT f1 FROM t2) AS derived_t2
ON t1.f1=derived_t2.f1;
The optimizer constructs an index over column 'f1' from 'derived_t2' if doing so would enable use of 'ref' access for the lowest cost execution plan. After adding the index, the optimizer can treat the materialized derived table the same as a regular table with an index, and it benefits similarly from the generated index. The overhead of index creation is negligible compared to the cost of query execution without the index. If 'ref' access would result in higher cost than some other access method, the optimizer creates no index and loses nothing.
For optimizer trace output, a merged derived table or view reference is not shown as a node. Only its underlying tables appear in the top query's plan.
File: manual.info.tmp, Node: information-schema-optimization, Next: data-change-optimization, Prev: subquery-optimization, Up: statement-optimization
Applications that monitor databases may make frequent use of 'INFORMATION_SCHEMA' tables. Certain types of queries for 'INFORMATION_SCHEMA' tables can be optimized to execute more quickly. The goal is to minimize file operations (for example, scanning a directory or opening a table file) to collect the information that makes up these dynamic tables.
Note:
Comparison behavior for database and table names in 'INFORMATION_SCHEMA' queries might differ from what you expect. For details, see *note charset-collation-information-schema::.
1) Try to use constant lookup values for database and table names in the 'WHERE' clause
You can take advantage of this principle as follows:
To look up databases or tables, use expressions that evaluate to a constant, such as literal values, functions that return a constant, or scalar subqueries.
Avoid queries that use a nonconstant database name lookup value (or no lookup value) because they require a scan of the data directory to find matching database directory names.
Within a database, avoid queries that use a nonconstant table name lookup value (or no lookup value) because they require a scan of the database directory to find matching table files.
This principle applies to the 'INFORMATION_SCHEMA' tables shown in the following table, which shows the columns for which a constant lookup value enables the server to avoid a directory scan. For example, if you are selecting from *note 'TABLES': information-schema-tables-table, using a constant lookup value for 'TABLE_SCHEMA' in the 'WHERE' clause enables a data directory scan to be avoided.
Table Column to specify to Column to specify to avoid data directory avoid database scan directory scan
*note 'COLUMNS': information-schema-columns-table.
'TABLE_SCHEMA' 'TABLE_NAME'
*note 'KEY_COLUMN_USAGE': information-schema-key-column-usage-table.
'TABLE_SCHEMA' 'TABLE_NAME'
*note 'PARTITIONS': information-schema-partitions-table.
'TABLE_SCHEMA' 'TABLE_NAME'
*note 'REFERENTIAL_CONSTRAINTS': information-schema-referential-constraints-table.
'CONSTRAINT_SCHEMA' 'TABLE_NAME'
*note 'STATISTICS': information-schema-statistics-table.
'TABLE_SCHEMA' 'TABLE_NAME'
*note 'TABLES': information-schema-tables-table.
'TABLE_SCHEMA' 'TABLE_NAME'
*note 'TABLE_CONSTRAINTS': information-schema-table-constraints-table.
'TABLE_SCHEMA' 'TABLE_NAME'
*note 'TRIGGERS': information-schema-triggers-table.
'EVENT_OBJECT_SCHEMA' 'EVENT_OBJECT_TABLE'
*note 'VIEWS': information-schema-views-table.
'TABLE_SCHEMA' 'TABLE_NAME'
The benefit of a query that is limited to a specific constant database name is that checks need be made only for the named database directory. Example:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test';
Use of the literal database name 'test' enables the server to check only the 'test' database directory, regardless of how many databases there might be. By contrast, the following query is less efficient because it requires a scan of the data directory to determine which database names match the pattern ''test%'':
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA LIKE 'test%';
For a query that is limited to a specific constant table name, checks need be made only for the named table within the corresponding database directory. Example:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test' AND TABLE_NAME = 't1';
Use of the literal table name 't1' enables the server to check only the files for the 't1' table, regardless of how many tables there might be in the 'test' database. By contrast, the following query requires a scan of the 'test' database directory to determine which table names match the pattern ''t%'':
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test' AND TABLE_NAME LIKE 't%';
The following query requires a scan of the database directory to determine matching database names for the pattern ''test%'', and for each matching database, it requires a scan of the database directory to determine matching table names for the pattern ''t%'':
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test%' AND TABLE_NAME LIKE 't%';
2) Write queries that minimize the number of table files that must be opened
For queries that refer to certain 'INFORMATION_SCHEMA' table columns, several optimizations are available that minimize the number of table files that must be opened. Example:
SELECT TABLE_NAME, ENGINE FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test';
In this case, after the server has scanned the database directory to determine the names of the tables in the database, those names become available with no further file system lookups. Thus, 'TABLE_NAME' requires no files to be opened. The 'ENGINE' (storage engine) value can be determined by opening the table's '.frm' file, without touching other table files such as the '.MYD' or '.MYI' file.
Some values, such as 'INDEX_LENGTH' for 'MyISAM' tables, require opening the '.MYD' or '.MYI' file as well.
The file-opening optimization types are denoted thus:
'SKIP_OPEN_TABLE': Table files do not need to be opened. The information has already become available within the query by scanning the database directory.
'OPEN_FRM_ONLY': Only the table's '.frm' file need be opened.
'OPEN_TRIGGER_ONLY': Only the table's '.TRG' file need be opened.
'OPEN_FULL_TABLE': The unoptimized information lookup. The '.frm', '.MYD', and '.MYI' files must be opened.
The following list indicates how the preceding optimization types apply to 'INFORMATION_SCHEMA' table columns. For tables and columns not named, none of the optimizations apply.
*note 'COLUMNS': information-schema-columns-table.: 'OPEN_FRM_ONLY' applies to all columns
*note 'KEY_COLUMN_USAGE': information-schema-key-column-usage-table.: 'OPEN_FULL_TABLE' applies to all columns
*note 'PARTITIONS': information-schema-partitions-table.: 'OPEN_FULL_TABLE' applies to all columns
*note 'REFERENTIAL_CONSTRAINTS': information-schema-referential-constraints-table.: 'OPEN_FULL_TABLE' applies to all columns
*note 'STATISTICS': information-schema-statistics-table.:
Column Optimization type
'TABLE_CATALOG' 'OPEN_FRM_ONLY'
'TABLE_SCHEMA' 'OPEN_FRM_ONLY'
'TABLE_NAME' 'OPEN_FRM_ONLY'
'NON_UNIQUE' 'OPEN_FRM_ONLY'
'INDEX_SCHEMA' 'OPEN_FRM_ONLY'
'INDEX_NAME' 'OPEN_FRM_ONLY'
'SEQ_IN_INDEX' 'OPEN_FRM_ONLY'
'COLUMN_NAME' 'OPEN_FRM_ONLY'
'COLLATION' 'OPEN_FRM_ONLY'
'CARDINALITY' 'OPEN_FULL_TABLE'
'SUB_PART' 'OPEN_FRM_ONLY'
'PACKED' 'OPEN_FRM_ONLY'
'NULLABLE' 'OPEN_FRM_ONLY'
'INDEX_TYPE' 'OPEN_FULL_TABLE'
'COMMENT' 'OPEN_FRM_ONLY'
*note 'TABLES': information-schema-tables-table.:
Column Optimization type
'TABLE_CATALOG' 'SKIP_OPEN_TABLE'
'TABLE_SCHEMA' 'SKIP_OPEN_TABLE'
'TABLE_NAME' 'SKIP_OPEN_TABLE'
'TABLE_TYPE' 'OPEN_FRM_ONLY'
'ENGINE' 'OPEN_FRM_ONLY'
'VERSION' 'OPEN_FRM_ONLY'
'ROW_FORMAT' 'OPEN_FULL_TABLE'
'TABLE_ROWS' 'OPEN_FULL_TABLE'
'AVG_ROW_LENGTH' 'OPEN_FULL_TABLE'
'DATA_LENGTH' 'OPEN_FULL_TABLE'
'MAX_DATA_LENGTH' 'OPEN_FULL_TABLE'
'INDEX_LENGTH' 'OPEN_FULL_TABLE'
'DATA_FREE' 'OPEN_FULL_TABLE'
'AUTO_INCREMENT' 'OPEN_FULL_TABLE'
'CREATE_TIME' 'OPEN_FULL_TABLE'
'UPDATE_TIME' 'OPEN_FULL_TABLE'
'CHECK_TIME' 'OPEN_FULL_TABLE'
'TABLE_COLLATION' 'OPEN_FRM_ONLY'
'CHECKSUM' 'OPEN_FULL_TABLE'
'CREATE_OPTIONS' 'OPEN_FRM_ONLY'
'TABLE_COMMENT' 'OPEN_FRM_ONLY'
*note 'TABLE_CONSTRAINTS': information-schema-table-constraints-table.: 'OPEN_FULL_TABLE' applies to all columns
*note 'TRIGGERS': information-schema-triggers-table.: 'OPEN_TRIGGER_ONLY' applies to all columns
*note 'VIEWS': information-schema-views-table.:
Column Optimization type
'TABLE_CATALOG' 'OPEN_FRM_ONLY'
'TABLE_SCHEMA' 'OPEN_FRM_ONLY'
'TABLE_NAME' 'OPEN_FRM_ONLY'
'VIEW_DEFINITION' 'OPEN_FRM_ONLY'
'CHECK_OPTION' 'OPEN_FRM_ONLY'
'IS_UPDATABLE' 'OPEN_FULL_TABLE'
'DEFINER' 'OPEN_FRM_ONLY'
'SECURITY_TYPE' 'OPEN_FRM_ONLY'
'CHARACTER_SET_CLIENT' 'OPEN_FRM_ONLY'
'COLLATION_CONNECTION' 'OPEN_FRM_ONLY'
3) Use note 'EXPLAIN': explain. to determine whether the server can use 'INFORMATION_SCHEMA' optimizations for a query*
This applies particularly for 'INFORMATION_SCHEMA' queries that search for information from more than one database, which might take a long time and impact performance. The 'Extra' value in *note 'EXPLAIN': explain. output indicates which, if any, of the optimizations described earlier the server can use to evaluate 'INFORMATION_SCHEMA' queries. The following examples demonstrate the kinds of information you can expect to see in the 'Extra' value.
mysql> EXPLAIN SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS WHERE
TABLE_SCHEMA = 'test' AND TABLE_NAME = 'v1'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: VIEWS
type: ALL
possible_keys: NULL
key: TABLE_SCHEMA,TABLE_NAME
key_len: NULL
ref: NULL
rows: NULL
Extra: Using where; Open_frm_only; Scanned 0 databases
Use of constant database and table lookup values enables the server to avoid directory scans. For references to 'VIEWS.TABLE_NAME', only the '.frm' file need be opened.
mysql> EXPLAIN SELECT TABLE_NAME, ROW_FORMAT FROM INFORMATION_SCHEMA.TABLES\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: TABLES
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Open_full_table; Scanned all databases
No lookup values are provided (there is no 'WHERE' clause), so the server must scan the data directory and each database directory. For each table thus identified, the table name and row format are selected. 'TABLE_NAME' requires no further table files to be opened (the 'SKIP_OPEN_TABLE' optimization applies). 'ROW_FORMAT' requires all table files to be opened ('OPEN_FULL_TABLE' applies). *note 'EXPLAIN': explain. reports 'OPEN_FULL_TABLE' because it is more expensive than 'SKIP_OPEN_TABLE'.
mysql> EXPLAIN SELECT TABLE_NAME, TABLE_TYPE FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'test'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: TABLES
type: ALL
possible_keys: NULL
key: TABLE_SCHEMA
key_len: NULL
ref: NULL
rows: NULL
Extra: Using where; Open_frm_only; Scanned 1 database
No table name lookup value is provided, so the server must scan the 'test' database directory. For the 'TABLE_NAME' and 'TABLE_TYPE' columns, the 'SKIP_OPEN_TABLE' and 'OPEN_FRM_ONLY' optimizations apply, respectively. *note 'EXPLAIN': explain. reports 'OPEN_FRM_ONLY' because it is more expensive.
mysql> EXPLAIN SELECT B.TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES AS A, INFORMATION_SCHEMA.COLUMNS AS B
WHERE A.TABLE_SCHEMA = 'test'
AND A.TABLE_NAME = 't1'
AND B.TABLE_NAME = A.TABLE_NAME\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: A
type: ALL
possible_keys: NULL
key: TABLE_SCHEMA,TABLE_NAME
key_len: NULL
ref: NULL
rows: NULL
Extra: Using where; Skip_open_table; Scanned 0 databases
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: B
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Using where; Open_frm_only; Scanned all databases;
Using join buffer
For the first *note 'EXPLAIN': explain. output row: Constant database and table lookup values enable the server to avoid directory scans for 'TABLES' values. References to 'TABLES.TABLE_NAME' require no further table files.
For the second note 'EXPLAIN': explain. output row: All note 'COLUMNS': information-schema-columns-table. table values are 'OPEN_FRM_ONLY' lookups, so 'COLUMNS.TABLE_NAME' requires the '.frm' file to be opened.
mysql> EXPLAIN SELECT * FROM INFORMATION_SCHEMA.COLLATIONS\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: COLLATIONS
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra:
In this case, no optimizations apply because *note 'COLLATIONS': information-schema-collations-table. is not one of the 'INFORMATION_SCHEMA' tables for which optimizations are available.
File: manual.info.tmp, Node: data-change-optimization, Next: permission-optimization, Prev: information-schema-optimization, Up: statement-optimization
Menu:
delete-optimization:: Optimizing DELETE Statements
This section explains how to speed up data change statements: note 'INSERT': insert, note 'UPDATE': update, and note 'DELETE': delete. Traditional OLTP applications and modern web applications typically do many small data change operations, where concurrency is vital. Data analysis and reporting applications typically run data change operations that affect many rows at once, where the main considerations is the I/O to write large amounts of data and keep indexes up-to-date. For inserting and updating large volumes of data (known in the industry as ETL, for 'extract-transform-load'), sometimes you use other SQL statements or external commands, that mimic the effects of note 'INSERT': insert, note 'UPDATE': update, and note 'DELETE': delete. statements.
File: manual.info.tmp, Node: insert-optimization, Next: update-optimization, Prev: data-change-optimization, Up: data-change-optimization
8.2.4.1 Optimizing INSERT Statements ....................................
To optimize insert speed, combine many small operations into a single large operation. Ideally, you make a single connection, send the data for many new rows at once, and delay all index updates and consistency checking until the very end.
The time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:
Connecting: (3)
Sending query to server: (2)
Parsing query: (2)
Inserting row: (1 x size of row)
Inserting indexes: (1 x number of indexes)
Closing: (1)
This does not take into consideration the initial overhead to open tables, which is done once for each concurrently running query.
The size of the table slows down the insertion of indexes by log N, assuming B-tree indexes.
You can use the following methods to speed up inserts:
If you are inserting many rows from the same client at the same time, use note 'INSERT': insert. statements with multiple 'VALUES' lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row note 'INSERT': insert. statements. If you are adding data to a nonempty table, you can tune the 'bulk_insert_buffer_size' variable to make data insertion even faster. See *note server-system-variables::.
When loading a table from a text file, use note 'LOAD DATA': load-data. This is usually 20 times faster than using note 'INSERT': insert. statements. See *note load-data::.
Take advantage of the fact that columns have default values. Insert values explicitly only when the value to be inserted differs from the default. This reduces the parsing that MySQL must do and improves the insert speed.
See *note optimizing-innodb-bulk-data-loading:: for tips specific to 'InnoDB' tables.
See *note optimizing-myisam-bulk-data-loading:: for tips specific to 'MyISAM' tables.
File: manual.info.tmp, Node: update-optimization, Next: delete-optimization, Prev: insert-optimization, Up: data-change-optimization
8.2.4.2 Optimizing UPDATE Statements ....................................
An update statement is optimized like a *note 'SELECT': select. query with the additional overhead of a write. The speed of the write depends on the amount of data being updated and the number of indexes that are updated. Indexes that are not changed do not get updated.
Another way to get fast updates is to delay updates and then do many updates in a row later. Performing multiple updates together is much quicker than doing one at a time if you lock the table.
For a 'MyISAM' table that uses dynamic row format, updating a row to a longer total length may split the row. If you do this often, it is very important to use note 'OPTIMIZE TABLE': optimize-table. occasionally. See note optimize-table::.
File: manual.info.tmp, Node: delete-optimization, Prev: update-optimization, Up: data-change-optimization
8.2.4.3 Optimizing DELETE Statements ....................................
The time required to delete individual rows in a 'MyISAM' table is exactly proportional to the number of indexes. To delete rows more quickly, you can increase the size of the key cache by increasing the 'key_buffer_size' system variable. See *note server-configuration::.
To delete all rows from a 'MyISAM' table, 'TRUNCATE TABLE TBL_NAME' is faster than 'DELETE FROM TBL_NAME'. Truncate operations are not transaction-safe; an error occurs when attempting one in the course of an active transaction or active table lock. See *note truncate-table::.
File: manual.info.tmp, Node: permission-optimization, Next: miscellaneous-optimization-tips, Prev: data-change-optimization, Up: statement-optimization
The more complex your privilege setup, the more overhead applies to all SQL statements. Simplifying the privileges established by *note 'GRANT': grant. statements enables MySQL to reduce permission-checking overhead when clients execute statements. For example, if you do not grant any table-level or column-level privileges, the server need not ever check the contents of the 'tables_priv' and 'columns_priv' tables. Similarly, if you place no resource limits on any accounts, the server does not have to perform resource counting. If you have a very high statement-processing load, consider using a simplified grant structure to reduce permission-checking overhead.
File: manual.info.tmp, Node: miscellaneous-optimization-tips, Prev: permission-optimization, Up: statement-optimization
This section lists a number of miscellaneous tips for improving query processing speed:
If your application makes several database requests to perform related updates, combining the statements into a stored routine can help performance. Similarly, if your application computes a single result based on several column values or large volumes of data, combining the computation into a loadable function can help performance. The resulting fast database operations are then available to be reused by other queries, applications, and even code written in different programming languages. See *note stored-routines:: and Adding Functions to MySQL (https://dev.mysql.com/doc/extending-mysql/5.7/en/adding-functions.html) for more information.
To fix any compression issues that occur with 'ARCHIVE' tables, use note 'OPTIMIZE TABLE': optimize-table. See note archive-storage-engine::.
If possible, classify reports as 'live' or as 'statistical', where data needed for statistical reports is created only from summary tables that are generated periodically from the live data.
If you have data that does not conform well to a rows-and-columns table structure, you can pack and store data into a *note 'BLOB': blob. column. In this case, you must provide code in your application to pack and unpack information, but this might save I/O operations to read and write the sets of related values.
With Web servers, store images and other binary assets as files, with the path name stored in the database rather than the file itself. Most Web servers are better at caching files than database contents, so using files is generally faster. (Although you must handle backups and storage issues yourself in this case.)
If you need really high speed, look at the low-level MySQL interfaces. For example, by accessing the MySQL note 'InnoDB': innodb-storage-engine. or note 'MyISAM': myisam-storage-engine. storage engine directly, you could get a substantial speed increase compared to using the SQL interface.
Similarly, for databases using the *note 'NDBCLUSTER': mysql-cluster. storage engine, you may wish to investigate possible use of the NDB API (see MySQL NDB Cluster API Developer Guide (https://dev.mysql.com/doc/ndbapi/en/)).
Replication can provide a performance benefit for some operations. You can distribute client retrievals among replicas to split up the load. To avoid slowing down the source while making backups, you can make backups using a replica. See *note replication::.
File: manual.info.tmp, Node: optimization-indexes, Next: optimizing-database-structure, Prev: statement-optimization, Up: optimization