1. Overview of Join Algorithms
MySQL relies on three core algorithms to execute join queries:
1.1 Nested-Loop Join
Basic principle: Use the left table as the driving table, iterate through each row of the driving table, then match all eligible rows in the right table (driven table).
-- Example query
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id;
-- Pseudocode logic
for each row in orders { -- Iterate through the driving table (orders)
for each row in order_items { -- Iterate through the driven table (order_items)
if (orders.id = order_items.order_id) { -- Match join conditions
output matched row -- Output matching results
}
}
}
1.1.1 Simple Nested-Loop Join
- The most primitive nested loop method without any optimization.
- Time complexity: O(M*N) (M is the number of rows in the driving table, N is the number of rows in the driven table).
- Extremely poor performance, MySQL rarely chooses this algorithm.
1.1.2 Index Nested-Loop Join
Applicable scenario: When there is an index on the join field of the driven table.
-- Example: There is an index on the order_id field of the driven table (order_items)
SELECT * FROM orders o -- Driving table
INNER JOIN order_items oi ON o.id = oi.order_id; -- Driven table (relies on index)
-- Pseudocode logic
for each row in orders {
lookup order_items using index(order_id) -- Use index for fast matching row lookup
}
- Core optimization: Reduce the scanning range through the index of the driven table to avoid full table traversal.
- Time complexity: O(M*logN) (the time complexity of index query is logN).
- One of the most commonly used optimization methods in MySQL.
1.1.3 Block Nested-Loop Join
Applicable scenario: When there is no index on the driven table, reduce disk I/O through memory buffering.
-- Configuration: Set join buffer size (default value can be adjusted via system parameters)
SET join_buffer_size = 1048576; -- 1MB
-- Pseudocode logic
while (rows in orders not loaded) {
load batch of orders rows into join_buffer; -- Batch load driving table data into memory buffer
for each row in order_items { -- Iterate through the driven table
for each row in join_buffer { -- Match with driving table data in the buffer
if (orders.id = order_items.order_id) {
output matched row
}
}
}
}
- Core optimization: Batch cache driving table data through join_buffer to reduce repeated scans of the driven table.
- Time complexity is still O(M*N), but actual performance is better than Simple Nested-Loop Join (reduces the number of disk accesses).
1.2 Hash Join
Applicable version: Supported in MySQL 8.0.18 and above, specifically for equi-joins (ON A.key = B.key).
-- Example query
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id;
-- Execution process
1. Build phase (construct hash table):
- Select the smaller table as the "build table" and construct a hash table (in memory) using the join key (e.g., order_id).
2. Probe phase (probe for matches):
- Scan the larger table (probe table) and calculate the hash value for the join key of each row.
- Quickly find matching hash values in the hash table to get results.
1.2.1 Advantages
- No reliance on indexes, suitable for joining large tables without indexes.
- Time complexity: O(M+N) (linear complexity, high efficiency).
- Friendly to large table joins, especially when the table data volume exceeds memory, it can be processed in blocks.
1.2.2 Limitations
- Only supports equi-joins (cannot handle range conditions such as <, >).
- Requires additional memory to store the hash table; if memory is insufficient, it will be written to disk, resulting in performance degradation.
1.3 Sort-Merge Join
Applicable scenario: When the join keys are already sorted (or can be sorted through indexes), often used in non-equi-joins or scenarios requiring sorting.
-- Example query (join keys need to be sorted)
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
ORDER BY o.id;
-- Execution process
1. Sort phase:
- If the join keys of the two tables are not sorted, sort them by the join keys first.
2. Merge phase:
- Simultaneously scan the two sorted tables and match and merge results in the order of the join keys.
- Core logic: Utilize the order of sorted data to avoid row-by-row matching in nested loops, suitable for ordered data.
- Time complexity: O(M logM + N logN) (the main time consumption is in the sorting phase).
2. Optimizer’s Selection Strategy
The MySQL optimizer automatically selects the optimal join algorithm based on factors such as table data volume and index conditions.
2.1 Cost Estimation
The optimizer’s selection results can be viewed through the EXPLAIN command:
EXPLAIN SELECT * FROM orders oINNER JOIN order_items oi ON o.id = oi.order_id;
The optimizer mainly considers the following cost factors:
- Number of rows in the table (data volume).
- Whether there is an index on the join field and the index efficiency.
- Data distribution (e.g., the repetition rate of join keys).
- System parameters (such as join_buffer_size, hash_join_buffer_size, etc.).
2.2 Access Method Selection
Example: When the driven table has an index, the optimizer typically selects:
-- Create an index for the driven table
CREATE INDEX idx_order_id ON order_items(order_id);
-- The optimizer may choose:
1. Use the small table (orders) as the driving table (reduce the number of outer loop iterations).
2. Use the Index Nested-Loop Join algorithm.
3. Utilize the idx_order_id index of the order_items table to speed up matching.
3. JOIN Optimization Practices
3.1 Index Optimization
- Create indexes on the join fields of the driven table (prioritize supporting Index Nested-Loop Join).
- For complex scenarios, composite indexes can be used (including join keys and fields required for the query).
-- Single-field indexCREATE INDEX idx_order_id ON order_items(order_id);-- Composite index (covers join keys and query fields)CREATE INDEX idx_order_product ON order_items(order_id, product_id);
3.2 Small Table Drives Large Table
- Prefer to select a small table as the driving table (reduce the number of outer loop iterations and reduce total time consumption).
- The join order can be forced through STRAIGHT_JOIN (use with caution, only when it is confirmed that the optimizer’s selection is unreasonable).
-- Example of small table driving large table
SELECT * FROM small_table s
INNER JOIN big_table b ON s.id = b.small_id;
-- Force join order
SELECT STRAIGHT_JOIN * FROM small_table s
INNER JOIN big_table b ON s.id = b.small_id;
3.3 JOIN Buffer Optimization
- Adjust the join_buffer_size (default 256KB, maximum not exceeding 50% of physical memory).
- Monitor buffer usage:
-- Set buffer size
SET join_buffer_size = 4194304; -- 4MB
-- View buffer usage status
SHOW STATUS LIKE 'Join%'; -- Pay attention to indicators such as Join_buffer_use
3.4 Pagination Optimization
- For large data volume join queries, reduce the scanning range through primary key range restrictions:
-- Optimize pagination query (avoid full table scan)
SELECT o.*, oi.*
FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
WHERE o.id > last_id -- Start from the ID of the last record of the previous page
LIMIT 100;
4. Common Issues and Solutions
4.1 Avoid Cartesian Products
Problem: When no join condition is specified, the two tables will generate M*N invalid results.
-- Wrong example (no join condition)
SELECT * FROM orders, order_items;
-- Correct example (specify join condition)
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id;
4.2 Handling NULL Values
Problem: NULL cannot be matched through =, so IS NULL should be used.
-- Find orders without order items (data exists in the left table, no match in the right table)
SELECT * FROM orders o
LEFT JOIN order_items oi ON o.id = oi.order_id
WHERE oi.id IS NULL;
4.3 Multi-table JOIN Optimization
Problem: Directly joining multiple tables may lead to a sharp increase in complexity, and queries can be split.
-- Split complex queries (simplify intermediate results with CTE first)
WITH order_info AS ( -- Subquery/CTE splits the first step of joining
SELECT o.id, o.order_no, oi.product_id
FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
)
-- Then join other tables
SELECT oi.*, p.name
FROM order_info oi
INNER JOIN products p ON oi.product_id = p.id;