Underlying Principles of MySQL Join Queries

1. Overview of Join Algorithms

MySQL relies on three core algorithms to execute join queries:

1.1 Nested-Loop Join

Basic principle: Use the left table as the driving table, iterate through each row of the driving table, then match all eligible rows in the right table (driven table).

-- Example query
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id;

-- Pseudocode logic
for each row in orders {  -- Iterate through the driving table (orders)
    for each row in order_items {  -- Iterate through the driven table (order_items)
        if (orders.id = order_items.order_id) {  -- Match join conditions
            output matched row  -- Output matching results
        }
    }
}

1.1.1 Simple Nested-Loop Join

The most primitive nested loop method without any optimization.
Time complexity: O(M*N) (M is the number of rows in the driving table, N is the number of rows in the driven table).
Extremely poor performance, MySQL rarely chooses this algorithm.

1.1.2 Index Nested-Loop Join

Applicable scenario: When there is an index on the join field of the driven table.

-- Example: There is an index on the order_id field of the driven table (order_items)
SELECT * FROM orders o  -- Driving table
INNER JOIN order_items oi ON o.id = oi.order_id;  -- Driven table (relies on index)

-- Pseudocode logic
for each row in orders {
    lookup order_items using index(order_id)  -- Use index for fast matching row lookup
}

Core optimization: Reduce the scanning range through the index of the driven table to avoid full table traversal.
Time complexity: O(M*logN) (the time complexity of index query is logN).
One of the most commonly used optimization methods in MySQL.

1.1.3 Block Nested-Loop Join

Applicable scenario: When there is no index on the driven table, reduce disk I/O through memory buffering.

-- Configuration: Set join buffer size (default value can be adjusted via system parameters)
SET join_buffer_size = 1048576;  -- 1MB

-- Pseudocode logic
while (rows in orders not loaded) {
    load batch of orders rows into join_buffer;  -- Batch load driving table data into memory buffer
    for each row in order_items {  -- Iterate through the driven table
        for each row in join_buffer {  -- Match with driving table data in the buffer
            if (orders.id = order_items.order_id) {
                output matched row
            }
        }
    }
}

Core optimization: Batch cache driving table data through join_buffer to reduce repeated scans of the driven table.
Time complexity is still O(M*N), but actual performance is better than Simple Nested-Loop Join (reduces the number of disk accesses).

1.2 Hash Join

Applicable version: Supported in MySQL 8.0.18 and above, specifically for equi-joins (ON A.key = B.key).

-- Example query
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id;

-- Execution process
1. Build phase (construct hash table):
   - Select the smaller table as the "build table" and construct a hash table (in memory) using the join key (e.g., order_id).
2. Probe phase (probe for matches):
   - Scan the larger table (probe table) and calculate the hash value for the join key of each row.
   - Quickly find matching hash values in the hash table to get results.

1.2.1 Advantages

No reliance on indexes, suitable for joining large tables without indexes.
Time complexity: O(M+N) (linear complexity, high efficiency).
Friendly to large table joins, especially when the table data volume exceeds memory, it can be processed in blocks.

1.2.2 Limitations

Only supports equi-joins (cannot handle range conditions such as <, >).
Requires additional memory to store the hash table; if memory is insufficient, it will be written to disk, resulting in performance degradation.

1.3 Sort-Merge Join

Applicable scenario: When the join keys are already sorted (or can be sorted through indexes), often used in non-equi-joins or scenarios requiring sorting.

-- Example query (join keys need to be sorted)
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
ORDER BY o.id;

-- Execution process
1. Sort phase:
   - If the join keys of the two tables are not sorted, sort them by the join keys first.
2. Merge phase:
   - Simultaneously scan the two sorted tables and match and merge results in the order of the join keys.

Core logic: Utilize the order of sorted data to avoid row-by-row matching in nested loops, suitable for ordered data.
Time complexity: O(M logM + N logN) (the main time consumption is in the sorting phase).

2. Optimizer’s Selection Strategy

The MySQL optimizer automatically selects the optimal join algorithm based on factors such as table data volume and index conditions.

2.1 Cost Estimation

The optimizer’s selection results can be viewed through the EXPLAIN command:

EXPLAIN SELECT * FROM orders oINNER JOIN order_items oi ON o.id = oi.order_id;

The optimizer mainly considers the following cost factors:

Number of rows in the table (data volume).
Whether there is an index on the join field and the index efficiency.
Data distribution (e.g., the repetition rate of join keys).
System parameters (such as join_buffer_size, hash_join_buffer_size, etc.).

2.2 Access Method Selection

Example: When the driven table has an index, the optimizer typically selects:

-- Create an index for the driven table
CREATE INDEX idx_order_id ON order_items(order_id);

-- The optimizer may choose:
1. Use the small table (orders) as the driving table (reduce the number of outer loop iterations).
2. Use the Index Nested-Loop Join algorithm.
3. Utilize the idx_order_id index of the order_items table to speed up matching.

3. JOIN Optimization Practices

3.1 Index Optimization

Create indexes on the join fields of the driven table (prioritize supporting Index Nested-Loop Join).
For complex scenarios, composite indexes can be used (including join keys and fields required for the query).

-- Single-field indexCREATE INDEX idx_order_id ON order_items(order_id);-- Composite index (covers join keys and query fields)CREATE INDEX idx_order_product ON order_items(order_id, product_id);

3.2 Small Table Drives Large Table

Prefer to select a small table as the driving table (reduce the number of outer loop iterations and reduce total time consumption).
The join order can be forced through STRAIGHT_JOIN (use with caution, only when it is confirmed that the optimizer’s selection is unreasonable).

-- Example of small table driving large table
SELECT * FROM small_table s
INNER JOIN big_table b ON s.id = b.small_id;

-- Force join order
SELECT STRAIGHT_JOIN * FROM small_table s
INNER JOIN big_table b ON s.id = b.small_id;

3.3 JOIN Buffer Optimization

Adjust the join_buffer_size (default 256KB, maximum not exceeding 50% of physical memory).
Monitor buffer usage:

-- Set buffer size
SET join_buffer_size = 4194304;  -- 4MB

-- View buffer usage status
SHOW STATUS LIKE 'Join%';  -- Pay attention to indicators such as Join_buffer_use

3.4 Pagination Optimization

For large data volume join queries, reduce the scanning range through primary key range restrictions:

-- Optimize pagination query (avoid full table scan)
SELECT o.*, oi.*
FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id
WHERE o.id > last_id  -- Start from the ID of the last record of the previous page
LIMIT 100;

4. Common Issues and Solutions

4.1 Avoid Cartesian Products

Problem: When no join condition is specified, the two tables will generate M*N invalid results.

-- Wrong example (no join condition)
SELECT * FROM orders, order_items;

-- Correct example (specify join condition)
SELECT * FROM orders o
INNER JOIN order_items oi ON o.id = oi.order_id;

4.2 Handling NULL Values

Problem: NULL cannot be matched through =, so IS NULL should be used.

-- Find orders without order items (data exists in the left table, no match in the right table)
SELECT * FROM orders o
LEFT JOIN order_items oi ON o.id = oi.order_id
WHERE oi.id IS NULL;

4.3 Multi-table JOIN Optimization

Problem: Directly joining multiple tables may lead to a sharp increase in complexity, and queries can be split.

-- Split complex queries (simplify intermediate results with CTE first)
WITH order_info AS (  -- Subquery/CTE splits the first step of joining
    SELECT o.id, o.order_no, oi.product_id
    FROM orders o
    INNER JOIN order_items oi ON o.id = oi.order_id
)
-- Then join other tables
SELECT oi.*, p.name
FROM order_info oi
INNER JOIN products p ON oi.product_id = p.id;

Underlying Principles of MySQL Join Queries

1. Overview of Join Algorithms

1.1 Nested-Loop Join

1.1.1 Simple Nested-Loop Join

1.1.2 Index Nested-Loop Join

1.1.3 Block Nested-Loop Join

1.2 Hash Join

1.2.1 Advantages

1.2.2 Limitations

1.3 Sort-Merge Join

2. Optimizer’s Selection Strategy

2.1 Cost Estimation

2.2 Access Method Selection

3. JOIN Optimization Practices

3.1 Index Optimization

3.2 Small Table Drives Large Table

3.3 JOIN Buffer Optimization

3.4 Pagination Optimization

4. Common Issues and Solutions

4.1 Avoid Cartesian Products

4.2 Handling NULL Values

4.3 Multi-table JOIN Optimization

By BytePilot

Leave a Reply Cancel reply

You Missed

Key Points for Java Practical Application Practices

Key Knowledge for Java Interviews

Underlying Principles of MySQL Join Queries

MySQL transaction isolation levels: Read Uncommitted, Read Committed, Repeatable Read, and Serializable

1. Overview of Join Algorithms

1.1 Nested-Loop Join

1.1.1 Simple Nested-Loop Join

1.1.2 Index Nested-Loop Join

1.1.3 Block Nested-Loop Join

1.2 Hash Join

1.2.1 Advantages

1.2.2 Limitations

1.3 Sort-Merge Join

2. Optimizer’s Selection Strategy

2.1 Cost Estimation

2.2 Access Method Selection

3. JOIN Optimization Practices

3.1 Index Optimization

3.2 Small Table Drives Large Table

3.3 JOIN Buffer Optimization

3.4 Pagination Optimization

4. Common Issues and Solutions

4.1 Avoid Cartesian Products

4.2 Handling NULL Values

4.3 Multi-table JOIN Optimization

By BytePilot

Related Post

Leave a Reply Cancel reply

You Missed