MySQL 8.0新特性之統(tǒng)計(jì)直方圖
概覽
MySQL8.0實(shí)現(xiàn)了統(tǒng)計(jì)直方圖。利用直方圖,用戶可以對(duì)一張表的一列做數(shù)據(jù)分布的統(tǒng)計(jì),特別是針對(duì)沒有索引的字段。這可以幫助查詢優(yōu)化器找到更優(yōu)的執(zhí)行計(jì)劃。統(tǒng)計(jì)直方圖的主要使用場(chǎng)景是用來計(jì)算字段選擇性,即過濾效率。
可以通過以下方式來創(chuàng)建或者刪除直方圖:
- ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS;
- ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
buckets默認(rèn)是100。統(tǒng)計(jì)直方圖的信息存儲(chǔ)在數(shù)據(jù)字典表"column_statistcs"中,可以通過視圖information_schema.COLUMN_STATISTICS訪問。直方圖以靈活的JSON的格式存儲(chǔ)。ANALYZE TABLE會(huì)基于表大小自動(dòng)判斷是否要進(jìn)行取樣操作。ANALYZE TABLE也會(huì)基于表中列的數(shù)據(jù)分布情況以及bucket的數(shù)量來決定是否要建立等寬直方圖(singleton)還是等高直方圖(equi-height)。
什么是直方圖
數(shù)據(jù)庫(kù)中,查詢優(yōu)化器負(fù)責(zé)將SQL轉(zhuǎn)換成最有效的執(zhí)行計(jì)劃。有時(shí)候,查詢優(yōu)化器會(huì)走不到最優(yōu)的執(zhí)行計(jì)劃,導(dǎo)致花費(fèi)了更多不必要的時(shí)間。造成這種情況的主要原因是,查詢優(yōu)化器有時(shí)無法準(zhǔn)確的知道以下幾個(gè)問題的答案:
- 每個(gè)表有多少行?
- 每一列有多少不同的值?
- 每一列的數(shù)據(jù)分布情況?
舉例說明:一張簡(jiǎn)單的表,兩個(gè)字段,一個(gè)字段是person_id,另一個(gè)字段是time_of_day,表示睡覺時(shí)間
- CREATE TABLE bedtime (
- person_id INT,
- time_of_day TIME);
對(duì)于time_of_day列,大部分人上床時(shí)間會(huì)在晚上11:00左右。所以下面第一個(gè)查詢會(huì)比第二個(gè)查詢返回更多的行數(shù):
- 1) SELECT * FROM bedtime WHERE time_of_day BETWEEN "22:00:00" AND "23:59:00"
- 2) SELECT * FROM bedtime WHERE time_of_day BETWEEN "12:00:00" AND "14:00:00"
如果沒有統(tǒng)計(jì)數(shù)據(jù),優(yōu)化器會(huì)假設(shè)time_of_day的值是均勻分配的,即一個(gè)人的上床時(shí)間在下午3點(diǎn)和晚上11點(diǎn)的概率差不多。如何才能使查詢優(yōu)化器知道數(shù)據(jù)的分布情況?一個(gè)解決方法就是在列上建立統(tǒng)計(jì)直方圖。
直方圖能近似獲得一列的數(shù)據(jù)分布情況,從而讓數(shù)據(jù)庫(kù)知道它含有哪些數(shù)據(jù)。直方圖有多種形式,MySQL支持了兩種:等寬直方圖(singleton)、等高直方圖(equi-height)。直方圖的共同點(diǎn)是,它們都將數(shù)據(jù)分到了一系列的buckets中去。MySQL會(huì)自動(dòng)將數(shù)據(jù)劃到不同的buckets中,也會(huì)自動(dòng)決定創(chuàng)建哪種類型的直方圖。
如何創(chuàng)建和刪除統(tǒng)計(jì)直方圖
為了管理統(tǒng)計(jì)直方圖,ANALYZE TABLE命令新增了兩個(gè)子句:
- ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS;
- ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
第一個(gè)表示一次可以為一個(gè)或多個(gè)列創(chuàng)建統(tǒng)計(jì)直方圖:
- mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS;
- +----------------+-----------+----------+---------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+---------------------------------------------------+
- | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. |
- +----------------+-----------+----------+---------------------------------------------------+
- 1 row in set (0.27 sec)
- mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS;
- +----------------+-----------+----------+---------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+---------------------------------------------------------+
- | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. |
- | sakila.payment | histogram | status | Histogram statistics created for column 'payment_date'. |
- +----------------+-----------+----------+---------------------------------------------------------+
buckets的值必須指定,可以設(shè)置為1到1024,默認(rèn)值是100。
對(duì)于不同的數(shù)據(jù)集合,buckets的值取決于以下幾個(gè)因素:
- 這列有多少不同的值
- 數(shù)據(jù)的分布情況
- 需要多高的準(zhǔn)確性
但是,某些buckets的值能提升的關(guān)于數(shù)據(jù)分布情況的準(zhǔn)確性相當(dāng)?shù)?。所以,建議的做法是,開始的時(shí)候?qū)uckets的值設(shè)的低一點(diǎn),比如32,然后如果沒有滿足期望,再往上增大。
上面這個(gè)例子中,我們對(duì)于amount列建立了兩次直方圖。第一個(gè)語(yǔ)句,建立了一個(gè)新的直方圖;第二個(gè)語(yǔ)句,amount列的直方圖被重寫了。
如果需要?jiǎng)h除已經(jīng)創(chuàng)建的直方圖,用DROP HISTOGRAM就可以實(shí)現(xiàn):
- mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date;
- +----------------+-----------+----------+---------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+---------------------------------------------------------+
- | sakila.payment | histogram | status | Histogram statistics removed for column 'payment_date'. |
- +----------------+-----------+----------+---------------------------------------------------------+
UPDATE HISTOGRAM可以一次性為多個(gè)列創(chuàng)建直方圖。如果命令中間寫錯(cuò),ANALYZE TABLE仍然會(huì)起作用。比如,你指定了三列,但第二列不存在。MySQL仍然會(huì)為第一列和第三列創(chuàng)建直方圖。
- mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS;
- +----------------+-----------+----------+----------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+----------------------------------------------------------+
- | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. |
- | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_month'. |
- | tpcds.customer | histogram | Error | The column 'c_foobar' does not exist. |
- +----------------+-----------+----------+----------------------------------------------------------+
- 3 rows in set (0.15 sec)
數(shù)據(jù)庫(kù)內(nèi)部發(fā)生了什么
當(dāng)你讀過MySQL手冊(cè),你可能已經(jīng)注意到新的系統(tǒng)變量histogram_generation_max_mem_size。當(dāng)用戶建立統(tǒng)計(jì)直方圖,這個(gè)值是用來控制大約多少內(nèi)存能允許被使用。那么,為什么要控制這個(gè)呢?
當(dāng)你在建立直方圖的時(shí)候,MySQL server會(huì)將所有數(shù)據(jù)讀到內(nèi)存中,然后在內(nèi)存中進(jìn)行操作,包括排序。如果對(duì)一個(gè)很大的表建立直方圖,可能會(huì)有風(fēng)險(xiǎn)將幾百M(fèi)的數(shù)據(jù)都讀到內(nèi)存中,但這是不明智的。為了規(guī)避這個(gè)風(fēng)險(xiǎn),MySQL會(huì)根據(jù)給定的histogram_generation_max_mem_size的值計(jì)算該將多少行數(shù)據(jù)讀到內(nèi)存中。如果根據(jù)當(dāng)前histogram_generation_max_mem_size的限制,MySQL認(rèn)為只能讀一部分?jǐn)?shù)據(jù),那么MySQL會(huì)進(jìn)行取樣。通過“sampling-rate”屬性,可以觀察到取樣比率。
- mysql> SET histogram_generation_max_mem_size = 1000000;
- Query OK, 0 rows affected (0.00 sec)
- mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS;
- +----------------+-----------+----------+------------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+------------------------------------------------------------+
- | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_country'. |
- +----------------+-----------+----------+------------------------------------------------------------+
- 1 row in set (0.22 sec)
- mysql> SELECT histogram->>'$."sampling-rate"'
- -> FROM information_schema.column_statistics
- -> WHERE table_name = "customer"
- -> AND column_name = "c_birth_country";
- +---------------------------------+
- | histogram->>'$."sampling-rate"' |
- +---------------------------------+
- | 0.048743243211626014 |
- +---------------------------------+
- 1 row in set (0.00 sec)
優(yōu)化器創(chuàng)建了一個(gè)直方圖,大約讀了c_birth_country列4.8%的數(shù)據(jù)。取樣是不確定的,因此意義不大。同樣的數(shù)據(jù),同樣的兩條語(yǔ)句‘‘ANALYZE TABLE tbl UPDATE HISTOGRAM …’’,如果用了取樣,得到的直方圖可能就不一樣。
查詢案例
統(tǒng)計(jì)直方圖可以帶來些什么?我們可以看個(gè)例子,這個(gè)例子中用了直方圖,在執(zhí)行時(shí)間上會(huì)有很大的不同。
環(huán)境:
- TPC-DS Benchmark with scale factor of 1
- Intel Core i7-4770
- Debian Stretch
- MySQL 8.0 RC1
- innodb_buffer_pool_size = 2G
- optimizer_switch = "condition_fanout_filter=on"
Query 90
查詢?nèi)缦拢荷衔缡圪u的數(shù)量與晚上售賣的數(shù)量的比率。
- mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio
- -> FROM (SELECT COUNT(*) amc
- -> FROM web_sales,
- -> household_demographics,
- -> time_dim,
- -> web_page
- -> WHERE ws_sold_time_sk = time_dim.t_time_sk
- -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk
- -> AND ws_web_page_sk = web_page.wp_web_page_sk
- -> AND time_dim.t_hour BETWEEN 9 AND 9 + 1
- -> AND household_demographics.hd_dep_count = 2
- -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) at,
- -> (SELECT COUNT(*) pmc
- -> FROM web_sales,
- -> household_demographics,
- -> time_dim,
- -> web_page
- -> WHERE ws_sold_time_sk = time_dim.t_time_sk
- -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk
- -> AND ws_web_page_sk = web_page.wp_web_page_sk
- -> AND time_dim.t_hour BETWEEN 15 AND 15 + 1
- -> AND household_demographics.hd_dep_count = 2
- -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt
- -> ORDER BY am_pm_ratio
- -> LIMIT 100;
- +-------------+
- | am_pm_ratio |
- +-------------+
- | 1.27619048 |
- +-------------+
- 1 row in set (1.48 sec)
可以看到,查詢花費(fèi)了1.5秒左右??雌饋聿凰愣?,但是通過在一列上建立直方圖,可以讓執(zhí)行速度快三倍。
- mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS;
- +----------------+-----------+----------+----------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+----------------------------------------------------------+
- | tpcds.web_page | histogram | status | Histogram statistics created for column 'wp_char_count'. |
- +----------------+-----------+----------+----------------------------------------------------------+
- 1 row in set (0.06 sec)
- mysql> SELECT ...
- +-------------+
- | am_pm_ratio |
- +-------------+
- | 1.27619048 |
- +-------------+
- 1 row in set (0.50 sec)
通過這個(gè)直方圖,查詢花費(fèi)了0.5秒左右。原因呢?主要的原因是,查詢語(yǔ)句中的謂詞“web_page.wp_char_count BETWEEN 5000 AND 5200”。沒有直方圖的時(shí)候,優(yōu)化器會(huì)假設(shè)web_page表中符合謂詞“web_page.wp_char_count BETWEEN 5000 AND 5200”的數(shù)據(jù)占到總數(shù)據(jù)11.11%左右。但,這是錯(cuò)誤的。用下面的查詢語(yǔ)句,可以看到實(shí)際上滿足條件的數(shù)據(jù)只有1.6%。
- mysql> SELECT
- -> (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200)
- -> /
- -> (SELECT COUNT(*) FROM web_page) AS ratio;
- +--------+
- | ratio |
- +--------+
- | 0.0167 |
- +--------+
- 1 row in set (0.00 sec)
通過直方圖,優(yōu)化器會(huì)知道這個(gè)信息,并且更早進(jìn)行表join,因此執(zhí)行時(shí)間快了三倍。
Query 61
查詢?nèi)缦拢涸诮o定的年份和月份,有和沒有廣告宣傳的情況下貨物的售賣比率。
- mysql> SELECT promotions, -> total,
- -> CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100
- -> FROM (SELECT SUM(ss_ext_sales_price) promotions
- -> FROM store_sales,
- -> store,
- -> promotion,
- -> date_dim,
- -> customer,
- -> customer_address,
- -> item
- -> WHERE ss_sold_date_sk = d_date_sk
- -> AND ss_store_sk = s_store_sk
- -> AND ss_promo_sk = p_promo_sk
- -> AND ss_customer_sk = c_customer_sk
- -> AND ca_address_sk = c_current_addr_sk
- -> AND ss_item_sk = i_item_sk
- -> AND ca_gmt_offset = -5
- -> AND i_category = 'Home'
- -> AND ( p_channel_dmail = 'Y'
- -> OR p_channel_email = 'Y'
- -> OR p_channel_tv = 'Y' )
- -> AND s_gmt_offset = -5
- -> AND d_year = 2000
- -> AND d_moy = 12) promotional_sales,
- -> (SELECT SUM(ss_ext_sales_price) total
- -> FROM store_sales,
- -> store,
- -> date_dim,
- -> customer,
- -> customer_address,
- -> item
- -> WHERE ss_sold_date_sk = d_date_sk
- -> AND ss_store_sk = s_store_sk
- -> AND ss_customer_sk = c_customer_sk
- -> AND ca_address_sk = c_current_addr_sk
- -> AND ss_item_sk = i_item_sk
- -> AND ca_gmt_offset = -5
- -> AND i_category = 'Home'
- -> AND s_gmt_offset = -5
- -> AND d_year = 2000
- -> AND d_moy = 12) all_sales
- -> ORDER BY promotions,
- -> total
- -> LIMIT 100;
- +------------+------------+--------------------------------------------------------------------------+
- | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 |
- +------------+------------+--------------------------------------------------------------------------+
- | 3213210.07 | 5966836.78 | 53.85114741 |
- +------------+------------+--------------------------------------------------------------------------+
- 1 row in set (2.78 sec)
可以看到,查詢花費(fèi)了2.8秒左右。但是,查詢優(yōu)化器不知道s_gmt_offset列只有一個(gè)不同的值。沒有統(tǒng)計(jì)數(shù)據(jù)的情況下,優(yōu)化器會(huì)用所謂的“hard-coded guesstimates”,會(huì)假設(shè)10%的數(shù)據(jù)符合條件“ca_gmt_offset = -5“。如果在這個(gè)列上增加一個(gè)直方圖,優(yōu)化器會(huì)知道所有的數(shù)據(jù)都符合條件,因此會(huì)走一個(gè)更好的執(zhí)行計(jì)劃。
- mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS;
- +-------------+-----------+----------+---------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +-------------+-----------+----------+---------------------------------------------------------+
- | tpcds.store | histogram | status | Histogram statistics created for column 's_gmt_offset'. |
- +-------------+-----------+----------+---------------------------------------------------------+
- 1 row in set (0.06 sec)
- mysql> SELECT ...
- +------------+------------+--------------------------------------------------------------------------+
- | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 |
- +------------+------------+--------------------------------------------------------------------------+
- | 3213210.07 | 5966836.78 | 53.85114741 |
- +------------+------------+--------------------------------------------------------------------------+
- 1 row in set (1.37 sec)
有了直方圖,查詢花了不到1.4秒,差不多提升了2倍。原因是:
- 第一個(gè)執(zhí)行計(jì)劃,優(yōu)化器選擇了第一個(gè)派生表在store表上做了全表掃描,然后對(duì)表item, store_sales, date_dim, customer,customer_address分別做了主鍵查找。
- 但是,當(dāng)MySQL意識(shí)到store表會(huì)比它猜測(cè)的返回更多的數(shù)據(jù)時(shí),優(yōu)化器會(huì)在item表上做全表掃描,然后對(duì)store_sales, store, date_dim, customer,customer_address 分別做主鍵查找。
為什么不用索引?
索引往往也能做上述工作,比如:
- mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset);
- Query OK, 0 rows affected (0.53 sec)
- Records: 0 Duplicates: 0 Warnings: 0
- mysql> SELECT ...
- +------------+------------+--------------------------------------------------------------------------+
- | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 |
- +------------+------------+--------------------------------------------------------------------------+
- | 3213210.07 | 5966836.78 | 53.85114741 |
- +------------+------------+--------------------------------------------------------------------------+
- 1 row in set (1.41 sec)
但是,用直方圖而不是索引有以下兩個(gè)原因:
- 維護(hù)一個(gè)索引有代價(jià)。每一次的insert、update、delete都會(huì)需要更新索引,會(huì)對(duì)性能有一定的影響。而直方圖一次創(chuàng)建永不更新,除非明確去更新它。所以不會(huì)影響insert、update、delete的性能。
- 如果有索引,優(yōu)化器用使用index dives技術(shù)來估算符合條件范圍的記錄數(shù)量。這種方式也是有代價(jià)的,特別是查詢語(yǔ)句條件中有很長(zhǎng)的IN列表。直方圖相對(duì)而言代價(jià)小,因此可能更合適。
檢索統(tǒng)計(jì)直方圖
統(tǒng)計(jì)直方圖以JSON的形式存在數(shù)據(jù)字典中??梢杂脙?nèi)建的JSON函數(shù)built-in JSON functions從直方圖獲取一些信息。舉例來說,如果需要知道amount列的直方圖的創(chuàng)建或者更新時(shí)間,可以用JSON unquoting extraction operator來獲取信息:
- mysql> SELECT
- -> HISTOGRAM->>'$."last-updated"' AS last_updated
- -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
- -> WHERE
- -> SCHEMA_NAME = "sakila"
- -> AND TABLE_NAME = "payment"
- -> AND COLUMN_NAME = "amount";
- +----------------------------+
- | last_updated |
- +----------------------------+
- | 2017-09-15 11:54:25.000000 |
- +----------------------------+
如果要查找實(shí)際有多少個(gè)buckets,以及用analyze table時(shí)指定了多少個(gè)buckets,可以如下:
- mysql> SELECT
- -> TABLE_NAME,
- -> COLUMN_NAME,
- -> HISTOGRAM->>'$."number-of-buckets-specified"' AS num_buckets_specified,
- -> JSON_LENGTH(HISTOGRAM, '$.buckets') AS num_buckets_created
- -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
- -> WHERE
- -> SCHEMA_NAME = "sakila";
- +------------+--------------+-----------------------+---------------------+
- | TABLE_NAME | COLUMN_NAME | num_buckets_specified | num_buckets_created |
- +------------+--------------+-----------------------+---------------------+
- | payment | amount | 32 | 19 |
- | payment | payment_date | 32 | 32 |
- +------------+--------------+-----------------------+---------------------+
經(jīng)測(cè)試,num_buckets_created與字段的distinct值很接近,近似相等;但是num_buckets_created不會(huì)大于num_buckets_specified。如果num_buckets_created與num_buckets_specified相等,那么存在可能,在創(chuàng)建直方圖的時(shí)候指定的buckets不夠多,那么此時(shí)可以通過增加buckets的數(shù)量,來提高直方圖的準(zhǔn)確性。
buckets可以設(shè)置為1到1024
優(yōu)化器trace
如果你想要知道直方圖做了什么,最簡(jiǎn)單的方式就是看一下執(zhí)行計(jì)劃:
- mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10;
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 11.11 | Using where |
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- 1 row in set, 1 warning (0.00 sec)
- mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS;
- +----------------+-----------+----------+--------------------------------------------------------+
- | Table | Op | Msg_type | Msg_text |
- +----------------+-----------+----------+--------------------------------------------------------+
- | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. |
- +----------------+-----------+----------+--------------------------------------------------------+
- 1 row in set (0.10 sec)
- mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10;
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 32.12 | Using where |
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- 1 row in set, 1 warning (0.00 sec)
可以看到filtered列,從默認(rèn)的11.11%變成了更精確的32.12%。但是,如果有多個(gè)條件,有些有直方圖,有些沒有,就比較難判斷優(yōu)化器做了什么改進(jìn):
- mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967;
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 6.38 | Using where |
- +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+
- 1 row in set, 1 warning (0.00 sec)
如果想要知道更多關(guān)于直方圖統(tǒng)計(jì)的細(xì)節(jié),可以使用trace:
- mysql> SET OPTIMIZER_TRACE = "enabled=on";
- Query OK, 0 rows affected (0.00 sec)
- mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000;
- Query OK, 0 rows affected (0.00 sec)
- mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967;
- mysql> SELECT JSON_EXTRACT(TRACE, "$**.filtering_effect") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;
- +----------------------------------------------------------------------------------------+
- | JSON_EXTRACT(TRACE, "$**.filtering_effect") |
- +----------------------------------------------------------------------------------------+
- | [[{"condition": "(`customer`.`c_birth_day` <= 20)", "histogram_selectivity": 0.6376}]] |
- +----------------------------------------------------------------------------------------+
- 1 row in set (0.00 sec)
這里用了JSON_EXTRACT從trace里取出相關(guān)的部分。對(duì)于每個(gè)條件,直方圖被使用的話,就會(huì)看到估算過的字段的選擇性。在這個(gè)例子里,通過直方圖,對(duì)“c_birth_day <= 20”條件,估算出63.76%的數(shù)據(jù)滿足條件。事實(shí)上,與實(shí)際的數(shù)據(jù)分布情況基本一致:
- mysql> SELECT
- -> (SELECT count(*) FROM customer WHERE c_birth_day <= 20)
- -> /
- -> (SELECT COUNT(*) FROM customer) AS ratio;
- +--------+
- | ratio |
- +--------+
- | 0.6376 |
- +--------+
- 1 row in set (0.03 sec)




























