PostgreSQL 行评估算法

了解本文需要先了解一下PostgreSQL的统计信息表pg_class,pg_stats。可以查看我前面写的

PostgreSQL Statistics and Query Explain Introduction

接下来先看一个PostgreSQL EXPLAIN语句的输出:

注意看图中的rows=114688,PostgreSQL是怎么来算这个值的，这里简单的写一下我对PostgreSQL行评估的理解，有兴趣的朋友可以线下交流,同时这对了解PostgreSQL的代码也是有帮助的。

创建测试表，写入测试记录：

create table tbl_test1 (id serial8 primary key,name varchar(32));

create table tbl_test2 (id serial8 primary key,name varchar(32));

插入测试记录

insert into tbl_test1 (name) values (‘a’),(‘b’),(‘b’),(‘c’),(‘c’),(‘c’),(‘d’),(‘d’),(‘d’),(‘d’),(‘e’),(‘e’),(‘e’),(‘e’),(‘e’),(‘f’),(‘f’),(‘f’),(‘f’),(‘f’),(‘f’),(‘g’),(‘g’),(‘g’),(‘g’),(‘g’),(‘g’),(‘g’);

insert into tbl_test1(name) select name from tbl_test1;

…………….

bj_dsm_0=> insert into tbl_test2(name) select name from tbl_test1;

INSERT 0 14336

bj_dsm_0=> insert into tbl_test2(name) select name from tbl_test1;

INSERT 0 14336

bj_dsm_0=> select count(*) from tbl_test1;

count

——-

14336

(1 row)

bj_dsm_0=> select count(*) from tbl_test2;

count

——-

28672

(1 row)

再插入几条稀有记录

bj_dsm_0=> insert into tbl_test1 (name) values (‘digoal’);

INSERT 0 1

bj_dsm_0=> insert into tbl_test2 (name) values (‘digoal’);

INSERT 0 1

查看测试记录

bj_dsm_0=> select * from tbl_Test1 limit 29;

id | name

—-+——

1 | a

2 | b

3 | b

4 | c

5 | c

6 | c

7 | d

8 | d

9 | d

10 | d

11 | e

12 | e

13 | e

14 | e

15 | e

16 | f

17 | f

18 | f

19 | f

20 | f

21 | f

22 | g

23 | g

24 | g

25 | g

26 | g

27 | g

28 | g

29 | a

(29 rows)

更新统计信息

analyze tbl_test1;

analyze tbl_test2;

例一：

bj_dsm_0=> explain select * from tbl_test1;

QUERY PLAN

—————————————————————-

Seq Scan on tbl_test1  (cost=0.00..221.36 rows=14336 width=10)

(1 row)

PostgreSQL要得出rows=14336，首先得到真实的pages,然后去比较 pg_class中的relpages，如果一致的话直接取出pg_class中的reltuples.如果不一致,PostgreSQL会根据真实的 pages值自动计算出reltuples。得到的rows=reltuples。

我们来通过查看pg_class验证一下:

bj_dsm_0=> select relname,relpages,reltuples from pg_class where relname=’tbl_test1′;

relname  | relpages | reltuples

———–+———-+———–

tbl_test1 |    78 |    14336

完全正确。

例二:

bj_dsm_0=> explain select * from tbl_test1 where id < 1000;

QUERY PLAN

————————————————————————————-

Index Scan using tbl_test1_pkey on tbl_test1  (cost=0.00..43.52 rows=1044 width=10)

Index Cond: (id < 1000)

(2 rows)

我们查看一下pg_stats的统计信息

bj_dsm_0=> select attname,histogram_bounds from pg_stats where tablename = ‘tbl_test1′ and attname=’id’;

attname |                   histogram_bounds

———+————————————————————

id    | {17,1367,2838,4328,5704,7162,8621,10028,11467,12903,14332}

(1 row)

ID总共被划分为10个区段，id<1000落在第一个区段.

rows=

round(14336.0*(((1000.0-17.0)/(1367.0-17.0))/10.0))

=1044

验证正确。

例三：

bj_dsm_0=> explain select * from tbl_test1 where name=’a';

QUERY PLAN

————————————————————–

Seq Scan on tbl_test1  (cost=0.00..257.20 rows=487 width=10)

Filter: ((name)::text = ‘a’::text)

(2 rows)

查看一下统计信息：

bj_dsm_0=> select attname,most_common_vals,most_common_freqs from pg_stats where tablename = ‘tbl_test1′ and attname=’name’;

attname | most_common_vals |                   most_common_freqs

———+——————+———————————————————-

name | {g,f,e,d,c,b,a}  | {0.239667,0.229,0.174333,0.145333,0.103,0.0746667,0.034}

(1 row)

刚好‘a’落在了MCV里面,通过统计信息可以看出频率为0.034

rows=487

验证正确。

bj_dsm_0=> select count(*) from tbl_test1 where name=’a';

count

——-

512

(1 row)

与真实记录差别不大

例四：

bj_dsm_0=> explain select * from tbl_test1 where name=’digoal’;

QUERY PLAN

————————————————————

Seq Scan on tbl_test1  (cost=0.00..257.20 rows=1 width=10)

Filter: ((name)::text = ‘digoal’::text)

(2 rows)

查看统计信息

bj_dsm_0=> select n_distinct,most_common_vals,most_common_freqs,histogram_bounds from pg_stats where tablename=’tbl_test1′ and attname=’name’;

n_distinct | most_common_vals |                   most_common_freqs                      | histogram_bounds

————+——————+————————————————————–+——————

7 | {g,f,e,d,c,b,a}  | {0.243,0.225667,0.184333,0.147,0.103333,0.0633333,0.0333333} |

(1 row)

bj_dsm_0=> select 0.243+0.225667+0.184333+0.147+0.103333+0.0633333+0.0333333;

?column?

———–

0.9999996

(1 row)

bj_dsm_0=> select (1-0.9999996)*14336;

?column?

———–

0.0057344

(1 row)

显然round(0.0057344)=0,但是PostgreSQL取了rows=1。

所以估计rows返回应该都是>=1。

例五:

bj_dsm_0=> explain select * from tbl_test1 where name=’a’ and id<1000;

QUERY PLAN

———————————————————————————–

Index Scan using tbl_test1_pkey on tbl_test1  (cost=0.00..45.85 rows=34 width=10)

Index Cond: (id < 1000)

Filter: ((name)::text = ‘a’::text)

(3 rows)

查看统计信息

bj_dsm_0=> select n_distinct,most_common_vals,most_common_freqs,histogram_bounds from pg_stats where tablename=’tbl_test1′;

n_distinct | most_common_vals |                   most_common_freqs                      |                   histogram_bounds

————+——————+————————————————————–+———————————————————-

-1 |                |                                                             | {2,1391,2882,4271,5690,7170,8664,9989,11426,12920,14335}

7 | {g,f,e,d,c,b,a}  | {0.243,0.225667,0.184333,0.147,0.103333,0.0633333,0.0333333} |

(2 rows)

概率应该是两个相乘的结果

bj_dsm_0=> select 14336*(((1000.0-2.0)/(1391.0-2.0))/10.0)*0.0333333;

?column?

——————————–

34.334806077926565874258692096

(1 row)

验证正确。

例六：

bj_dsm_0=> explain select * from tbl_test1 t1,tbl_test2 t2 where t1.id<1000 and t1.id=t2.id;

QUERY PLAN

————————————————————————————————

Merge Join  (cost=0.00..545.78 rows=1052 width=20)

Merge Cond: (t1.id = t2.id)

->  Index Scan using tbl_test1_pkey on tbl_test1 t1  (cost=0.00..43.66 rows=1052 width=10)

Index Cond: (id < 1000)

->  Index Scan using tbl_test2_pkey on tbl_test2 t2  (cost=0.00..912.36 rows=28674 width=10)

(5 rows)

查看统计信息

bj_dsm_0=> select tablename,n_distinct,most_common_vals,most_common_freqs,histogram_bounds from pg_stats where tablename ~ ‘tbl_test’;

tablename | n_distinct | most_common_vals |                   most_common_freqs                   |                      histogram_bounds

———–+————+——————+———————————————————+————————————————————–

tbl_test1 |       -1 |                |                                                       | {2,1362,2855,4237,5700,7116,8494,9989,11405,12815,14336}

tbl_test1 |       7 | {g,f,e,d,c,b,a}  | {0.244,0.215,0.183,0.144,0.109667,0.0626667,0.0416667}  |

tbl_test2 |       -1 |                |                                                       | {5,2831,5541,8355,11671,14421,17172,20137,22954,25755,28668}

tbl_test2 |       7 | {g,f,e,d,c,b,a}  | {0.243667,0.217333,0.18,0.142333,0.105,0.0736667,0.038} |

(4 rows)

Merge两个PK取min(rows)

验证正确。

先说到这里，后面还有例子继续补充。

作者: digoal 发布时间: 2011-02-22

学习学习。先收藏起来。

作者: renxiao2003 发布时间: 2011-02-28

PostgreSQL 行评估算法

热门频道