表膨胀

表膨胀的原因

udw的存储实现(MVCC-多版本并发控制)来自于Postgres。根据MVCC的原理,没有办法直接更新数据(更新操作(update)是通过先删除(delete)再插入(insert)实现的),被更新之前的行数据仍然在数据文件中。

如何避免表膨胀

方法一: vacuum full table 注意,vacuum full不能回收索引的膨胀空间。vacuum full 加载的锁与DDL锁类似,是排它锁。建议在没有业务的时候执行,不要堵塞业务。 使用vacuum full回收垃圾的建议操作流程: 1 记录下表的索引
2 删除索引
3 vacuum full 表
4 重建索引
示例:

dev=# create table test(id int , name text); 
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE

dev=# insert into test select generate_series(1,100000000), 'test';
INSERT 0 100000000

dev=# create index idx_test on test(id);
CREATE INDEX

dev=# update test set name='nice';
UPDATE 100000000  

查看表格数据:
dev=# select pg_size_pretty(pg_relation_size('test'));  
 pg_size_pretty 
----------------
 8401 MB
(1 row)

查看索引数据:
dev=# select pg_size_pretty(pg_relation_size('idx_test'));  
 pg_size_pretty 
----------------
 6377 MB
(1 row)

先回收表数据(此方法不能回收索引数据):
dev=# vacuum full test ;
VACUUM

dev=# select pg_size_pretty(pg_relation_size('test'));
 pg_size_pretty 
----------------
 4200 MB
(1 row)

Time: 4.278 ms
dev=# select pg_size_pretty(pg_relation_size('idx_test'));
 pg_size_pretty 
----------------
 6377 MB
(1 row)


回收索引和表的数据:

dev=# drop index idx_test ;  
DROP INDEX

dev=# vacuum full test ;  
VACUUM

dev=# create index idx_test on test(id);  
CREATE INDEX

dev=# select pg_size_pretty(pg_relation_size('test'));
 pg_size_pretty 
----------------
 4200 MB
(1 row)

dev=# select pg_size_pretty(pg_relation_size('idx_test'));
 pg_size_pretty 
----------------
 2126 MB
(1 row)

方法二:通过修改分布键释放空间

修改分布键可以回收索引的膨胀空间。修改分布键加载的锁与DDL锁类似,是排它锁。建议在没有业务的时候执行,不要影响业务。

alter table test set with (reorganize=true) distributed randomly;
alter table test set with (reorganize=true) distributed by (id);

实例:
dev=# update test set name='back';
UPDATE 100000000

查看数据
dev=# select pg_size_pretty(pg_relation_size('test'));
 pg_size_pretty 
----------------
 8401 MB
(1 row)

dev=# select pg_size_pretty(pg_relation_size('idx_test'));
 pg_size_pretty 
----------------
 4251 MB
(1 row)

查看表格的分布键,如下所示Distributed by: (id),分布键为id
dev=# \d+ test 
                  Table "public.test"
 Column |  Type   | Modifiers | Storage  | Description 
--------+---------+-----------+----------+-------------
 id     | integer |           | plain    | 
 name   | text    |           | extended | 
Indexes:
    "idx_test" btree (id)
Has OIDs: no
Distributed by: (id)

按照原有的分布键重新分布
dev=# alter table test set with (reorganize=true) distributed by (id);
ALTER TABLE
查看数据
dev=# select pg_size_pretty(pg_relation_size('test'));
 pg_size_pretty 
----------------
 4200 MB
(1 row)

dev=# select pg_size_pretty(pg_relation_size('idx_test'));
 pg_size_pretty 
----------------
 2126 MB

方法三:创建新表,导入数据

CREATE TABLE…AS SELECT命令把该表拷贝为一个新表,新建的表将不会出现膨胀现象。然后删除原始表并且重命名拷贝的表。

参考: https://gp-docs-cn.github.io/docs/best_practices/bloat.html https://docs.ucloud.cn/analysis/udw/developer(选择数据分布策略)

如有疑问请联系 技术支持