表膨胀

表膨胀的原因

udw的存储实现(MVCC-多版本并发控制)来自于Postgres。根据MVCC的原理,没有办法直接更新数据(更新操作(update)是通过先删除(delete)再插入(insert)实现的),被更新之前的行数据仍然在数据文件中。

如何避免表膨胀

方法一: vacuum full table 注意,vacuum full不能回收索引的膨胀空间。vacuum full 加载的锁与DDL锁类似,是排它锁。建议在没有业务的时候执行,不要堵塞业务。 使用vacuum full回收垃圾的建议操作流程: 1 记录下表的索引 2 删除索引 3 vacuum full 表 4 重建索引 示例:

dev=# create table test(id int , name text); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE dev=# insert into test select generate_series(1,100000000), 'test'; INSERT 0 100000000 dev=# create index idx_test on test(id); CREATE INDEX dev=# update test set name='nice'; UPDATE 100000000 查看表格数据: dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 8401 MB (1 row) 查看索引数据: dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 6377 MB (1 row) 先回收表数据(此方法不能回收索引数据): dev=# vacuum full test ; VACUUM dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 4200 MB (1 row) Time: 4.278 ms dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 6377 MB (1 row) 回收索引和表的数据: dev=# drop index idx_test ; DROP INDEX dev=# vacuum full test ; VACUUM dev=# create index idx_test on test(id); CREATE INDEX dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 4200 MB (1 row) dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 2126 MB (1 row)

方法二:通过修改分布键释放空间

修改分布键可以回收索引的膨胀空间。修改分布键加载的锁与DDL锁类似,是排它锁。建议在没有业务的时候执行,不要影响业务。

alter table test set with (reorganize=true) distributed randomly; alter table test set with (reorganize=true) distributed by (id); 实例: dev=# update test set name='back'; UPDATE 100000000 查看数据 dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 8401 MB (1 row) dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 4251 MB (1 row) 查看表格的分布键,如下所示Distributed by: (id),分布键为id dev=# \d+ test Table "public.test" Column | Type | Modifiers | Storage | Description --------+---------+-----------+----------+------------- id | integer | | plain | name | text | | extended | Indexes: "idx_test" btree (id) Has OIDs: no Distributed by: (id) 按照原有的分布键重新分布 dev=# alter table test set with (reorganize=true) distributed by (id); ALTER TABLE 查看数据 dev=# select pg_size_pretty(pg_relation_size('test')); pg_size_pretty ---------------- 4200 MB (1 row) dev=# select pg_size_pretty(pg_relation_size('idx_test')); pg_size_pretty ---------------- 2126 MB

方法三:创建新表,导入数据

CREATE TABLE…AS SELECT命令把该表拷贝为一个新表,新建的表将不会出现膨胀现象。然后删除原始表并且重命名拷贝的表。

参考: https://gp-docs-cn.github.io/docs/best_practices/bloat.html https://docs.ucloud.cn/analysis/udw/developer(选择数据分布策略