www.cs.cmu.edu
Yes, PostgreSQL Has Problems. But We're Sticking With It!
Our previous blog article, “The Part of PostgreSQL We Hate the Most,” discussed the problems caused by everyone’s favorite street-strength DBMS multi-version concurrency control (MVCC) implementation. These include version copying, table bloat, index maintenance, and vacuum management. This article will explore ways to optimize PostgreSQL for each problem. Although PostgreSQL’s MVCC implementation is the __worst__ among other widely used databases like Oracle and MySQL, it remains our favorite DBMS, and we still love it! By sharing our insights, we hope to help users unlock the full potential of this powerful database system. ... ## Problem #1: Version Copying When a query modifies a tuple, regardless of whether it updates one or all of its columns, PostgreSQL creates a new version by copying all of its columns. This copying can result in significant data duplication and increased storage demands, particularly for tables with many columns and large row sizes. Optimization: Unfortunately, there are no workarounds to address this issue without a significant rewrite of PostgreSQL’s internals that would be disruptive. It’s not like replacing a character on a sitcom that nobody notices. ... ## Problem #2: Table Bloat PostgreSQL stores expired versions (dead tuples) and live tuples on the same pages. Although PostgreSQL’s autovacuum worker eventually removes these dead tuples, write-heavy workloads can cause them to accumulate faster than the vacuum can keep up. Additionally, the autovacuum only removes dead tuples for reuse (e.g., to store new versions) and does not reclaim unused storage space. During query execution, PostgreSQL loads dead tuples into memory (since the DBMS intermixes them on pages with live tuples), increasing disk IO and hurting performance because the DBMS retrieves useless data. If you are running Amazon’s PostgreSQL Aurora, this will increase the DBMS’s IOPS and cause you to give more money to Jeff Bezos! Optimization: We recommend monitoring PostgreSQL’s table bloat and then periodically reclaiming unused space. The pgstattuple built-in module accurately calculates the free space in a database but it requires full table scans, which is not practical for large tables in production environments. ``` $ psql -c "CREATE EXTENSION pgstattuple" -d $DB_NAME $ psql -c "SELECT * FROM pgstattuple('$TABLE_NAME')" -d $DB_NAME ``` … ## Problem #3: Secondary Index Maintenance When an application executes an `UPDATE` query on a table, PostgreSQL must also update all the indexes for that table to add entries to the new version. These index updates increase the DBMS’s memory pressure and disk I/O, especially for tables with numerous indexes (one OtterTune customer has **90** indexes on a single table!). As the number of indexes in a table increases, the overhead incurred when updating a tuple increases. PostgreSQL avoids updating indexes for Heap-Only Tuples (HOT) updates, where the DBMS stores the new version on the same page as the previous version. But as we mentioned in our last article, OtterTune customers’ PostgreSQL databases only use the HOT optimization for 46% of update operations. … `DROP INDEX` command. ## Problem #4: Vacuum Management PostgreSQL’s performance heavily depends on the effectiveness of its autovacuum to clean up obsolete data and prune version chains in its MVCC scheme. However, configuring the autovacuum to operate correctly and remove this data in a timely manner is challenging due to its complexity. The default global autovacuum settings are inappropriate for large tables (millions to billions of tuples), as it may take too long before triggering vacuums. Additionally, if each autovacuum invocation takes too long to complete or gets blocked by long-running transactions, the DBMS will accumulate dead tuples and suffer from stale statistics. Delaying the autovacuum for too long results in queries getting gradually slower over time, requiring manual intervention to address the problem. Optimization: Although having to vacuum tables in PostgreSQL is a pain, the good news is that it is manageable. But as we now discuss, there are a lot of steps to this and a lot of information you need to track.
Related Pain Points4件
Excessive or duplicate indexes degrade write performance and storage
6Unused or duplicate indexes cause every database modification to unnecessarily update those indexes, resulting in high storage utilization and IOPs consumption. This creates a silent performance drag that isn't immediately obvious.
MVCC version copying creates excessive data duplication
6PostgreSQL's MVCC implementation copies all columns of a tuple when any column is modified, regardless of which columns change. This causes significant data duplication and increased storage demands, especially for large tables with many columns. No practical workaround exists without major PostgreSQL rewrites.
Complex and error-prone autovacuum configuration
6Configuring autovacuum correctly is challenging due to its complexity. Default global settings are inappropriate for large tables with millions/billions of tuples. If autovacuum invocations take too long or are blocked, dead tuples accumulate and statistics become stale, causing gradual query slowdown. Manual intervention is often required.
Table Bloat from Lack of Regular VACUUM
5Failing to run VACUUM regularly can lead to bloated tables and degraded performance. Without VACUUM, PostgreSQL cannot reclaim space from deleted rows, accumulating dead tuples that consume disk space and slow down query performance over time.