Troubleshooting PostgreSQL Performance: Slow Queries, Locks, Index Scans, VACUUM Issues, and Dead Tuples

Diagnose and resolve common PostgreSQL performance bottlenecks including transaction locks, inefficient index usage, VACUUM issues, and dead tuple accumulation. Optimize your database for speed and stability.

As an experienced systems administrator, you’ve likely encountered the elusive and frustrating scenario where your PostgreSQL database, once a paragon of speed, begins to buckle under pressure. Applications become unresponsive, API requests timeout, and users report excruciatingly slow page loads. This guide delves into a common, multifaceted performance issue encompassing slow queries, transaction locks, inefficient index usage (especially index scans), autovacuum struggles, and the accumulation of “dead tuples” leading to table bloat. Understanding and systematically addressing these interconnected problems is crucial for restoring database health and application responsiveness.

Symptom & Error Signature

The symptoms are often observed at the application level before database logs reveal the underlying issues. Users might experience:

Application Timeouts: Web applications or services fail with HTTP 504 Gateway Timeout (Nginx), database connection timeouts, or query execution timeouts.
Slow Application Responses: Pages load slowly, data retrieval is sluggish.
High Server Load: Elevated CPU utilization, I/O wait, and memory consumption on the database server.
Increased Database Connection Pool Usage: More connections stay open longer, potentially exhausting the pool.

Typical observations from monitoring and logs:

Nginx Error Log (Example Timeout):

2023/10/26 14:35:01 [error] 12345#12345: *67890 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.1.100, server: example.com, request: "GET /api/v1/data HTTP/1.1", upstream: "http://unix:/var/run/php/php8.2-fpm.sock:/api/v1/data", host: "example.com"

PostgreSQL Log Entries (Possible clues):

LOG:  duration: 15432.123 ms  statement: SELECT id, data, status FROM large_table WHERE created_at < '2023-01-01' ORDER BY id DESC LIMIT 100 OFFSET 10000;
LOG:  statement: SELECT pg_backend_pid();
WARNING:  autovacuum: VACUUM of table "mydb.public.huge_table": index "huge_table_pkey" would wrap around

pg_stat_activity showing blocked queries:

SELECT pid, state, backend_type, query_start, query, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state = 'waiting' AND wait_event_type = 'Lock';

Output might show:

  pid  |  state  |  backend_type  |         query_start         |             query             | wait_event_type | wait_event
-------+---------+----------------+-----------------------------+-------------------------------+-----------------+------------
 12345 | waiting | client backend | 2023-10-26 14:35:05.123456+00 | UPDATE my_table SET status=1... | Lock            | tuple
 12346 | active  | client backend | 2023-10-26 14:35:00.000000+00 | UPDATE my_table SET value=... |                 |

pg_stat_user_tables indicating high dead tuples:

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 5;

Output might show:

    relname    | n_live_tup | n_dead_tup |       last_autovacuum       |      last_autoanalyze
---------------+------------+------------+-----------------------------+-----------------------------
 large_data    | 15000000   | 5000000    | 2023-10-26 08:00:00.000000+00 | 2023-10-26 09:00:00.000000+00
 another_table | 200000     | 100000     | 2023-10-26 10:30:00.000000+00 | 2023-10-26 10:45:00.000000+00

Root Cause Analysis

The issue “PostgreSQL slow queries locks index scan vacuum dead tuples table” points to a complex interplay of factors:

Slow Queries:
- Inefficient Query Plans: Lack of suitable indexes, outdated statistics (ANALYZE not run), or poorly written SQL (e.g., SELECT *, excessive JOINs, subqueries that could be optimized).
- Resource Contention: Queries waiting for locks or I/O.
- Bloat: Queries need to scan more data blocks due to dead tuples.
Locks:
- Long-Running Transactions: Transactions that stay open for extended periods, especially UPDATE or DELETE operations on many rows, can hold exclusive or row-level locks, blocking other queries.
- Application Design Flaws: Holding transactions open while performing external API calls or user interactions.
- Deadlocks: Two or more transactions mutually waiting for locks held by each other. PostgreSQL typically detects and resolves these by aborting one transaction, but it’s a symptom of contention.
Index Scan vs. Index-Only Scan:
- An Index Scan means PostgreSQL uses an index to find the row (tuple ID/TID), then has to visit the actual table page to retrieve the full row data and check its visibility (MVCC rules). This involves more disk I/O and CPU than an Index-Only Scan.
- An Index-Only Scan retrieves all required data directly from the index, avoiding a trip to the table. This is much faster.
- Dead Tuples Impact: When a table (or its corresponding index pages) has many dead tuples, PostgreSQL’s MVCC (Multi-Version Concurrency Control) mechanism has to do more work. Even if an index could provide all the necessary columns for an Index-Only Scan, if the heap (table) pages are heavily bloated with dead tuples, PostgreSQL might choose an Index Scan instead, or the Index-Only Scan might fail to find a visible tuple version in the index without a heap fetch, falling back to a regular Index Scan or performing additional heap lookups. This is often reflected as high “rows removed by MVCC” in EXPLAIN ANALYZE output.
- Missing/Inefficient Indexes: Incorrectly chosen or absent indexes force sequential scans or less optimal index scans.
VACUUM Issues & Dead Tuples:
- Dead Tuples: PostgreSQL’s MVCC design means UPDATE and DELETE operations don’t immediately remove old row versions (dead tuples). Instead, they are marked for deletion. These dead tuples consume disk space and can bloat tables and indexes.
- Autovacuum Lag: The autovacuum daemon is responsible for cleaning up dead tuples and updating table statistics. If autovacuum cannot keep up with the rate of UPDATE/DELETE operations, dead tuples accumulate.
- Autovacuum Configuration: Default autovacuum settings might be too conservative for high-transaction workloads. Factors like autovacuum_vacuum_scale_factor, autovacuum_vacuum_threshold, autovacuum_vacuum_cost_delay, and autovacuum_vacuum_cost_limit can hinder autovacuum’s efficiency.
- Table Bloat: The physical size of tables and indexes grows beyond what is strictly necessary due to accumulated dead tuples. This leads to:
  - More disk I/O to read relevant data.
  - Reduced cache hit rates.
  - Slower sequential and index scans.
  - Increased backup sizes.
  - Potential for transaction ID wraparound if not addressed, leading to database shutdown.

These issues create a vicious cycle: slow queries lead to longer-running transactions, increasing lock contention, which then exacerbates autovacuum’s ability to clean up dead tuples, leading to more bloat, which in turn makes queries even slower.

Step-by-Step Resolution

Solving these issues requires a systematic approach, combining monitoring, configuration tuning, and query optimization.

1. Monitor and Diagnose the Current State

Before making any changes, accurately identify the primary bottlenecks.

Active and Waiting Queries:

-- Top 20 active queries by duration
SELECT pid, datname, usename, client_addr, application_name, backend_start,
       state, wait_event_type, wait_event, query_start,
       (now() - query_start) AS query_duration, query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_duration DESC
LIMIT 20;

-- Queries currently waiting for locks
SELECT
    pa.pid AS blocked_pid,
    pa.query AS blocked_query,
    pa.state AS blocked_state,
    pa.query_start AS blocked_query_start,
    pl.locktype, pl.mode,
    pb.pid AS blocking_pid,
    pb.query AS blocking_query,
    pb.state AS blocking_state,
    pb.query_start AS blocking_query_start
FROM pg_catalog.pg_locks pl
JOIN pg_catalog.pg_stat_activity pa ON pl.pid = pa.pid
LEFT JOIN pg_catalog.pg_stat_activity pb ON pl.granted = false AND pl.pid = pb.pid
WHERE pa.wait_event_type = 'Lock' AND pa.state = 'waiting';

Identify Bloat and Autovacuum Status:

-- Top 10 tables by dead tuples
SELECT
    relname,
    pg_size_pretty(pg_relation_size(relid)) AS table_size,
    n_live_tup,
    n_dead_tup,
    CASE
        WHEN n_live_tup > 0 THEN round(n_dead_tup::numeric / n_live_tup * 100)
        ELSE 0
    END AS dead_tup_ratio_pct,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- Check autovacuum activity
SELECT
    pid,
    age(backend_xid) AS xid_age,
    current_setting('autovacuum_freeze_max_age') AS freeze_max_age,
    query,
    state,
    query_start
FROM pg_stat_activity
WHERE backend_type = 'autovacuum worker';

Enable pg_stat_statements: This extension tracks execution statistics of all queries executed by a server.

-- On Ubuntu/Debian, edit postgresql.conf:
sudo vim /etc/postgresql/$(pg_version_num)/main/postgresql.conf
# Add/uncomment:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000 # Increase if you have many unique queries
pg_stat_statements.track = all # track all statements

[!IMPORTANT] Changes to shared_preload_libraries require a PostgreSQL service restart.

sudo systemctl restart postgresql.service

Then, connect to your database and enable the extension:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries by total time
SELECT
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    (total_exec_time / calls) AS avg_exec_time,
    rows,
    blocks_hit,
    blocks_read
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

2. Optimize Slow Queries and Indexing

Focus on the queries identified in step 1.

Analyze Query Plans: Use EXPLAIN ANALYZE to understand why a query is slow. Pay attention to:
- rows removed by MVCC: Indicates dead tuples impacting visibility checks, hinting at VACUUM issues.
- Index Scan vs Index Only Scan: If Index Scan is used and an Index Only Scan is possible (all columns are in the index), it suggests bloat or visibility map issues.
- High cost values for certain operations (e.g., sequential scan on a large table).
```
EXPLAIN (ANALYZE, BUFFERS, FORMAT YAML)
SELECT id, data FROM large_table WHERE status = 'active' AND created_at > '2023-10-01' ORDER BY id LIMIT 100;
```
Look for Seq Scan on large tables where an index could be used, or Index Scan performing many heap fetches.
Create or Refine Indexes: Based on EXPLAIN ANALYZE, add indexes that cover the WHERE, ORDER BY, and JOIN clauses. For Index Only Scan potential, consider covering indexes (PostgreSQL 11+).
```
-- Example for a query using status and created_at
CREATE INDEX CONCURRENTLY idx_large_table_status_created_at
ON large_table (status, created_at);

-- Example for a covering index (if 'data' is also needed frequently)
CREATE INDEX CONCURRENTLY idx_large_table_status_created_at_data
ON large_table (status, created_at) INCLUDE (data);
```
[!IMPORTANT] Always use CREATE INDEX CONCURRENTLY for production environments to avoid exclusive locks that block DML operations. It takes longer but allows concurrent work.
Rewrite Inefficient Queries:
- Avoid SELECT * where only a few columns are needed.
- Break down complex queries into simpler ones if possible.
- Review JOIN conditions.
- Ensure appropriate use of LIMIT/OFFSET (large offsets can be slow). Consider cursor-based pagination for very large datasets.

3. Address Lock Contention

Identify Blocking Sessions:

SELECT
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocking_activity.application_name AS blocking_app,
    blocking_activity.query AS blocking_query,
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocked_activity.application_name AS blocked_app,
    blocked_activity.query AS blocked_query,
    blocked_activity.wait_event_type,
    blocked_activity.wait_event
FROM pg_catalog.pg_locks AS blocked_locks
JOIN pg_catalog.pg_stat_activity AS blocked_activity
    ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks AS blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity AS blocking_activity
    ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Review Application Transaction Logic:
- Ensure transactions are as short-lived as possible. Commit frequently.
- Avoid performing external I/O (API calls, file system operations) within an open transaction.
- Use explicit BEGIN; ... COMMIT; or ROLLBACK; rather than relying solely on auto-commit for critical operations.

Set Statement and Lock Timeouts: Prevent indefinitely running queries or locks.

-- In postgresql.conf
statement_timeout = '60s'    # Terminate any statement that takes longer than 60s
lock_timeout = '5s'          # Abort if a lock cannot be acquired within 5s

These can also be set per session or per transaction.

Consider Connection Pooling: Tools like PgBouncer can manage connections efficiently, reducing overhead and improving contention handling by allowing applications to quickly get a fresh connection for each transaction.
Kill Blocking Sessions (Last Resort): If a session is irreversibly stuck and causing severe blocking, you may need to terminate it.
```
SELECT pg_terminate_backend(pid);
```
[!WARNING] Killing backend processes can lead to incomplete transactions and data inconsistencies if not handled carefully by the application. Use with extreme caution and understanding of the implications.

4. Tune Autovacuum and Manage Dead Tuples

Effective autovacuum configuration is paramount for preventing bloat and maintaining performance.

PostgreSQL Configuration (postgresql.conf):

# Ensure autovacuum is enabled
autovacuum = on
log_autovacuum_min_duration = 0 # Log all autovacuum actions for debugging (set to -1 for production once tuned)

# Number of autovacuum worker processes
autovacuum_max_workers = 3     # Default is 3, increase for busy systems (e.g., 5-8)

# Delay between autovacuum runs (lower for faster cleanup)
autovacuum_nap_time = 1min     # How often to check for work

# Autovacuum cost-based delay (lower for faster, more aggressive vacuum)
autovacuum_vacuum_cost_delay = 10ms # Default 2ms (PostgreSQL 16), 10ms (older). Lower means more aggressive I/O.
autovacuum_analyze_cost_delay = 10ms

# Cost limit (higher for more work per vacuum, but more I/O)
autovacuum_vacuum_cost_limit = 2000 # Default 200. Increase for more powerful I/O systems (e.g., 500-1000 or higher)

# Autovacuum thresholds (adjust these carefully)
# autovacuum_vacuum_scale_factor = 0.2  # Default. Percentage of table size for VACUUM trigger.
# autovacuum_analyze_scale_factor = 0.1 # Default. Percentage of table size for ANALYZE trigger.
# autovacuum_vacuum_threshold = 50      # Default. Min tuples for VACUUM trigger.
# autovacuum_analyze_threshold = 50     # Default. Min tuples for ANALYZE trigger.

[!IMPORTANT] Restart PostgreSQL service after changing postgresql.conf.

sudo systemctl restart postgresql.service

Per-Table Autovacuum Settings: For highly volatile tables, you can override global settings.

ALTER TABLE large_data SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- Trigger vacuum sooner (5% dead tuples)
    autovacuum_vacuum_threshold = 1000,     -- Minimum 1000 dead tuples
    autovacuum_analyze_scale_factor = 0.02, -- Analyze sooner (2% changes)
    autovacuum_analyze_threshold = 500
);

Manual VACUUM/ANALYZE: If autovacuum lags significantly, a manual run might be necessary.

-- To clean up dead tuples and update statistics (non-blocking)
VACUUM (ANALYZE, VERBOSE) large_data;
-- To re-scan all indexes and rebuild visibility map (can be slow but non-blocking)
VACUUM (FULL, ANALYZE, VERBOSE) large_data; -- Careful: VACUUM FULL *IS* blocking. Use pg_repack instead for large tables.

REINDEX Bloated Indexes: Indexes can also get bloated.

-- Rebuild a specific index concurrently (non-blocking)
REINDEX INDEX CONCURRENTLY idx_large_table_status_created_at;

-- Rebuild all indexes for a table concurrently (non-blocking)
REINDEX TABLE CONCURRENTLY large_data;

[!IMPORTANT] REINDEX CONCURRENTLY is the preferred method as it avoids exclusive locks.

Address Table Bloat with pg_repack: For severe table bloat, VACUUM FULL requires an exclusive lock and downtime. pg_repack is a powerful tool that can perform online defragmentation without exclusive locks.

Install pg_repack:

# First, find your PostgreSQL major version (e.g., 16)
pg_version_num=$(psql -V | grep -oE '[0-9]+\.[0-9]+' | head -1 | cut -d'.' -f1)
sudo apt update
sudo apt install postgresql-$pg_version_num-repack

Enable extension in your database:

CREATE EXTENSION IF NOT EXISTS pg_repack;

Run pg_repack:

# To repack a specific table
pg_repack -d your_database_name -t your_table_name
# To repack all tables in a database
pg_repack -d your_database_name
# To repack specific index
pg_repack -d your_database_name -i your_index_name

[!WARNING] pg_repack creates a copy of the table, requires sufficient disk space (typically 2x the table size temporarily), and can add I/O load during the operation. Monitor carefully.

5. Database & System-Level Configuration

Review overall PostgreSQL and OS settings.

shared_buffers: Controls how much memory PostgreSQL uses for caching data pages. A common recommendation is 25% of total system RAM, but can go up to 40% on dedicated DB servers.
```
# In postgresql.conf
shared_buffers = 4GB # Example for a 16GB RAM server
```
work_mem: Amount of memory used by internal sort operations and hash tables before writing to temporary disk files. Set this carefully, as it’s per operation per session.
```
# In postgresql.conf
work_mem = 64MB # Adjust based on your workload, common values 16MB-256MB
```
maintenance_work_mem: Memory dedicated to maintenance operations like VACUUM, REINDEX, CREATE INDEX. Set this higher than work_mem for faster maintenance.
```
# In postgresql.conf
maintenance_work_mem = 1GB # Example for larger databases
```
effective_cache_size: The planner’s estimate of the total amount of memory available for disk caching by the operating system and within PostgreSQL itself. Helps in making good planning decisions.
```
# In postgresql.conf
effective_cache_size = 12GB # Example for a 16GB RAM server with 4GB shared_buffers
```

WAL Settings:

# In postgresql.conf
wal_buffers = 16MB       # Default 16MB, increase if needed for heavy writes
max_wal_size = 4GB       # Controls how often checkpoints occur
checkpoint_timeout = 30min # Controls how often checkpoints occur

Operating System Tuning (Ubuntu/Debian):
- Swappiness: Set vm.swappiness to a low value (e.g., 1-10) to minimize swapping database pages to disk.
```
sudo sysctl vm.swappiness=10
echo 'vm.swappiness = 10' | sudo tee -a /etc/sysctl.conf
```
- Huge Pages: Can improve performance by reducing TLB misses for large memory allocations. Configuration is more complex and depends on your kernel and PostgreSQL version.
- Disk I/O: Ensure your storage system (SSD, appropriate RAID levels) provides adequate read/write performance for your workload.

6. Application-Level Optimizations

Beyond database configurations, application code plays a vital role.

Batch Operations: Instead of single-row INSERTs/UPDATEs in a loop, use INSERT ... VALUES (...), (...), ...; or UPDATE ... WHERE id IN (...); for multiple rows.
Caching: Implement application-level caching (e.g., Redis, Memcached) for frequently accessed, slow-changing data.
Read Replicas: For read-heavy workloads, consider offloading reads to a PostgreSQL read replica.
Asynchronous Processing: Use message queues or background jobs for non-critical, long-running operations that don’t need immediate results.