Resolving High CPU & Disk I/O: systemd-journald and Log Rotation Issues on Linux

Address systemd-journald consuming excessive CPU and disk I/O due to aggressive logging or misconfigured log rotation, improving system performance and stability.


When managing production Linux servers, a common performance bottleneck can arise from excessive logging, particularly when systemd-journald begins consuming an inordinate amount of CPU cycles or disk I/O. This situation often manifests as system slowdowns, high load averages, and unresponsive applications, directly impacting user experience and service availability. This guide provides a comprehensive, step-by-step approach to diagnose and resolve such issues.

Symptom & Error Signature

The primary symptom is a noticeable degradation in system performance, often accompanied by alerts from monitoring systems about high CPU utilization or disk I/O wait times. When inspecting the system, you’ll typically observe the systemd-journald process consuming significant resources.

Typical top/htop output showing high CPU:

Tasks: 201 total,   1 running, 200 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.3 us,  5.7 sy,  0.0 ni, 78.4 id,  0.6 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7969.5 total,   3489.6 free,   1234.4 used,   3245.5 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.   6091.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    354 systemd-+  20   0 1373516 116840  24760 S 100.0   1.4  13:45.12 systemd-journal
   1123 root      20   0  268388  18376  13304 S   0.7   0.2   0:15.21 sshd
   1456 www-data  20   0  468968  24508   9408 S   0.3   0.3   0:02.11 nginx

Typical iotop output showing high disk write activity:

Total DISK READ:       0.00 B/s | Total DISK WRITE:   203.49 M/s
Current DISK READ:     0.00 B/s | Current DISK WRITE: 203.49 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
    354 be/4 systemd-+    0.00 B/s   203.49 M/s  0.00 % 99.99 % systemd-journald
    987 be/4 root         0.00 B/s     0.00 B/s  0.00 %  0.00 % [kworker/u16:0-events]
   1001 be/4 www-data     0.00 B/s     0.00 B/s  0.00 %  0.00 % nginx: worker process

You might also observe journalctl --disk-usage reporting an unexpectedly large journal size:

journalctl --disk-usage
Journals take up 3.7G on disk.

Root Cause Analysis

The root causes for systemd-journald consuming excessive resources are typically multifaceted, often stemming from a combination of aggressive logging, misconfiguration, or underlying system issues.

  1. Excessive Log Verbosity:
    • Application Misconfiguration: Debug-level logging enabled in production for applications (e.g., Nginx, Apache, PHP-FPM, Docker containers, custom applications), resulting in millions of log entries per second.
    • Rapid Error Loops: An application or service crashing and restarting repeatedly, flooding the journal with error messages and startup sequences.
    • Security Events: Brute-force attacks or misconfigured security tools generating a high volume of authentication failures or firewall rejections.
  2. Misconfigured systemd-journald Limits: The default settings for journald might not be aggressive enough for high-traffic servers or those with limited disk space, leading to unbounded growth of journal files or inefficient rotation.
  3. Inefficient Log Rotation: While systemd-journald manages its own log rotation, if other services are writing directly to /var/log (outside of systemd’s direct control) and logrotate is misconfigured or failing, this can exacerbate disk I/O issues or disk full conditions, indirectly affecting journald’s ability to prune.
  4. Disk Full Conditions: If the /var/log partition (or root partition) is full, journald may struggle to write new entries or prune old ones efficiently, leading to resource contention.
  5. Kernel/Systemd Bugs: While less common, specific versions of systemd or the Linux kernel can occasionally exhibit performance regressions or bugs related to journald’s I/O handling, especially under heavy load.
  6. Underlying Disk I/O Performance: A slow or failing disk subsystem can make journald’s legitimate logging activity appear disproportionately resource-intensive.

Step-by-Step Resolution

Addressing this issue requires a systematic approach, starting with identifying the source of logging and then configuring journald and other applications appropriately.

1. Identify the Source of Excessive Logging

The first step is to pinpoint which application or service is generating the log spam.

Monitor live log output:

journalctl -f

This command streams new log entries. Look for recurring patterns, specific service names, or IP addresses that appear frequently.

Analyze recent errors/warnings:

journalctl -p err -p warning -b

This shows error and warning messages from the current boot. High counts of these might indicate a looping failure.

Check disk usage by journal files:

sudo du -sh /var/log/journal/

This confirms the current size of the journal files.

Identify top journal contributors (advanced): To see which units are logging the most, you can use a combination of journalctl and text processing. This is not directly available but can be approximated. First, get a count of entries per unit for a recent period:

journalctl --since "1 hour ago" | grep _SYSTEMD_UNIT= | cut -d'=' -f2 | sort | uniq -c | sort -nr | head -n 10

This command helps you identify which systemd units (services) are generating the most log entries in the last hour.

2. Configure systemd-journald Log Retention Policies

The journald configuration controls how much disk space the journal uses and how long entries are kept.

Edit the journald configuration file:

sudo vim /etc/systemd/journald.conf

Uncomment and set appropriate values for the following parameters. Here are recommended settings for a typical production server:

[Journal]
# Ensure persistent storage. This is usually default, but confirm.
Storage=persistent
# Maximum size of all journal files on disk.
SystemMaxUse=1G
# Keep at least this much free disk space.
SystemKeepFree=15%
# Maximum individual journal file size.
SystemMaxFileSize=100M
# Max size of journal files in /run/log/journal (volatile, for boot).
RuntimeMaxUse=100M
# Retain journal entries for a maximum of 30 days.
MaxRetentionSec=30day

[!IMPORTANT] The SystemMaxUse and SystemKeepFree directives are crucial. SystemMaxUse sets an absolute maximum for the total size of all journal files. SystemKeepFree ensures that a certain percentage of disk space remains free, taking precedence if SystemMaxUse would cause less free space than specified. Adjust SystemMaxUse based on your disk capacity and logging needs (e.g., 1G-5G is common).

Apply changes by restarting systemd-journald:

sudo systemctl restart systemd-journald

Manually purge old journal entries (optional but recommended initially):

# Trim journal files to a maximum of 1GB
sudo journalctl --vacuum-size=1G

# Trim journal files older than 7 days
sudo journalctl --vacuum-time=7d

3. Adjust Application Logging Verbosity

Once you’ve identified the source, adjust the logging level for those applications.

For Nginx:

If Nginx access logs are flooding your system, consider disabling them for specific static assets or reducing their verbosity.

To disable access logs for specific locations (e.g., static files):

# /etc/nginx/nginx.conf or a site-specific conf file
server {
    listen 80;
    server_name example.com;

    access_log /var/log/nginx/example.com_access.log; # Main access log

    location ~* \.(jpg|jpeg|gif|png|ico|css|js)$ {
        access_log off; # Disable access logs for static assets
        expires 30d;
        add_header Cache-Control "public";
        root /var/www/example.com/html;
    }

    # ... other configurations
}

To reduce Nginx error log verbosity:

# /etc/nginx/nginx.conf
error_log /var/log/nginx/error.log warn; # Change 'info' or 'notice' to 'warn' or 'error'

After modifying Nginx configuration, test and reload:

sudo nginx -t
sudo systemctl reload nginx
For Docker Containers:

Docker containers can generate a huge amount of logs, often sent to journald via the default json-file driver or syslog driver.

Configure Docker daemon-wide logging limits:

Edit /etc/docker/daemon.json (create if it doesn’t exist):

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

This sets a default log limit of 10MB per file, keeping 3 files, for a total of 30MB per container.

[!IMPORTANT] After modifying daemon.json, you must restart the Docker daemon:

sudo systemctl restart docker

This will affect new containers. For existing containers, you’ll need to restart them or stop/remove and recreate them for the new logging options to take effect.

Configure logging for specific containers:

For individual containers, you can override the daemon defaults:

docker run -d \
  --log-opt max-size=5m \
  --log-opt max-file=2 \
  your-image-name
For Custom Applications:

Check the configuration of your custom applications (e.g., Python, Node.js, Java) and ensure they are not logging at DEBUG or INFO level in production unless absolutely necessary. Switch to WARN or ERROR levels.

4. Verify logrotate for Non-Journald Logs

While systemd-journald manages its own logs, many applications still write directly to files in /var/log (e.g., Nginx access/error logs if configured directly, database logs, older applications). Ensure logrotate is correctly configured and running for these.

Check logrotate configuration:

ls -l /etc/logrotate.d/

Review the configuration files for your services (e.g., nginx, mysql, apache2).

Example logrotate config for Nginx (/etc/logrotate.d/nginx):

/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
    sharedscripts
    prerotate
        if [ -d /etc/logrotate.d/httpd-prerotate ]; then \
            run-parts /etc/logrotate.d/httpd-prerotate; \
        fi \
    endscript
    postrotate
        invoke-rc.d nginx rotate >/dev/null 2>&1
    endscript
}

Ensure rotate and daily/weekly directives are set to appropriate values.

Manually run logrotate in debug mode to test:

sudo logrotate -d /etc/logrotate.conf

This command will simulate a rotation without making changes, showing you what would happen.

Force logrotate to run:

sudo logrotate -f /etc/logrotate.conf

This can be useful to immediately clean up old logs if disk space is critical.

[!WARNING] Forcing logrotate can sometimes cause issues if a service is actively writing to a log file and not properly signaled to reopen its log file after rotation. Always ensure the postrotate script correctly reloads/restarts the service.

5. Check Disk Space and I/O Performance

Ensure there are no underlying disk issues or full partitions preventing efficient log management.

Check disk space:

df -h

Look for any partitions at 90% or higher, especially / or /var.

Monitor disk I/O performance:

iostat -x 1 10

This command provides detailed I/O statistics every second for 10 iterations. Look at %util (percentage of time the device was busy) and await (average time for I/O requests) for your disk device (e.g., sda, nvme0n1). High values indicate I/O bottlenecks.

vmstat 1 10

Focus on the wa (wait for I/O) column under cpu. A persistently high wa percentage indicates the CPU is spending a lot of time waiting for disk operations to complete.

If these tools indicate a struggling disk, consider upgrading your storage, optimizing database I/O, or investigating hardware faults.

6. Upgrade systemd (If Applicable)

In rare cases, a specific version of systemd might have a bug contributing to the issue. If all other steps fail and you suspect a software bug, consider upgrading systemd to the latest stable version available for your distribution.

sudo apt update
sudo apt upgrade systemd

Always review release notes for systemd updates for any breaking changes or known issues before upgrading in a production environment.

By systematically applying these troubleshooting steps, you can identify the root cause of systemd-journald’s high resource consumption and implement lasting solutions to maintain your system’s performance and stability.