Troubleshooting: Redis Connection Refused Error in Cluster Configuration (Node Down)

Resolve 'Redis connection refused' when a cluster node is down. A technical guide for sysadmins to diagnose and fix connectivity issues in Redis clusters.

When managing high-availability Redis clusters, encountering a “Redis connection refused” error, especially when it implicates a “cluster configuration node down,” is a critical event that demands immediate attention. This issue typically manifests as application downtime, data loss for session stores, or stale cache data, directly impacting user experience and system stability. As an expert SysAdmin, understanding the intricate layers of a Redis cluster and its dependencies is paramount to a swift and effective resolution.

Symptom & Error Signature

Users will typically experience application failures where Redis is a dependency. This can range from an “Internal Server Error” on a web application to services failing to start. The underlying error in logs will point to a network connection refusal when attempting to reach a specific Redis instance.

Here are common manifestations of the error signature:

From an Application Log (e.g., Node.js with ioredis):

Error: connect ECONNREFUSED <Redis_Node_IP>:<Redis_Port>
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '<Redis_Node_IP>',
  port: <Redis_Port>
}

From a Ruby on Rails Application Log:

Redis::CannotConnectError: Connection refused - connect(2) for "<Redis_Node_IP>" port <Redis_Port>
    from /usr/local/bundle/gems/redis-4.8.1/lib/redis/client.rb:397:in `_connect_socket'
    from /usr/local/bundle/gems/redis-4.8.1/lib/redis/client.rb:355:in `connect'
    ...

From a redis-cli attempt to a faulty node:

$ redis-cli -h <Redis_Node_IP> -p <Redis_Port>
Could not connect to Redis at <Redis_Node_IP>:<Redis_Port>: Connection refused

From a redis-cli within a healthy cluster node, inspecting the cluster state:

$ redis-cli -h <Healthy_Node_IP> -p <Healthy_Port> CLUSTER NODES
<node-id> <Healthy_Node_IP>:<Healthy_Port>@<Cluster_Bus_Port> master - 0 1678825700000 1 connected 0-5460
<node-id> <Another_Healthy_Node_IP>:<Another_Healthy_Port>@<Cluster_Bus_Port> master - 0 1678825700000 2 connected 5461-10922
<node-id> <Suspect_Node_IP>:<Suspect_Port>@<Cluster_Bus_Port> master,fail 1678825700000 1678825600000 3 disconnected 10923-16383

Notice the fail and disconnected flags for the suspect node.

Root Cause Analysis

A “connection refused” error in a Redis cluster, especially when a node is explicitly identified as “down,” points to a multi-faceted problem. The core issue is that the target Redis process is unreachable or unwilling to accept connections.

Redis Process Failure:
- Process Crash: The redis-server process might have crashed due to an internal error, unhandled exception, or corrupt data.
- OOM Killer: The Operating System’s Out-Of-Memory (OOM) killer might have terminated the Redis process due to excessive memory consumption on the node.
- Manual Stop: The service might have been manually stopped and not restarted.
- Failed Startup: The Redis process failed to start after a system reboot or manual attempt due to misconfiguration or resource issues.
Network Inaccessibility:
- Firewall Rules: iptables, ufw, or cloud provider security groups/firewall rules (e.g., AWS Security Groups, GCP Firewall Rules) are blocking incoming connections to the Redis port on the suspect node.
- Network Partition: A network issue, such as a faulty switch, router, or virtual network misconfiguration, prevents traffic from reaching the node.
- Incorrect bind Directive: The bind directive in redis.conf might be set to 127.0.0.1 (localhost only) while clients/other cluster nodes try to connect using the node’s external IP address.
- Incorrect port or cluster-port: The Redis instance is listening on a different port than what clients or cluster members are attempting to connect to.
Node System Failure:
- Operating System Crash: The entire virtual machine or physical server hosting the Redis node might be offline or rebooting.
- Resource Exhaustion:
  - CPU: Extremely high CPU utilization could make the system unresponsive and prevent Redis from accepting connections.
  - Disk I/O: Severe disk latency or a full disk (especially with RDB/AOF persistence enabled) can cause Redis to hang or fail.
Redis Configuration Mismatch/Errors:
- protected-mode enabled: If protected-mode yes is set and no requirepass (password) is configured, Redis will only accept connections from 127.0.0.1 and ::1. This is a common pitfall.
- Corrupted nodes.conf: The nodes.conf file, which stores the cluster’s state, might be corrupted, causing the node to fail joining or communicating correctly with the cluster.
- Incorrect cluster-config-file: If the path to nodes.conf is wrong or the file is inaccessible.

Step-by-Step Resolution

Follow these steps meticulously to diagnose and resolve the “Redis connection refused” error in your cluster.

1. Initial Cluster Health Check

First, identify which specific node(s) are problematic from a healthy cluster member.

# Connect to a healthy Redis cluster node
redis-cli -h <Healthy_Node_IP> -p <Healthy_Port>

Once connected, run the CLUSTER NODES command:

127.0.0.1:<Healthy_Port>> CLUSTER NODES

Analyze the output. Look for lines containing fail, PFAIL (potentially fail), or disconnected status for any node. Note down the IP address and port of the problematic node(s). The output provides the node ID, IP:Port, role (master/slave), state (connected/disconnected/fail), and other details.

Also, check the overall cluster information:

127.0.0.1:<Healthy_Port>> CLUSTER INFO

This will show cluster_state:fail if enough nodes are down to render the cluster non-operational.

2. Verify Network Connectivity to the Suspect Node

SSH into a server that should be able to connect to the suspect Redis node (e.g., your application server or another Redis cluster member).

Ping the IP:
```
ping <Suspect_Node_IP>
```
If ping fails, the host might be down, or there’s a fundamental network routing issue.
Test Port Accessibility:
```
# Using telnet (if installed)
telnet <Suspect_Node_IP> <Redis_Port>

# Using netcat (nc) for a quick check (more common on modern systems)
nc -vz <Suspect_Node_IP> <Redis_Port>
```
If telnet or nc reports “Connection refused” or “No route to host,” the Redis process is either not running, or a firewall is blocking the connection. If it successfully connects or reports “Connected,” the issue is likely within Redis configuration.
Check Firewall Rules on the Suspect Node: SSH directly into the suspect Redis node.

For UFW (Uncomplicated Firewall) on Ubuntu/Debian:
```
sudo ufw status verbose
```
Ensure that the Redis port (<Redis_Port>, typically 6379) is allowed for incoming connections from your application servers and other cluster nodes.

For iptables:
```
sudo iptables -L -n -v | grep <Redis_Port>
```
Look for ACCEPT rules for the Redis port on the INPUT chain.

[!IMPORTANT] If using a cloud provider (AWS, GCP, Azure), verify the Network Security Groups (NSGs), Security Groups, or Firewall Rules are correctly configured to allow traffic on the Redis port from relevant sources (e.g., application servers, other cluster nodes, management IPs).

3. Inspect the Suspect Node Directly (SSH)

SSH into the problematic Redis node.

Check Redis Service Status:

For Systemd-managed Redis:
```
sudo systemctl status redis-server
```
Look for “Active: active (running)” status. If it’s inactive (dead), failed, or activating, the service is not running correctly.

For Docker Containerized Redis:
```
sudo docker ps -a | grep redis
```
Check if the Redis container is listed and its STATUS is Up. If it’s Exited or not shown, the container is not running. Check sudo docker logs <container_id_or_name> for its output.
Review Redis Logs: The Redis logs are crucial for understanding why the process might have stopped or refused connections.

For Systemd-managed Redis:
```
sudo journalctl -u redis-server.service -xn 50
# Or to follow live logs:
sudo journalctl -u redis-server.service -f
```
For custom Redis logging (check your redis.conf for logfile directive):
```
sudo tail -f /var/log/redis/redis-server.log
```
Look for messages indicating:
- Startup failures.
- Configuration errors.
- OOM killer invocations (e.g., OOM command not allowed when used memory > 'maxmemory').
- Internal errors or crashes.
Check System Logs for OOM Killer: If Redis mysteriously stopped, the OOM killer is a common culprit.
```
dmesg -T | grep -i oom
# Or for a broader search in journalctl
sudo journalctl -xe | grep -i oom
```
If OOM-killer messages appear, increase the node’s RAM, optimize Redis memory usage, or set maxmemory appropriately.
Check Resource Utilization: High CPU, memory, or disk I/O can make a node unresponsive.
```
top      # Or htop for a more visual interface
free -h  # Check available memory
df -h    # Check disk space
```
Address any resource bottlenecks (e.g., add more CPU/RAM, free up disk space).

4. Validate Redis Configuration

A common source of “connection refused” is an incorrectly configured redis.conf file. Locate your redis.conf (typically /etc/redis/redis.conf or /etc/redis/6379.conf).

sudo cat /etc/redis/redis.conf

Pay close attention to these directives:

bind:
- If bind 127.0.0.1 is set, Redis will only accept connections from the local machine. Change it to 0.0.0.0 to listen on all interfaces (less secure without a password) or bind to specific private IPs of your cluster members/application servers.
- Example for listening on all interfaces:
```
bind 0.0.0.0
```
- Example for binding to specific interfaces (replace with actual IPs):
```
bind 192.168.1.100 10.0.0.5
```
port: Ensure this matches the port your clients are trying to connect to (default is 6379).
protected-mode:
- If protected-mode yes and no requirepass (password) is configured, Redis will only listen on loopback interfaces (like 127.0.0.1). If you need external access without a password, set protected-mode no.
- [!WARNING]
Setting protected-mode no without requirepass exposes your Redis instance to the network. Ensure proper firewall rules are in place.
```
protected-mode no
```
- Alternatively, configure a strong password:
```
requirepass your_strong_password_here
```
cluster-enabled: Must be yes for cluster nodes.
cluster-config-file: Path to nodes.conf. Ensure it’s correct and Redis has permissions to write to it.
daemonize: Set to yes if running as a background service without Systemd/Docker managing it. For Systemd, no is often preferred as Systemd handles daemonization.

After any configuration changes, save the file.

5. Restart Redis Service and Monitor

If you’ve identified and fixed a problem (e.g., incorrect bind directive, resource issue, service stopped), restart the Redis service.

For Systemd-managed Redis:

sudo systemctl restart redis-server
sudo systemctl status redis-server

Immediately after restarting, monitor the logs to ensure it starts without errors:

sudo journalctl -u redis-server.service -f

For Docker Containerized Redis:

sudo docker restart <redis_container_id_or_name>
sudo docker logs -f <redis_container_id_or_name>

After the node has restarted successfully and is logging properly, re-check the cluster status from a healthy node:

redis-cli -h <Healthy_Node_IP> -p <Healthy_Port> CLUSTER NODES

Verify that the problematic node now shows connected.

6. Handle Corrupted `nodes.conf` (Advanced)

In rare cases, the nodes.conf file on a specific node might become corrupted, preventing it from correctly joining or communicating with the cluster, even if the Redis process itself starts.

[!WARNING] This step is highly destructive for a single node’s cluster state and should only be performed as a last resort if the node persistently fails to re-join the cluster, and all other troubleshooting steps have been exhausted. Always back up your nodes.conf file before proceeding.

Stop the Redis service on the problematic node.
```
sudo systemctl stop redis-server
```
Backup the existing nodes.conf file. The location is defined by cluster-config-file in redis.conf.
```
sudo cp /var/lib/redis/nodes-6379.conf /var/lib/redis/nodes-6379.conf.bak
```
Delete the nodes.conf file.
```
sudo rm /var/lib/redis/nodes-6379.conf
```
Start the Redis service. It will generate a new nodes.conf with a fresh node ID.
```
sudo systemctl start redis-server
```
Re-integrate the node into the cluster. This needs to be done from a healthy node in the cluster.
- If the node was a replica, you can add it as a replica to an existing master:
```
redis-cli --cluster add-node <Suspect_Node_IP>:<Redis_Port> <Healthy_Master_IP>:<Healthy_Master_Port> --cluster-slave --cluster-master-id <Master_Node_ID_for_Replica>
```
- If it was a master and its data is lost, and you need to replace it as a new master (less common for “connection refused”):
```
redis-cli --cluster add-node <Suspect_Node_IP>:<Redis_Port> <Healthy_Node_IP>:<Healthy_Port>
```
  Then, you might need to rebalance the cluster slots.
[!NOTE] Identifying the Master_Node_ID_for_Replica comes from the CLUSTER NODES output from a healthy node.

7. Client-Side Verification

After the Redis cluster node is fully operational and connected, ensure your application clients are correctly configured.

Cluster-Aware Clients: Verify that your application is using a Redis client library that supports Redis Cluster and is configured with the correct seed nodes.
Stale Cache: If the client’s internal view of the cluster became stale, restarting the application or refreshing its connection pool might be necessary to pick up the updated cluster configuration.

By systematically working through these steps, you can effectively diagnose and resolve “Redis connection refused” errors caused by a down or inaccessible cluster node, restoring your Redis cluster’s stability and your applications’ functionality.