Troubleshooting: Redis Connection Refused Error in Cluster Configuration (Node Down)
Resolve 'Redis connection refused' when a cluster node is down. A technical guide for sysadmins to diagnose and fix connectivity issues in Redis clusters.
When managing high-availability Redis clusters, encountering a “Redis connection refused” error, especially when it implicates a “cluster configuration node down,” is a critical event that demands immediate attention. This issue typically manifests as application downtime, data loss for session stores, or stale cache data, directly impacting user experience and system stability. As an expert SysAdmin, understanding the intricate layers of a Redis cluster and its dependencies is paramount to a swift and effective resolution.
Symptom & Error Signature
Users will typically experience application failures where Redis is a dependency. This can range from an “Internal Server Error” on a web application to services failing to start. The underlying error in logs will point to a network connection refusal when attempting to reach a specific Redis instance.
Here are common manifestations of the error signature:
From an Application Log (e.g., Node.js with ioredis):
Error: connect ECONNREFUSED <Redis_Node_IP>:<Redis_Port>
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1494:16) {
errno: -111,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '<Redis_Node_IP>',
port: <Redis_Port>
}
From a Ruby on Rails Application Log:
Redis::CannotConnectError: Connection refused - connect(2) for "<Redis_Node_IP>" port <Redis_Port>
from /usr/local/bundle/gems/redis-4.8.1/lib/redis/client.rb:397:in `_connect_socket'
from /usr/local/bundle/gems/redis-4.8.1/lib/redis/client.rb:355:in `connect'
...
From a redis-cli attempt to a faulty node:
$ redis-cli -h <Redis_Node_IP> -p <Redis_Port>
Could not connect to Redis at <Redis_Node_IP>:<Redis_Port>: Connection refused
From a redis-cli within a healthy cluster node, inspecting the cluster state:
$ redis-cli -h <Healthy_Node_IP> -p <Healthy_Port> CLUSTER NODES
<node-id> <Healthy_Node_IP>:<Healthy_Port>@<Cluster_Bus_Port> master - 0 1678825700000 1 connected 0-5460
<node-id> <Another_Healthy_Node_IP>:<Another_Healthy_Port>@<Cluster_Bus_Port> master - 0 1678825700000 2 connected 5461-10922
<node-id> <Suspect_Node_IP>:<Suspect_Port>@<Cluster_Bus_Port> master,fail 1678825700000 1678825600000 3 disconnected 10923-16383
Notice the fail and disconnected flags for the suspect node.
Root Cause Analysis
A “connection refused” error in a Redis cluster, especially when a node is explicitly identified as “down,” points to a multi-faceted problem. The core issue is that the target Redis process is unreachable or unwilling to accept connections.
-
Redis Process Failure:
- Process Crash: The
redis-serverprocess might have crashed due to an internal error, unhandled exception, or corrupt data. - OOM Killer: The Operating System’s Out-Of-Memory (OOM) killer might have terminated the Redis process due to excessive memory consumption on the node.
- Manual Stop: The service might have been manually stopped and not restarted.
- Failed Startup: The Redis process failed to start after a system reboot or manual attempt due to misconfiguration or resource issues.
- Process Crash: The
-
Network Inaccessibility:
- Firewall Rules:
iptables,ufw, or cloud provider security groups/firewall rules (e.g., AWS Security Groups, GCP Firewall Rules) are blocking incoming connections to the Redis port on the suspect node. - Network Partition: A network issue, such as a faulty switch, router, or virtual network misconfiguration, prevents traffic from reaching the node.
- Incorrect
bindDirective: Thebinddirective inredis.confmight be set to127.0.0.1(localhost only) while clients/other cluster nodes try to connect using the node’s external IP address. - Incorrect
portorcluster-port: The Redis instance is listening on a different port than what clients or cluster members are attempting to connect to.
- Firewall Rules:
-
Node System Failure:
- Operating System Crash: The entire virtual machine or physical server hosting the Redis node might be offline or rebooting.
- Resource Exhaustion:
- CPU: Extremely high CPU utilization could make the system unresponsive and prevent Redis from accepting connections.
- Disk I/O: Severe disk latency or a full disk (especially with RDB/AOF persistence enabled) can cause Redis to hang or fail.
-
Redis Configuration Mismatch/Errors:
protected-modeenabled: Ifprotected-mode yesis set and norequirepass(password) is configured, Redis will only accept connections from127.0.0.1and::1. This is a common pitfall.- Corrupted
nodes.conf: Thenodes.conffile, which stores the cluster’s state, might be corrupted, causing the node to fail joining or communicating correctly with the cluster. - Incorrect
cluster-config-file: If the path tonodes.confis wrong or the file is inaccessible.
Step-by-Step Resolution
Follow these steps meticulously to diagnose and resolve the “Redis connection refused” error in your cluster.
1. Initial Cluster Health Check
First, identify which specific node(s) are problematic from a healthy cluster member.
# Connect to a healthy Redis cluster node
redis-cli -h <Healthy_Node_IP> -p <Healthy_Port>
Once connected, run the CLUSTER NODES command:
127.0.0.1:<Healthy_Port>> CLUSTER NODES
Analyze the output. Look for lines containing fail, PFAIL (potentially fail), or disconnected status for any node. Note down the IP address and port of the problematic node(s). The output provides the node ID, IP:Port, role (master/slave), state (connected/disconnected/fail), and other details.
Also, check the overall cluster information:
127.0.0.1:<Healthy_Port>> CLUSTER INFO
This will show cluster_state:fail if enough nodes are down to render the cluster non-operational.
2. Verify Network Connectivity to the Suspect Node
SSH into a server that should be able to connect to the suspect Redis node (e.g., your application server or another Redis cluster member).
-
Ping the IP:
ping <Suspect_Node_IP>If
pingfails, the host might be down, or there’s a fundamental network routing issue. -
Test Port Accessibility:
# Using telnet (if installed) telnet <Suspect_Node_IP> <Redis_Port> # Using netcat (nc) for a quick check (more common on modern systems) nc -vz <Suspect_Node_IP> <Redis_Port>If
telnetorncreports “Connection refused” or “No route to host,” the Redis process is either not running, or a firewall is blocking the connection. If it successfully connects or reports “Connected,” the issue is likely within Redis configuration. -
Check Firewall Rules on the Suspect Node: SSH directly into the suspect Redis node.
For UFW (Uncomplicated Firewall) on Ubuntu/Debian:
sudo ufw status verboseEnsure that the Redis port (
<Redis_Port>, typically 6379) is allowed for incoming connections from your application servers and other cluster nodes.For
iptables:sudo iptables -L -n -v | grep <Redis_Port>Look for
ACCEPTrules for the Redis port on theINPUTchain.[!IMPORTANT] If using a cloud provider (AWS, GCP, Azure), verify the Network Security Groups (NSGs), Security Groups, or Firewall Rules are correctly configured to allow traffic on the Redis port from relevant sources (e.g., application servers, other cluster nodes, management IPs).
3. Inspect the Suspect Node Directly (SSH)
SSH into the problematic Redis node.
-
Check Redis Service Status:
For Systemd-managed Redis:
sudo systemctl status redis-serverLook for “Active: active (running)” status. If it’s
inactive (dead),failed, oractivating, the service is not running correctly.For Docker Containerized Redis:
sudo docker ps -a | grep redisCheck if the Redis container is listed and its
STATUSisUp. If it’sExitedor not shown, the container is not running. Checksudo docker logs <container_id_or_name>for its output. -
Review Redis Logs: The Redis logs are crucial for understanding why the process might have stopped or refused connections.
For Systemd-managed Redis:
sudo journalctl -u redis-server.service -xn 50 # Or to follow live logs: sudo journalctl -u redis-server.service -fFor custom Redis logging (check your
redis.confforlogfiledirective):sudo tail -f /var/log/redis/redis-server.logLook for messages indicating:
- Startup failures.
- Configuration errors.
- OOM killer invocations (e.g.,
OOM command not allowed when used memory > 'maxmemory'). - Internal errors or crashes.
-
Check System Logs for OOM Killer: If Redis mysteriously stopped, the OOM killer is a common culprit.
dmesg -T | grep -i oom # Or for a broader search in journalctl sudo journalctl -xe | grep -i oomIf
OOM-killermessages appear, increase the node’s RAM, optimize Redis memory usage, or setmaxmemoryappropriately. -
Check Resource Utilization: High CPU, memory, or disk I/O can make a node unresponsive.
top # Or htop for a more visual interface free -h # Check available memory df -h # Check disk spaceAddress any resource bottlenecks (e.g., add more CPU/RAM, free up disk space).
4. Validate Redis Configuration
A common source of “connection refused” is an incorrectly configured redis.conf file.
Locate your redis.conf (typically /etc/redis/redis.conf or /etc/redis/6379.conf).
sudo cat /etc/redis/redis.conf
Pay close attention to these directives:
-
bind:- If
bind 127.0.0.1is set, Redis will only accept connections from the local machine. Change it to0.0.0.0to listen on all interfaces (less secure without a password) or bind to specific private IPs of your cluster members/application servers. - Example for listening on all interfaces:
bind 0.0.0.0 - Example for binding to specific interfaces (replace with actual IPs):
bind 192.168.1.100 10.0.0.5
- If
-
port: Ensure this matches the port your clients are trying to connect to (default is 6379). -
protected-mode:- If
protected-mode yesand norequirepass(password) is configured, Redis will only listen on loopback interfaces (like127.0.0.1). If you need external access without a password, setprotected-mode no. -
[!WARNING]
Setting
protected-mode nowithoutrequirepassexposes your Redis instance to the network. Ensure proper firewall rules are in place.protected-mode no- Alternatively, configure a strong password:
requirepass your_strong_password_here
- If
-
cluster-enabled: Must beyesfor cluster nodes. -
cluster-config-file: Path tonodes.conf. Ensure it’s correct and Redis has permissions to write to it. -
daemonize: Set toyesif running as a background service without Systemd/Docker managing it. For Systemd,nois often preferred as Systemd handles daemonization.
After any configuration changes, save the file.
5. Restart Redis Service and Monitor
If you’ve identified and fixed a problem (e.g., incorrect bind directive, resource issue, service stopped), restart the Redis service.
For Systemd-managed Redis:
sudo systemctl restart redis-server
sudo systemctl status redis-server
Immediately after restarting, monitor the logs to ensure it starts without errors:
sudo journalctl -u redis-server.service -f
For Docker Containerized Redis:
sudo docker restart <redis_container_id_or_name>
sudo docker logs -f <redis_container_id_or_name>
After the node has restarted successfully and is logging properly, re-check the cluster status from a healthy node:
redis-cli -h <Healthy_Node_IP> -p <Healthy_Port> CLUSTER NODES
Verify that the problematic node now shows connected.
6. Handle Corrupted nodes.conf (Advanced)
In rare cases, the nodes.conf file on a specific node might become corrupted, preventing it from correctly joining or communicating with the cluster, even if the Redis process itself starts.
[!WARNING] This step is highly destructive for a single node’s cluster state and should only be performed as a last resort if the node persistently fails to re-join the cluster, and all other troubleshooting steps have been exhausted. Always back up your
nodes.conffile before proceeding.
-
Stop the Redis service on the problematic node.
sudo systemctl stop redis-server -
Backup the existing
nodes.conffile. The location is defined bycluster-config-fileinredis.conf.sudo cp /var/lib/redis/nodes-6379.conf /var/lib/redis/nodes-6379.conf.bak -
Delete the
nodes.conffile.sudo rm /var/lib/redis/nodes-6379.conf -
Start the Redis service. It will generate a new
nodes.confwith a fresh node ID.sudo systemctl start redis-server -
Re-integrate the node into the cluster. This needs to be done from a healthy node in the cluster.
- If the node was a replica, you can add it as a replica to an existing master:
redis-cli --cluster add-node <Suspect_Node_IP>:<Redis_Port> <Healthy_Master_IP>:<Healthy_Master_Port> --cluster-slave --cluster-master-id <Master_Node_ID_for_Replica> - If it was a master and its data is lost, and you need to replace it as a new master (less common for “connection refused”):
Then, you might need to rebalance the cluster slots.redis-cli --cluster add-node <Suspect_Node_IP>:<Redis_Port> <Healthy_Node_IP>:<Healthy_Port>
[!NOTE] Identifying the
Master_Node_ID_for_Replicacomes from theCLUSTER NODESoutput from a healthy node. - If the node was a replica, you can add it as a replica to an existing master:
7. Client-Side Verification
After the Redis cluster node is fully operational and connected, ensure your application clients are correctly configured.
- Cluster-Aware Clients: Verify that your application is using a Redis client library that supports Redis Cluster and is configured with the correct seed nodes.
- Stale Cache: If the client’s internal view of the cluster became stale, restarting the application or refreshing its connection pool might be necessary to pick up the updated cluster configuration.
By systematically working through these steps, you can effectively diagnose and resolve “Redis connection refused” errors caused by a down or inaccessible cluster node, restoring your Redis cluster’s stability and your applications’ functionality.
