Last night I was working on my systems then suddenly saw a lot of Nagios Alerts on Dashboard showing "Socket Timeout" error. Then I went to my Server from where I was getting the nagios alerts and then tried to check the nrpe service status. It seems nrpe was not running then I started troubleshooting the issue and thought to put the steps in this article which might help you as well in case you are also facing the same issue.
nrpe.service: main process exited, code=exited, status=1/FAILURE
Also Read: Solved: nrpe.service: main process exited, code=exited, status=2/INVALIDARGUMENT
You might have observed "main process exited, code=exited, status=1/FAILURE" error while trying to start or restart the nrpe service in your Linux based systems. When you run systemctl status nrpe command to check the status, then output will show something like below.
[root@localhost ~]# systemctl status nrpe ● nrpe.service - Nagios Remote Program Executor Loaded: loaded (/usr/lib/systemd/system/nrpe.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2020-09-21 21:45:35 UTC; 2s ago Docs: http://www.nagios.org/documentation Process: 27745 ExecStopPost=/bin/rm -f /var/run/nrpe/nrpe.pid (code=exited, status=0/SUCCESS) Process: 27743 ExecStart=/usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f $NRPE_SSL_OPT (code=exited, status=1/FAILURE) Main PID: 27743 (code=exited, status=1/FAILURE) Sep 21 21:45:35 localhost systemd[1]: Started Nagios Remote Program Executor. Sep 21 21:45:35 localhost systemd[1]: Starting Nagios Remote Program Executor... Sep 21 21:45:35 localhost systemd[1]: nrpe.service: main process exited, code=exited, status=1/FAILURE Sep 21 21:45:35 localhost systemd[1]: Unit nrpe.service entered failed state. Sep 21 21:45:35 localhost systemd[1]: nrpe.service failed.
NOTE:
root
user to run all the below commands. You can use any user with sudo
access to run all these commands. For more information Please check Step by Step: How to Add User to Sudoers to provide sudo
access to the User.Well, this error could occur due to multiple possible issue so it is important to discuss all the possible scenarios that could result in this error. One of the Common Scenario that one can think of is the Permission issue to create the nrpe.pid file. You can check the path of nrpe.pid
file mentioned in /etc/nagios/nrpe.cfg
file using below command.
[root@localhost ~]# cat /etc/nagios/nrpe.cfg | grep nrpe.pid pid_file=/etc/nagios/nrpe.pid
To verify the permission issue you need to go to /etc/nagios
directory and check the permission of the file nrpe.pid
using ls -lrt nrpe.pid
command. You need to make sure that this file has correct permissions.
[root@localhost ~]# cd /etc/nagios/ [root@localhost nagios]# ls -lrt nrpe.pid -rw-r--r--. 1 root root 5 Sep 21 21:51 nrpe.pid
If the file is having the correct permission then you can try changing the path of the pid_file
once to /var/run/nagios
and then restart the service to check if this helps. You can open the file using vi
editor by running vi /etc/nagios/nrpe.cfg
command as shown below and after editing the file you can save and exit by pressing Esc
and then :wq!
[root@localhost ~]# vi /etc/nagios/nrpe.cfg pid_file=/var/run/nagios/nrpe.pid
Then restart the nrpe service by using systemctl restart nrpe
command as shown below.
[root@localhost ~]# systemctl restart nrpe
Then check the status again by using systemctl status nrpe
command as shown below.
[root@localhost ~]# systemctl status nrpe ● nrpe.service - Nagios Remote Program Executor Loaded: loaded (/usr/lib/systemd/system/nrpe.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2020-09-21 21:45:35 UTC; 2s ago Docs: http://www.nagios.org/documentation Process: 27745 ExecStopPost=/bin/rm -f /var/run/nrpe/nrpe.pid (code=exited, status=0/SUCCESS) Process: 27743 ExecStart=/usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f $NRPE_SSL_OPT (code=exited, status=1/FAILURE) Main PID: 27743 (code=exited, status=1/FAILURE) Sep 21 21:45:35 localhost systemd[1]: Started Nagios Remote Program Executor. Sep 21 21:45:35 localhost systemd[1]: Starting Nagios Remote Program Executor... Sep 21 21:45:35 localhost systemd[1]: nrpe.service: main process exited, code=exited, status=1/FAILURE Sep 21 21:45:35 localhost systemd[1]: Unit nrpe.service entered failed state. Sep 21 21:45:35 localhost systemd[1]: nrpe.service failed.
If it still does not help then you can check the journalctl error to find out the root cause by using journalctl -xfeu nrpe
command as shown below. You can check journalctl command Man Page to Know more about all the available options.
[root@localhost ~]# journalctl -xfeu nrpe -- Logs begin at Fri 2020-03-13 04:24:07 UTC, end at Mon 2020-09-21 21:22:01 UTC. -- Sep 21 21:20:48 localhost systemd[1]: Started Nagios Remote Program Executor. -- Subject: Unit nrpe.service has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nrpe.service has finished starting up. -- -- The start-up result is done. Sep 21 21:20:48 localhost systemd[1]: Starting Nagios Remote Program Executor... -- Subject: Unit nrpe.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit nrpe.service has begun starting up. Sep 21 21:20:48 localhost systemd[1]: nrpe.service: main process exited, code=exited, status=2/INVALIDARGUMENT Sep 21 21:20:48 localhost systemd[1]: Unit nrpe.service entered failed state. Sep 21 21:20:48 localhost systemd[1]: nrpe.service failed.
You can also try to find the error from systemd-analyze output using systemd-analyze blame | grep -i nrpe command as shown below.
[root@localhost ~]# systemd-analyze blame | grep -i nrpe
If you are running Kubernetes Cluster in your System then you need to check the Node status by running kubectl get nodes command as shown below.
[root@localhost ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION 192.168.0.103 NotReady master 29d v1.14.5
In my case I saw that Node is showing in "NotReady" State so i checked the status of my kubectl service and found some errors there. So i just restarted the service by using systemctl restart kubelet
command. Then I again checked the status of my nodes using kubectl get nodes
command and found that it came back to "Ready"
State.
[root@localhost ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION 192.168.0.103 Ready master 29d v1.14.5
Once it came back to Ready state I checked my Nagios Server again and observed that all the alerts are cleared. Then i tried starting the nrpe service again and found that nrpe service now started successfully.
In your case, even after checking all these logs if you still does not find the root cause then my recommendation is to find out the changes that is being done after which nrpe is not starting if it was running fine before. Finally if everything fails then the simplest solution is to reboot the system once to check if that helps.
Hope you enjoyed this debugging session on error "nrpe.service: main process exited, code=exited, status=1/FAILURE". Please let me know your feedback on Comment Box.
Recommended Posts:-
8 Easy Ways to check Ubuntu Version using Bash Command Line
How to Install Let's Encrypt(Certbot) on RHEL/CentOS 8 Using 10 Easy Steps
33 Practical Examples of ulimit command in Linux/Unix for Professionals
5 Easy Steps to Install Openssh-Server on Ubuntu 20.04 to Enable SSH
Unix/Linux Find Files and Directories Owned By a Particular User(5 Useful Examples)
15 Practical Bash For Loop Examples in Linux/Unix for Professionals
6 Popular Methods to List All Running Services Under Systemd in Linux
How to Limit CPU Limit of a Process Using CPULimit in Linux (RHEL/CentOS 7/8)