Cluster Monitoring And Alert

Cluster Monitoring

The Cluster Manager provides option to monitor multiple clusters. Monitoring page shows overview of a running cluster metrics and insight of each nodes’ memory, network and HDFS usage, disk space, JVM properties and more.

Job Details

Job details page shows resource manager metrics and job details.

Alert

Alert is a centralized notification system for all clusters that are maintained by Cluster Manager, you can be aware of critical issues that commonly leads to data loss and job failure.

The following test cases are run across clusters based on alert settings.

Test case</b> Description Solution</b>
HA notification If there is no proper active and standby name node to maintain high availability in a cluster. Check whether namenode machine and Hadoop services are in running state. If the machine is not available permanently, please add new name node to replace dead name node.
Agent status If Big Data agent is not running or cluster manager is unable to communicate them. Ensure Big Agent is running properly in corresponding nodes. Restart them if it is not running. Please check section 'Big Data agent is not installed or running' in Troubleshooting page for more details
Cluster safe mode If name node enters to safe mode. This may due to many reasons probably will happen when data nodes having corrupted files or in initializing fsimage and edit logs are not loaded properly.You can try following command in any of cluster node manually to leave from safe mode.

hdfs dfsadmin –safemode leave

To move and delete corrupted files. Run the following commands respectively.

hdfs fsck / -move

hdfs fsck / -delete

HDD - Free space check If free disk space is not available for HDFS. Free required disk space in cluster nodes
Corrupted files check If data nodes contain corrupted files To move and delete corrupted files. Run the following commands respectively.

hdfs fsck / -move

hdfs fsck / -delete

Missing replica If there is no required number of data nodes to maintain configured replication factor. Add required number of additional data nodes to maintain configured replication factor.
Container space check If job is running out of container space. Increase allocated physical memory for container or kill some jobs that can be run later.
Hadoop log file memory check If Hadoop’s log file memory size exceeds 10% of total hard disk in cluster nodes. Backup logs and remove it to save disk space.Log location:

drive:\Syncfusion\HadoopNode\version\SDK\Hadoop\logs

Live Data node status If all data nodes are dead. Add required data nodes in the cluster.
Hadoop services status If Hadoop services such as Resource Manager, Job History Server, Journal Node and Node Manager fail in a cluster. Check logs and restart the services if required.
Big Data agent version is not matched If the build version of Cluster Manager and Agent of any node is not same. Ensure whether the build versions are same.

Settings

You can edit test cases list and frequency of test case to be run in alert settings page.

Alert notification through mail

To enable email alerts, check the Mail Alert checkbox in Alert Settings page. Mail Settings can be done by following the below steps,

  1. Click Email Settings button,
  2. Fill in following details,
    • Your mail server details (SMTP server, SMTP port number).
    • Sender name.
    • Email user.
    • The username and password of the email user that will be logged into the mail server as the “sender” of the alert emails.
    • Comma-separated list of email addresses that will be the recipients of alert emails.
  3. Click Save button.