RealOpInsight Concepts Guide
This document covers the concepts and the technology that underlie RealOpInsight.Contents
Why RealOpInsight
Monitoring in operations environments such as Network Operations Centers (NOC) and large data centers is highly demanding and requires to quickly react when incidents arise:- On the one hand, you should quickly evaluate the impacts of incidents on business processes so as to prioritize their recoveries from incidents that have higher impacts.
- On the other hand, you would need to report certain specific incidents to operators able to address them quickly.
Deal with these issues from native dashboards provided by the monitoring systems such as Nagios, Zabbix and Zenoss is not easy. This may be especially challenging when the amount of devices monitored increases.
That's where RealOpInsight comes in.
Before moving forward on how it works, let's introduce some concepts you do need to understand.
Terminology
- Device: is any physical component composing an IT infrastructure. E.g. servers, routers, switches...
- Incident: is any disruption to the normal operation of a device or of a service that the device enables. E.g. failure of a hard-drive, a crash of a process, a network failure.
- Monitoring item: is a probe allowing to get the status of a device against a given type of incidents. E.g. a monitoring item can allow to check the availability of the http service on a web server.
- For Nagios, Shinken, Icinga, Groundwork, op5, Centreon and other Nagios-derived systems, a monitoring item corresponds to a 'check'. E.g. check_mysql allows to monitor the status of MySQL daemon.
- For Zabbix this corresponds to a 'trigger'. E.g. the ping trigger allows to monitor the availability of a device.
- For Zenoss a monitoring item corresponds to a 'component'. E.g. the httpd component allows to check the status of httpd daemon.
- Data point: defines the relationship between a monitoring item and a device. In RealOpInsight context, each data point is uniquely characterized with an identifier with the pattern device_id/item_id. Hence:
- For Nagios, Shinken, Icinga, Groundwork, op5, Centreon and other Nagios-derived systems, a data point is identified with the patternhost_name/service_description.
- For Zabbix, they're identitied with the pattern host/trigger_name.
- For Zenoss, they're identified with the pattern device_name/component_name.
- Severity of incident: is the level of impact an incident has on the healthy of a device or of a service enabled by the device. For example:
- Nagios, Shinken, Icinga, Groundwork, op5, Centreon and other their derived systems defines four levels of severities also known as States: OK, WARNING, CRITICAL and UNKNOWN.
- Zabbix defines six levels of severities: NOT CLASSIFIED, INFORMATION, WARNING, AVERAGE, HIGH and DISASTER.
- Zenoss also defines six levels severities: CLEAR, DEBUG, INFO, WARNING, ERROR and CRITICAL.
- Service: in RealOpInsight context we distinguish two kinds of services:
- Native services also called Native checks: they include all the services directly related to data points. E.g. a service related to an application process check.
- Business processes services: correspond to high level services not directly related to data points. E.g. a backup service and a hosting service.
- Monitored platform: represents a set of monitored services. In RealOpInsight context the services are related among them within an hierarchy of services as described in the next section.
- Severity propagation rule: describes how the severity of a service is propagated to its parent within an hierarchy.
- Severity calculation rule: describes how the severity of a service is calculated according to the severities of its sub services.
- Dashboard: is a visualizing interface offering a simple and summarized view on a monitored platform.
- Nagios-based system and Nagios-derived system both refer to either: Nagios itself, Shinken, Centreon, Icinga, GroundWork or any other monitoring system that relies on the same concepts than Nagios.
RealOpInsight Dashboard Management Concepts
This section introduces:
- The organisation of a monitoring hierarchy
- How the serverity of incidents are managed
- How alarm messages are handled
Dashboard as a Hierarchy of Services
The hierarchy of a monitored platform is comprised of two kinds of services :- Native Checks: representing by chk on the figure, these kinds of services are associated to data points. Therefore the status of native check service depends on the status of the related data point. The statuses of data points are updated periodically with data retrieved from the monitoring server.
- Business Processes: Are abstractions defining high level services such as operating systems, network devices, applications, database systems, storage areas/devices, web engines, a hosting service, a downloading service, and so forth.
NOTE: To ensure the consistency of the hierarchy, all the leaf nodes should be related to native checks while a business process should not be a sheet node. A business process node can have one or more child nodes that can be checks or other business processes.

Severity Model and Incident Management
To deal with the various models of severity used by the different underlying monitoring systems, RealOpInsight relies on a unified severity model. This model comprised five levels of impacts: NORMAL, MINOR, MAJOR, CRITICAL, UNKNOWN. The following table describes the relationship among the different models.
| Severity | Nagios State | Zabbix Severity | Zenoss Severity |
|---|---|---|---|
| NORMAL | OK | CLEAR |
CLEAR |
| MINOR | - |
INFORMATION, WARNING |
DEBUG |
| MAJOR | WARNING | AVERAGE | WARNING |
| CRITICAL | CRITICAL | HIGH, DISASTER | ERROR, CRITICAL |
| UNKNOWN | UNKNOWN | NOT CLASSIFIED | - |
When an incident enters to the RealOpInsight Engine its severity is processed and propagates according to well-defined rules. This rules allow to:
- Aggregate the severities of two or more incidents to calculate the status of the related business process.
- Decrease, increase the severity of incidents according to your monitoring needs.
To better understand how these rules can used, let's take an example. Assume you have a storage service that relies on two mirrored disks. If one of the disks fails, the storage service will still operate -- even if in a degraded mode. In such a case, we would not want to report the status of the storage area as criticalt. Indeed, since it operates again, we would like to set its status to major.
RealOpInsight combines five sorts of rules which permit various kinds of advanced incident management.
| Rule | Type | Description |
|---|---|---|
| HIGH SEVERITY | Calculation rule | The severity of the related service is determined by the severity of its sub service having the highest severity |
| AVERAGE SEVERITY | Calculation rule | The severity of the related service is determined by aggregating the severities of its sub services |
| DECREASE | Propagation rule | The severity of the related service is decreased before being propagated to its parent service |
| INCREASE | Propagation rule | The severity of the related service is increased before being propagated to its parent service |
| UNCHANGED | Propagation rule | The severity of the related service is propagated as is to its parent service |
Alarm Messages and Contextualization
Message customization is one of the key features enabled by RealOpInsight. To better understand how that works we'll proceed with an example.Illustration Example
Assume that for monitoring the root partition of a database server, we have the following definition in our Nagios configuration :
define service{
use local-service
host_name mysql-server
service_description Root Partition
check_command check_local_disk!30%!10%!/
}
If the free space on that partition becomes less than 30% (warning threshold) of the total disk space:
- Nagios shall report an alert indicating something like "DISK WARNING - free space: / 58483 MB (28% inode=67%)".
- Instead of this basis message, RealOpInsight allows you to have a more human-comprehensible message such as "The free space in the root partition of the machine mysql-server is less than 30%".
In the RealOpInsight Editor, you just need to set a message in the form of "The free space in the root partition of the machine {hostname} is less than {threshold}". Here {hostname} and {threshold} are tags that are automaticaly replaced at runtime with contextual information before being printed on the Message Console.
Supported Contextualization Tags
There is the list of supported tags (the curly braces are required) :- {hostname}: shall be replaced with the hostname of the machine to which the incident is related.
- {threshold}: shall be replaced with the threshold defined in the check command. Currently this tag is only supported for Nagios.
- {plugin_output}: shall be replaced with the native message returned by the command (e.g. PING ok - Packet loss = 0%, RTA = 0.80 ms)
- {check_name}: shall be replaced with the name of the check component. E.g. check_local_disk.
How Does It Work?
To ease its integration, even in existing monitoring environments, RealOpInsight has been designed to be loosely coupled with the underlying monitoring system. Illustrated on the Figure below, the resulting architecture is an effective system that makes the monitoring of your business processes easier than ever.

- We have a Operations Console on each operator workstation that shows the monitoring dashboard to the operators/administrators.
- The Operations Console is updated periodically with data retrieved from the monitoring server.
- The way of retrieving the monitoring data differs according to the monitoring system:
- For Nagios, Shinken, Icinga, Groundwork, op5, Centreon and their derived systems, we need to enable a specific daemon on the monitoring server for retrieving data from the status.dat file.
From the version 2.3.0, RealOpInsight also supports networked Livestatus API (E.g. Shinken Livestatus API or MK Livestatus over a TCP socket) as alternatives to the Daemon Service. - For Zabbix and Zenoss no daemon is necessary. RealOpInsight relies on their RPC APIs.
- In all the cases, the interactions between the Operations Consoles and the monitoring servers relies on a powerful messaging mechanism.
- For Nagios, Shinken, Icinga, Groundwork, op5, Centreon and their derived systems, we need to enable a specific daemon on the monitoring server for retrieving data from the status.dat file.
- Every request from the Operations Console to the monitoring server needs to be authenticated for being taken into account by the latter. This protects data against unauthorized accesses:
- For Nagios, Shinken, Icinga, Groundwork, op5, Centreon and their derived systems, message exchanges between the Daemon and the Operations Console are authenticated through a token enabled on the Daemon Service.
- For Zabbix and Zenoss you just need an user account having suitable permissions to access their respective API's.
- The access to RealOpInsight GUIs requires local credentials