Eliatra Suite

2023-05-11

Introducing Alerting Plus

The Eliatra Suite extends the capabilities of OpenSearch by adding new features and functionalities that make running OpenSearch on an enterprise level much more effortless. This article introduces Alerting Plus, part of the Eliatra Suite.

Eliatra Suite Alerting Plus

The Eliatra Suite extends the capabilities of OpenSearch by adding new features and functionalities that make running OpenSearch on an enterprise level much more effortless. This article introduces Alerting Plus, part of the Eliatra Suite.
Alerting Plus is an advanced notification system that helps you ingest, monitor, examine, and correlate data from various sources, including OpenSearch and other systems. By setting up watches and conditions, Alerting Plus can send notifications if anomalies are detected in your data. These can be simple notifications via Email, Slack, PagerDuty, etc. You can also set up more sophisticated escalation models that use different channels depending on the severity of the issue at hand.
Let’s look at the building blocks of Alerting Plus and how we can use them to set up a typical use case, namely log file monitoring.

Usecase: Application Log File Monitoring

In our use case, we use OpenSearch to ingest log files from a customer-facing web application. At ingest time, each log line is transformed into a document with several fields like the timestamp, the log message, and the HTTP response code. The goal is to get notified if anomalies regarding the application’s error rate are detected.

Watches

First, we need to set up a watch. A watch fetches data from one or more input sources at defined intervals and sends notifications if one or more conditions/checks are met. It can optionally also perform calculations, transformations, and correlations on the data. The basic structure of a watch looks like this:
{
  "trigger": {
     ...
  },
  "checks": [
    ...
  ],
  "actions": [
  ],
}
Sounds too abstract? Let’s break it down.
Since we want to check the application logs regularly, we first set up a trigger that executes in 10-minute intervals. Alerting Plus comes with a range of predefined triggers, like hourly or daily triggers or triggers based on cron expressions. In our case, we set up a 10 minute interval trigger:
"trigger": {
  "schedule": {
    "interval": "10m"
   }
}
Next, we need to access the data we want to examine. Since our data is already stored in OpenSearch, we define a Search Input for our watch. A search input executes an OpenSearch query on one or more indices and makes the result available for the subsequent processing steps. We define a simple query that selects all log file entries with an error code 500 in the last 10 minutes from an index called “logs”.
"checks": [
  {
	"type": "search",
	...
	"request": {
		"indices": [
			"logs"
		],
		"body": {
		  "query": {
		    "bool": {
		      "must": [
		        {
		          "range": {
		            "timestamp": {
		              "gte": "now-10m"
		            }
		          }
		        },
		        {
		          "match": {
		            "response_code": 500
		          }
		        }
		      ]
		    }
		  }
        }
      }
    }
  ] 

Conditions

Conditions are used to control the execution flow. A condition controls whether a specific value or threshold is reached to decide whether the watch should continue execution or not. We want to send out a notification if the number of errors is above 5. This correlates with the number of documents found by the query above.
{
  "type": "condition",
  "name": "error rate above 5",
  "source": "data.searchresult.hits.total.value > 5"
}
If the number of documents found is below 5, everything seems okay, and the execution will stop here. Otherwise, Alerting Plus executes the defined actions.

Actions

Actions can be used to send notifications by e-mail or other messaging services such as Slack or PagerDuty. Also, actions allow writing data back to OpenSearch indices. A general-purpose mechanism to invoke external services is the webhook action which allows making HTTP requests to configurable endpoints.
Alerting Plus supports the following action types:
    Email
    Slack
    PagerDuty
    JIRA
    Webhooks
    Index - to write data back to OpenSearch

Escalation Model: Severity Levels

In a real-world scenario, you probably want more fine-grained control over the notifications sent out. Not all anomalies are created equal. If we detect a slight increase in error rates, we may want to email the DevOps team. If we see a sudden spike, we want to inform the person on duty immediately via PagerDuty.
This is where the Alerting Plus severity levels come into play. You can associate any metric a watch observes with a severity level and assign actions to these severity levels. The actions will then be executed only if these severity levels are reached.
In our use case, we can define two severity levels, one for error rates between 5 and 10, and one for error rates greater than 10.
"severity": {
    "value": "data.searchresult.hits.total.value",
    "order": "ascending",
    "mapping": [
        {
            "threshold": 5,
            "level": "warning"
        },
        {
            "threshold": 10,
            "level": "error"            	  
        }           
    ]
}
We can then map our actions to severity levels:
"actions": [
    {
        "type": "email",
        "severity": ["warning"],
        ...
    },
    {
        "type": "pagerduty",
        "severity": ["error"],
        ...
    }
]

Throttling and Acknowledgement

Usually, it will take a while until someone has identified and fixed the problem with our application. Alerting Plus offers the capability to throttle or acknowledge actions not to get flooded with too many notifications.
A throttle limits the number of times an action is executed with a specific period.
{
  "actions": [
  	{
  		"type": "email",
  		"throttle_period": "1h",
        ... 
  	}
  ]
}
By acknowledging an action, you silence this action until it is unacknowledged. Suppose a technician is aware of the issue already. To not get flooded with notifications, the technician can acknowledge the corresponding action while trying to analyze and fix the problem. This can be done using the REST API, or the Alerting Plus Dashboards UI. It is also possible to add acknowledge links via template variables to any notification.

Resolve Actions

While getting notified when an abnormal error level is detected is crucial, knowing when things are back to normal is equally important. This is the job of resolve actions. Resolve actions are executed if one of its severity levels was active before but is not active anymore.
Let’s say we want to send out a message on Slack that the error level of our application has decreased from severity level “error” to “warning”. To do so, we can define a resolve action like:
 "resolve_actions": [
     {
         "type": "slack",
         "resolves_severity": [ "error" ],
         "text": "Severity level is now ; it was before: . The error rate has decreased from  to "
     }
 ]

Access Control: Multi-Tenancy Integration

Alerting Plus is perfectly integrated with our Eliatra Suite security solution: You can define which user can access watches and notification channels by leveraging the multi-tenancy capabilities of Eliatra Suite Security Plus. That way, you always have control over who can create, change or delete a watch or a notification channel.

Next Steps

We will follow up this overview article with posts explaining each Alerting Plus feature in more detail. In the meantime, download and install the Eliatra Suite and give it a spin. We appreciate any questions or feedback on the Eliatra Forum.
Ready to get started?!
Let's work together to navigate your OpenSearch journey. Send us a message and talk to the team today!
Get in touch