Eliatra Suite Alerting Plus
The Eliatra Suite extends the capabilities of OpenSearch by adding new features and functionalities that make running OpenSearch on an enterprise level much more effortless. This article introduces Alerting Plus, part of the Eliatra Suite.
Alerting Plus is an advanced notification system that helps you ingest, monitor, examine, and correlate data from various sources, including OpenSearch and other systems. By setting up watches and conditions, Alerting Plus can send notifications if anomalies are detected in your data. These can be simple notifications via Email, Slack, PagerDuty, etc. You can also set up more sophisticated escalation models that use different channels depending on the severity of the issue at hand.
Let’s look at the building blocks of Alerting Plus and how we can use them to set up a typical use case, namely log file monitoring.
Usecase: Application Log File Monitoring
In our use case, we use OpenSearch to ingest log files from a customer-facing web application. At ingest time, each log line is transformed into a document with several fields like the timestamp, the log message, and the HTTP response code. The goal is to get notified if anomalies regarding the application’s error rate are detected.
Watches
First, we need to set up a watch. A watch fetches data from one or more input sources at defined intervals and sends notifications if one or more conditions/checks are met. It can optionally also perform calculations, transformations, and correlations on the data. The basic structure of a watch looks like this:
copy{
"trigger": {
...
},
"checks": [
...
],
"actions": [
],
}
Sounds too abstract? Let’s break it down.
Since we want to check the application logs regularly, we first set up a
trigger that executes in 10-minute intervals. Alerting Plus comes with a range of predefined triggers, like
hourly or daily triggers or triggers based on cron expressions. In our case, we set up a 10 minute interval trigger:
copy"trigger": {
"schedule": {
"interval": "10m"
}
}
Next, we need to access the data we want to examine. Since our data is already stored in OpenSearch, we define a
Search Input for our watch. A search input executes an OpenSearch query on one or more indices and makes the result available for the subsequent processing steps. We define a simple query that selects all log file entries with an error code 500 in the last 10 minutes from an index called “logs”.
copy"checks": [
{
"type": "search",
...
"request": {
"indices": [
"logs"
],
"body": {
"query": {
"bool": {
"must": [
{
"range": {
"timestamp": {
"gte": "now-10m"
}
}
},
{
"match": {
"response_code": 500
}
}
]
}
}
}
}
}
]
Conditions
Conditions are used to control the execution flow. A condition controls whether a specific value or threshold is reached to decide whether the watch should continue execution or not. We want to send out a notification if the number of errors is above 5. This correlates with the number of documents found by the query above.
copy{
"type": "condition",
"name": "error rate above 5",
"source": "data.searchresult.hits.total.value > 5"
}
If the number of documents found is below 5, everything seems okay, and the execution will stop here. Otherwise, Alerting Plus executes the defined actions.
Actions
Actions can be used to send notifications by e-mail or other messaging services such as Slack or PagerDuty. Also, actions allow writing data back to OpenSearch indices. A general-purpose mechanism to invoke external services is the webhook action which allows making HTTP requests to configurable endpoints.
Alerting Plus supports the following action types:
Email
Slack
PagerDuty
JIRA
Webhooks
Index - to write data back to OpenSearch
Escalation Model: Severity Levels
In a real-world scenario, you probably want more fine-grained control over the notifications sent out. Not all anomalies are created equal. If we detect a slight increase in error rates, we may want to email the DevOps team. If we see a sudden spike, we want to inform the person on duty immediately via PagerDuty.
This is where the
Alerting Plus severity levels come into play. You can associate any metric a watch observes with a severity level and assign actions to these severity levels. The actions will then be executed only if these severity levels are reached.
In our use case, we can define two severity levels, one for error rates between 5 and 10, and one for error rates greater than 10.
copy"severity": {
"value": "data.searchresult.hits.total.value",
"order": "ascending",
"mapping": [
{
"threshold": 5,
"level": "warning"
},
{
"threshold": 10,
"level": "error"
}
]
}
We can then map our actions to severity levels:
copy"actions": [
{
"type": "email",
"severity": ["warning"],
...
},
{
"type": "pagerduty",
"severity": ["error"],
...
}
]
Throttling and Acknowledgement
Usually, it will take a while until someone has identified and fixed the problem with our application. Alerting Plus offers the capability to
throttle or acknowledge actions not to get flooded with too many notifications.
A throttle limits the number of times an action is executed with a specific period.
copy{
"actions": [
{
"type": "email",
"throttle_period": "1h",
...
}
]
}
By acknowledging an action, you silence this action until it is unacknowledged. Suppose a technician is aware of the issue already. To not get flooded with notifications, the technician can acknowledge the corresponding action while trying to analyze and fix the problem. This can be done using the REST API, or the Alerting Plus Dashboards UI. It is also possible to add acknowledge links via template variables to any notification.
Resolve Actions
While getting notified when an abnormal error level is detected is crucial, knowing when things are back to normal is equally important. This is the job of resolve actions. Resolve actions are executed if one of its severity levels was active before but is not active anymore.
Let’s say we want to send out a message on Slack that the error level of our application has decreased from severity level “error” to “warning”. To do so, we can define a resolve action like:
copy "resolve_actions": [
{
"type": "slack",
"resolves_severity": [ "error" ],
"text": "Severity level is now ; it was before: . The error rate has decreased from to "
}
]
Access Control: Multi-Tenancy Integration
Alerting Plus is perfectly integrated with our
Eliatra Suite security solution: You can define which user can access watches and notification channels by leveraging the multi-tenancy capabilities of Eliatra Suite Security Plus. That way, you always have control over who can create, change or delete a watch or a notification channel.
Next Steps
We will follow up this overview article with posts explaining each Alerting Plus feature in more detail. In the meantime,
download and install the Eliatra Suite and give it a spin. We appreciate any questions or feedback on the
Eliatra Forum.