Introduction

Monitoring is the key service needed to gain insights into an infrastructure. It needs to be continuous and on-demand to quickly detect, correlate, and analyse data for a fast reaction to anomalous behaviour. The challenge of this type of monitoring is how to quickly identify and correlate problems before they affect end-users and ultimately the productivity of the organisation. Management teams can monitor the availability and reliability of the services from a high level view down to individual system metrics and monitor the conformance of multiple SLAs. The key functional requirements are:

Monitoring of services
Reporting availability and reliability,
Visualisation of the services status,
Provide dashboard interfaces,
Sending real-time alerts.

The dashboard design should enable easy access and visualisation of data for end-users. APIs should also be supported so as to allow third parties to gather monitoring data from the system through them.

The key requirements of a monitoring system are:

Support for multiple entry points (different types of systems can work together)
Interoperable
High availability of the different components of the system
Loosely coupled: support API’s in the full stack so that components are independent in their development cycles
Support for Multiple Tenants, Configurations, Metrics and profiles to add flexibility and ease of customisation.

For EOSC there are two monitoring services already in place: EOSC CORE and the EOSC Exchange Monitoring Service. These two services are responsible to monitor the Core services (EOSC Core Monitoring) and the services onboarded to the Marketplace (EOSC-Exchange Monitoring).

High-level Service Reference Architecture

The service collects status (metrics) results from one or more monitoring engine(es) and delivers daily and/or monthly availability (A) and reliability (R) results of distributed services. Both status results and A/R metrics are presented through a Web UI, with the ability for a user to drill-down from the availability of a site to individual test results that contributed to the computed figure.

Figure 1. High level architecture of a Monitoring service

The main components of a monitoring service are depicted in the high-level architecture diagram and described below.

Monitoring Engine(s): This service component executes the service checks against the infrastructure and delivers the metric data (probe check results) to the Messaging Service.

Sources of Truth: The Monitoring system should support a number of connector plugins that are able to fetch topology, Metrics and Factors from various sources such as the CMDB and the Providers Portal. It also offers a Metric and Profile Management Component which is used in order to define checks (probes) and associate them to service types. Each grouping of checks and service types forms a profile.

Messaging: The monitoring system depends on a Pub/Sub Messaging Service to be in place, in order to facilitate the communication between its components.

Computations & Analytics: This component of the system should include computational job definitions for ingesting data, calculating status and availability/reliability and a management service to automatically configure, deploy and execute those jobs on a distributed processing engine for stateful computations. At the same time this component analyzes the monitoring results and sends notifications based on a set of rules, to inform the users (operators, NGIs) about the status of their services.

WEB API: Rest-like HTTP API service that provides access to status and availability/reliability results. It supports token based authentication and authorization with established roles. Results are provided in JSON Format.

WEB UI : The Web UI is the component used to store, consolidate and “feed” data into the web application. The global information from the primary and heterogeneous data sources is retrieved by means of the different plugins. The collected information is structured and organized within configuration files in the service and, finally, made available to the web application without the need for any further computations. This modular architecture is conceived in order to make it easy to add new data sources and to use cached information if a primary source is unavailable. The resulting data is exposed through a RESTful web service interface.

Definitions

In this section we explain the basic concepts of EOSC Monitoring.

A Tenant is an isolated instance of the ARGO Monitor service that relies on common components and provides the user with its own environment.

ARGO provides default UI and POEM URLs in following form:

UI: https://<tenant_name>.ui.argo.grnet.gr
POEM: https://<tenant_name>.ui.argo.grnet.gr

In case custom ones are to be used, the customer is responsible for providing valid certificates and DNS aliases.

ARGO Monitoring service requires following topology information in order to monitor services:

the services and service endpoints they are running,
the way they are organised (e.g. groups of sites, groups of services),
the service actors (owners, admins, contact points).

Topology can be further extended with attributes needed for individual probes (e.g. service port or URL, path to be used in case of storage services, e.g.).

Supported topology sources are:

EOSC Resource Registry (Providers Portal):

Need to be extended to hold the following information for Monitoring.

Service unique Id
the service endpoints,
the way they are organised (service, service components),
the service actors (owners, admins, contact points).

EGI Configuration Database (GOCDB)
EUDAT DPMT
JSON feed in the predefined format.

A Metric is a chunk of code that checks specific functionality of a given service. For example a metric such as Portal-WebCheck runs on a site and checks if the HTTP connection responds correctly or not.

A Probe is a piece of code that implements single or multiple tests. The probe must comply with the guidelines for monitoring probes.

ARGO provides a registry of probes and metrics. New probes and metrics can be added to the registry with the support of the ARGO monitoring team.

A Metric Profile is used to associate a Service with the corresponding metrics.

An Aggregation Profile defines how to aggregate service statuses into higher hierarchical grouping (i.e. a service_group) status results. They are actually used to define logical rules on how to aggregate individual service status computations into groups.

The results of the metrics are computed, into EOSC Monitoring Service calculations, in order to conclude into the operational state of the service, during a specific period. In order to conclude about the operational state of the service, all or part of the metrics that check the service’s functionality should be taken into account. In Metrics Profiles are included, for each service, these metrics whose results are considered to the computations of the service’s state. For example, a service WebSite runs on host1.example.com. The WebSite service should operate properly, be accessible and some actions should be available such as downloading or uploading material (documents, images etc). Three metrics can apply on the service to check it’s functionalities:

Portal-WebCheck is a metric to check if the http responds
http.download is a metric to check if download functionality operates well
http.upload is a metric to check if upload functionality operates well

The service is assumed to operate properly if it is accessible and can support downloading material. Uploading material does not affect the state of the service (whether it is working properly or not). So in the Metrics Profile, the metrics Portal-WebCheck and http.download will be defined in order to be taken into account for concluding the status of the service.

Adopted Standards

REST (https://www.ics.uci.edu/~fielding/pubs/)
SAML2 (https://wiki.oasis-open.org/security/FrontPage)
X509 (https://www.rfc-editor.org/info/rfc5280)
Apache Avro (http://avro.apache.org/)

Adopted Protocols

HTTPS (https://tools.ietf.org/html/rfc2818)
Nagios Plugin API (https://nagios-plugins.org/doc/guidelines.html)
ARGO API over REST API (http://argoeu.github.io/guides/api/)

Integration Options

Use Case 1: Monitor an Onboarded Service (central one)

Introduction

This use case covers the scenario to monitor a service Onboarded to EOSC via the Providers Portal. The results of this process will become available via the EOSC Exchange Monitoring WebUI (https://argo.eosc-portal.eu). In order to start monitoring an onboarded service, several requirements should be met. In addition to the basic information provided during the onboarding process, the service provider needs to provide some extra information needed by the ARGO monitoring service, described in the section below.

Solution

In order to start monitoring a service, a customer should follow the steps described below.

Step 1 Onboard the service

Before the service can be monitored, it should be onboarded into the EOSC Portal. The procedure for service on-boarding is described in detail in the EOSC portal onboarding process wiki page.

Step 2 Provide additional info for monitoring

When the service has been successfully onboarded, the ARGO monitoring service requires some additional information. First and foremost, the monitoring service requires the probes and metrics to be associated with the service.

Once the service provider decides on the probes/metrics they wish to use, the metrics should be mapped to the service they wish to monitor in EOSC-Exchange Metric Profile. After the metric profile, aggregation and thresholds profiles should also be updated.

Step 3 Start monitoring

Once all the information has been provided, the monitoring of the service starts and the ARGO monitoring Computation and Analytics component calculates availability and reliability of the service, and creates a report.

The Service Provider can have a look at the A/R and status results from the EOSC-Exchange Monitoring UI.

Use Case 2: Monitor an Infrastructure (community).

Introduction

Use case 2 covers the scenario when infrastructure monitoring requirements cannot be met by EOSC-Exchange Monitoring. For example, one the following are required:

defining custom topology and aggregation of monitored endpoints
selecting from existing range of probes and adding custom ones
managing profiles and metrics for different services

Solution

In order to start monitoring an infrastructure, an Infrastructure Manager should follow the steps described below.

Step 1 EOSC helpdesk request

The Infrastructure Manager opens a ticket on EOSC Helpdesk requesting creation of an ARGO Monitoring instance for monitoring new infrastructure. Minimum information that should be provided in ticket:

Infrastructure topology
Personnel responsible for managing profiles
URLs for POEM and UI components

Step 2 ARGO team initial actions

ARGO team will create a new tenant based on provided information and reply to the initial request that all instances are ready for use.

Step 3 Define initial monitoring profiles

Minimum set of profiles that must be defined before monitoring can start:

List of metrics must be selected from the metric repository
Metric Profile
Aggregation Profile

Step 4 Start monitoring

Once all the information has been provided, the monitoring of the service starts and the ARGO monitoring Computation and Analytics component calculates availability and reliability of the service, and creates a report. The Infrastructure Manager can have a look at the A/R and status results from the dedicated UI. Monitoring new services is described in Use Case 1.

Use Case 3: Integrate External Monitoring service.

Introduction

In order to be able to scale-out and take advantage of existing Monitoring systems, the EOSC Monitoring service is capable of accepting data from external sources. When referring to external sources we mean other monitoring engines that want to connect with the EOSC Monitoring Service. This use case is split in two different sections as follows:

Case 3.1: Supported Monitoring Engine and Operating System (Nagios on Centos 7 or Debian 8).
Case 3.2 Other Monitoring Engine and Operating System

Solution

The connection of a monitoring system with EOSC is based mainly on the data that have the necessary information to create the final report. In this use case an external monitoring system replaces the internal monitoring engine and is thus reliable for the validity of the monitoring data that is published.

Step 1: EOSC helpdesk

The interested party opens a ticket on EOSC Helpdesk requesting to start the process to connect to the EOSC Monitoring Service. During the preparation of its request they need to prepare their systems to be able to provide the following information:

The type of system used
Infrastructure topology
Personnel responsible for managing the necessary profiles
URLs for POEM and UI components.

Step 2: The Monitoring team creates a new Tenant.

The monitoring team creates a new tenant on the monitoring service and at the same time requests from the messaging team to create the necessary configuration on the EOSC Messaging service. As a result the team will then send to the customer the necessary instructions and access tokens to connect to the Monitoring Service.

Step 3: The monitoring team assists the interested party to create the necessary profiles.

The profiles that need to be defined

Metric Profile
Aggregation Profile.

Step 4: Publish Metric Data

The customer will need to make the necessary configuration on their monitoring engine in order to start publishing metric data via the EOSC messaging service. The EOSC Monitoring Service supports two options

Case 3.1: Supported monitoring Engine and Operating System (Nagios on Centos 7 or Debian 8).

If the customers uses Nagios as its monitoring tool, EOSC Monitoring offers the argo-nagios-ams-publisher tool that is currently supported on Centos-7 and Debian-8. argo-nagios-ams-publisher is a component acting as a bridge from Nagios to ARGO Messaging system and finally to the ARGO Monitoring Engine. It is responsible for forming and dispatching messages that wrap up results from the monitoring engine. In order to use the this solution the customer will need to :

Install argo-nagios-ams-publisher and ams-library
Configure argo-nagios-ams-publisher
Enable OCSP in Nagios:

In /etc/nagios/nagios.cfg add this configuration

obsess_over_services=1

ocsp_command=argo_service_check

ocsp_timeout=15

Add OCSP command:

should add an OCSP command in /etc/nagios/objects/commands.cfg

define command {

command_name argo_service_check

command_line /usr/bin/ams-metric-to-queue --queue /var/spool/argo-nagios-ams-publisher/metrics/ --hostname "$HOSTNAME$" --status "$SERVICESTATE$" --summary "$SERVICEOUTPUT$" --message "$LONGSERVICEOUTPUT$" --servicestatetype "$SERVICESTATETYPE$" --actual_data "$SERVICEPERFDATA$" --service "$_SERVICESERVICE_FLAVOUR$" --metric "$_SERVICEMETRIC_NAME$"

}

All the Services to be published must have following attributes set:

define service {

use generic-service; Name of service template to use

host_name grnet.gr

service_description HTTP

check_command check_http

check_interval 5

_service_flavour WebPortal //the service

_metric_name org.nagios.WebCheck

}

Start argo-nagios-ams-publisher by executing

service ams-publisherd start

Case 3.2 Other monitoring systems

In this solution - use case the client cannot or doesn't want to use the solution described in the case 3.1 . Then the external monitoring system should find a way to send the monitoring data (metric data) to the EOSC Monitoring . These data should follow a predefined format.

The data should be stamped with their source and timestamp. Every metric should be prefixed with [source_type], following the metric naming best practises. Every metric is also labelled with the hostname and service description. These predefined messages should be sent to the EOSC Messaging service which is the service responsible to pass them to the computations engine which performs the necessary calculations to produce the reports.

{

"hostname": "host101.example.com",

"service": "eu.eosc.portal.services.url",

"metric": "org.nagios.WebCheck",

"timestamp": "2022-01-02T00:24:38Z",

"status": "OK",

"tags": {

"endpoint_group": "GroupA"

},

"summary": "200 OK",

"actual_data": "time=0.085796s;;;0.000000 size=1126B;;;0",

"monitoring_host": "monbox.example.com",//name of the external monitoring box

"message": "a more detailed message about the monitoring result"

}

Metric data comes in the form of avro files, (json files support currently in development ) and contains timestamped status information about the hostname, service and specific checks (metrics) that are being monitored. A typical item of information in the metric data contains the field listed in the table below.

{"namespace": "argo.avro", //currently this type is supported.

"type": "record",

"name": "metric_data",

"fields": [

{"name": "timestamp", "type": "string"},

{"name": "service", "type": "string"},

{"name": "hostname", "type": "string"},

{"name": "metric", "type": "string"},

{"name": "status", "type": "string"},

{"name": "monitoring_host", "type": ["null", "string"]},

{"name": "summary", "type": ["null", "string"]},

{"name": "message", "type": ["null", "string"]},

{"name": "tags", "type" : ["null", {"name" : "Tags",

"type" : "map",

"values" : ["null", "string"]

}]

}

Table x: The accepted format of the schema.

The monitoring team will validate the published metric data against the supplied topology and perform a number of dry runs to ensure that there is no issue with the supplied data. As soon as the metric data is validated by the Monitoring Team these will be the main data to compute A/R and status results.

Step 5: Start Monitoring

Once information has been provided, the monitoring of the service starts and the ARGO monitoring Computation and Analytics component calculates availability and reliability of the service, and creates a report. The Infrastructure Manager can have a look at the A/R and status results from the dedicated UI. Monitoring new services is described in Use Case 1.

Use Case 4: Combine Results of existing ARGO Tenants.

Introduction

This use case covers the scenarios where the topology and the results of multiple tenants need to be combined in a number of reports.

Prerequisites

In order to combine results from tenants A, B (example names), those tenants should be already monitored by ARGO Monitoring service complete with the following definitions for each tenant:

Latest Data available: Each tenant should be checked that has an active stream of incoming monitoring data.

Topology: Each tenant should already have a well defined source of topology that includes lists of groups, endpoints and services.
Metric Profile: In simple terms, a list of all services to be checked along with all relevant metrics per service

Solution

Step 1: Open a ticket to helpdesk

In order to have results, the customer should create a ticket on the helpdesk describing:

Tenants to be used in the combined report
Services and metrics
Aggregation profile.

For each tenant that is going to take part in producing the combined results check that all of the prerequisites (mentioned in the previous section) do apply.

Step 2: Creation of the Combined Tenant.

Create a new tenant that will host the combined report. This tenant will act as a host tenant for the combined results and will rely on the data of the other tenants as input for the computations of the availability, reliability and status results.

Step 3 Start monitoring

Once all the information has been provided, the monitoring of the service starts and the ARGO monitoring Computation and Analytics component calculates availability and reliability of the services, and creates a report.

The User can have a look at the A/R and status results from the combined reports from the UI.

Use Case 5: Third-party services exploiting EOSC Monitoring data

Introduction

This use case covers the scenario according to which the customer needs to use the results of the EOSC Monitoring Service in an external service/dashboard.

The customer can access the following information via an API:

A/R information about the service and its service components
Status information about the service and its service components
The topology and grouping of the service

Solution

Step 1: EOSC helpdesk

Τhe user that wants to gain access to this type of monitoring information will get a token with read-only access to the A/R and status results. The user via the EOSC helpdesk may send his request to the monitoring team by sending:

The name of the service that wants the information
An email to create the user
The type of information (A/R results, status results, both)

Step 2: Start Ingesting the data.

The monitoring team will provide the required token and information, guidance on how to retrieve the information.

Example used

In this example we are going to present how the user can get the availability, the reliability values and the status of the AMS (Messaging Service) ( endpoint: https://msg.argo.grnet.gr ) of the Organisation GRNET.

The Monitoring Service Monitoring Service is checking the services at regular intervals. It actually runs explicit tests (checks) in order to assess the status of the service. The result of the checks decides on the status of the service. In order to display status information it uses reports where it keeps all the necessary information.

At the same time it produces useful conclusions about the monitoring item via the monitoring analytics engine. One very useful conclusion is to decide if the item is available for usage and if it is considered as reliable. To succeed this, availability/reliability values (hourly, daily, monthly) are calculated. These different types of information are also encapsulated in a report.

The EOSC monitoring service monitors the Messaging Service and it performs the following checks

cert_validity_check : a metric that checks the validity of the certificate used by the service
ams_check: a metric that checks a list of functionalities provided by the messaging service.

Based on the explanation provided above, the information about the service follows:

Definition	Value	Description
GROUP	GRNET	A collection of services
SERVICE	AMS	The type of one of the services of the collection
SERVICE endpoint	msg.argo.grnet.gr(AMS)	is defined as the combination of a hostname and Service Type. (a Service Type of AMS listening on port/s <ams-port/s> on the host msg.argo.grnet.gr is a service endpoint)
Grouping used in the report	SERVICEGROUPS	the way the services are organized (e.g. in groups of sites, in groups of services) in the monitoring engine
A/R report	Default	The place where the A/R results are provided.
Status report	Default	The place where status results are provided.

This is the configuration that the user will have to use to use the api calls.

API call examples for A/R reports

The api authenticates the user using the api-key within the x-api-key header. Users can specify time granularity (monthly or daily) for retrieved results and also format using the Accept header. Depending on the form of the request the user can request a group, service or service endpoint.

Detailed documentation: https://argoeu.github.io/api/v3/results/

Example

For the AMS the corresponding api call to get the A/R of the service group GRNET is:

Request for A/R results for service group GRNET

$ curl -X GET -H "Accept: application/json" -H "Content-Type: application/json" -H "x-api-key: secret-token" https://api.argo.grnet.gr/api/v3/results/Default/SERVICEGROUPS/GRNET?start_time=2021-08-05T00:00:00Z&end_time=2021-08-05T23:59:59Z

API call examples for status reports

The api authenticates the user using the api-key within the x-api-key header. Users can specify time granularity (monthly or daily) for retrieved results and also format using the Accept header. Depending on the form of the request the user can request a group, service or service endpoint.

Detailed documentation: https://argoeu.github.io/api/v3/status/

Example

For the AMS the corresponding api call to get the status of the service group GRNET is:

Request for status results for service group GRNET

$ curl -X GET -H "Accept: application/json" -H "Content-Type: application/json" -H "x-api-key: secret-token" https://api.argo.grnet.gr/api/v3/status/Default/SERVICEGROUPS/GRNET?start_time=2021-08-05T00:00:00Z&end_time=2021-08-05T23:59:59Z

Space shortcuts

Page tree

3. Monitoring: Architecture and Interoperability Guidelines

Introduction

High-level Service Reference Architecture

Definitions

Integration Options

Use Case 1: Monitor an Onboarded Service (central one)

Introduction

Solution

Step 1 Onboard the service

Step 2 Provide additional info for monitoring

Step 3 Start monitoring

Use Case 2: Monitor an Infrastructure (community).

Introduction

Solution

Step 1 EOSC helpdesk request

Step 2 ARGO team initial actions

Step 3 Define initial monitoring profiles

Step 4 Start monitoring

Use Case 3: Integrate External Monitoring service.

Introduction

Solution

Step 1: EOSC helpdesk

Step 2: The Monitoring team creates a new Tenant.

Step 3: The monitoring team assists the interested party to create the necessary profiles.

Step 4: Publish Metric Data

Case 3.1: Supported monitoring Engine and Operating System (Nagios on Centos 7 or Debian 8).

Case 3.2 Other monitoring systems

Step 5: Start Monitoring

Use Case 4: Combine Results of existing ARGO Tenants.

Introduction

Prerequisites

Solution

Step 1: Open a ticket to helpdesk

Step 2: Creation of the Combined Tenant.

Step 3 Start monitoring

Use Case 5: Third-party services exploiting EOSC Monitoring data

Introduction

Solution

Step 1: EOSC helpdesk

Step 2: Start Ingesting the data.

Example used

API call examples for A/R reports

Example

API call examples for status reports

Detailed documentation: https://argoeu.github.io/api/v3/status/

Example