# Colmet - Collecting metrics about jobs running in a distributed environnement
## Introduction:
Colmet is a monitoring tool to collect metrics about jobs running in a
distributed environnement, especially for gathering metrics on clusters and
grids. It provides currently several backends :
- Input backends:
- taskstats: fetch task metrics from the linux kernel
- rapl: intel processors realtime consumption metrics
- perfhw: perf_event counters
- jobproc: get infos from /proc
- ipmipower: get power metrics from ipmi
- temperature: get temperatures from /sys/class/thermal
- infiniband: get infiniband/omnipath network metrics
- lustre: get lustre FS stats
- Output backends:
- elasticsearch: store the metrics on elasticsearch indexes
- hdf5: store the metrics on the filesystem
- stdout: display the metrics on the terminal
It uses zeromq to transport the metrics across the network.
It is currently bound to the [OAR](http://oar.imag.fr) RJMS.
A Grafana [sample dashboard](./graph/grafana) is provided for the elasticsearch backend. Here are some snapshots:


## Installation:
### Requirements
- a Linux kernel that supports
- Taskstats
- intel_rapl (for RAPL backend)
- perf_event (for perfhw backend)
- ipmi_devintf (for ipmi backend)
- Python Version 2.7 or newer
- python-zmq 2.2.0 or newer
- python-tables 3.3.0 or newer
- python-pyinotify 0.9.3-2 or newer
- python-requests
- For the Elasticsearch output backend (recommended for sites with > 50 nodes)
- An Elasticsearch server
- A Grafana server (for visu)
- For the RAPL input backend:
- libpowercap, powercap-utils (https://github.com/powercap/powercap)
- For the infiniband backend:
- `perfquery` command line tool
- for the ipmipower backend:
- `ipmi-oem` command line tool (freeipmi) or other configurable command
### Installation
You can install, upgrade, uninstall colmet with these commands::
```
$ pip install [--user] colmet
$ pip install [--user] --upgrade colmet
$ pip uninstall colmet
```
Or from git (last development version)::
```
$ pip install [--user] git+https://github.com/oar-team/colmet.git
```
Or if you already pulled the sources::
```
$ pip install [--user] path/to/sources
```
### Usage:
for the nodes :
```
sudo colmet-node -vvv --zeromq-uri tcp://127.0.0.1:5556
```
for the collector :
```
# Simple local HDF5 file collect:
colmet-collector -vvv --zeromq-bind-uri tcp://127.0.0.1:5556 --hdf5-filepath /data/colmet.hdf5 --hdf5-complevel 9
```
```
# Collector with an Elasticsearch backend:
colmet-collector -vvv \
--zeromq-bind-uri tcp://192.168.0.1:5556 \
--buffer-size 5000 \
--sample-period 3 \
--elastic-host http://192.168.0.2:9200 \
--elastic-index-prefix colmet_dahu_ 2>>/var/log/colmet_err.log >> /var/log/colmet.log
```
You will see the number of counters retrieved in the debug log.
For more information, please refer to the help of theses scripts (`--help`)
### Notes about backends
Some input backends may need external libraries that need to be previously compiled and installed:
```
# For the perfhw backend:
cd colmet/node/backends/lib_perf_hw/ && make && cp lib_perf_hw.so /usr/local/lib/
# For the rapl backend:
cd colmet/node/backends/lib_rapl/ && make && cp lib_rapl.so /usr/local/lib/
```
Here's acomplete colmet-node start-up process, with perfw, rapl and more backends:
```
export LIB_PERFHW_PATH=/usr/local/lib/lib_perf_hw.so
export LIB_RAPL_PATH=/applis/site/colmet/lib_rapl.so
colmet-node -vvv --zeromq-uri tcp://192.168.0.1:5556 \
--cpuset_rootpath /dev/cpuset/oar \
--enable-infiniband --omnipath \
--enable-lustre \
--enable-perfhw --perfhw-list instructions cache_misses page_faults cpu_cycles cache_references \
--enable-RAPL \
--enable-jobproc \
--enable-ipmipower >> /var/log/colmet.log 2>&1
```
#### RAPL - Running Average Power Limit (Intel)
RAPL is a feature on recent Intel processors that makes possible to know the power consumption of cpu in realtime.
Usage : start colmet-node with option `--enable-RAPL`
A file named RAPL_mapping.[timestamp].csv is created in the working directory. It established the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric as well as the package and zone (core / uncore / dram) of the processor the metric refers to.
If a given counter is not supported by harware the metric name will be "`counter_not_supported_by_hardware`" and `0` values will appear in the collected data; `-1` values in the collected data means there is no counter mapped to the column.
#### Perfhw
This provides metrics collected using interface [perf_event_open](http://man7.org/linux/man-pages/man2/perf_event_open.2.html).
Usage : start colmet-node with option `--enable-perfhw`
Optionnaly choose the metrics you want (max 5 metrics) using options `--perfhw-list` followed by space-separated list of the metrics/
Example : `--enable-perfhw --perfhw-list instructions cpu_cycles cache_misses`
A file named perfhw_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.
Available metrics (refers to perf_event_open documentation for signification) :
```
cpu_cycles
instructions
cache_references
cache_misses
branch_instructions
branch_misses
bus_cycles
ref_cpu_cycles
cache_l1d
cache_ll
cache_dtlb
cache_itlb
cache_bpu
cache_node
cache_op_read
cache_op_prefetch
cache_result_access
cpu_clock
task_clock
page_faults
context_switches
cpu_migrations
page_faults_min
page_faults_maj
alignment_faults
emulation_faults
dummy
bpf_output
```
#### Temperature
This backend gets temperatures from `/sys/class/thermal/thermal_zone*/temp`
Usage : start colmet-node with option `--enable-temperature`
A file named temperature_mapping.[timestamp].csv is created in the working directory. It establishes the correspondence between `counter_1`, `counter_2`, etc from collected data and the actual name of the metric.
Colmet CHANGELOG
================
Version 0.6.8
-------------
- Added nvidia GPU support
Version 0.6.7
-------------
- bugfix: glob import missing into procstats
Version 0.6.6
-------------
- Added --no-check-certificates option for elastic backend
- Added involved jobs and new metrics into jobprocstats
Version 0.6.4
-------------
- Added http auth support for elasticsearch backend
Version 0.6.3
-------------
Released on September 4th 2020
- Bugfixes into lustrestats and jobprocstats backend
Version 0.6.2
-------------
Released on September 3rd 2020
- Python package fix
Version 0.6.1
-------------
Released on September 3rd 2020
- New input backends: lustre, infiniband, temperature, rapl, perfhw, impipower, jobproc
- New ouptut backend: elasticsearch
- Example Grafana dashboard for Elasticsearch backend
- Added "involved_jobs" value for metrics that are global to a node (job 0)
- Bugfix for "dictionnary changed size during iteration"
Version 0.5.4
-------------
Released on January 19th 2018
- hdf5 extractor script for OAR RESTFUL API
- Added infiniband backend
- Added lustre backend
- Fixed cpuset_rootpath default always appended
Version 0.5.3
-------------
Released on April 29th 2015
- Removed unnecessary lock from the collector to avoid colmet to wait forever
- Removed (async) zmq eventloop and added ``--sample-period`` to the collector.
- Fixed some bugs about hdf file
Version 0.5.2
-------------
Released on Apr 2nd 2015
- Fixed python syntax error
Version 0.5.1
-------------
Released on Apr 2nd 2015
- Fixed error about missing ``requirements.txt`` file in the sdist package
Version 0.5.0
-------------
Released on Apr 2nd 2015
- Don't run colmet as a daemon anymore
- Maintained compatibility with zmq 3.x/4.x
- Dropped ``--zeromq-swap`` (swap was dropped from zmq 3.x)
- Handled zmq name change from HWM to SNDHWM and RCVHWM
- Fixed requirements
- Dropped python 2.6 support
Version 0.4.0
-------------
- Saved metrics in new HDF5 file if colmet is reloaded in order to avoid HDF5 data corruption
- Handled HUP signal to reload ``colmet-collector``
- Removed ``hiwater_rss`` and ``hiwater_vm`` collected metrics.
Version 0.3.1
-------------
- New metrics ``hiwater_rss`` and ``hiwater_vm`` for taskstats
- Worked with pyinotify 0.8
- Added ``--disable-procstats`` option to disable procstats backend.
Version 0.3.0
-------------
- Divided colmet package into three parts
- colmet-node : Retrieve data from taskstats and procstats and send to
collectors with ZeroMQ
- colmet-collector : A collector that stores data received by ZeroMQ in a
hdf5 file
- colmet-common : Common colmet part.
- Added some parameters of ZeroMQ backend to prevent a memory overflow
- Simplified the command line interface
- Dropped rrd backend because it is not yet working
- Added ``--buffer-size`` option for collector to define the maximum number of
counters that colmet should queue in memory before pushing it to output
backend
- Handled SIGTERM and SIGINT to terminate colmet properly
Version 0.2.0
-------------
- Added options to enable hdf5 compression
- Support for multiple job by cgroup path scanning
- Used Inotify events for job list update
- Don't filter packets if no job_id range was specified, especially with zeromq
backend
- Waited the cgroup_path folder creation before scanning the list of jobs
- Added procstat for node monitoring through fictive job with 0 as identifier
- Used absolute time take measure and not delay between measure, to avoid the
drift of measure time
- Added workaround when a newly cgroup is created without process in it
(monitoring is suspended upto one process is launched)
Version 0.0.1
-------------
- Conception