Telemetry, Observability

Fleeting

I tried to setup some telemetry stack in my company and this is what I think I understood so far.

vocabulary

observability

action of looking at a running system to find out what happens, how it happens and why it happens,

telemetry

technical setup enabling observability, three actors are commonly known as part of telemetry

logging: the aggregation of lines output by the services in observation,
metrics: data points conveying knowledge about the state of the observed system,
traces: DAG of nested timelines that happened resulting from a root cause. The root cause can be a user opening a web page for instance. Each DAG is called a trace and the timelines are called spans.

profiling

description resource usage inside a running process (think valgrid, flamegraph, eBPF etc). It’s starting to become the fourth common telemetry element, yet it is still presented as an outlier for now.

opentelementry

resulting from the merge of opencensus and opentracing in 2019, a standardisation of telemetry elements so as to enable interoperability. As long as a product “speaks” opentelemetry, it can be substituted by another.

otel collector

program aiming at receiving data from several data sources, following the opentelemetry standard, enriching them and sending them to the appropriate databases.

otlp (Open Telemetry Line Protocol)

protocol of sending data following the opentelementry format. Can be shared using grpc or http, the later being easier to pass through http ingresses and the former being more efficient.

correlation

links between telemetry data of different natures. To understand a system, one needs to navigates from metrics/logging/traces to other metrics/logging/traces. Following a same standard and adding the appropriate metadata is called correlation. Traces and logging already capture dynamic data, so correlation them is only a matter of using similar labeling.

examplar: metadata added to metrics data containing a trace_id so as to get from metrics to logging and tracing.

the grafana stack

metrics: prometheus
logs: loki
traces: tempo
otel collector: alloy

So far, I only use alloy to gather logs, so maybe it could be better suited to enrich the data. So far, I tried configuring grafana make it correlate the databases.

what I learned

I was surprised to how manual things needed to be done.

Prometheus needed several command line options so that it would work with the telemetry stack: --web.enable-remote-write-receiver --enable-feature=exemplar-storage --enable-feature=native-histograms --web.enable-otlp-receiver

Also, I needed to do a lot of trials and errors to add manual correlations in grafana to make links between them. This is an example of my datasource provisioning.

apiVersion: 1
datasources:
  - name: Tempo
    url: ...
    type: tempo
    uid: tempo
    jsonData:
      tracesToLogsV2:
        datasourceUid: 'loki'
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags: [{ key: 'service.name', value: 'service_name' }]
        filterByTraceID: true
        filterBySpanID: false
        customQuery: true
        query: '{$$__tags} |~ `(?i)"TraceId":"$$${__trace.traceId}` or `(?i)trace_id=$$${__trace.traceId}`'
      tracesToMetrics:
        datasourceUid: 'prometheus'
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags: [{ key: 'service.name', value: 'task' }]
        queries:
          - name: 'memory'
            query: 'nomad_client_allocs_memory_usage{$$__tags}'
          - name: 'cpu'
            query: 'konix:nomad_client_allocs_cpu_percent{$$__tags}'
      httpMethod: GET
      serviceMap:
        datasourceUid: 'prometheus'
      nodeGraph:
        enabled: true
  - name: Loki
    url: ...
    type: loki
    uid: loki
    jsonData:
      maxLines: 1000
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: "trace_id=(\\w+)"
          name: found_trace_id_by_regex
          urlDisplayLabel: 'View Trace'
          url: "$$${__value.raw}"
        - datasourceUid: tempo
          matcherRegex: "TraceId.:.(\\w+)"
          name: found_trace_id_by_regex_json
          urlDisplayLabel: 'View Trace'
          url: "$$${__value.raw}"
        - datasourceUid: tempo
          matcherRegex: "trace_id"
          matcherType: label
          name: found_trace_id_by_label
          urlDisplayLabel: 'View Trace'
  - name: Prometheus
    url: ...
    type: prometheus
    uid: prometheus
    jsonData:
        httpMethod: GET
        exemplarTraceIdDestinations:
          - datasourceUid: tempo
            name: trace_id

I could not create alerts from traces, grafana does not propose it in the alerting panel.

Tempo has a lot of configurations. Those are hard to understand because they imply some deeper knowledge about what happens behind the hood. Somethings like distributors, metrics_generators, overrides, compactors etc. With a default setup, it happens to work, but requires at least 200MB of RAM and spikes at 1GB when requested. This is huge considering the fact I don’t capture much for now.

Prometheus is historically pull based (captures /metrics kind of paths), making it ill suited for highly correlated data. It now also can be fed with otlp protocol, but now it is hard to decide which one to use for what situation. Should I simply avoid using the prometheus scraping feature and make the scrapping in alloy?

Notes linking here

open telemetry

Konubinix' opinionated web of thoughts