Telemetry, Observability
FleetingI tried to setup some telemetry stack in my company and this is what I think I understood so far.
vocabulary
- observability
- action of looking at a running system to find out what happens, how it happens and why it happens,
- telemetry
- technical setup enabling observability, three actors are commonly known as part of telemetry
- logging: the aggregation of lines output by the services in observation,
- metrics: data points conveying knowledge about the state of the observed system,
- traces: DAG of nested timelines that happened resulting from a root cause. The root cause can be a user opening a web page for instance. Each DAG is called a trace and the timelines are called spans.
- profiling
- description resource usage inside a running process (think valgrid, flamegraph, eBPF etc). It’s starting to become the fourth common telemetry element, yet it is still presented as an outlier for now.
- opentelementry
- resulting from the merge of opencensus and opentracing in 2019, a standardisation of telemetry elements so as to enable interoperability. As long as a product “speaks” opentelemetry, it can be substituted by another.
- otel collector
- program aiming at receiving data from several data sources, following the opentelemetry standard, enriching them and sending them to the appropriate databases.
- otlp (Open Telemetry Line Protocol)
- protocol of sending data following the opentelementry format. Can be shared using grpc or http, the later being easier to pass through http ingresses and the former being more efficient.
- correlation
- links between telemetry data of different natures. To
understand a system, one needs to navigates from metrics/logging/traces to
other metrics/logging/traces. Following a same standard and adding the
appropriate metadata is called correlation. Traces and logging already
capture dynamic data, so correlation them is only a matter of using similar
labeling.
- examplar: metadata added to metrics data containing a trace_id so as to get from metrics to logging and tracing.
the grafana stack
- metrics
- prometheus
- logs
- loki
- traces
- tempo
- otel collector
- alloy
So far, I only use alloy to gather logs, so maybe it could be better suited to enrich the data. So far, I tried configuring grafana make it correlate the databases.
what I learned
I was surprised to how manual things needed to be done.
Prometheus needed several command line options so that it would work with the
telemetry stack: --web.enable-remote-write-receiver --enable-feature=exemplar-storage --enable-feature=native-histograms --web.enable-otlp-receiver
Also, I needed to do a lot of trials and errors to add manual correlations in grafana to make links between them. This is an example of my datasource provisioning.
apiVersion: 1
datasources:
- name: Tempo
url: ...
type: tempo
uid: tempo
jsonData:
tracesToLogsV2:
datasourceUid: 'loki'
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
tags: [{ key: 'service.name', value: 'service_name' }]
filterByTraceID: true
filterBySpanID: false
customQuery: true
query: '{$$__tags} |~ `(?i)"TraceId":"$$${__trace.traceId}` or `(?i)trace_id=$$${__trace.traceId}`'
tracesToMetrics:
datasourceUid: 'prometheus'
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
tags: [{ key: 'service.name', value: 'task' }]
queries:
- name: 'memory'
query: 'nomad_client_allocs_memory_usage{$$__tags}'
- name: 'cpu'
query: 'konix:nomad_client_allocs_cpu_percent{$$__tags}'
httpMethod: GET
serviceMap:
datasourceUid: 'prometheus'
nodeGraph:
enabled: true
- name: Loki
url: ...
type: loki
uid: loki
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo
matcherRegex: "trace_id=(\\w+)"
name: found_trace_id_by_regex
urlDisplayLabel: 'View Trace'
url: "$$${__value.raw}"
- datasourceUid: tempo
matcherRegex: "TraceId.:.(\\w+)"
name: found_trace_id_by_regex_json
urlDisplayLabel: 'View Trace'
url: "$$${__value.raw}"
- datasourceUid: tempo
matcherRegex: "trace_id"
matcherType: label
name: found_trace_id_by_label
urlDisplayLabel: 'View Trace'
- name: Prometheus
url: ...
type: prometheus
uid: prometheus
jsonData:
httpMethod: GET
exemplarTraceIdDestinations:
- datasourceUid: tempo
name: trace_id
I could not create alerts from traces, grafana does not propose it in the alerting panel.
Tempo has a lot of configurations. Those are hard to understand because they imply some deeper knowledge about what happens behind the hood. Somethings like distributors, metrics_generators, overrides, compactors etc. With a default setup, it happens to work, but requires at least 200MB of RAM and spikes at 1GB when requested. This is huge considering the fact I don’t capture much for now.
Prometheus is historically pull based (captures /metrics kind of paths), making it ill suited for highly correlated data. It now also can be fed with otlp protocol, but now it is hard to decide which one to use for what situation. Should I simply avoid using the prometheus scraping feature and make the scrapping in alloy?