Skip to main content
Star us on GitHub Star

File Reporter

The Ziti Controller event system can write metrics to a file for direct viewing or post-processing into a metrics system of your choice.

Ziti Configuration

There are two independent configurations that need to be set up for metric event reporting to work:

  1. Report Interval: The controller and routers need to have a metric reporting interval set. The interval determines how often metrics will be sent to the controller and possibly written out to file
  2. Event Subscription: The subscription is configured in the controller, and determines which of the reported metrics are written.

Metrics Reporting

The reporting interval is defined in the metrics.reportInterval property. While the controller and each router can have a different reporting interval, we suggest that you set them all the same to make lining up metrics across the cluster easier.

Example:

metrics:
reportInterval: 20s

Routers support an additional configuration parameter MessageQueueSize which is the number of unsent metrics messages that can sit in the router metrics queue before metrics gathering is paused.

Example:

metrics:
reportInterval: 20s
messageQueueSize: 20 // Default 10

Metrics Writing

Writing of metrics is broken into two pieces:

  1. subscription: Which metrics will be written
  2. hander: how the metrics will be written

Metrics Subscription

The metrics subscription defines which metrics will be written and how they will be written.

There are two parts to a metrics event reporter The subscription has three components

  1. sourceFilter: Which components to write metrics for. This is a regular expression.
    • ctrl_client: Special marker for the controller
    • Router ID: Get metrics for one and only one router
    • .*: Get metrics from the controller and all routers
  2. metricFilter: Which metrics to report. This is a regular expression
    • .*pool.*: Report only pool metrics
    • .*: Report all metrics
Example:
events:
allControllerMetrics:
subscriptions:
- type: metrics
sourceFilter: ctrl_client
metricFilter: .*
justEdgeRouterPoolMetrics:
subscriptions:
- type: metrics
sourceFilter: .*
metricFilter: .*pool.*

Metrics Handling

The metrics handler defined how metrics are to be written. It is comprised of:

  1. type: The type of handler. Supported types are:
  • file: Metrics will be written to a file
  1. maxsizemb: File rolling - log files will be rolled when they reach this size. Default size is 10mb.
  2. maxbackups File rolling - the number of files to keep. Default is 0 (unlimited).
  3. format: The format of the metric. Supported formats are: json
  4. path: The name of the file to write metrics to
File Rolling

files are rolled when they reach a size of maxsizemb. Files are renamed from name.log to name-iso8601.log

For example, name-2022-06-07T18-50-44.568.log

Example

Write 100mb files, saving 2 of them.

    handler:
type: file
format: json
maxsizemb: 100
maxbackups: 2
path: /tmp/controller-metrics.log

Putting it all together

Controller configuration file:

metrics:
reportInterval: 20s

events:
allControllerMetrics:
subscriptions:
- type: metrics
sourceFilter: ctrl_client
metricFilter: .*
handler:
type: file
format: json
maxsizemb: 50
maxbackups: 2
path: /tmp/controller-metrics.log

justPoolMetrics:
subscriptions:
- type: metrics
sourceFilter: .*
metricFilter: .*pool.*
handler:
type: file
format: json
maxsizemb: 100
maxbackups: 5
path: /tmp/router-pool-metrics.log

Router configuration files:

metrics:
reportInterval: 20s

Types of Metrics Reported

OpenZiti is instrumenting more code and adding additional metrics all of the time. This section will describe the different types of metrics that OpenZiti reports, not individual metric names.

intValue/floatValue

A gauge of a single value. The value is the current metric value, and can go up and down over time

Histogram

Histogram metrics utilize the Go metrics module, and are set to a 128 sample exponentially decaying bucket with a alpha value of .015. This is important to understand, especially in reference to minimum and maximum values. The bucket is sample bound, not time bound. In practice this means one will often see a maximum or minimum value that carries on for several time samples; this is expected behavior. For example, link latency is measured every 10 seconds by default. This means a maximum value can be in place for 21:40 minutes (128 * 10s). When viewing the measurements, it is important to keep this in mind, as it may appear that a low or high value is more prevalent than it actually is, if you are familiar with thinking of time bound buckets. The histogram implementation allows for extremely fast and memory efficient data collection. As some of the metrics are multiplied by multiple levels of cardinality, it is critical to maintaining the operations of the software.

An exponentially decaying histogram means that as the samples age across the 128 sample window, they are weighted less than the newer samples. This makes functions, such as the mean, which is often used, able to respond more quickly to changes than a straight sliding window. An alpha value of .015 means that the sample weights range from 1 (the newest sample) to approximately .93. This means that when calculating the mean, the oldest sample in the window is weighted to 93%, reducing its contribution to the function.

A simple weighting exercise: Given 3 samples, 10, 5, and 5, how does the weighting and order affect the mean function? (This is not the same actual function of the histogram, but is intended to help explain the decaying function nd the impact of the age of the sample on the measurements) | Sample | Weight | Weighted Value | |--------|-------------|----------------| | 10 | 1.0 | 10.0 | | 5 | .95 | 4.75 | | 5 | .90 | 4.5 | | Sum | 2.85 | 19.25 | | Mean | 19.25/2.85 | 6.75 |

SampleWeightWeighted Value
51.05.0
5.954.75
10.909.0
Sum2.8518.75
Mean18.75/2.856.58

Standard histograms provide:

  • min
  • max
  • mean
  • stdev
  • variance
  • percentiles
    • p50
    • p75
    • p95
    • p99
    • p999
    • p9999

It is important to note the sample size (128) means the more specific percentiles will use the same actual values, and may be repetitive.

Meter

Meters are used for rate measurements, how much of something happened per unit time. The samples are exponentially decayed, similar to the histogram, however the values are bound to specific time intervals, such as 1, 5, and 15 minutes. They can also provide similar statistical values to histograms

Meter metric with:

  • count
  • m1_rate
  • m5_rate
  • m15_rate
  • min
  • max
  • mean
  • std_dev
  • variance
  • percentiles
    • p50
    • p75
    • p95
    • p99
    • p999
    • p9999

Timer

Timers provide statistical samples of timed events.

  • min
  • max
  • mean
  • std_dev
  • variance
  • percentiles
    • p50
    • p75
    • p95
    • p99
    • p999
    • p9999

Gauge

Gauges present a point in time measurement of a metric. For example, the number of open database transactions at a given moment.

Metric Examples

intValue

{
"metric_type": "intValue",
"namespace": "metrics",
"source_id": "ctrl_client",
"version": 2,
"timestamp": {
"seconds": 1654625684,
"nanos": 479708609
},
"metric": "pool.listener.mgmt.worker_count",
"metrics": {
"value": 1
},
"source_event_id": "acb85925-0e17-4ca0-90cb-9a2498b33bc8"
}

Histogram

{
"metric_type": "histogram",
"namespace": "metrics",
"source_id": "ctrl_client",
"source_entity_id": "xpw7BEDAk",
"version": 2,
"timestamp": {
"seconds": 1654625684,
"nanos": 479708609
},
"metric": "ctrl.queue_time",
"metrics": {
"count": 57,
"max": 21647,
"mean": 15266.508771929824,
"min": 5753,
"p50": 15670,
"p75": 16558.5,
"p95": 18362.699999999997,
"p99": 21647,
"p999": 21647,
"p9999": 21647,
"std_dev": 2604.8795245927113,
"variance": 6785397.337642349
},
"source_event_id": "acb85925-0e17-4ca0-90cb-9a2498b33bc8"
}

Timer

{
"metric_type": "timer",
"namespace": "metrics",
"source_id": "ctrl_client",
"version": 2,
"timestamp": {
"seconds": 1654625684,
"nanos": 479708609
},
"metric": "api.session.enforcer.run",
"metrics": {
"count": 11,
"m15_rate": 0.2,
"m1_rate": 0.2,
"m5_rate": 0.2,
"max": 6842849,
"mean": 1096126.3636363635,
"mean_rate": 0.20374652060865114,
"min": 254514,
"p50": 335348,
"p75": 1212318,
"p95": 6842849,
"p99": 6842849,
"p999": 6842849,
"p9999": 6842849,
"std_dev": 1858257.4031879376,
"variance": 3453120576502.777
},
"source_event_id": "acb85925-0e17-4ca0-90cb-9a2498b33bc8"
}