Service meshes like Istio have changed the world of observability. They provide the fastest path to generating the critical metrics and traces that enable software reliability teams to find bugs and bottlenecks in a system. Zipkin and Jaeger are implementations of distributed tracing, and Istio uses them to provide observability into requests throughout a system of microservices. Let’s explore what distributed tracing is, why you would want it, how Istio uses it, and the differences between Zipkin and Jaeger as backends for your traces.
Distributed Tracing
In a monolithic application, when an error occurs, you usually already have a trace to follow: a stack trace. Because the entire lifetime of a request or transaction is owned by the one application, you can get a full view of what happened in that transaction. After an error, that’s known as a stack trace. In addition, profiling libraries can give you the timing of how long a particular function or database call took.
But what about in a distributed system? Let’s say your request has to pass through three microservices, taking the following steps to complete:
- You send your request to App A
- App A calls App B, App B checks a Redis cache
- App B calls App C, App C queries a database
If you get a 403 status code back, which one generated the error? If your request normally takes 100ms, but now is taking 3 seconds, which application or database slowed down? Distributed tracing systems provide data to answer these kinds of questions.
How Distributed Tracing Works
Distributed traces are composed of a trace, which is a parent object, and spans, which are children. In our three-service example:
- App A generates a trace ID. Internally, it creates child spans of function calls, and finally makes a request to App B. In the headers of that request, it sends the trace ID.
- App B takes the trace ID and makes its own spans of its internal calls, including around the call to Redis. It sends a request to App C, also sending the trace ID in the request headers.
- App C makes its own spans, and encounters an error. To one of the spans, it adds a key value of “error = 403” and responds with the 403 error code.
In each application, there is a library for a tracing backend, like Zipkin or Jaeger. All of the trace and span information is sent to the backend, where a developer can analyze it.
Istio and Distributed Tracing
Istio, like other service meshes, provides convenience around distributed tracing, minimizing the work developers have to put in to get the benefits. Istio can be configured to recognize tracing headers, and automatically generate a span for each service in the mesh, giving you a view of your system at the service entry/exit level. This functionality offers a powerful way to get observability of a large system with minimal effort from developers.
Limitations
Since the Envoy sidecars that Istio deploys are unaware of an application’s business logic, the spans that Istio automatically creates are at the entry and exit of a request through that application. Instrumenting database calls or function calls still has to be done by the developer.
Developers have to add a little bit of code to forward the headers that tracing relies on. This functionality can be added in middleware without adding a full tracing library. But since tracing relies on request headers, trace context must pass between services, even with an Envoy sidecar. This context could be wrapped in a common library your organization adds as a dependency to all apps, for example. Istio provides a helpful list of headers that are required:
- x-request-id
- x-b3-traceid
- x-b3-spanid
- x-b3-parentspanid
- x-b3-sampled
- x-b3-flags
- b3
Zipkin vs Jaeger
Once you have your applications instrumented, either with tracing libraries or simply by placing them in a mesh, you need a place to analyze them. Zipkin and Jaeger provide backends that collate all the traces and spans, and allow users to view them. Examples of the analysis view are available for Zipkin and Jaeger.
Choosing between these options used to be a harder decision. Zipkin and Jaeger are not just backends; they also define their own formats and protocols for trace data. However, as the distributed tracing ecosystem has matured, Jaeger has adopted compatibility with Zipkin’s protocol. Both are supported by the OpenTelemetry project. This support means teams can send traces of almost any common format to almost any major backend, and have any necessary translations done in the middle. Both projects are open source, and have similar architectures:
- A collector or multiple collectors that receive traces and spans
- A storage backend (both support Cassandra and ElasticSearch, Zipkin also supports MySQL)
- A UI for querying traces
When a team is choosing which to run as a backend for their organization’s traces, the most important consideration is ease of deployability and maintainability, which will differ from team to team. Zipkin offers easy deployment via Docker Compose. This option might be more suitable for teams working directly on VM instances, as Jaeger has a Kubernetes Operator for convenient deployment to a Kubernetes cluster.
When designing your deployment, a key thing to keep in mind is that the query UI for both Jaeger and Zipkin must be kept inside a VPN-accessible network, as neither has any security on their frontends. They are designed for private networks where any developer can view the data. This might mean multiple deployments into cordoned-off networks, depending on your security posture or organizational structure.
Whichever you choose, your developers are going to be delighted to go...
- from a world where they are ssh’ing to a VM and exec’ing into a container to curl to the next upstream service to see if the connection is alive,
- to a world where they are looking at a trace chart and seeing where errors are occurring.
And with the rate that this ecosystem is maturing, you soon won’t have to choose between Jaeger and Zipkin, as the current tracing Tower of Babel is replaced by the OpenTelemetry collector.