<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    April 29, 2021

    What are Zipkin and Jaeger, and How Does Istio Use Them?

    Key Takeaways
    • Leverage Istio to gain a powerful way to establish observability of large systems with minimal development effort.
    • Implement tools like Zipkin and Jaeger for distributed tracing to visualize and analyze application performance issues effectively.
    • Factor in ease of deployability when choosing between Jaeger and Zipkin.

    Service meshes like Istio have changed the world of observability. They provide the fastest path to generating the critical metrics and traces that enable software reliability teams to find bugs and bottlenecks in a system. Zipkin and Jaeger are implementations of distributed tracing, and Istio uses them to provide observability into requests throughout a system of microservices. Let’s explore what distributed tracing is, why you would want it, how Istio uses it, and the differences between Zipkin and Jaeger as backends for your traces.

    Distributed Tracing

    In a monolithic application, when an error occurs, you usually already have a trace to follow: a stack trace. Because the entire lifetime of a request or transaction is owned by the one application, you can get a full view of what happened in that transaction. After an error, that’s known as a stack trace. In addition, profiling libraries can give you the timing of how long a particular function or database call took.

    But what about in a distributed system? Let’s say your request has to pass through three microservices, taking the following steps to complete:

    1. You send your request to App A
    2. App A calls App B, App B checks a Redis cache
    3. App B calls App C, App C queries a database

    If you get a 403 status code back, which one generated the error? If your request normally takes 100ms, but now is taking 3 seconds, which application or database slowed down? Distributed tracing systems provide data to answer these kinds of questions.

    How Distributed Tracing Works

    Distributed traces are composed of a trace, which is a parent object, and spans, which are children. In our three-service example:

    1. App A generates a trace ID. Internally, it creates child spans of function calls, and finally makes a request to App B. In the headers of that request, it sends the trace ID.
    2. App B takes the trace ID and makes its own spans of its internal calls, including around the call to Redis. It sends a request to App C, also sending the trace ID in the request headers.
    3. App C makes its own spans, and encounters an error. To one of the spans, it adds a key value of “error = 403” and responds with the 403 error code.

    In each application, there is a library for a tracing backend, like Zipkin or Jaeger. All of the trace and span information is sent to the backend, where a developer can analyze it.

    Istio and Distributed Tracing

    Istio, like other service meshes, provides convenience around distributed tracing, minimizing the work developers have to put in to get the benefits. Istio can be configured to recognize tracing headers, and automatically generate a span for each service in the mesh, giving you a view of your system at the service entry/exit level. This functionality offers a powerful way to get observability of a large system with minimal effort from developers.

    Limitations

    Since the Envoy sidecars that Istio deploys are unaware of an application’s business logic, the spans that Istio automatically creates are at the entry and exit of a request through that application. Instrumenting database calls or function calls still has to be done by the developer.

    Developers have to add a little bit of code to forward the headers that tracing relies on. This functionality can be added in middleware without adding a full tracing library. But since tracing relies on request headers, trace context must pass between services, even with an Envoy sidecar. This context could be wrapped in a common library your organization adds as a dependency to all apps, for example. Istio provides a helpful list of headers that are required:

    • x-request-id
    • x-b3-traceid
    • x-b3-spanid
    • x-b3-parentspanid
    • x-b3-sampled
    • x-b3-flags
    • b3

    Zipkin vs Jaeger

    Once you have your applications instrumented, either with tracing libraries or simply by placing them in a mesh, you need a place to analyze them. Zipkin and Jaeger provide backends that collate all the traces and spans, and allow users to view them. Examples of the analysis view are available for Zipkin and Jaeger.

    Choosing between these options used to be a harder decision. Zipkin and Jaeger are not just backends; they also define their own formats and protocols for trace data. However, as the distributed tracing ecosystem has matured, Jaeger has adopted compatibility with Zipkin’s protocol. Both are supported by the OpenTelemetry project. This support means teams can send traces of almost any common format to almost any major backend, and have any necessary translations done in the middle. Both projects are open source, and have similar architectures:

    • A collector or multiple collectors that receive traces and spans
    • A storage backend (both support Cassandra and ElasticSearch, Zipkin also supports MySQL)
    • A UI for querying traces

    When a team is choosing which to run as a backend for their organization’s traces, the most important consideration is ease of deployability and maintainability, which will differ from team to team. Zipkin offers easy deployment via Docker Compose. This option might be more suitable for teams working directly on VM instances, as Jaeger has a Kubernetes Operator for convenient deployment to a Kubernetes cluster.

    When designing your deployment, a key thing to keep in mind is that the query UI for both Jaeger and Zipkin must be kept inside a VPN-accessible network, as neither has any security on their frontends. They are designed for private networks where any developer can view the data. This might mean multiple deployments into cordoned-off networks, depending on your security posture or organizational structure.

    Whichever you choose, your developers are going to be delighted to go...

    • from a world where they are ssh’ing to a VM and exec’ing into a container to curl to the next upstream service to see if the connection is alive,
    • to a world where they are looking at a trace chart and seeing where errors are occurring.

    And with the rate that this ecosystem is maturing, you soon won’t have to choose between Jaeger and Zipkin, as the current tracing Tower of Babel is replaced by the OpenTelemetry collector.

    Tag(s): AIOps

    David Sudia

    Dave Sudia is an educator, turned developer, turned DevOps engineer. He's passionate about supporting other developers in doing their best work by making sure they have the right tools and environments. In his day-to-day, he's responsible for managing Kubernetes clusters, deploying databases, writing utility apps, and...

    Other posts you might be interested in

    Explore the Catalog
    icon
    Blog November 4, 2024

    Unlocking the Power of UIMAPI: Automating Probe Configuration

    Read More
    icon
    Blog October 4, 2024

    Capturing a Complete Topology for AIOps

    Read More
    icon
    Blog October 4, 2024

    Fantastic Universes and How to Use Them

    Read More
    icon
    Blog September 26, 2024

    DX App Synthetic Monitor (ASM): Introducing Synthetic Operator for Kubernetes

    Read More
    icon
    Blog September 16, 2024

    Streamline Your Maintenance Modes: Automate DX UIM with UIMAPI

    Read More
    icon
    Blog September 16, 2024

    Introducing The eBPF Agent: A New, No-Code Approach for Cloud-Native Observability

    Read More
    icon
    Blog September 6, 2024

    CrowdStrike: Are Regulations Failing to Ensure Continuity of Essential Services?

    Read More
    icon
    Blog August 28, 2024

    Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management

    Read More
    icon
    Blog August 27, 2024

    Topology for Incident Causation and Machine Learning within AIOps

    Read More