Add Jaeger proposal #42

yurishkuro · 2017-08-17T22:43:44Z

No description provided.

caniszczyk · 2017-08-18T04:07:54Z

Thanks @yurishkuro, requesting that the @cncf/toc and wider CNCF community look at the proposal before we formally call for a vote at the end of this month

bassam · 2017-08-24T20:29:18Z

@yurishkuro et. al, nice to see this work being proposed to the CNCF. I have a few question and comments:

is it fair to say that the scalability limit of Jaeger is a function of the backend store? Are there any other scalability limits?
in looking at the Cassandra schema, are any of the secondary indexes (duration_index, service_name_index, etc.) used on the collection path, or are they there to support the UI?
can one use a stateless load balancer in front of the collectors? are there any consistency issues with doing so? are there any inefficiencies for having traces spread across many collectors?
how are old traces removed? does Jaeger include any facilities for garbage collection or aging of traces?
can the agent be deployed at a node level and shared by multiple applications on that node?
how are the http endpoints for the agent and collector secured?
have you considered using prometheus as a backend for Jaeger? is prometheus' indexing compatible with jaeger's?
have you considered using gRPC instead of thrift for app->agent and agent->collector?
is the zipkin support legacy at this point? are there plans to remove it?

yurishkuro · 2017-08-24T21:28:25Z

great questions, @bassam

is it fair to say that the scalability limit of Jaeger is a function of the backend store? Are there any other scalability limits?

Backend storage is the primary scalability limit. Of course if you replace it with /dev/null, there will be some overhead of collectors themselves processing the messages, but collectors are stateless and horizontally scalable.

in looking at the Cassandra schema, are any of the secondary indexes (duration_index, service_name_index, etc.) used on the collection path, or are they there to support the UI?

In the current version all indices are written at the same time as the main span record.

can one use a stateless load balancer in front of the collectors? are there any consistency issues with doing so? are there any inefficiencies for having traces spread across many collectors?

Collectors are stateless. In order to do aggregations there needs to be a stateful processing layer, which in our case is Kafka + Flink that sit after the collectors.

how are old traces removed? does Jaeger include any facilities for garbage collection or aging of traces?

For Cassandra we simply rely on the TTL it supports natively. For Elasticsearch we have a script that purges old indices and needs to run on a schedule (e.g. once a day).

can the agent be deployed at a node level and shared by multiple applications on that node?

Yes, that's precisely what it is designed for and how we run things internally. Under Kubernetes agents run as a daemon set.

how are the http endpoints for the agent and collector secured?

They are not. Our decision was that there are many of other ways to secure access (e.g. this blog post), it did not need to be built into the product.

have you considered using prometheus as a backend for Jaeger? is prometheus' indexing compatible with jaeger's?

It has been suggested before, but it was not a priority for us to replace Cassandra (we've been using it for 2yrs now). It would be interesting if Prometheus could support the indexing/search use case for traces.

have you considered using gRPC instead of thrift for app->agent and agent->collector?

It's on the roadmap for agent->collector. The client libs in the app are primarily built to use UDP to avoid extra dependencies (even though - thrift, I know).

is the zipkin support legacy at this point? are there plans to remove it?

It's actually brand new, added as a migration path. To clarify, support is only to accept spans in Thrift or JSON Zipkin format, then they are immediately converted to Jaeger format that is modelled after OpenTracing spec.

bassam · 2017-08-24T22:02:18Z

In the current version all indices are written at the same time as the main span record.

I'm trying to understand which tables/indices are required for collection, vs. the ones needed to support the UI/cli (and can lag behind, or built as needed). Can collection just write to the main span record?

Collectors are stateless. In order to do aggregations there needs to be a stateful processing layer, which in our case is Kafka + Flink that sit after the collectors.

sorry I'm not sure I understood. consider the following case, a given trace (id7) has 3 spans (a,b,c) being sent concurrently by one agent to one or more collectors. Independent collectors might receive a,b,c out of order and might write to the backend store at different times. The schema seems to have some implicit ordering (for example, I assume the parent must be written before the child for parent_id to be correct). Who ensures this ordering when the traces and spans are being spread to stateless collectors?

yurishkuro · 2017-08-25T04:11:06Z

I'm trying to understand which tables/indices are required for collection

traces is the only table required for collection.

Who ensures this ordering when the traces and spans are being spread to stateless collectors?

Spans can be written to storage in any order. The full trace is assembled at query time by retrieving all spans with a given trace ID and arranging them in-memory according to parent-child references.

bassam · 2017-08-25T04:17:37Z

@yurishkuro thanks!

felixbarny · 2017-08-25T16:48:58Z

We are using Jaeger at stagemonitor and we support Jaeger becoming a CNCF member project

frankgreco · 2017-08-26T01:56:48Z

We use Jaeger at Northwestern Mutual along with Opentracing to trace 100s or services across 1000s of developers. Jaeger has been an essential tool for us and I fully support the project joining the CNF :)

omeid · 2017-08-26T09:16:52Z

We are using Jeager at microcloud and support the project joining CNF.

caniszczyk · 2017-08-29T17:33:41Z

request from @monadic, can you update the project proposal with a new section that discusses how Jaeger fits in with Zipkin, thanks!

Dieterbe · 2017-08-30T21:39:52Z

👍 from GrafanaLabs. happily using jaeger in production since a few weeks ago. seems like the best open source distributed tracing system out there.

mabn · 2017-08-31T01:16:25Z

At Base CRM we're also using Jaeger and would be quite happy if it joined CNCF.

tanner-bruce · 2017-08-31T18:19:29Z

At FarmersEdge we are just rolling out Jaeger alongside the other CNCF projects we use

dmueller2001 · 2017-09-01T19:06:03Z

Currently, we are testing and using Jaeger with OpenShift, our Kubernetes-based Container Platform. We have it on our road map to use it as part of the monitoring tooling to be provided with OpenShift. We chose it because it fully supports OpenTracing and has been used in production at high scale/performance within Uber for two years. We look forward to the continued evolution of distributed tracing and will continue to support and collaborate with the CNCF community on Jaeger.

bharat-p · 2017-09-02T18:24:14Z

We are slowly adopting Jaeger at Symantec (Elastica) for couple of services to start and it will be great if Jaeger becomes CNCF member project.

bradtopol · 2017-09-05T16:08:41Z

proposals/jaeger.adoc

+CNCF efforts like the OpenTracing specification came to existence to help unify existing tracing implementations
+out there.
+
+Jaeger is a battle tested distributing system that takes advantage of OpenTracing and advances the state


Can you provide more detail on what you mean by battle tested? At what level of scale has this tracing been tested and what are the performance implications? In other open source cloud infrastructure communities a monitoring project was accepted for inclusion before it could prove it could scale and before it could prove that it could work for both private and public clouds. This was a huge mistake as that monitoring project's architecture prohibited it from working at very large scale and the project was accepted accepted yet never able to meet the requirements of the overall community. For a project like this I would really recommend that you keep the bar very high and make sure the community needs are well satisfied before a project like this goes beyond incubator status.

Regarding "battle tested", Jaeger has been running in production at Uber for two years, as of today tracing more than 1200 microservices. We're using fairly aggressive sampling rates (typically between 1/1000 and 1/10000 for top tier services), which make performance overhead negligible. Based on anecdotal evidence from other commercial vendors, tracing 100% of traffic may incur up to 5% of performance overhead, which can be tuned by less extensive instrumentation and further perf optimization in the tracing libraries.

Regarding the scalability of Jaeger architecture, all layers in Jaeger backend are horizontally scalable. Typically the bottleneck is in the storage backends (Cassandra, ES), which are also horizontally scalable. There are may avenues for improving the storage backend throughput and indexing capabilities, as well as reducing the amount of data needed to be stored in the first place, by investing in techniques likes on-demand sampling, post-trace sampling, trace discovery through aggregates rather than heavy indexing.

1200 services does that mean 1200 different software projects, or 1200 running daemons (possibly multiple copies of the same code). though does it matter? can you shed some light on the volume of spans and traces being generated, or the volume of log lines, tags, etc, whatever it is that affects workload.

anecdotal evidence from other commercial vendors, tracing 100% of traffic may incur up to 5% of performance overhead

can you clarify whether this means commercial vendors selling jaeger-based solutions, or alternatives to jaeger? if so, do we have stats on the overhead of jaeger?

1200 different services / applications / products, not just different instances - those are probably in the 500,000 ballpark. Our tracing backend receives ~50K spans per second, median span size is ~500b, max ~60Kb.

can you clarify whether this means commercial vendors selling jaeger-based solutions, or alternatives to jaeger? if so, do we have stats on the overhead of jaeger?

I meant commercial alternatives to Jaeger, like Lightstep, but my point was that the actual instrumentation libraries are very similar in functionality / architecture, so I am fairly comfortable extrapolating their perf overhead numbers. We don't have stats on perf overhead of Jaeger libs at 100% sampling - it's actually a non-linear problem with many contributing factors, so raw numbers like 5% are not very meaningful without describing the exact methodology and conditions of measurements. I guess what I'm getting at is that it's always possible to limit perf overhead by applying sampling, and even with low sampling rates like ours we get extremely rich data about application behavior that the overhead is totally worth it.

caniszczyk · 2017-09-07T21:04:52Z

The project is now out for official @cncf/toc vote: https://lists.cncf.io/pipermail/cncf-toc/2017-September/001149.html

Thanks everyone for participating in the community vetting and due diligence process.

caniszczyk · 2017-09-13T16:58:27Z

Good news, the @cncf/toc has accepted Jaeger as our 12th CNCF project! :)

https://lists.cncf.io/pipermail/cncf-toc/2017-September/001204.html

Welcome!

Yuri Shkuro added 2 commits August 17, 2017 18:43

Create Jaeger proposal

1a09eb0

Fix lists

3dbe0b8

yurishkuro changed the title ~~Create Jaeger proposal~~ Add Jaeger proposal Aug 25, 2017

yurishkuro mentioned this pull request Aug 25, 2017

Proposal to join Cloud Native Computing Foundation jaegertracing/jaeger#354

Closed

Yuri Shkuro added 2 commits August 29, 2017 20:00

Add comparison with Zipkin

e3e1bdc

Fix links, add docs site

e040050

Add link to roadmap

fda71bc

Yuri Shkuro added 2 commits September 3, 2017 17:10

Add examples of community contributions

6b820bc

Add Instrumentation libraries in more languages

f43d5dd

bradtopol reviewed Sep 5, 2017

View reviewed changes

caniszczyk merged commit 41c0735 into cncf:master Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Jaeger proposal #42

Add Jaeger proposal #42

yurishkuro commented Aug 17, 2017

caniszczyk commented Aug 18, 2017

bassam commented Aug 24, 2017

yurishkuro commented Aug 24, 2017

bassam commented Aug 24, 2017 •

edited

yurishkuro commented Aug 25, 2017

bassam commented Aug 25, 2017

felixbarny commented Aug 25, 2017

frankgreco commented Aug 26, 2017

omeid commented Aug 26, 2017

caniszczyk commented Aug 29, 2017

Dieterbe commented Aug 30, 2017

mabn commented Aug 31, 2017

tanner-bruce commented Aug 31, 2017

dmueller2001 commented Sep 1, 2017

bharat-p commented Sep 2, 2017

bradtopol Sep 5, 2017 •

edited

yurishkuro Sep 5, 2017

Dieterbe Sep 5, 2017

yurishkuro Sep 5, 2017

caniszczyk commented Sep 7, 2017

caniszczyk commented Sep 13, 2017

Add Jaeger proposal #42

Add Jaeger proposal #42

Conversation

yurishkuro commented Aug 17, 2017

caniszczyk commented Aug 18, 2017

bassam commented Aug 24, 2017

yurishkuro commented Aug 24, 2017

bassam commented Aug 24, 2017 • edited

yurishkuro commented Aug 25, 2017

bassam commented Aug 25, 2017

felixbarny commented Aug 25, 2017

frankgreco commented Aug 26, 2017

omeid commented Aug 26, 2017

caniszczyk commented Aug 29, 2017

Dieterbe commented Aug 30, 2017

mabn commented Aug 31, 2017

tanner-bruce commented Aug 31, 2017

dmueller2001 commented Sep 1, 2017

bharat-p commented Sep 2, 2017

bradtopol Sep 5, 2017 • edited

Choose a reason for hiding this comment

yurishkuro Sep 5, 2017

Choose a reason for hiding this comment

Dieterbe Sep 5, 2017

Choose a reason for hiding this comment

yurishkuro Sep 5, 2017

Choose a reason for hiding this comment

caniszczyk commented Sep 7, 2017

caniszczyk commented Sep 13, 2017

bassam commented Aug 24, 2017 •

edited

bradtopol Sep 5, 2017 •

edited