[中英对照]Uber 的分布式跟踪

By | 2018年10月31日

Distributed tracing is quickly becoming a must-have component in the tools that organizations use to monitor their complex, microservice-based architectures. At Uber Engineering, our open source distributed tracing system Jaeger saw large-scale internal adoption throughout 2016, integrated into hundreds of microservices and now recording thousands of traces every second. As we start the new year, here is the story of how we got here, from investigating off-the-shelf solutions like Zipkin, to why we switched from pull to push architecture, and how distributed tracing will continue to evolve in 2017.

From Monolith to Microservices

As Uber’s business has grown exponentially, so has our software architecture complexity. A little over a year ago, in fall 2015, we had around five hundred microservices. As of early 2017, we have over two thousand. This is in part due to the increasing number of business features—user-facing ones like UberEATS and UberRUSH—as well as internal functions like fraud detection, data mining, and maps processing. The other reason complexity increased was a move away from large monolithic applications to a distributed microservices architecture.

分布式跟踪正迅速成为许多组织用于监视复杂的基于微服务的架构的工具中必不可少的组件。在 Uber 工程团队中,我们的开源分布式跟踪系统 Jaeger 在整个 2016 年都实现了大规模的内部采用,集成到数百个微服务中,现在每秒能记录数千条记录。随着新一年的开始,这篇文章讲述我们如何得到下面的内容,从调查像 Zipkin 这样的现成解决方案,到为什么我们从 pull 架构切换到 push 架构,以及分布式跟踪将如何在 2017 年继续发展。

从 Monolith 到 Microservices

随着 Uber 的业务成倍增长,我们的软件架构复杂性也在增长。一年多以前的2015年秋天,我们有大约 500 个微服务。截至2017年初,我们有超过两千个微服务。这部分是由于越来越多的业务功能 —— 面向用户的业务功能,如 UberEATSUberRUSH —— 以及像欺诈检测、数据挖掘和地图处理等这样的内部功能。复杂性增加的另一个原因是从大型单片应用程序转向分布式微服务架构。

As it often happens, moving into a microservices ecosystem brings its own challenges. Among them is the loss of visibility into the system, and the complex interactions now occurring between services. Engineers at Uber know that our technology has a direct impact on people’s livelihoods. The reliability of the system is paramount, yet it is not possible without observability. Traditional monitoring tools such as metrics and distributed logging still have their place, but they often fail to provide visibility across services. This is where distributed tracing thrives.

Tracing Uber’s Beginnings

The first widely used tracing system at Uber was called Merckx, named after the fastest cyclist in the world during his time. Merckx quickly answered complex queries about Uber’s monolithic Python backend. It made queries like “find me requests where the user was logged in and the request took more than two seconds and only certain databases were used and a transaction was held open for more than 500 ms” possible. The profiling data was organized into a tree of blocks, with each block representing a certain operation or a remote call, similar to the notion of “span” in the OpenTracing API. Users could run ad hoc queries against the data stream in Kafka using command-line tools. They could also use a web UI to view predefined digests that summarized the high-level behavior of API endpoints and Celery tasks.

Merckx modeled the call graph as a tree of blocks, with each block representing an operation within the application, such as a database call, an RPC, or even a library function like parsing JSON.



首个在Uber上广泛运用的跟踪系统名为Merckx,以当时世界上最快的自行车命名。Merckx可以快速回应Uber的整个Python后台的复杂查询。他可能会处理这样的查询:“查找请求超过两秒,只使用了特定数据库,而且事务持续打开了超过500ms的请求用户所登录的位置”。产生的分析数据会组织成一个由块组成的树,每个块代表了一个特定的操作或者远程调用,类似于OpenTracing API中“span”的概念。用户可以使用命令行工具对Kafka中的数据流进行ad hoc查询。他们也可以使用web UI界面来查看预定义摘要,它概括了API终端和Celery任务的高级行为。


Merckx instrumentation was automatically applied to a number of infrastructure libraries in Python, including HTTP clients and servers, SQL queries, Redis calls, and even JSON serialization. The instrumentation recorded certain performance metrics and metadata about each operation, such as the URL for an HTTP call, or SQL query for database calls. It also captured information like how long database transactions have remained open, and which database shards and replicas were accessed.

Merckx architecture is a pull model from a stream of instrumentation data in Kafka.

The major shortcoming with Merckx was its design for the days of a monolithic API at Uber. Merckx lacked any concept of distributed context propagation. It recorded SQL queries, Redis calls, and even calls to other services, but there was no way to go more than one level deep. One other interesting Merckx limitation was that many advanced features like database transaction tracking really only worked under uWSGI, since Merckx data was stored in a global, thread-local storage. Once Uber started adopting Tornado, an asynchronous application framework for Python services, the thread-local storage was unable to represent many concurrent requests running in the same thread on Tornado’s IOLoop. We began to realize how important it was to have a solid story for keeping request state around and propagating it correctly, without relying on global variables or global state.

Merckx 自动应用于 Python 中的许多基础架构库,包括 HTTP 客户端和服务端、SQL 查询、Redis 调用,甚至 JSON 序列化。检测记录了每个操作的某些性能指标和元数据,例如 HTTP 调用的 URL 或数据库调用的 SQL 查询。它还捕获了诸如数据库事务保持开放多长时间、访问了哪些数据库碎片和副本等信息。

Merckx 架构是一个对接到 Kafka 中的数据流的拉模型。

Merckx 的主要缺点在于,它的设计是针对优步单一 API 的时代。Merckx 缺乏任何分布式上下文的概念。它记录了 SQL 查询、Redis 调用,甚至是对其他服务的调用,但无法深入到一层级别以上。另一个有趣的 Merckx 限制是,许多高级特性(如数据库事务跟踪)实际上只在 uWSGI 下工作,因为 Merckx 数据存储在全局的、线程本地的存储中。一旦 Uber 开始采用 Tornado (用于 Python 服务的异步应用程序框架),线程本地存储就无法表示在 Tornado 的 IOLoop 上同一线程中运行的许多并发请求。我们开始意识到,在不依赖全局变量或全局状态的情况下,有一个可靠的机制来保持请求状态并正确传播请求状态是多么重要。

Next, Tracing in TChannel

At the beginning of 2015, we started the development of TChannel, a network multiplexing and framing protocol for RPC. One of the design goals of the protocol was to have Dapper-style distributed tracing built into the protocol as a first-class citizen. Toward that goal, the TChannel protocol specification defined tracing fields as part of the binary format.

spanid:8 parentid:8 traceid:8 traceflags:1

field type description


int64 that identifies the current span


int64 of the previous span


int64 assigned by the original requestor


uint8 bit flags field

Tracing fields appear as part of the binary format in TChannel protocol specification.

In addition to the protocol specification, we released several open-source client libraries that implement the protocol in different languages. One of the design principles for those libraries was to have the notion of a request context that the application was expected to pass through from the server endpoints to the downstream call sites. For example, in tchannel-go, the signature to make an outbound call with JSON encoding required the context as the first argument:

func (c *Client) Call(ctx Context, method string, arg, resp interface{}) error {..}

接下来,在 TChannel 中跟踪

2015年初,我们开始开发一种用于 RPC 的网络多路复用和帧结构协议 TChannel 。协议的设计目标之一是将 Dapper 风格的分布式跟踪作为一等公民构建到协议中。为了实现这个目标,TChannel 协议规范跟踪字段定义为二进制格式的一部分。

spanid:8 parentid:8 traceid:8 traceflags:1

field type description


int64 当前 span的ID


int64 前一个 span的ID


int64 原始请求者赋予的ID


uint8 标识位

作为二进制格式的一部分出现在 TChannel 协议规范中的跟踪字段。

除了协议规范之外,我们还发布了几个开源客户端库,它们用不同的语言实现了协议。这些库的设计原则之一是具有请求上下文的概念,应用程序希望通过请求上下文从服务器端点传递到下游调用站点。例如,在 tchannel-go 中,使用 JSON 编码进行出站调用的签名需要上下文作为第一个参数:

func (c *Client) Call(ctx Context, method string, arg, resp interface{}) error {..}

The TChannel libraries encouraged application developers to write their code with distributed context propagation in mind.

The client libraries had built-in support for distributed tracing by marshalling the tracing context between the wire representation and the in-memory context object, and by creating tracing spans around service handlers and the outbound calls. Internally, the spans were represented in a format nearly identical to the Zipkin tracing system, including the use of Zipkin-specific annotations, such as “cs” (Client Send) and “cr” (Client Receive). TChannel used a tracing reporter interface to send the collected tracing spans out of process to the tracing system’s backend. The libraries came with a default reporter implementation that used TChannel itself and Hyperbahn, the discovery and routing layer, to send the spans in Thrift format to a cluster of collectors.

TChannel client libraries got us close to the working distributing tracing system Uber needed, providing the following building blocks:

  • Interprocess propagation of tracing context, in-band with the requests

  • Instrumentation API to record tracing spans

  • In-process propagation of the tracing context

  • Format and mechanism for reporting tracing data out of process to the tracing backend

TChannel 库鼓励应用程序开发人员在编写代码时考虑到分布式上下文传播。

客户端库通过在线表示和内存上下文对象之间编组跟踪上下文,以及通过围绕服务处理程序和出站调用创建跟踪跨度,为分布式跟踪提供了内置支持。在内部,span 以一种几乎与 Zipkin 跟踪系统相同的格式表示,包括使用 Zipkin 特有的注释,例如 “cs” (客户机发送)和 “cr” (客户机接收)。TChannel 使用跟踪报告器接口将收集到的跟踪跨越进程发送到跟踪系统的后端。这些库附带了一个默认的报告器实现,它使用 TChannel 本身和 Hyperbahn (发现和路由层)以 Thrift 格式将跨度发送到收集器集群。

TChannel 客户端库让我们更加接近 Uber 需要的工作分发跟踪系统,提供了以下构建模块:

  • 进程间传播跟踪上下文,在请求内部管理

  • 记录跟踪跨度的工具 API

  • 跟踪上下文的进程内传播

  • 用于将流程外的跟踪数据报告到跟踪后端的格式和机制

The only missing piece was the tracing backend itself. Both the wire format of the tracing context and the default Thrift format used by the reporter have been designed to make it very straightforward to integrate TChannel with a Zipkin backend. However, at the time the only way to send spans to Zipkin was via Scribe, and the only performant data store that Zipkin supported was Cassandra. Back then, we had no direct operational experience for either of those technologies, so we built a prototype backend that combined some custom components with the Zipkin UI to form a complete tracing system.

The architecture of the prototype backend for TChannel-generated traces was a push model with custom collectors, custom storage, and the open source Zipkin UI.

The success of distributed tracing systems at other major tech companies such as Google and Twitter was predicated on the availability of RPC frameworks, Stubby and Finagle respectively, widely used at those companies.

唯一缺失的部分是跟踪后端本身。跟踪上下文的格式和报告器使用的默认 Thift 格式都被设计成将 TChannel 与 Zipkin 后端集成起来非常简单。然而,当时将 span 发送到 Zipkin 的唯一方法是通过 Scribe ,而 Zipkin 支持的唯一性能数据存储是 Cassandra 。当时,我们对这两种技术都没有直接的操作经验,因此我们构建了一个原型后端,将一些定制组件与 Zipkin UI 结合起来,形成了一个完整的跟踪系统。

Tchannel 生成跟踪的原型后端体系结构是一个带有自定义收集器、自定义存储和开源 Zipkin UI 的 push 模型。

其他大科技公司如谷歌和 Twitter 的分布式跟踪系统的成功取决于 RPC 框架的可用性,Stubby 和 Finagle 分别在这些公司广泛使用。

Similarly, out-of-the-box tracing capabilities in TChannel were a big step forward. The deployed backend prototype started receiving traces from several dozen services right away. More services were being built using TChannel, but full-scale production rollout and widespread adoption were still problematic. The prototype backend and its Riak/Solr based storage had some issues scaling up to Uber’s traffic, and several query capabilities were missing to properly interoperate with the Zipkin UI. And despite the rapid adoption of TChannel by new services, Uber still had a large number of services not using TChannel for RPC; in fact, most of the services responsible for running the core business functions ran without TChannel. These services were implemented in four major programming languages (Node.js, Python, Go, and Java), using a variety of different frameworks for interprocess communication. This heterogeneity of the technology landscape made deploying distributed tracing at Uber a much more difficult task than at places like Google and Twitter.

Building Jaeger in New York City

The Uber NYC Engineering organization began in early 2015, with two primary teams: Observability on the infrastructure side and Uber Everything on the product side (including UberEATS and UberRUSH). Since distributed tracing is a form of production monitoring, it was a good fit for Observability.

同样,在 TChannel 中开箱即用的跟踪功能是向前迈出的一大步。部署的后端原型立即开始接收来自几十个服务的跟踪。正在使用 TChannel 构建更多的服务,但是全面的生产部署和广泛的采用仍然存在问题。原型后端和基于 Riak/Solr 的存储在 Uber 的流量上存在一些问题,并且缺少一些查询功能来与 Zipkin UI 进行适当的互操作。尽管新服务迅速采用了 TChannel ,但 Uber 仍有大量的服务未将 TChannel 用于 RPC ;事实上,大多数负责运行核心业务功能的服务在没有 TChannel 的情况下运行。这些服务是用四种主要的编程语言 (Node) 实现的。使用各种不同的框架进行进程间通信。这种技术领域的异质性使得在 Uber 部署分布式跟踪比在谷歌和 Twitter 这样的地方更难。

在纽约建造 Jaeger

Uber NYC 工程组织成立于2015年初,有两个主要团队:基础设施方面的可观察性和产品方面的一切(包括 UberEATS 和 UberRUSH )。由于分布式跟踪是生产监视的一种形式,因此它非常适合于可观察性。

We formed the Distributed Tracing team with two engineers and two objectives: transform the existing prototype into a full-scale production system, and make distributed tracing available to and adopted by all Uber microservices. We also needed a code name for the project. Naming things is one of the two hard problems in computer science, so it took us a couple weeks of brainstorming words with the themes of tracing, detectives, and hunting, until we settled on the name Jaeger (?yā-g?r), German for hunter or hunting attendant.

The NYC team already had the operational experience of running Cassandra clusters, which was the database directly supported by the Zipkin backend, so we decided to abandon the Riak/Solr based prototype. We reimplemented the collectors in Go to accept TChannel traffic and store it in Cassandra in the binary format compatible with Zipkin. This allowed us to use Zipkin web and query services without any modifications, and also provided the missing functionality of searching traces by custom tags. We have also built in a dynamically configurable multiplication factor into each collector to multiply the inbound traffic n times for the purpose of stress testing the backend with production data.

The early Jaeger architecture still relied on Zipkin UI and Zipkin storage format.

我们组建了分布式跟踪团队,有两个工程师和两个目标:将现有的原型转化为完整的生产系统,让所有的 Uber 微服务都可以使用和采用分布式跟踪。我们还需要项目的代码名。命名事物是计算机科学的两大难题中的一个,因此我们花了几周的头脑风暴和跟踪的主题,侦探,和打猎,直到我们选定了这个名字 Jaeger(?yā-g?r),德国的猎人或狩猎服务员。

NYC 团队已经有了运行 Cassandra cluster 的操作经验,而 Cassandra cluster 是 Zipkin 后端直接支持的数据库,因此我们决定放弃基于 Riak/Solr 的原型。我们重新实现了 Go 中的收集器,以接受 TChannel 通信量,并将其以与 Zipkin 兼容的二进制格式存储在 Cassandra 中。这使我们可以使用 Zipkin web 和查询服务,而无需进行任何修改,还提供了通过自定义标记搜索跟踪的缺失功能。我们还在每个收集器中构建了一个动态可配置的乘法因子,以便将入站流量乘以n倍,以便用生产数据对后端进行压力测试。

早期的 Jaeger 架构仍然依赖于 Zipkin UI 和 Zipkin 存储格式。

The second order of business was to make tracing available to all the existing services that were not using TChannel for RPC. We spent the next few months building client side libraries in Go, Java, Python, and Node.js to support instrumentation of arbitrary services, including HTTP-based ones. Even though the Zipkin backend was fairly well known and popular, it lacked a good story on the instrumentation side, especially outside of the Java/Scala ecosystem. We considered various open source instrumentation libraries, but they were maintained by different people with no guarantee of interoperability on the wire, often with completely different APIs, and most requiring Scribe or Kafka as the transport for reporting spans. We ultimately decided to write our own libraries that would be integration tested for interoperability, support the transport that we needed, and, most importantly, provide a consistent instrumentation API in different languages. All our client libraries have been build to support the OpenTracing API from inception.

在业务上排第二的是让跟踪有效,这样,所有存在的服务就不能使用由 RPC 支持的 T通道。在这之后,我们花费了几个月的时间来建立了客户端侧的库,这些库支持 Go, Java,Python 和 Node.js 来监控任意的服务(包括基于 HTTP 的服务)。尽管 Zipkin 后端相当知名且受欢迎,但它在监控方面缺乏一个好的案例,特别是在Java / Scala 生态系统之外。我们还考虑过各种开源的监控库,但是他们是由不同的人维护的,无法保证线上的互操作性,还通常使用完全不同的API,并且大多数要求使用 Scribe 或 Kafka 作为中介传输。我们最终决定编写自己的库,这些库将进行集成测试,以实现互操作性,支持我们所需的传输,最重要的是,提供不同语言的一致的监控API。我们构建了所有客户端库,以便从一开始就支持 OpenTracing API。

Another novel feature that we built into the very first versions of the client libraries was the ability to poll the tracing backend for the sampling strategy. When a service receives a request that has no tracing metadata, the tracing instrumentation usually starts a new trace for that request by generating a new random trace ID. However, most production tracing systems, especially those that have to deal with the scale of Uber, do not profile every single trace or record it in storage. Doing so would create a prohibitively large volume of traffic from the services to the tracing backend, possibly orders of magnitude larger than the actual business traffic handled by the services. Instead, most tracing systems sample only a small percentage of traces and only profile and record those sampled traces. The exact algorithm for making a sampling decision is what we call a sampling strategy. Examples of sampling strategies include:

  • Sample everything. This is useful for testing, but expensive in production!

  • A probabilistic approach, where a given trace is sampled randomly with a certain fixed probability.

  • A rate limiting approach, where X number of traces are sampled per time unit. For example, a variant of the leaky bucket algorithm might be used.


  • 采样一切。这对于测试很有用,但是在生产中很昂贵!

  • 一种基于概率的方法,对给定的轨迹以一定的固定概率随机抽样。

  • 一种速率限制方法,其中每时间单元采样X条轨迹。例如,可以使用漏桶算法的变体。

div.column{width:49.5%;display:table-cell;border:1px solid #d4d4d5;}



电子邮件地址不会被公开。 必填项已用*标注