Go Microservices, Part 12: Distributed Tracing With Zipkin
in this part of the go microservices blog series, we’ll explore the concept of distributed tracing and how to add this to our go microservices.
contents
- overview
- distributed tracing
- zipkin
- edge server - netflix zuul
- go code - adding distributed tracing
- deploy and run
- summary
source code
the finished source can be cloned from github:
> git clone https://github.com/callistaenterprise/goblog.git
> git checkout p12
1. overview
here's the state of the microservices landscape when we’re finished with this part of the blog series:
figure 1: landscape overview for part 12.
marked with red boxes, we see our two new supporting components - the zuul edge server and zipkin . also, we see small boxes with “ta” indicating services where we’ve added distributed tracing.
2. distributed tracing
keeping track of the life of a request passing through a system (and back) isn’t exactly new. we’ve been adding request id’s, thread identifiers and user id’s to log statements for ages. however, during the transition from monolithic applications to fine-grained microservices, the complexity increases when requests are passed between microservices, to storage backends, with messages spawning new requests - all belonging to the one and same business transaction. how do we identify performance bottlenecks when a requests is served by a large number of services, possibly in part relying on asynchronous operations completing?
while logs are very useful for this purpose, the concept of distributed tracing has now emerged as an important part of a maintainable and production-ready microservices operations model. for a more in-depth explantion of the rationale and basics of distributed tracing, i suggest reading my co-worker magnus's blog post about spring cloud sleuth and distributed tracing with zipkin.
3. zipkin
zipkin is an application for visualizing traces between and within services, their supporting components and even messaging. zipkin originates from twitter and is currently an open source project on github. zipkin provides a user-friendly gui while the backend takes care of collecting tracing data and aggregating those into something we humans can make sense of.
i’m using an external pre-baked container image for zipkin that exposes port 9411 for the admin gui. of course, you can build zipkin from source , configure different storage backends etc.
a docker service create can look like this:
> docker service create --constraint node.role==manager --replicas 1 \
-p 9411:9411 --name zipkin --network my_network \
--update-delay 10s --with-registry-auth \
--update-parallelism 1 openzipkin/zipkin
the visualization of a distributed trace can look like this (borrowed from my colleagues blog post ):
we’ll dig a little bit deeper into the possibilities of zipkin when we’ve gotten our own tracing up and running.
4. edge server - netflix zuul
in order to showcase how we can trace a request from start to finish, we’ll introduce an edge server capable of adding zipkin-compatible tracing information to http requests out of the box. netflix zuul is the default edge server of spring cloud / netflix oss.
if you’re wondering what the difference is between an edge server, a reverse-proxy and a load-balancer such as nginx or haproxy you’re probably in good company. from my point of view, an edge server such as netflix zuul can act both as reverse proxy, load-balancer and to some extent a security gateway with capabilities to support your applications with - for example - routing requests to the appropriate internal service, adding correlation id’s to inbound requests or relaying certain http headers such as oauth tokens.
the “edge” in the name stems from the fact that these servers usually resides where your internal network connects to the public internet or a dmz net - or even between different applications within your own enterprise, acting as the single way traffic may enter the logical internal network of your application.
i’ve prepared a container image pre-configured for our sample landscape with spring sleuth / zipkin enabled as well as a simple routing rule that will provide a https endpoint at /api/accounts/{accountid} that will forward requests to our good ol’ underlying “accountservice” at port 6767.
just a glimpse of some of the zuul configuration:
# disable eureka service discovery, we're on docker swarm mode.
eureka:
client:
enabled: false
# enable zipkin support, sample all requests
spring:
zipkin:
baseurl: http://zipkin:9411
sleuth:
sampler:
percentage: 1.0
sample:
zipkin:
enabled: true
# zuul routing rules, will create /api/accounts/ mapping to http://accountservice:6767/accounts
zuul:
ignoredservices: "*"
prefix: /api
routes:
accountservice:
path: /accounts/**
url: http://accountservice:6767/accounts
you can deploy a pre-baked zuul image for this blog series using the following docker service create :
docker service create --replicas 1 --name edge-server -p 8765:8765 \
--network my_network --update-delay 10s --with-registry-auth \
--update-parallelism 1 eriklupander/edge-server
5. go code - opentracing
5.1 on tracing without thread-local storage
all of this sounds great - but how do we actually add this cool tracing stuff to our go-based microservices and how will zipkin get hold of our traces?
conveniently enough, there’s a ready to use tracing library for us go-nuts we can use, named opentracing-go based on the opentracing standard - which is very compatible with spring cloud sleuth used by zuul and other spring cloud-based support services - that we’re already using in this blog series.
in all honesty - go isn’t the ideal language to add this kind of stuff since there exists no (usable) notion of thread-local storage in go . also, the mechanisms offered by interceptors and/or aop-based programming which is quite suitable for transparently adding functionality such as distributed tracing to a call-stack, isn’t natively available in go.
however - with some careful use of the go middleware pattern and the go contexts introduced in go 1.7 we can add tracing to our go microservices in a somewhat developer-friendly. i must admit i frown a bit upon the context pattern where the idiomatic use is to always pass context.context as the first parameter of each func in the call stack. google themselves says the following in the official docs:
"at google, we require that go programmers pass a context parameter as the first argument to every function on the call path between incoming and outgoing requests."
this is a somewhat controversial thing within the go community. i know thread-locals are considered evil too, though very useful at times to keep track of request-scoped information such as security tokens, user principals and of course logging/tracing ids.
oh well - enough of this “i dislike it but i use it anyway” stuff. let’s start coding!
5.2 our tracing library
well - i wouldn’t necessary call our little tracing.go file a library. it basically wraps some functionality of the go-opentracing library and provides a somewhat clean abstraction with a declarative api to start, stop and parse traces.
there’s a few typical use cases where we need to concern ourselves with tracing info:
- incoming http requests: we look for opentracing correlation id’s in http headers and if found - starts a trace as well as dumping the required data structure into a go context.
- outbound http requests: basically the reverse of the above. we check for tracing data in our context and add that as a http header in outgoing requests.
- sending a message with amqp: more or less the same as above, i.e. if our context contains opentracing id’s we stuff them into headers along with the message. instead of http headers we’re using the header abstraction provided in the amqp protocol.
- receiving a message over amqp: as you’ve probably figured out already - check if there’s tracing data in a message header and if so - extract and start a new trace.
- internal tracing: talking to an external database? spawn a child-span to keep track of the amount of time used for that action? performing a cpu-intensive operation on some data? track this using a child-span as well. there are many occurrences where it makes sense to keep track of what’s going on using tracing even within services.
5.2.1 initialization
each microservice that wants to transmit tracing results to zipkin needs to be configured to do that. for that purpose, we’re going to use zipkin-go-opentracing . the code to set this up is very simple:
var tracer opentracing.tracer
// inittracing connects the calling service to zipkin
func inittracing(zipkinurl string, servicename string) {
collector, err := zipkin.newhttpcollector(fmt.sprintf("%s/api/v1/spans", zipkinurl))
if err != nil {
panic("error connecting to zipkin server at " +
fmt.sprintf("%s/api/v1/spans", zipkinurl) + ". error: " + err.error())
}
tracer, err = zipkin.newtracer(
zipkin.newrecorder(collector, false, "127.0.0.1:0", servicename))
if err != nil {
panic("error starting new zipkin tracer. error: " + err.error())
}
}
note the initialization of the package-scoped tracer opentracing.tracer variable, which is the object we’ll be doing all our tracing work with. the zipkinurl actually comes from our yaml-based config files stored on github and served to us over spring config and viper:
zipkin_server_url: http://zipkin:9411
as you might notice, we’ll using the http protocol for uploading traces to zipkin. probably not the most efficient protocol for this purpose. zipkin also support consumption of amqp (e.g. rabbitmq) messages.
5.2.2 incoming http requests
as previously stated, we’re going to be using the middleware pattern and context.context to work with tracing data in incoming http requests. in /accountservice/services/router.go :
func newrouter() *mux.router {
.... other code above ....
router.methods(route.method).
path(route.pattern).
name(route.name).
handler(loadtracing(route.handlerfunc)) // look here!
.... other code below ....
}
func loadtracing(next http.handler) http.handler {
return http.handlerfunc(func(rw http.responsewriter, req *http.request) {
span := tracing.starthttptrace(req, "getaccount") // start the span
ctx := tracing.updatecontext(req.context(), span) // add span to context
next.servehttp(rw, req.withcontext(ctx)) // note next-based chaining and copy of context!!
span.finish() // finish the span
})
}
what’s going on above? in the newrouter() we are passing a func loadtracing() to the handler func of the router builder api. as argument to loadtracing() , we’re passning the func defined in the route . this is actually the “getaccount” func from handlers.go where we do the actual work.
this looks a lot like interceptors and filter chains familiar from other languages and frameworks, where we “wrap” the call to a function into another function, allowing us to do stuff before and after the actual call - in this case starting a span and then closing it once the “next” func is done. we’ll probably be adding more chaining of handlers in a later blog post where we’ll be adding security and auth checking to our microservices.
the code to start a new httptrace looks like this, e.g. our wrapping of go-opentracing code:
func starthttptrace(r *http.request, opname string) opentracing.span {
carrier := opentracing.httpheaderscarrier(r.header) // 1. get hold of http headers for tracing from request.
clientcontext, err := tracer.extract(opentracing.httpheaders, carrier) // 2. extract into a tracing context
if err == nil { // 3. if there were a tracing context...
return tracer.startspan( // 3.1 start and return child span of the ongoing one
opname, ext.rpcserveroption(clientcontext))
} else {
return tracer.startspan(opname) // 3.2 otherwise, start a new one from scratch
}
}
how are we using context to store the “tracing info”, i.e. some correlation id’s and such?
func updatecontext(ctx context.context, span opentracing.span) context.context {
return context.withvalue(ctx, "opentracing-span", span)
}
since the contexts are immutable, we’re using context.withvalue to add the supplied span to our existing context, returning the new context. note the ugly use of “opentracing-span” as key. i don’t particularly like this pattern with hard-coded keys but at least its only the “tracing.go” code that knows about the key we’re using to fetch the current tracing span from our thread-local substitute - e.g. the context we’re passing around.
5.2.3 outgoing http requests
so - let’s say our “accountservice” got tracing info when zuul routed a request to /accounts/{accountid} . now, we want to continue that trace when the “accountservice” performs a http call to the “imageservice”. this code is quite intermingled with the circuit-breaker and retry code from the last part , but i hope it makes sense anyway:
// note how we pass context as 1st param and are passing the http req object as a parameter too.
func performhttprequestcircuitbreaker(ctx context.context, breakername string, req *http.request) ([]byte, error) {
output := make(chan []byte, 1) // hystrix stuff...
errors := hystrix.go(breakername, func() error { // hystrix stuff...
tracing.addtracingtoreqfromcontext(ctx, req) // here!!!
err := callwithretries(req, output)
return err // for hystrix, forward the err from the retrier. it's nil if ok.
}, func(err error) error {
return err
})
... some more code ...
we see that we’re calling tracing.addtracingtoreqfromcontext(ctx, req) :
func addtracingtoreqfromcontext(ctx context.context, req *http.request) {
if ctx.value("opentracing-span") == nil { // do nothing if no tracing context available
return
}
carrier := opentracing.httpheaderscarrier(req.header) // create http carrier for passing tracing data connected to the passed request.
err := tracer.inject( // inject passes span data into the http headers of the request
ctx.value("opentracing-span").(opentracing.span).context(), // note ugly typecast here and use of the hard-coded key...
opentracing.httpheaders,
carrier)
if err != nil {
panic("unable to inject tracing context: " + err.error()) // here be dragons.
}
}
well - i guess the code above isn’t my finest hour, but it basically fetches tracing stuff from the passed context (our substitute for thread-local storage) and passes it into the request object as http headers.
5.2.4 internal tracing
we can of course add child traces without dealing with http headers - we could even pass opentracing.span structs around as parameters instead of using that ugly context.context . a really simple use case is when we’re calling our boltdb to fetch the account instance. looks like this:
// note that we're passing the context as 1st param, just as google asks us to!
func (bc *boltclient) queryaccount(ctx context.context, accountid string) (model.account, error) {
// tracing code.
span := tracing.startchildspanfromcontext(ctx, "queryaccount") // start a child span of the current one,
// named queryaccount
defer span.finish() // note use of defer, e.g. the span won't be finished and uploaded to zipkin until
// the ongoing func has finished. (we could also put span.finish() at the very last
// line of this func.
account := model.account{}
err := bc.boltdb.view(func(tx *bolt.tx) error {
......... more code .........
}
see comments for details.
a quick peek at this particular trace in zipkin:
yes, it’s the tiny one using 33 microseconds just below the middle, with “accountservice” as its parent. we’ll look more closely at zipkin very soon.
there’s a number of other code changes for this part of the blog series. the key changes being introduction of context.context as 1st parameter, passing of trace id’s (e.g. spans) across microservices using http or amqp headers and each microservice uploading traces to zipkin using http.
6. deploy and run
let’s get this show on the road, shall we? we’ve already covered deployment of netflix zuul and zipkin. also make sure you’ve checked out branch p12 of the source code repo. given that we’ve got a working go environment (remember gopath) and docker running (don’t forget to eval “$(docker-machine env swarm-manager-0)” etc.), we can continue by rebuilding all our go microservices using the “./copyall.sh” shell script:
> ./copyall.sh
built /users/myuser/goblog/src/github.com/callistaenterprise/goblog/accountservice
built /users/myuser/goblog/src/github.com/callistaenterprise/goblog/vipservice
... and so on ...
this should build all go-based microservices and deploy them to our swarm. let’s take a look at dvizz on http://192.168.99.100:6969 :
quite a few services! time to do a few requests using curl to the edge server and see if we can get some traces into zipkin!
6.1 produce some traces
we’ll use curl to request /api/accounts/10000 which is the endpoint served by our zuul edge server. internally, the flow of requests should be like this:
- our http client only knows about the edge server and requests /api/accounts/{accountid} over https
- zuul routes this request to the accountservice using the logical service name “accountservice” using http.
- the accountservice internally loads an accountobject from its boltdb database and then sends a message to the vipservice using amqp.
- next, the accountservice requests a “quote of the day” from the quotes-service .finally, the accountservice requests an imageurl from the imageservice .
run a few calls using url:
curl -k https://192.168.99.100:8765/api/accounts/10000
(the -k flag is to ignore ssl warnings, i’m running zuul with a self-signed cert)
open the zipkin gui at http://192.168.99.100:9411 and click the “find traces” button:
cool! the traces are there right away. we see that the longest request needed about 45ms from start (in the edge server) to finish (when the edge server responded to curl). the 45ms trace is made up of 9 spans with varying lengths. by clicking on the topmost trace, we can examine it in more detail:
examine the trace above closely. if we are troubleshooting performance issues, it should be relatively straightforward to spot the most likely culprit for most of the 45ms duration.
remember, when reading the trace, the topmost spans are usually spending most of their time waiting for sub-services to finish. we should pay special attention to leaf operations taking a lot of time. let’s see:
- getaccount uses 32ms
- getquote uses 30 ms
- queryaccount using 29μs is the boltdb query.
- getimageurl uses 1.1 ms (and 17μs internally) so that call is also quite cheap.
- vipservice#onmessage uses about 11ms, but remember that we’re just sending an asynchronous message so that execution isn’t blocking anything else.
since the getquote span makes up 30 ms of the total 32 ms of the getaccount span, we can probably say for certain that the quotes-service is guilty.
(in this case, we shouldn’t blame java. you might remember from a few blog posts back that we send a ?strength=4 query param to the _quotes-service that makes it use cpu cycles artificially to simulate work.)_
needless to say, a tool such as zipkin can be invaluable for identifying both which services that’s invoked when your microservices are serving a request as well as identifying where time is being spent.
it’s also possible to click on the individual spans for even more detail. a cool thing about opentracing and zipkin is that you can attach both arbitrary key-value pairs as well as “log events” to spans that ends up in zipkin. here we see zuul providing some extra info for us:
of course, we can add this kind of stuff in go code too.
6.2 resource usage
let’s take a quick peek at resource usage, we’ve added quite a bit of code in the last parts in regard to circuit breakers, tracing, configuration, logging, etc:
container cpu % mem usage
imageservice.1.fcaax3b2coexljqs82l72sw6q 2.13% 4.496mib
accountservice.3.ma5x5r9wzkkfippr5lg1rucce 0.22% 4.445mib
vipservice.1.ydi9g7qg5fx6841dznzhlynk1 1.93% 3.418mib
our go services are still lean.
in a few blog posts, i plan to deploy all of the above using aws cloudformation and docker stack to an amazon ec2 cluster made up of t2.micro instances. there, we will really start to notice the impact of resource-friendly services when we start to scale stuff.
7. summary
in this part of the blog series we’ve added distributed tracing to our go microservices and added an edge server (zuul) and zipkin for collecting and viewing traces.
in part 13 we’ll take a look at using go with cockroachdb and the o/r-mapper gorm.