We currently use quite a few AWS services and use AWS X-ray for distributed tracing. It has worked very well given that it provides an easy mechanism to record traces, visualise calls across services and analyse issues across distributed applications. It has SDKs for several languages which makes the integration trivial.
X-ray lets us filter traces by latency, status codes, failure/error and a lot more.
Trace Inbound HTTP Calls
We can easily configure the SDK middleware to intercept inbound requests. This adds basic information to the trace which includes the HTTP method, timing and a few more HTTP headers. The segment name can be configured to respect the Host header. The alternative is to specify a static name for the node.
We can also force sampling by passing a custom header with the request X-Amzn-Trace-Id with the value as Sampled=1 . This feature isn’t documented but the SDK libraries typically look for this header in the incoming request.
Override Sampling Rules
By default, the first request every second and 5% of any additional requests are sampled. However if we have certain routes which are triggered less often but are important to monitor, we can tweak the sampling rules based on url pattern, service_name and http method. We use this to increase the sampling to 100% for important low traffic integration endpoints. See https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-nodejs-configuration.html#xray-sdk-nodejs-configuration-sampling for an example in NodeJS.
With 2.0.0 version of the X-Ray SDKs, the sampling rules can now be managed from the web console itself allowing us to tweak the rules on runtime without redeploying the application.
A maximum of 1000 traces can be seen at a time in the list view. The data is retained for one month and traces can be seen for a time window between 5 mins to 6 hours.
Trace Outbound HTTP Calls
Outbound HTTP calls from the application can be traced as a sub-segment. This allows us to easily trace requests which exceed a specified duration or have been failing with a particular status code. This usually involves replacing the http client with an equivalent instrumented client. See https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-nodejs-httpclients.html
Capture Custom Metadata Per Trace
Using the AWS X-ray SDK, we can add custom metadata and annotations.
The default X-ray HTTP instrumentation doesn’t log the query parameters. This can be easily incorporated in the application code by using some middleware/interceptor to add additional metadata attributes.
We use a custom requestId annotation which lets us map requests from X-ray to those in our logs. This helps us investigate the issue further.
Annotations are more powerful since they are searchable via filter expressions and also allow us to group the traces by an annotation. Metadata however are only displayed with each trace.
Capture Custom Sub-Segments
We use custom sub-segments to capture any background tasks executed in a separate thread. This makes it easy to visualise the time taken for background processing.
Trace calls to AWS Services
The X-ray SDK automatically captures calls to other AWS services. The attributes reported can be customised using a whitelist file. In our case, knowing the path to the S3 file is useful. This can be tweaked using the whitelist. E.g. https://github.com/aws/aws-xray-sdk-node/blob/master/packages/core/lib/resources/aws_whitelist.json which can be set from a TracingHandler. With X-ray 2.x versions of the Java SDK, the S3 file path is tracked automatically using the same mechanism.
SQL queries associated with an application can be traced by using an instrumented database client. We don’t use this currently but this can be quite useful.
Colorise the X-Ray Service map
Lastly, the default X-ray service map only highlights nodes for throttled/failed requests. We use a lot of AWS components and this makes it difficult to identify hot spots. I wanted a color scale depending upon requests/sec per node and so I ended up writing a Chrome extension for the same.
The service map currently cannot be combined with filter queries. This would have been quite useful to filter nodes only invoked in a particular flow/application.
Overall, AWS X-ray provides us with insights into our services without the hassles of maintaining the corresponding infrastructure.