-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MB-10364] Document Open Telemetry data available in AWS #51
[MB-10364] Document Open Telemetry data available in AWS #51
Conversation
- StringBasedCmdsLatency | ||
- SwapUsage | ||
|
||
### Resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to OTel official docs here? And/or some of the other existing documentation that was linked to above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some general links
``` | ||
|
||
#### Database SQL Statements | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What're these used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql.conn.query
- This is the more useful span that will be added to the trace request that logs the sql query and the duration so we could identify long running database queries.
sql.rows
- I don't see much data here besides the duration. My guess is that this reports when you return a result set of records from a query as a cursor then loop through them like for paginated data. I don't think we're doing that too many places in the codebase although Pop may be doing it under the hood.
You can see some of the real data in AWS loadtest now https://console.amazonaws-us-gov.com/cloudwatch/home?region=us-gov-west-1#servicelens:traces/1-61a67e8c-21e5141bacfa25f710da1a5b?~(query~(filter~(node~'*7e*28name*7e*27sql.conn.exec*7etype*7e*27Database*2a3a*2a3aSQL*7edataSource*7e*27xray*29~groupFilters~()))~context~())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this question, I realized I was missing sql.conn.exec
which are for updates/inserts and sql.conn.query are only selects statements apparently 🎉
## Data collection | ||
|
||
Data from the Milmove app is sent to the Open Telemetry collector, which processes the trace, metric, and log data and exports it to the services of our choice. In AWS the collector exports to the Cloudwatch and AWS X-Ray services for storage and analysis. When run locally the data is exported to the Elastic APM Server running Elasticsearch and Kibana. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! LGTM. As discussed, I don't think it's necessary to add the new trace information as part of this story PR, but we should add a story to track this.
@@ -0,0 +1,251 @@ | |||
# Open Telemetry | |||
|
|||
The Milmove app has had aspects of logging (Zap), tracing (trace middleware), and monitoring (AWS infra dashboards) previously but there is now an [ADR to use the Open Telemetry library](https://github.com/transcom/mymove/blob/master/docs/adr/0061-use-opentelemetry-for-distributed-tracing.md) to standardize our efforts. While not solely useful just for load testing, it did expose our need for better insight into the performance of the Milmove app and it's services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think there should be a comma
The Milmove app has had aspects of logging (Zap), tracing (trace middleware), and monitoring (AWS infra dashboards) previously but there is now an [ADR to use the Open Telemetry library](https://github.com/transcom/mymove/blob/master/docs/adr/0061-use-opentelemetry-for-distributed-tracing.md) to standardize our efforts. While not solely useful just for load testing, it did expose our need for better insight into the performance of the Milmove app and it's services. | |
The Milmove app has had aspects of logging (Zap), tracing (trace middleware), and monitoring (AWS infra dashboards) previously, but there is now an [ADR to use the Open Telemetry library](https://github.com/transcom/mymove/blob/master/docs/adr/0061-use-opentelemetry-for-distributed-tracing.md) to standardize our efforts. While not solely useful just for load testing, it did expose our need for better insight into the performance of the Milmove app and it's services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks for putting this together. I have a few comments, though they aren't blocking. Also, not sure if it would be good to mention that we have a fork of sorts, specifically for the otelhttp package in https://github.com/trussworks/otelhttp
#### HTTP Requests | ||
|
||
``` | ||
http.client_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where in the doc would be best since there's several places that mention them, but it might be good to indicate somewhere that most of these are semantic conventions defined by open telemetry:
Trace: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/
Metrics: https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, their specs are very readable too.
- Inuse | ||
- Idle | ||
- Wait Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along the lines of mentioning the others are semantic conventions, idk if there's value in also mentioning these three are custom ones for us.
I did add the bit about the fork under the instrumentation section https://github.com/transcom/mymove-docs/pull/51/files#diff-9ced70abf83582d83ee306cf0105f767926ef47dfcacbc04029b563e14b424b7R59 |
x.x sorry, I somehow missed that... |
I had a discovery story to identify what telemetry data is currently being reported in AWS. I quickly realized we had no quick start docs to understanding our Open Telemetry implementation so as part of my own learning I had tried to write up what I discovered.
There are still PRs in draft, such as transcom/mymove#7762 that reflect different things than are on the main branch in the mymove repo. However this is what has been deployed to the load testing environment and is generating the data you will see in the AWS console logs.
I think we have most of the data we need to proceed with load testing. There are missing dimensions when you go to query certain metrics such as no http method to distinguish between endpoints that accept GET/POST/PUT requests. We will want to create custom traces/spans for anything external or long running possibly Syncada, AWS, PDF Generation, etc ... I added more thoughts here https://dp3.atlassian.net/browse/MB-10364?focusedCommentId=18307
To preview this PR:
Logging into AWS to see some of this data
aws-vault login transcom-gov-milmove-loadtest
ECS/AWSOTel/Application
namespace.ECS/ContainerInsights
namespace instead.The only trace data being reported currently are requests to health endpoints due to a permissions issue. However they are being logged to Cloudwatch, even if they fail to be sent to X-Ray as proper traces. I am manually searching for these in Cloudwatch insights currently if folks want help with that.
Open Telemetry Page
Cloudwatch Metrics
HTTP Trace Span
Database SQL Trace Span