Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MB-10364] Document Open Telemetry data available in AWS #51

Conversation

duncan-truss
Copy link
Contributor

@duncan-truss duncan-truss commented Nov 30, 2021

I had a discovery story to identify what telemetry data is currently being reported in AWS. I quickly realized we had no quick start docs to understanding our Open Telemetry implementation so as part of my own learning I had tried to write up what I discovered.

There are still PRs in draft, such as transcom/mymove#7762 that reflect different things than are on the main branch in the mymove repo. However this is what has been deployed to the load testing environment and is generating the data you will see in the AWS console logs.

I think we have most of the data we need to proceed with load testing. There are missing dimensions when you go to query certain metrics such as no http method to distinguish between endpoints that accept GET/POST/PUT requests. We will want to create custom traces/spans for anything external or long running possibly Syncada, AWS, PDF Generation, etc ... I added more thoughts here https://dp3.atlassian.net/browse/MB-10364?focusedCommentId=18307

To preview this PR:

yarn start
  1. Visit localhost:4000
  2. Find the backend guides section and click on the Open Telemetry page.

Logging into AWS to see some of this data

  1. From your terminal log into the AWS console with aws-vault login transcom-gov-milmove-loadtest
  2. Select the Cloudwatch Service
  3. From here you can select Metrics from the sidebar
  4. For Milmove application metrics you will want to view the ECS/AWSOTel/Application namespace.
  5. To view ContainerInsight metrics you will want the ECS/ContainerInsights namespace instead.

The only trace data being reported currently are requests to health endpoints due to a permissions issue. However they are being logged to Cloudwatch, even if they fail to be sent to X-Ray as proper traces. I am manually searching for these in Cloudwatch insights currently if folks want help with that.

Open Telemetry Page
Screen Shot 2021-11-30 at 11 00 22 AM

Cloudwatch Metrics
Screen Shot 2021-11-30 at 10 36 55 AM

HTTP Trace Span
Screen Shot 2021-11-30 at 10 49 29 AM

Database SQL Trace Span
Screen Shot 2021-11-30 at 10 50 10 AM

@duncan-truss duncan-truss self-assigned this Nov 30, 2021
@duncan-truss duncan-truss changed the title [mb-10364] Document Open Telemetry data available in AWS [MB-10364] Document Open Telemetry data available in AWS Nov 30, 2021
- StringBasedCmdsLatency
- SwapUsage

### Resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to OTel official docs here? And/or some of the other existing documentation that was linked to above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some general links

```

#### Database SQL Statements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What're these used for?

Copy link
Contributor Author

@duncan-truss duncan-truss Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql.conn.query - This is the more useful span that will be added to the trace request that logs the sql query and the duration so we could identify long running database queries.

sql.rows - I don't see much data here besides the duration. My guess is that this reports when you return a result set of records from a query as a cursor then loop through them like for paginated data. I don't think we're doing that too many places in the codebase although Pop may be doing it under the hood.

You can see some of the real data in AWS loadtest now https://console.amazonaws-us-gov.com/cloudwatch/home?region=us-gov-west-1#servicelens:traces/1-61a67e8c-21e5141bacfa25f710da1a5b?~(query~(filter~(node~'*7e*28name*7e*27sql.conn.exec*7etype*7e*27Database*2a3a*2a3aSQL*7edataSource*7e*27xray*29~groupFilters~()))~context~())

Screen Shot 2021-11-30 at 4 18 31 PM

Copy link
Contributor Author

@duncan-truss duncan-truss Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this question, I realized I was missing sql.conn.exec which are for updates/inserts and sql.conn.query are only selects statements apparently 🎉

## Data collection

Data from the Milmove app is sent to the Open Telemetry collector, which processes the trace, metric, and log data and exports it to the services of our choice. In AWS the collector exports to the Cloudwatch and AWS X-Ray services for storage and analysis. When run locally the data is exported to the Elastic APM Server running Elasticsearch and Kibana.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this context

Copy link
Contributor

@ronaktruss ronaktruss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! LGTM. As discussed, I don't think it's necessary to add the new trace information as part of this story PR, but we should add a story to track this.

@@ -0,0 +1,251 @@
# Open Telemetry

The Milmove app has had aspects of logging (Zap), tracing (trace middleware), and monitoring (AWS infra dashboards) previously but there is now an [ADR to use the Open Telemetry library](https://github.com/transcom/mymove/blob/master/docs/adr/0061-use-opentelemetry-for-distributed-tracing.md) to standardize our efforts. While not solely useful just for load testing, it did expose our need for better insight into the performance of the Milmove app and it's services.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think there should be a comma

Suggested change
The Milmove app has had aspects of logging (Zap), tracing (trace middleware), and monitoring (AWS infra dashboards) previously but there is now an [ADR to use the Open Telemetry library](https://github.com/transcom/mymove/blob/master/docs/adr/0061-use-opentelemetry-for-distributed-tracing.md) to standardize our efforts. While not solely useful just for load testing, it did expose our need for better insight into the performance of the Milmove app and it's services.
The Milmove app has had aspects of logging (Zap), tracing (trace middleware), and monitoring (AWS infra dashboards) previously, but there is now an [ADR to use the Open Telemetry library](https://github.com/transcom/mymove/blob/master/docs/adr/0061-use-opentelemetry-for-distributed-tracing.md) to standardize our efforts. While not solely useful just for load testing, it did expose our need for better insight into the performance of the Milmove app and it's services.

Copy link
Contributor

@felipe-lee felipe-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for putting this together. I have a few comments, though they aren't blocking. Also, not sure if it would be good to mention that we have a fork of sorts, specifically for the otelhttp package in https://github.com/trussworks/otelhttp

#### HTTP Requests

```
http.client_ip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where in the doc would be best since there's several places that mention them, but it might be good to indicate somewhere that most of these are semantic conventions defined by open telemetry:

Trace: https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/
Metrics: https://opentelemetry.io/docs/reference/specification/metrics/semantic_conventions/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, their specs are very readable too.

Comment on lines +155 to +157
- Inuse
- Idle
- Wait Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along the lines of mentioning the others are semantic conventions, idk if there's value in also mentioning these three are custom ones for us.

@duncan-truss
Copy link
Contributor Author

Awesome, thanks for putting this together. I have a few comments, though they aren't blocking. Also, not sure if it would be good to mention that we have a fork of sorts, specifically for the otelhttp package in https://github.com/trussworks/otelhttp

I did add the bit about the fork under the instrumentation section https://github.com/transcom/mymove-docs/pull/51/files#diff-9ced70abf83582d83ee306cf0105f767926ef47dfcacbc04029b563e14b424b7R59

@felipe-lee
Copy link
Contributor

Awesome, thanks for putting this together. I have a few comments, though they aren't blocking. Also, not sure if it would be good to mention that we have a fork of sorts, specifically for the otelhttp package in https://github.com/trussworks/otelhttp

I did add the bit about the fork under the instrumentation section https://github.com/transcom/mymove-docs/pull/51/files#diff-9ced70abf83582d83ee306cf0105f767926ef47dfcacbc04029b563e14b424b7R59

x.x sorry, I somehow missed that...

@duncan-truss duncan-truss merged commit 6275e1a into main Dec 1, 2021
@duncan-truss duncan-truss deleted the ds-mb-MB-10364-document-open-telemetry-data-available-in-aws branch December 1, 2021 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants