-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Move carbon cost calculations to backend #17044
Comments
@Renni771 is currently looking into exactly that as part of his thesis :-) |
I don't think we should persist this on the backend, this is stuff you can easily do on demand and has nothing to do with Galaxy. If this is part of some client tooling we ship with Galaxy that is fine, but otherwise we should only record raw data. |
We will record raw data and try to persist it in the DB. E.g. overall cpu-runtime. overall memory consumptions etc ... @Renni771 maybe you can lay out your plan here once you have it, instead of creating a new issue. Thanks. |
@mvdbeek, @bgruening the preliminary plan is as follows: The thesis' focus is on raising awareness for green computing. The main motivation of moving the carbon emissions estimation logic to the backend is that we are considering storing an "all time" carbon emissions rating for a user. Additional features like calculating the carbon emissions of a history or an entire workflow are also being planned. These total values are something we're considering persisting in the DB so they don't always need to be re-calculated "from history" on the fly - most particularly in the case of workflows. Constantly calculating emissions values on demand isn't an issue for single jobs/histories until we get to the point were we consider emissions for workflows as this relies on runtime metrics that I'm not currently sure we can get until someone has actually run the job at least once before, so we have access to data like CPU usage and runtime, for example. I understand that this logic doesn't necessarily need to live in the backend as it can be done on demand. In the case of histories, we don't need to move the logic since the job metrics, upon which the emissions estimations are based, are already on the galaxy client. This may be a motivation to encapsulate the logic into an API endpoint so that any user can calculate estimations for "anything" really. This allows us to decouple the logic from the backend and client so we don't have to ship the feature as a part of galaxy and this solves the issues of user's not having access to the logic which currently lives on the client. |
This is a little vague, do you agree this can be a separate service ? |
@mvdbeek Yes, as I understand it, storing the logic in an API endpoint is in that sense providing it as a service. Unless we're at a misunderstanding here. What do you mean here? |
I mean this doesn't have to be part of core Galaxy, this can be an external script you run. |
Forgive me, as I'm not familiar with entire galaxy architecture, but yes, I agree that we can provide the carbon estimation logic as service in an external script. Where would this actually live in the code base? |
You can create a new project on github, I don't think this needs to live with the codebase either. |
As @Renni771 said its about creating awareness. So we think this needs to be exposed in the account of a user. Independent of the carbon emission / or carbon cost, we think that we need to have an aggregated number of CPU-hours, Memory usage, storage usage ... others? Those matrices are meaningful for users who need to estimate future resources and also for admins and PIs. On top of those aggregated numbers, we can then do some interesting things - carbon costs is just one of them. |
I am happy if you want to improve the API and services, recording and management of resource data etc. I don't know that I fundamentally agree with
and I am skeptical that we want to store aggregate data in galaxy's database, this doesn't seem like the best fit.
seems like a good start that can feed any sort of external service. |
I'm partial to Marius' suggestion. This could be an "add-on" to galaxy, which lives in a separate repository like TPV does. The main difference will be that it would also have some database tables that can technically be created in the same database as Galaxy, have its own migrations scripts etc., but the code for it doesn't need to live in the same repo. If it does, is it ok to have referential integrity with the existing Job table for example? I've not really looked into how that kind of thing is generally modelled in SQLAlchemy. This could also be a nice opportunity to explore these kinds of add-ons in general. |
Regardless of performance, consumption information can only be calculated on demand if jobs are never deleted. Otherwise an aggregate is necessary for the information to be reliable. |
Jobs are never deleted at this point, and we don't do aggregates at all in the current app, so that's something to figure out were one to go down that direction. |
So, if I understand this correctly, the actual calculation of carbon emission, AWS costs, or any other processed metric should be an external service (API outside Galaxy), and what Galaxy can provide is just the raw computation data (CPU time, memory, cores, etc.) whatever Galaxy is already providing (or improving it if it makes sense). Then on the client, you can have "plugins" that would request the raw computation metrics from Galaxy and pass them to those external services to get a result and show it to the user. I guess there is also value in "not aggregating" the raw values in any database and instead allowing to query particular date ranges, etc., and aggregate them on demand. |
I'm not sure I understood this. I thought this could be done in the backend somehow? Having the client deal with this would mean that no other system can query this same information without repeating effort. Some separate, independent backend could do it, and how that service obtains information from Galaxy and populates its internal data (or not) would be an implementation decision, but I think it should just offer some ReST API that anyone can consume? |
The raw data is what you get from Galaxy in the client, for example, for a particular user, all the metrics of all the jobs run in the last month, year, etc. This raw data is now in the client store. Now you can aggregate it and store the aggregation in the store too. Then you can use this data to query different external services like the carbon emissions and the AWS costs to name some (notice there is no duplication of effort other than each service needs to use the "same aggregated data" from the store and return a different result) and then render in the UI the result. Just an idea, it might not be the ideal solution. |
I understand this idea since the current carbon emissions implementation does literally this on the client, just like the AWS estimates. The general data flow is: Job metrics plugin --> job metrics store (client) ---> carbon emissions component (CO2 emissions logic lives here)
^
|
|
v
Job metrics endpoint (backend) I'm actually not opposed to the idea of querying and processing raw metrics data, for histories and workflows, from stores as this is currently what galaxy does - the metrics data is also already on the client in a store. This almost suggests that keeping this logic on the client is the way to go. What do you think @davelopez, @bgruening?
Would we agree that encapsulating the carbon emissions logic as an external service outside of the client (since it has nothing to do with galaxy core) is an alternative preferred approach? If so, what benefit does this give us, besides being able to run carbon emission estimates outside of client environments? |
IMHO for now it's fine to keep the logic in the client but we should strive to make it an external service if possible. Maybe a tiny FastAPI application that admins can run as a micro-service would be enough?
Exactly those two benefits 😄
|
The carbon cost calculations are really great and very nicely done. Recently, I was trying to calculate the carbon cost of some jobs and realized that the calculations were being done on the front end. I managed to extract the code with some effort, but I think it would be really great to consider moving this to the backend, and also include a gxadmin query that could be used to figure out the carbon cost of a job.
Feature request
Related issues: #15046
The text was updated successfully, but these errors were encountered: