Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of java/lang/System entry points in CG generation #241

Draft
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

johannesduesing
Copy link
Collaborator

@johannesduesing johannesduesing commented Nov 29, 2024

This PR resulted from issues discussed in #230, in particular the fact that calls to System.out.println are not found by OPAL with multiple CG algorithms when the RTJar is loaded (either interfaces-only or fully). The reasons for this are depending on the CG algorithm and configuration, but boil down to:

  • RTA cannot find an instantiation of PrintWriter (the type of System.out) when RTJar is loaded as interfaces only - it is simple never instantiated
  • Somewhere between JDK9 and JDK11, they switched from using System.initializeSystemClass (which we model as an initial entrypoint in the config) to using System.initPhase1, System.initPhase2 and System.initPhase3 - all of which are called from native code and currently not configered to be initial entry points. This is another reason why RTA - even when the RTJar is loaded fully - does not find calls to println.
  • Propagation-Based algorithms never found a write to System.out that initializes the field (even when RTJar is loaded fully) - the reason for this is that the fields are set in native code

Side-Note: This was never a problem when the RTJar was not loaded, because CallGraphAnalysis.unknownLibraryCall caught any calls to virtual methods (like println in this scenario).

Changes in this PR
This PR introduces some changes to tackle the problems mentioned above. This includes:

  • Declare initPhase1, initPhase2 and initPhase3 of type java/lang/System to be initial entry points in the config - see the OpenJDK implementation for details on this.
  • Always consider java/lang/Thread, java/lang/ClassLoader, java/io/PrintStream and java/io/InputStream to be instantiated via config
  • Improve propagation-based cg algorithms - they now always assume that System.out, System.in and System.err are instantiated with their declared types by adding them to the respective typeSetEntity - furthermore, we now always assume that the configured instantiated types are available for the ExternalWorld type set entity.

With these changes CHA, RTA and all Propagation-Based CG algorithms will now find calls to System.out.println, regardless of whether the RTJar is loaded at all, as interfaces only, or fully.

What's still to do

  • Find more types that should always be considered instantiated - if possible define a method for identifying those types and add them to the config
  • Analyze why the CFA-Algorithms both do not find calls to println and fix the issue
  • There is still an issue with library fields if the library is loaded as interfaces only. We will likely have no writes to those fields, as the library method bodys are not loaded, therefore propagation-based algorithms will never consider the fields instantiated. They are also not considered ExternalWorld because their definition is available, therefore they cannot be treated differently compared to "normal" project fields. Decide on whether those cases deserve special handling or not - we might just have to live with that fact. The only way of dealing with the would be something like if(isLibraryField(field) && librariesAreInterfacesOnly) considerInstatiated(typeSetEntity(field), field.fieldType)

@johannesduesing johannesduesing added the enhancement New feature or request label Nov 29, 2024
@johannesduesing johannesduesing self-assigned this Nov 29, 2024
@errt
Copy link
Collaborator

errt commented Nov 29, 2024

Thank you for the extensive investigation and the PR. I fully agree with the additions to the config and with adding instantiated types to ExternalWorld.

I'm a little wary however with System.{in,out,err} special handling for the propagation-based CGs. Could this not lead to scenarios where RTA is less sound than the propagation-based CGs because RTA doesn't have this modeling? Also, different to changes to the config, this can never be overridden for an analysis of a Java System that might behave differently (Android maybe?). Is there another way this could be achieved, e.g., modeling the native methods that set the fields?
If we decide to keep your approach, I'd suggest to make it more general by not hardcoding the System class into it. It could also be made configurable via the config files which fields should be initialized.

For the last future work item: That is a fair point. I think we could deal with it like you suggested, though maybe rather than setting them to the field type, they could access the ExternalWorld like they would without loading the interfaces? That would also solve the issue above with System.{in,out,err}, right?

@johannesduesing
Copy link
Collaborator Author

I think you make a good point. I restructured parts of this PR so that now:

  • There is an InitialInstantiatedFieldsKey and corresponding InstantiatedFieldsFinder to detect fields that shall always be considered instantiated
  • Very similar to the InstantiatedTypesFinder it has a configurable aspect so that users can specify fields to be considered instantiated. By default the field type will be considered only, but specific types and subtype-trees can be specified in the config.
  • The SoundLibraryInstantiatedFieldsFinder (implemented but not used by default) would consider all library fields to be instantiated with their declared type if libraries are interfaces only - this could address the last future work item

I think this might be a good solution to the problem - analyses for android can just remove the three lines from the configuration, and they also don't introduce any overhead of libraries are not loaded at all.

Regarding your suggestion for associating library fields with ExternalWorld if libraries are interfaces only: I had that exact solution implemented earlier, but i does not really scale to any library besides the RTJar - users would have to manually specify the types of their library fields every time they use a new / different library.

What do you think of this idea, does that work for you?

@johannesduesing
Copy link
Collaborator Author

Oh i forgot to mention: The rta.InstantiatedTypesAnalysis and xta.InstantiatedTypesAnalysis now both use the InitialInstantiatedFieldsKey for their respective initialization, thus (hopefully) fixing the issue that RTA could sometimes be less sound than XTA.

@errt
Copy link
Collaborator

errt commented Dec 3, 2024

This solution sounds reasonable. Thank you for your work!

About the ExternalWorld: true for manual specification of types. A suitable InitialInstantiatedTypesFinder would do this automatically however.

@johannesduesing
Copy link
Collaborator Author

My latest commit adds support for the initial intantiated fields to CFA-1-0 and CFA-1-1. For the latter i had to create dummy allocation sites as the fields are defined in native code - i don't know if that is the way to go, but it seems to work as expected.

So regarding the library fields and the ExternalWorld we now have two options:

  • Leave it as it currently is. We have the SoundLibraryInstantiatedFieldsFinder that can detect those fields and mark them as instantiated. All CG algorithms respect the initial instantiated fields in their respective initialization phases.
  • Explicitly associate library fields (if libraries are interfaces only) with the ExternalWorld set entity. This will really only work with Propagation-Based algorithms, since RTA and CFA-1-* do not work on set entities (at least i did not see them using ExternalWorld anywhere). It would require implementing a suitable InitialInstantiatedTypesFinder that would maybe rely on the InitialInstatiatedFieldsFinder. The downside would be that we would have to adapt the CG algorithm(s) when it comes to finding instantiated fields in order to encode this specific case.

Is there any specific reason you think we should use the second option? I'm currently thinking the first one should be less overhead, but maybe i'm missing something .. 🤓

@errt
Copy link
Collaborator

errt commented Dec 3, 2024

You can have a look at ConfiguredMethodsPointsToAnalysis, it needs to do the same for allocation sites in native or otherwise unavailable methods.

The main reason for choosing the second option would be to have fewer necessary modifications. For RTA, it would work immediately, as RTA just uses instantiated types and thus the proper InitialInstantiatedTypesFinder will automatically make it work. For the propagation-based ones it would be just the single line that adds the InitialInstantiatedTypes to the ExternalWorld that we probably want to have anyway. The InstantiatedFieldsFinders would not be necessary. For the points-to-based algorithms, it should be similar: one handling for instantiated types (that we probably need anyway if we don't yet have it) instead of that and the additional one for the fields.
So my question is, where do you see the benefit of the first option? The one that I see is that it could be more specific, in adding types only to individual fields instead of a (potentially large) set of initially instantiated types being available through all (matching - but that could still be many, e.g., Object fields) fields.

@johannesduesing
Copy link
Collaborator Author

There are two points why i think the first option may be the better one:

  • Like you said, it's more specific and should keep loss of precision to a minimum for Propagation-Based algorithms and CFA - if we load large libraries as interfaces, the ExternalWorld could become very large.
  • The second option does not work for fields being set in native code if the library contents are actually loaded. If we go with the second option, at least for System.{out, err, in} the callgraphs will not capture println if the RTJar is actually loaded - it works if libraries are interfaces only.

So in my mind, if we want to capture fields being set in native code - and i'm open to discussing whether or not we want that - then we need an InitialInstatiatedFieldsFinder, otherwise it won't work in the case that libraries are actually fully loaded.

That being said, if you want me to implement the second option i'll do that - justed wanted to point that out 😄

@errt
Copy link
Collaborator

errt commented Dec 3, 2024

Shouldn't the second point already be addressed by the configured native methods analyses? I thought it was.

@johannesduesing
Copy link
Collaborator Author

After discussion, we will proceed as follows:

  • Remove InitialInstantiatedFieldsFinder
  • Use configured native methods to model effects of setOut0 & others
  • Add more instantiated types, see here
  • Also model effects of initPhase1 and others so calls can be detected if RTJar is interface only

@johannesduesing
Copy link
Collaborator Author

After some further inspection i think this is not going to be as easy as we thought it would. I found that:

  • setOut0 and others are already modeled with their field-setting effects in the config - with the new entrypoints in the config as well, RTA out of the Box finds calls to System.out.println if the RTJar is loaded
  • I was confused that Propagation-Based Algorithms did not find those calls with the RTJar loaded. I found that the PropagationBasedCallGraphKeys use the rta.ConfiguredNativeMethodsInstantiatedTypesAnalysisScheduler (see here). This does not make sense to me, this analysis was specifically written for RTA and attaches all configured instantiated types to the Project entity - as far as my understanding of any of the propagation-based CG algorithms goes, this is conceptually not compatible with XTA & co, which use typeSetEntities to attach types to.
  • When loading the RTJar as interfaces only, i am facing another issue: The EntryPointsFinder does not consider methods without bodies. This means that even though i carefully configured the effects of System.initPhase1 with its instantiations and invocations, it is simply never considered by the CG framework, which expects entrypoints to have a body.

If my assessment is correct, i think the following steps would be sensible to take next:

  • Write a new ConfiguredNativeMethodsInstantiatedTypesAnalysis for XTA & co. I started doing this, but found that because the configuration can state that "This fields is assigned the value of the first parameter" (which is what we have for setOut0), it needs to be a lot more complex than the one for RTA, as it would need continuations and dependencies on its caller property - maybe it would be sensible to integrate it into the existing TypePropagationAnalysis for XTA? (It would be structurally very similar and use a lot of the same constructs)
  • Consider entrypoints even if they are empty, provided that native method configurations exist for them - unfortunately the correpsonding configuration (plus deserialization utilities) are in the TAC project while entrypoints are in BR, meaning i can't access them directly. I removed the emptiness check for debugging purposes and it did not work as expected out of the box, so this would likely require more debugging

What do you think of this? Is this the right way to go?

@errt
Copy link
Collaborator

errt commented Dec 10, 2024

You are right, they are configured to use the RTA class there, which doesn't seem to make much sense. Not sure why that is, I thought there already was a variant for XTA, but apparently, there isn't. So yes, it looks like we need to add that.

For the other issue, it looks like entry points should always be considered, even without a body. I don't think it is necessary that they have configuration attached, they could still just exist without any callees. They should probably exist (so you don't get spurious methods in the CG that are configured as entry point for a different setup than you analyze), but bodies should not be necessary. I wonder why removing the emptiness check didn't just work, so yes, that should be looked into.

Thank you for your investigation!

…nalysis to work for empty methods as well - same for configured call graph analysis. Currently this breaks CFA, needs to be adapted next
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants