Improve handling of `java/lang/System` entry points in CG generation #241

johannesduesing · 2024-11-29T15:50:25Z

This PR resulted from issues discussed in #230, in particular the fact that calls to System.out.println are not found by OPAL with multiple CG algorithms when the RTJar is loaded (either interfaces-only or fully). The reasons for this are depending on the CG algorithm and configuration, but boil down to:

RTA cannot find an instantiation of PrintWriter (the type of System.out) when RTJar is loaded as interfaces only - it is simple never instantiated
Somewhere between JDK9 and JDK11, they switched from using System.initializeSystemClass (which we model as an initial entrypoint in the config) to using System.initPhase1, System.initPhase2 and System.initPhase3 - all of which are called from native code and currently not configered to be initial entry points. This is another reason why RTA - even when the RTJar is loaded fully - does not find calls to println.
Propagation-Based algorithms never found a write to System.out that initializes the field (even when RTJar is loaded fully) - the reason for this is that the fields are set in native code

Side-Note: This was never a problem when the RTJar was not loaded, because CallGraphAnalysis.unknownLibraryCall caught any calls to virtual methods (like println in this scenario).

Changes in this PR
This PR introduces some changes to tackle the problems mentioned above. This includes:

Declare initPhase1, initPhase2 and initPhase3 of type java/lang/System to be initial entry points in the config - see the OpenJDK implementation for details on this.
Always consider java/lang/Thread, java/lang/ClassLoader, java/io/PrintStream and java/io/InputStream to be instantiated via config
Improve propagation-based cg algorithms - they now always assume that System.out, System.in and System.err are instantiated with their declared types by adding them to the respective typeSetEntity - furthermore, we now always assume that the configured instantiated types are available for the ExternalWorld type set entity.

With these changes CHA, RTA and all Propagation-Based CG algorithms will now find calls to System.out.println, regardless of whether the RTJar is loaded at all, as interfaces only, or fully.

What's still to do

Find more types that should always be considered instantiated - if possible define a method for identifying those types and add them to the config
Analyze why the CFA-Algorithms both do not find calls to println and fix the issue
There is still an issue with library fields if the library is loaded as interfaces only. We will likely have no writes to those fields, as the library method bodys are not loaded, therefore propagation-based algorithms will never consider the fields instantiated. They are also not considered ExternalWorld because their definition is available, therefore they cannot be treated differently compared to "normal" project fields. Decide on whether those cases deserve special handling or not - we might just have to live with that fact. The only way of dealing with the would be something like if(isLibraryField(field) && librariesAreInterfacesOnly) considerInstatiated(typeSetEntity(field), field.fieldType)

…treams as instantiated by default

errt · 2024-11-29T16:22:44Z

Thank you for the extensive investigation and the PR. I fully agree with the additions to the config and with adding instantiated types to ExternalWorld.

I'm a little wary however with System.{in,out,err} special handling for the propagation-based CGs. Could this not lead to scenarios where RTA is less sound than the propagation-based CGs because RTA doesn't have this modeling? Also, different to changes to the config, this can never be overridden for an analysis of a Java System that might behave differently (Android maybe?). Is there another way this could be achieved, e.g., modeling the native methods that set the fields?
If we decide to keep your approach, I'd suggest to make it more general by not hardcoding the System class into it. It could also be made configurable via the config files which fields should be initialized.

For the last future work item: That is a fair point. I think we could deal with it like you suggested, though maybe rather than setting them to the field type, they could access the ExternalWorld like they would without loading the interfaces? That would also solve the issue above with System.{in,out,err}, right?

…actor config so that instantiated fields can be user supplied

johannesduesing · 2024-12-02T16:17:57Z

I think you make a good point. I restructured parts of this PR so that now:

There is an InitialInstantiatedFieldsKey and corresponding InstantiatedFieldsFinder to detect fields that shall always be considered instantiated
Very similar to the InstantiatedTypesFinder it has a configurable aspect so that users can specify fields to be considered instantiated. By default the field type will be considered only, but specific types and subtype-trees can be specified in the config.
The SoundLibraryInstantiatedFieldsFinder (implemented but not used by default) would consider all library fields to be instantiated with their declared type if libraries are interfaces only - this could address the last future work item

I think this might be a good solution to the problem - analyses for android can just remove the three lines from the configuration, and they also don't introduce any overhead of libraries are not loaded at all.

Regarding your suggestion for associating library fields with ExternalWorld if libraries are interfaces only: I had that exact solution implemented earlier, but i does not really scale to any library besides the RTJar - users would have to manually specify the types of their library fields every time they use a new / different library.

What do you think of this idea, does that work for you?

johannesduesing · 2024-12-02T16:20:04Z

Oh i forgot to mention: The rta.InstantiatedTypesAnalysis and xta.InstantiatedTypesAnalysis now both use the InitialInstantiatedFieldsKey for their respective initialization, thus (hopefully) fixing the issue that RTA could sometimes be less sound than XTA.

errt · 2024-12-03T09:16:47Z

This solution sounds reasonable. Thank you for your work!

About the ExternalWorld: true for manual specification of types. A suitable InitialInstantiatedTypesFinder would do this automatically however.

johannesduesing · 2024-12-03T13:08:05Z

My latest commit adds support for the initial intantiated fields to CFA-1-0 and CFA-1-1. For the latter i had to create dummy allocation sites as the fields are defined in native code - i don't know if that is the way to go, but it seems to work as expected.

So regarding the library fields and the ExternalWorld we now have two options:

Leave it as it currently is. We have the SoundLibraryInstantiatedFieldsFinder that can detect those fields and mark them as instantiated. All CG algorithms respect the initial instantiated fields in their respective initialization phases.
Explicitly associate library fields (if libraries are interfaces only) with the ExternalWorld set entity. This will really only work with Propagation-Based algorithms, since RTA and CFA-1-* do not work on set entities (at least i did not see them using ExternalWorld anywhere). It would require implementing a suitable InitialInstantiatedTypesFinder that would maybe rely on the InitialInstatiatedFieldsFinder. The downside would be that we would have to adapt the CG algorithm(s) when it comes to finding instantiated fields in order to encode this specific case.

Is there any specific reason you think we should use the second option? I'm currently thinking the first one should be less overhead, but maybe i'm missing something .. 🤓

errt · 2024-12-03T15:49:04Z

You can have a look at ConfiguredMethodsPointsToAnalysis, it needs to do the same for allocation sites in native or otherwise unavailable methods.

The main reason for choosing the second option would be to have fewer necessary modifications. For RTA, it would work immediately, as RTA just uses instantiated types and thus the proper InitialInstantiatedTypesFinder will automatically make it work. For the propagation-based ones it would be just the single line that adds the InitialInstantiatedTypes to the ExternalWorld that we probably want to have anyway. The InstantiatedFieldsFinders would not be necessary. For the points-to-based algorithms, it should be similar: one handling for instantiated types (that we probably need anyway if we don't yet have it) instead of that and the additional one for the fields.
So my question is, where do you see the benefit of the first option? The one that I see is that it could be more specific, in adding types only to individual fields instead of a (potentially large) set of initially instantiated types being available through all (matching - but that could still be many, e.g., Object fields) fields.

OPAL/br/src/main/scala/org/opalj/br/analyses/cg/InitialInstantiatedFieldsKey.scala

johannesduesing · 2024-12-03T16:04:21Z

There are two points why i think the first option may be the better one:

Like you said, it's more specific and should keep loss of precision to a minimum for Propagation-Based algorithms and CFA - if we load large libraries as interfaces, the ExternalWorld could become very large.
The second option does not work for fields being set in native code if the library contents are actually loaded. If we go with the second option, at least for System.{out, err, in} the callgraphs will not capture println if the RTJar is actually loaded - it works if libraries are interfaces only.

So in my mind, if we want to capture fields being set in native code - and i'm open to discussing whether or not we want that - then we need an InitialInstatiatedFieldsFinder, otherwise it won't work in the case that libraries are actually fully loaded.

That being said, if you want me to implement the second option i'll do that - justed wanted to point that out 😄

errt · 2024-12-03T16:08:48Z

Shouldn't the second point already be addressed by the configured native methods analyses? I thought it was.

johannesduesing · 2024-12-05T14:18:14Z

After discussion, we will proceed as follows:

Remove InitialInstantiatedFieldsFinder
Use configured native methods to model effects of setOut0 & others
Add more instantiated types, see here
Also model effects of initPhase1 and others so calls can be detected if RTJar is interface only

johannesduesing · 2024-12-10T10:20:04Z

After some further inspection i think this is not going to be as easy as we thought it would. I found that:

setOut0 and others are already modeled with their field-setting effects in the config - with the new entrypoints in the config as well, RTA out of the Box finds calls to System.out.println if the RTJar is loaded
I was confused that Propagation-Based Algorithms did not find those calls with the RTJar loaded. I found that the PropagationBasedCallGraphKeys use the rta.ConfiguredNativeMethodsInstantiatedTypesAnalysisScheduler (see here). This does not make sense to me, this analysis was specifically written for RTA and attaches all configured instantiated types to the Project entity - as far as my understanding of any of the propagation-based CG algorithms goes, this is conceptually not compatible with XTA & co, which use typeSetEntities to attach types to.
When loading the RTJar as interfaces only, i am facing another issue: The EntryPointsFinder does not consider methods without bodies. This means that even though i carefully configured the effects of System.initPhase1 with its instantiations and invocations, it is simply never considered by the CG framework, which expects entrypoints to have a body.

If my assessment is correct, i think the following steps would be sensible to take next:

Write a new ConfiguredNativeMethodsInstantiatedTypesAnalysis for XTA & co. I started doing this, but found that because the configuration can state that "This fields is assigned the value of the first parameter" (which is what we have for setOut0), it needs to be a lot more complex than the one for RTA, as it would need continuations and dependencies on its caller property - maybe it would be sensible to integrate it into the existing TypePropagationAnalysis for XTA? (It would be structurally very similar and use a lot of the same constructs)
Consider entrypoints even if they are empty, provided that native method configurations exist for them - unfortunately the correpsonding configuration (plus deserialization utilities) are in the TAC project while entrypoints are in BR, meaning i can't access them directly. I removed the emptiness check for debugging purposes and it did not work as expected out of the box, so this would likely require more debugging

What do you think of this? Is this the right way to go?

errt · 2024-12-10T10:35:53Z

You are right, they are configured to use the RTA class there, which doesn't seem to make much sense. Not sure why that is, I thought there already was a variant for XTA, but apparently, there isn't. So yes, it looks like we need to add that.

For the other issue, it looks like entry points should always be considered, even without a body. I don't think it is necessary that they have configuration attached, they could still just exist without any callees. They should probably exist (so you don't get spurious methods in the CG that are configured as entry point for a different setup than you analyze), but bodies should not be necessary. I wonder why removing the emptiness check didn't just work, so yes, that should be looked into.

Thank you for your investigation!

…d types analysis for XTA family

…nalysis to work for empty methods as well - same for configured call graph analysis. Currently this breaks CFA, needs to be adapted next

Add JDK 11+ initial entrypoints and instantiated types. Mark System s…

556fc2b

…treams as instantiated by default

johannesduesing added the enhancement New feature or request label Nov 29, 2024

johannesduesing self-assigned this Nov 29, 2024

Fix formatting issue

9d29f26

johannesduesing added 2 commits December 2, 2024 17:07

Introduce instantiated fields finder and project information key, ref…

b702a01

…actor config so that instantiated fields can be user supplied

Formatting, file headers

532bd64

Add initial instantiated fields to CFA_1_0 and CFA_1_1 CG analyses

990cc31

errt requested changes Dec 3, 2024

View reviewed changes

OPAL/br/src/main/scala/org/opalj/br/analyses/cg/InitialInstantiatedFieldsKey.scala Outdated Show resolved Hide resolved

Fix package declaration format

5950967

johannesduesing added 4 commits December 17, 2024 15:41

Remove InstantiatedFieldsFinder, add (first shot at) proper configure…

a8f0db1

…d types analysis for XTA family

Formatting

9f1bd02

Add headers to new files, remove unwanted extra whitespace

175f914

Make CG framework process empty entry points. Adapt XTA propagation a…

98f2fa5

…nalysis to work for empty methods as well - same for configured call graph analysis. Currently this breaks CFA, needs to be adapted next

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of `java/lang/System` entry points in CG generation #241

Improve handling of `java/lang/System` entry points in CG generation #241

johannesduesing commented Nov 29, 2024 •

edited

Loading

errt commented Nov 29, 2024

johannesduesing commented Dec 2, 2024

johannesduesing commented Dec 2, 2024

errt commented Dec 3, 2024

johannesduesing commented Dec 3, 2024

errt commented Dec 3, 2024

johannesduesing commented Dec 3, 2024

errt commented Dec 3, 2024

johannesduesing commented Dec 5, 2024

johannesduesing commented Dec 10, 2024

errt commented Dec 10, 2024

Improve handling of java/lang/System entry points in CG generation #241

Are you sure you want to change the base?

Improve handling of java/lang/System entry points in CG generation #241

Conversation

johannesduesing commented Nov 29, 2024 • edited Loading

errt commented Nov 29, 2024

johannesduesing commented Dec 2, 2024

johannesduesing commented Dec 2, 2024

errt commented Dec 3, 2024

johannesduesing commented Dec 3, 2024

errt commented Dec 3, 2024

johannesduesing commented Dec 3, 2024

errt commented Dec 3, 2024

johannesduesing commented Dec 5, 2024

johannesduesing commented Dec 10, 2024

errt commented Dec 10, 2024

Improve handling of `java/lang/System` entry points in CG generation #241

Improve handling of `java/lang/System` entry points in CG generation #241

johannesduesing commented Nov 29, 2024 •

edited

Loading