Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch and lfc become unusable for programs with large (>500) banks and multiports #433

Closed
cmnrd opened this issue Jul 26, 2021 · 17 comments
Closed
Assignees
Labels
bug Something isn't working compiler diagrams Problems with diagram synthesis

Comments

@cmnrd
Copy link
Collaborator

cmnrd commented Jul 26, 2021

While experimenting with the banking benchmark, which uses by default 1000 instances of a reactor, I noticed that the IDE becomes unusable for such a large number of reactor instances. The IDE simply hangs and after a while shows an error indicating that there is no more memory. lfc shows the same behavior... I assume that this originates from the fact that we unroll the complete instantiation graph and create an object for every single reactor instance, but I am still wondering why the memory footprint is so high. Maybe there is a memory leak of some sort.

The error can be reproduced using the following simple program:

target Cpp;

reactor Node(bank_index: size_t(0), num_nodes: size_t(4)) {
    input[num_nodes] in: size_t
    output[num_nodes] out: size_t
}

main reactor(num_nodes: size_t(1000)) {
    nodes = new[num_nodes] Node(num_nodes=num_nodes);
    nodes.out -> nodes.in;
}

Note that depending on your available memory, you might need to increase num_nodes to trigger the error. While experimenting, I had the impression that the memory footprint does not scale linearly with the number of nodes.

@cmnrd cmnrd added bug Something isn't working compiler diagrams Problems with diagram synthesis labels Jul 26, 2021
@edwardalee
Copy link
Collaborator

In looking at the graph classes that we are using, e.g. for detecting cycles, there are quite a few inefficiencies there. For example, to obtain a list of nodes in the graph, the code currently creates a new list on the heap and copies two lists into it to make one list. It does this fresh each time you obtain the list. Also, the graph instances duplicate information that is already in the ReactorInstance, ReactionInstance, etc. classes. I saw quite a few other potential problems. I'm not sure that is the culprit, but graphs of size 1000 are not big, so I think we have a problem if we can't handle those.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jul 27, 2021

Do we need to expand the graph in order to do the cycle analysis and other steps? I find it interesting that this works fine for num_nodes=10, but not for num_nodes=1000 although the visual representation is exactly the same. Maybe we should treat a bank of reactors not as N instances, but as one 'virtual instance'. For cycle analysis, it should be sufficient to use one representative. There is probably no need to repeat the analysis for every instance.

@edwardalee
Copy link
Collaborator

Interesting conjecture. Unfortunately, below is a counterexample. As given below, there is no cycle. However, if you change this line:

    s2 = new Split(delay = 2, nodelay = 2);

to

    s2 = new Split(delay = 1, nodelay = 3);

then you get a cycle.

This counterexample also shows a bug in the validator, which fails to find the cycle. Instead, the cycle gets found later and an internal error is thrown (an exception that a user should never see).

Screen Shot 2021-07-27 at 4 02 57 AM

target C;
reactor Base {
    input in:int;
    output out:int;
}
reactor NoDelay extends Base {
    reaction(in) -> out {= =}
}
reactor Delay extends Base {
    a = new NoDelay();
    b = new NoDelay();
    in -> a.in;
    a.out -> b.in after 1 msec;
    b.out -> out;
}
reactor Split(delay:int(2), nodelay:int(2)) {
    input[4] in:int;
    output[4] out:int;
    b1 = new[delay] Delay();
    b2 = new[nodelay] NoDelay();
    in -> b1.in, b2.in;
    b1.out, b2.out -> out;
}
reactor Source {
    input[2] in1:int;
    input[2] in2:int;
    output[4] out:int;
    
    reaction(in1) -> out {= =}
    reaction(in2) {= =}
}
main reactor {
    s1 = new Source();
    s2 = new Split(delay = 2, nodelay = 2);
    s1.out -> s2.in;
    s2.out -> s1.in1, s1.in2;
}

This counterexample also shows a downside of allowing target code to specify widths. It means that cycles will not be detected at compile time.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jul 27, 2021

Interesting example! Perhaps we should simply forbid such programs (where there could be a cycle). The example would be much clearer if Complicated would have two separate output ports, one for the upper and one for the lower part, which would also avoid the cycle.

I am not 100% sure, but I think we could do the cycle analysis simply on the graph shown in the diagram, where all multiport connections are treated as a single line, and all banks as a singe instance. Then the logic would detect a cycle in your first example, even though it is correct under our current rules. However, I don't think we would lose generality by imposing this additional restriction, as cycles could be resolved by introducing additional ports and being clearer about the individual connections.

@edwardalee
Copy link
Collaborator

edwardalee commented Jul 27, 2021 via email

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jul 27, 2021

My point wasn't so much about forbidding programs with a large number of nodes. It was more to simplify the cycle analysis such that it works with an arbitrary number of nodes (could be even > 1,000,000). But I agree, we need to handle this carefully.

By the way, the C++ runtime implements its own dependency analysis. This analysis works fine for 1000 nodes, although it introduces a small but noticeable delay to the program startup. This indicates that the cycle analysis generally is possible with a reasonable overhead for a graph of this size. However, I am still a bit concerned about the initialization delay that the cycle detection introduces. While it works for 1,000 nodes, I don't think it will work for 1,000,000 nodes. So we probably also need to consider ways to simplify the analysis.

@oowekyala
Copy link
Collaborator

oowekyala commented Jul 29, 2021

So I've fiddled with [VisualVM] for a bit and got to test the example program in the OP.

First I have to mention that the graph doesn't have 1000 nodes. We have 1000 nodes which each have 1000 inputs and 1000 outputs so we have 2 million ports total. Each instance is a node in TopologyGraph. I don't think this is a small or even regular size graph, though I could be wrong. I'm mentioning this because it's not clear from the discussion that you're aware of this.

Here's a heap sample during the IDE freeze:

Capture d’écran de 2021-07-29 02-35-06

LinkedHashSet is implemented with LinkedHashMap internally so there are only around 1 million LinkedHashMaps, but there are 3 million LinkedHashSets! The 2 million PortInstance are expected, and as we can see, they're not really a problem for the heap (they add up to 12% usage). The problem is the HashSets.

Does PortInstance use LinkedHashSet somehow?

https://github.com/icyphy/lingua-franca/blob/b88468c7fe3ce774889e5cc056cb3179bb2fb553/org.lflang/src/org/lflang/generator/PortInstance.xtend#L70-L73

This should account for 2 million of the sets.

Also ports are stored into a TopologyGraph instance. Looking into this graph hierarchy I find that the graph maintains set of successors and a set of predecessors for the entire graph.

https://github.com/icyphy/lingua-franca/blob/b88468c7fe3ce774889e5cc056cb3179bb2fb553/org.lflang/src/org/lflang/graph/DirectedGraph.xtend#L39-L47

This means that each edge between ports creates 2 sets in that data structure...

One glaring problem here is that each of these sets has only few elements (maybe most of them have 1). But even so, since we use LinkedHashSet everywhere, each corresponds to at least

  • 1 LinkedHashSet
  • 1 LinkedHashMap instance (implem of LinkedHashSet)
  • 1 HashMap$Node[] with length 16 (the default in HashMaps)
  • 1 LinkedHashMap$Node

This holds for LinkedHashSets with a single element, which is basically the worst case.

We should consider using more specialized set implementations for small sets, and only inflate Sets to a LinkedHashSet when we need to add many elements.

The entire codebase uses LinkedHashSet as a type everywhere... This is really bad practice, we should refactor everything to just require Set.

Anyway I'll start refactoring and measure the difference.

Full data, loadable into your own VVM instance: ide-vm-snapshot.zip


Also I think there's a memory leak since I've seen the number of PortInstance double when editing the file then saving. Possibly, the topology graph is stored in some field somewhere which is not cleaned up. I still have to investigate this...

@oowekyala
Copy link
Collaborator

oowekyala commented Jul 29, 2021

So by using specialized collections I managed to bring memory usage down to a point where the file is editable:

Capture d’écran de 2021-07-29 14-38-01

Capture d’écran de 2021-07-29 14-41-30

This is still a lot though and maybe we can track down more inefficiencies

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jul 29, 2021

This sounds like a good start! Did you push your changes somewhere?

@edwardalee
Copy link
Collaborator

The reason for using LinkedHashSet and LinkedHashMap is for repeatability. In particular, if we change these to some other implementation of Set and Map, then we will get random test failures because items get iterated over in a random order. This is because test results have to match a pre-specified pattern, and it isn't easy to express "don't care about order." In some cases, the ordering has semantic meaning (e.g. reactions).

I have a bit of a disagreement with @lhstrh on the way we are currently handling construction and analysis of graphs. The way that I would do it would be to define graph algorithms to operate on interfaces, and then to make ReactorInstance and related classes implement those interfaces. Instead, what we do now is to replicate the information in ReactorInstance and related classes in separate graph data structures. There are some algorithms where this is a reasonable way to do it (particularly algorithms that are based on removing nodes or edges from the graph), but I suspect these algorithms can be replaced with non-destructive ones.

@oowekyala
Copy link
Collaborator

oowekyala commented Jul 29, 2021

The reason for using LinkedHashSet and LinkedHashMap is for repeatability.

I understand that, and the "specialized" collections I'm referring to also preserve iteration ordering (they're those Set.of, Map.of overloads).

Unfortunately in Java, the collections framework works a lot by contract. The only reasonable common denominator between collection types is very general interfaces like Set or List or Map. Whether they are modifiable or read-only, whether they preserve insertion order or not, whether they are thread-safe or not, all of that is up to the programmer to guarantee (or leave unspecified) and document. There are no interfaces for these more precise characteristics.

(Side note, any List of course preserves insertion order, and using LinkedList over ArrayList is wasteful in nearly all cases).

In my experience though this doesn't cause any problem. We already have the discipline never to create a HashSet over a LinkedHashSet, so iteration order will be reproducible. But there's nothing wrong with creating immutable singleton sets or empty sets, especially when they save so much memory. The only way to make this work is to only couple to interfaces like Set or Map.

We can talk about this during the code review meeting I hope

I have a bit of a disagreement with @lhstrh on the way we are currently handling construction and analysis of graphs. The way that I would do it would be to define graph algorithms to operate on interfaces, and then to make ReactorInstance and related classes implement those interfaces. Instead, what we do now is to replicate the information in ReactorInstance and related classes in separate graph data structures. There are some algorithms where this is a reasonable way to do it (particularly algorithms that are based on removing nodes or edges from the graph), but I suspect these algorithms can be replaced with non-destructive ones.

We could have a middle ground by keeping separate graph classes, but having the logic to implement an edge be abstract. So the directed graph here wouldn't have directly sets to store succ and pred. Only abstract methods like Set<V> getSuccs(V vertex) and addSucc(V vertex, V succ).

Our implementation of getSuccs could then just say if (vertex instanceof PortInstance) ((PortInstance) vertex).dependentSet or something like that.

This way we avoid duplicating the graph, we keep graph algorithms tidy and outside our domain classes, and we don't introduce interfaces that are only rarely needed. (This is basically an adapter that adapts our domain classes to what the graph classes expect, but we would still program to interfaces)

@cmnrd
Copy link
Collaborator Author

cmnrd commented Oct 20, 2021

#442 helped with the issue, but I am not sure if we can consider it resolved just yet. Compiling the Banking benchmark with 1000 accounts is now possible without error, but it still takes about 2 minutes on my machine (spend in lfc, not the target compiler) and consumes more than 3GB of memory.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Oct 28, 2021

Apparently this got worse again... with current master I get a OutOfMemoryError when trying to compile the banking benchmark with 1000 accounts. My default heap size is 4GB. I had to increase it to 12GB to be able to compile the program, which took about 6 minutes.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Oct 28, 2021

Actually I spoke too soon... Since gcc also takes a considerable amount of memory and most of the memory allocated in our compiler is not freed before invoking the target compiler, I can not compile this on a machine with 16GB of memory (+Swap).

@cmnrd
Copy link
Collaborator Author

cmnrd commented Oct 28, 2021

As it turns out, even 32GB of memory are not enough. Due to #635, gcc also consumes a huge amount of memory.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Oct 29, 2021

As @Soroosh129 suspected yesterday, the compilation indeed works for the C++ banking benchmark (consuming about 3GB of memory) and does not work for the C version (consuming more than 8GB). This explains why I was so confused about the compilation working or not; I simply used the C++ version when I tried before.

So it seems that the C code generator has a significantly larger memory footprint than the C++ generator. This is likely in part due to the use of graph data structures for generating in C, which we don't use in C++. However, I think the instance graph is created nonetheless in GeneratorBase. So there might be another issue lurking in the C generator apart from the graphs.

@cmnrd
Copy link
Collaborator Author

cmnrd commented Jan 27, 2022

Looks like we can finally close this (#875).

@cmnrd cmnrd closed this as completed Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working compiler diagrams Problems with diagram synthesis
Projects
None yet
Development

No branches or pull requests

4 participants