Skip to content

Latest commit

 

History

History
1396 lines (1113 loc) · 56 KB

memory.asciidoc

File metadata and controls

1396 lines (1113 loc) · 56 KB

The Memory Subsystem: Stacks, Heaps and Garbage Collection

Before we dive into the memory subsystem of ERTS, we need to have some basic vocabulary and understanding of the general memory layout of a program in a modern operating system. In this review section I will assume the program is compiled to an ELF executable and running on Linux on something like an IA-32/AMD64 architecture. The layout and terminology is basically the same for all operating systems that ERTS compile on.

A program’s memory layout looks something like this:

 high
 addresses
        +--------------+
        |   Arguments  |
        |     ENV      |
        +--------------+
        |    Stack     | --+
        |      |       |   | Can grow
        |      v       |   | dynamically
        |              | --+
        +--------------+
        |              | -------------------------------+
        +--------------+                                |
        |    memory    |                                |
        |      map     | -- files or anonymous          |
        |    segment   |                                |
        +--------------+                                |
        |              |                                | Memory
        +--------------+                                | Mapping
        | Thread Stack | --+                            | Region
        |      |       |   | Statically allocated       |
        |      v       |   | on thread start.           |
        |              | --+                            |
        +--------------+                                |
        |              |                                |
        +--------------+                                |
        | Thread Stack | --+                            |
        |      |       |   | Statically allocated       |
        |      v       |   | on thread start.           |
        |              | --+                            |
        +--------------+                                |
        |              | -------------------------------+
        +--------------+ brk
        |              | --+
        |      ^       |   | Can grow
        |      |       |   | dynamically
        |     Heap     | --+
        +--------------+ start_brk
        |     BSS      | --  Static variables initialized to zero
        +--------------+
        |     Data     | --+
        +--------------+   | Binary (disk image)
        |     Code     | --+
        +--------------+
 low
 addresses

Even though this picture might look daunting it is still a simplification. (For a full understanding of the memory subsystem read a book like "Understanding the Linux Kernel" or "Linux System Programming") What I want you to take away from this is that there are two types of dynamically allocatable memory: the heap and memory mapped segments. I will try to call this heap the C-heap from now on, to distinguish it from an Erlang process heap. I will call a memory mapped segment for just a segment, and any of the stacks in this picture for the C-stack.

The C-heap is allocated through malloc and a segment is allocated with mmap.

  1. A note on pictures of memory

Note When drawing overview pictures of system memory and stacks we will follow the convention that memory addresses grows upward. That is low memory addresses on the bottom of the page and high memory addresses on the top of the page. (Stacks most often grow downward starting at high addresses, so that new elements are pushed at the lowest address.)

However when we draw a c-structure we will draw the fields from the top and down, even though the first field of the structure will be at the lowest address and the following fields at higher addresses. So pictures of structures have low address at the top of the page and high address at the bottom of the page.

This means that a picture of a c-structure and a picture of a memory area will have their address positions on the page mirrored. This becomes somewhat confusing when we try to pictures structures and heaps in the same picture.

The memory subsystem

Now that we dive into the memory subsystem it will once again be apparent that ERTS is more like an operating system than just a programming language environment. Not only does ERTS provide a garbage collector for Erlang terms on the Erlang process level, but it also provides a plethora of low level memory allocators and memory allocation strategies.

For an overview of memory allocators see the erts_alloc documentation at: http://www.erlang.org/doc/man/erts_alloc.html

All these allocators also comes with a number of parameters that can be used to tweak their behavior, and this is probably one of the most important areas from an operational point of view. This is where we can configure the system behavior to fit anything from a small embedded control system (like a Raspberry Pi) to an Internet scale 2TB database server.

There are currently eleven different allocators, six different allocation strategies, and more than 18 other different settings, some of which are taking arbitrary numerical values. This means that there basically is an infinite number of possible configurations. (OK, strictly speaking it is not infinite, since each number is bounded, but there are more configurations than you can shake a stick at.)

In order to be able to use these settings in any meaningful way we will have to understand how these allocators work and how each setting impacts the performance of the allocator.

The erts_alloc manual goes as far as to give the following warning:

Warning

Warning

Only use these flags if you are absolutely sure what you are doing. Unsuitable settings may cause serious performance degradation and even a system crash at any time during operation.

Making you absolutely sure that you know what you are doing, that is what this chapter is about.

Oh yes, we will also go into details of how the garbage collector works.

Different type of memory allocators

The Erlang run-time system is trying its best to handle memory in all situations and under all types of loads, but there are always corner cases. In this chapter we will look at the details of how memory is allocated and how the different allocators work. With this knoweledge and some tools that we will look at later you should be able to detect and fix problems if your system ends up in one of these corner cases.

For a nice story about the troubles the system might get into and how to analyze and correct the behavior read Fred Hébert’s essay "Troubleshooting Down the Logplex Rabbit Hole".

When we are talking about a memory allocator in this book we have a specific meaning in mind. Each memory allocator manage allocations and deallocations of memory of a certain type. Each allocator is intended for a specific type of data and is often specialized for one size of data.

Each memory allocator implements the allocator interface but can used different algorithms and settings for the actual memory allocation.

The goal with having different allocators is to reduce fragmentation, by grouping allocations of the same size, and to increase performance, by making frequent allocations cheap.

There are two special, fundamental or generic, memory allocator types sys_alloc and mseg_alloc, and nine specific allocators implemented through the alloc_util framework.

In the following sections we will go though the different allocators, with a little detour into the general framework for allocators (alloc_util).

Each allocator has several names used in the documentation and in the C code. See List of memory allocators. for a short list of all allocators and their names. The C-name is used in the C-code to refer to the allocator. The Type-name is used in erl_alloc.types to bind allocation types to an allocator. The Flag is the letter used for setting parameters of that allocator when starting Erlang.

Table 1. List of memory allocators.
Name Description C-name Type-name Flag

Basic allocator

malloc interface

sys_alloc

SYSTEM

Y

Memory segment allocator

mmap interface

mseg_alloc

-

M

Temporary allocator

Temporary allocations

temp_alloc

TEMPORARY

T

Heap allocator

Erlang heap data

eheap_alloc

EHEAP

H

Binary allocator

Binary data

binary_alloc

BINARY

B

ETS allocator

ETS data

ets_alloc

ETS

E

Driver allocator

Driver data

driver_alloc

DRIVER

R

Short lived allocator

Short lived memory

sl_alloc

SHORT_LIVED

S

Long lived allocator

Long lived memory

ll_alloc

LONG_LIVED

L

Fixed allocator

Fixed size data

fix_alloc

FIXED_SIZE

F

Standard allocator

For most other data

std_alloc

STANDARD

D

The basic allocator: sys_alloc

The allocator sys_alloc can not be disabled, and is basically a straight mapping to the underlying OS malloc implementation in libc.

If a specific allocator is disabled then sys_alloc is used instead.

All specific allocators uses either sys_alloc or mseg_alloc to allocate memory from the operating system as needed.

When memory is allocated from the OS sys_alloc can add (pad) a fixed number of kilobytes to the requested number. This can reduce the number system calls by over allocating memory. The default padding is zero.

When memory is freed, sys_alloc will keep some free memory allocated in the process. The size of this free memory is called the trim threshold, and the default is 128 kilobytes. This also reduces the number of system calls at the cost of a higher memory footprint. This means that if you are running the system with the default settings you can experience that the Beam process does not give memory back to the OS directly as memory is freed up.

Memory areas allocated by sys_alloc are stored in the C-heap of the beam process which will grow as needed through system calls to brk.

The memory segment allocator: mseg_alloc

If the underlying operating system supports mmap a specific memory allocator can use mseg_alloc instead of sys_alloc to allocate memory from the operating system.

Memory areas allocated through mseg_alloc are called segments. When a segment is freed it is not immediately returned to the OS, instead it is kept in a segment cache.

When a new segment is allocated a cached segment is reused if possible, i.e. if it is the same size or larger than the requested size but not too large. The value of absolute max cache bad fit determines the number of kilobytes of extra size which is considered not too large. The default is 4096 kilobytes.

In order not to reuse a 4096 kilobyte segment for really small allocations there is also a relative_max_cache_bad_fit value which states that a cached segment may not be used if it is more than that many percent larger. The default value is 20 percent. That is a 12 KB segment may be used when asked for a 10 KB segment.

The number of entries in the cache defaults to 10 but can be set to any value from zero to thirty.

The memory allocator framework: alloc_util

Building on top of the two generic allocators (sys_alloc and mseg_alloc) is a framework called alloc_util which is used to implement specific memory allocators for different types of usage and data.

The framework is implemented in erl_alloc_util.[ch] and the different allocators used by ERTS are defined in erl_alloc.types in the directory "erts/emulator/beam/".

In a smp system there is usually one allocator of each type per scheduler thread.

The smallest unit of memory that an allocator work with is called a block. When you call an allocator to allocate a certain amount of memory what you get back is a block. It is also blocks that you give as an argument to the allocator when you want to deallocate memory.

The allocator does not allocate blocks from the operating system directly though. Instead the allocator allocates a carrier from the operating system, either through sys_alloc or through mseg_alloc, which in turn uses malloc or mmap. If sys_alloc is used the carrier is placed on the C-heap and if mseg_alloc is used the carrier is placed in a segment.

Small blocks are placed in a multiblock carrier. A multiblock carrier can as the name suggests contain many blocks. Larger blocks are placed in a singleblock carrier, which as the name implies on contains one block.

What’s considered a small and a large block is determined by the parameter singleblock carrier threshold (sbct), see the list of system flags below.

Most allocators also have one "main multiblock carrier" which is never deallocated.

 high
 addresses
           |FREE OS MEMORY |
           +---------------+ brk
           |   FREE HEAP   |       | less than MYtt kb
           +---------------+
     /     |  Unused PAD   |  | multiple of Muycs
    |      |---------------|  |
    S      |               |  |    |
singleblock|               |  |    |
 carrier 1 |     Block     |  |    | larger than MSsbct kb
    |      |               |  |    |
     \     |               |  |    |
           +---------------+
     /     |Free in Carrier|       |
    |      |---------------|       |
    S      |               |       |
  main     |               |       |
multiblock |     Block 2   |       | MSmmbcs kb
 carrier   |---------------|       |
    |      |               |       |
     \     |     Block 1   |       |
           +---------------+
           |               |
           |    U S E D    |
           |               |
           +---------------+ start_brk
               C-Heap
 low
 addresses
Memory allocation strategies

To find a free block of memory in a multi block carrier an allocation strategy is used. Each type of allocator has a default allocation strategy, but you can also set the allocation strategy with the as flag.

The Erlang Run-Time System Application Reference Manual lists the following allocation strategies:

Best fit: Find the smallest block that satisfies the requested block size. (bf)

Address order best fit: Find the smallest block that satisfies the requested block size. If multiple blocks are found, choose the one with the lowest address. (aobf)

Address order first fit: Find the block with the lowest address that satisfies the requested block size. (aoff)

Address order first fit carrier best fit Find the carrier with the lowest address that can satisfy the requested block size, then find a block within that carrier using the "best fit" strategy. (aoffcbf)

Address order first fit carrier address order best fit: Find the carrier with the lowest address that can satisfy the requested block size, then find a block within that carrier using the "address order best fit" strategy. aoffcaobf (address order first fit carrier address order best fit)

Good fit: Try to find the best fit, but settle for the best fit found during a limited search. (gf)

A fit: Do not search for a fit, inspect only one free block to see if it satisfies the request. This strategy is only intended to be used for temporary allocations. (af)

The temporary allocator: temp_alloc

The allocator temp_alloc, is used for temporary allocations. That is very short lived allocations. Memory allocated by temp_alloc may not be allocated over a Erlang process context switch.

You can use temp_alloc as a small scratch or working area while doing some work within a function. Look at it as an extension of the C-stack and free it in the same way. That is, to be on the safe side, free memory allocated by temp_alloc before returning from the function that did the allocation. There is a note in erl_alloc.types saying that you should free a temp_alloc block before the emulator starts executing Erlang code.

Note that no Erlang process running on the same scheduler as the allocator may start executing Erlang code before the block is freed. This means that you can not use a temporary allocation over a bif or nif trap (yield).

In a default R16 smp system there is N+1 temp_alloc allocators where N is the number of schedulers. The temp_alloc uses the "A fit" (af) strategy. Since the allocation pattern of the temp_alloc basically is that of a stack (mostly of size 0 or 1), this strategy works fine.

The temporary allocator is, in R16, used by the following types of data: TMP_HEAP, MSG_ROOTS, ROOTSET, LOADER_TEMP, NC_TMP, TMP, DCTRL_BUF, TMP_DIST_BUF, ESTACK, DB_TMP, DB_MC_STK, DB_MS_CMPL_HEAP, LOGGER_DSBUF, TMP_DSBUF, DDLL_TMP_BUF, TEMP_TERM, SYS_READ_BUF, ENVIRONMENT, CON_VPRINT_BUF.

For an up to date list of allocation types allocated with each allocator, see erl_alloc.types (e.g. grep TEMPORARY erts/emulator/beam/erl_alloc.types).

I will not go through each of these different types, but in general as you can guess by their names, they are temporary buffers or work stacks.

The heap allocator: eheap_alloc

The heap allocator, is used for allocating memory blocks where tagged Erlang terms are stored, such as Erlang process heaps (all generations), heap fragments, and the beam_registers.

This is probably the memory areas you are most interested in as an Erlang developer or when tuning an Erlang system. We will talk more about how these areas are managed in the upcoming sections on garbage collection and process memory. There we will also cover what a heap fragment is.

The binary allocator: binary_alloc

The binary allocator is used for, yes you guessed it, binaries. Binaries can be of quite varying sizes and have varying life spans. This allocator uses the best fit allocation strategy by default.

The ETS allocator: ets_alloc

The ETS allocator is used for most ets related data, except for some short lived or temporary data used by ets tables-

The driver allocator: driver_alloc

The driver allocator is used for ports, linked in drivers and nifs.

The short lived allocator: sl_alloc

The short lived allocator is used for lists and buffers that are expected to be short lived. Short lived data can live longer than temporary data.

The long lived allocator: ll_alloc

The long lived allocator is used for long lived data, such as atoms, modules, funs and long lived tables

The fixed size allocator: fix_alloc

The fixed allocator is used for objects of a fixed size, such as PCBs, message refs and a few other. The fixed size allocator uses the address order best fit allocation strategy by default.

The standard allocator: std_alloc

The standard allocator is used by the other types of data. (active_procs alloc_info_request arg_reg bif_timer_ll bits_buf bpd calls_buf db_heir_data db_heir_data db_named_table_entry dcache ddll_handle ddll_processes ddll_processes dist_entry dist_tab driver_lock ethread_standard fd_entry_buf fun_tab gc_info_request io_queue line_buf link_lh module_refs monitor_lh monitor_lh monitor_sh nlink_lh nlink_lh nlink_sh node_entry node_tab nodes_monitor port_data_heap port_lock port_report_exit port_specific_data proc_dict process_specific_data ptimer_ll re_heap reg_proc reg_tab sched_wall_time_request stack suspend_monitor thr_q_element thr_queue zlib )

TODO: system flags for memory

TODO

Process Memory

As we saw in [CH-Processes] a process i really just a number of memory areas, in this chapter we will look a bit closer at how the stack, the heap and the mailbox are managed.

The default size of the stack and heap is 233 words. This default size can be changed globally when starting Erlang through the + h + flag. You can also set the minimum heap size by when starting a process with +spawn_opt by setting min_heap_size.

Erlang terms are tagged as we saw in [CH-TypeSystem], and when they are stored on the heap they are either cons cells or boxed objects.

Term sharing

Objects on the heap are passed by references within the context of one process. If you call one function with a tuple as an argument, then only a tagged reference to that tuple is passed to the called function. When you build new terms you will also only use references to sub terms.

For example if you have the string "hello" (which is the same as this list of integers: [104,101,108,108,111]) you would get a stack layout similar to:

         ADR                               BINARY  VALUE  +  DESCRIPTION
 hend ->     +-------- -------- -------- --------+
             |              ...                  |
             |              ...                  |
             |00000000 00000000 00000000 10000001| 128 + list tag  ---------------+
 stop ->;     |                                   |                                |
                                                                                  |
 htop ->;     |                                   |                                |
         132 |00000000 00000000 00000000 01111001| 120 + list tag  -------------- | -+
         128 |00000000 00000000 00000110 10001111| (H) 104 bsl 4 + small int tag <+  |
         124 |00000000 00000000 00000000 01110001| 112 + list tag  ----------------- | -+
         120 |00000000 00000000 00000110 01011111| (e) 101 bsl 4 + small int tag <---+  |
         116 |00000000 00000000 00000000 01110001| 112 + list tag  -------------------- | -+
         112 |00000000 00000000 00000110 11001111| (l) 108 bsl 4 + small int tag <------+  |
         108 |00000000 00000000 00000000 01110001|  96 + list tag  ----------------------- | -+
         104 |00000000 00000000 00000110 11001111| (l) 108 bsl 4 + small int tag <---------+  |
         100 |11111111 11111111 11111111 11111011| NIL                                        |
          96 |00000000 00000000 00000110 11111111| (o) 111 bsl 4 + small int tag <------------+
             |                ...                |
 heap ->     +-----------------------------------+

If you then create a tuple with two instances of the list, all that is repeated is the tagged pointer to the list: 00000000000000000000000001000001. The code

L = [104, 101, 108, 108, 111],
T = {L, L}.

would result in a memory layout as seen below. That is, a boxed header saying that this is a tuple of size 2 and then two pointers to the same list.

ADR VALUE                            DESCRIPTION
144 00000000000000000000000001000001 128+CONS
140 00000000000000000000000001000001 128+CONS
136 00000000000000000000000010000000 2+ARITYVAL

This is nice, since it is cheap to do and uses very little space. But if you send the tuple to another process or do any other type of IO, or any operations which results in something called a deep copy, then the data structure is expanded. So if we send out tuple T to another process P2 (P2 ! T) then the heap of T2 will look like in:

 ..

You can quickly bring down your Erlang node by expanding a highly shared term, see share.erl.

-module(share).

-export([share/2, size/0]).

share(0, Y) -> {Y,Y};
share(N, Y) -> [share(N-1, [N|Y]) || _ <- Y].

size() ->
    T = share:share(5,[a,b,c]),
    {{size, erts_debug:size(T)},
     {flat_size, erts_debug:flat_size(T)}}.



 1> timer:tc(fun() -> share:share(10,[a,b,c]), ok end).
 {1131,ok}

 2> share:share(10,[a,b,c]), ok.
 ok

 3> byte_size(list_to_binary(test:share(10,[a,b,c]))), ok.
 HUGE size (13695500364)
 Abort trap: 6

You can calculate the memory size of a shared term and the size of the expanded size of the term with the functions erts_debug:size/1 and erts_debug:flat_size/1.

> share:size().
{{size,19386},{flat_size,94110}}

For most applications this is not a problem, but you should be aware of the problem, which can come up in many situations. A deep copy is used for IO, ETS tables, binary_to_term, and message passing.

Let us look in more detail how message passing works.

Message passing

When a process P1 sends a message M to another (local) process P2, the process P1 first calculates the flat size of M. Then it allocates a new message buffer of that size by doing a heap_alloc of a heap_frag in the local scheduler context.

Given the code in send.erl the state of the system could look like this just before the send in p1/1:

    x0       |00000000 00000000 00000000 00100011| Pid 2
    x1       |00000000 00000000 00000000 01001010| 136 + boxed tag -----------+
                                                                              |
                                                                              |
         ADR                               BINARY  VALUE  +  DESCRIPTION      |
 hend ->     +-------- -------- -------- --------+                            |
             |              ...                  |                            |
             |              ...                  |                            |
 stop ->     |                                   |                            |
                                                                              |
 htop ->     |                                   |                            |
         144 |00000000 00000000 00000000 01000001| 128+CONS        ---------------+
         140 |00000000 00000000 00000000 01000001| 128+CONS        ---------------+
         136 |00000000 00000000 00000000 10000000| 2+ARITYVAL             <---+   |
         132 |00000000 00000000 00000000 01111001| 120+CONS        -------------- | -+
         128 |00000000 00000000 00000110 10001111| (H) 104 bsl 4 + small int tag <+  |
         124 |00000000 00000000 00000000 01110001| 112+CONS        ----------------- | -+
         120 |00000000 00000000 00000110 01011111| (e) 101 bsl 4 + small int tag <---+  |
         116 |00000000 00000000 00000000 01110001| 112+CONS        -------------------- | -+
         112 |00000000 00000000 00000110 11001111| (l) 108 bsl 4 + small int tag <------+  |
         108 |00000000 00000000 00000000 01110001|  96+CONS        ----------------------- | -+
         104 |00000000 00000000 00000110 11001111| (l) 108 bsl 4 + small int tag <---------+  |
         100 |11111111 11111111 11111111 11111011| NIL                                        |
          96 |00000000 00000000 00000110 11111111| (o) 111 bsl 4 + small int tag <------------+
             |                ...                |
 heap ->     +-----------------------------------+


P2

When P1 start sending the message M to P2. It (through the code in erl_message.c) first calculates the flat size of M (which in our example is 23 words)[1]. Then (in a SMP system) if it can take a lock on P2 and there is enough room on the heap of P2 it will copy the message to the heap of P2.

If P2 is running (or exiting) or there isn’t enough space on the heap, then a new heap fragment is allocated (of sizeof ErlHeapFragment - sizeof(Eterm) + 23*sizeof(Eterm)) [2] which after initialization will look like:

erl_heap_fragment:
    ErlHeapFragment* next;	  NULL
    ErlOffHeap off_heap:
      erl_off_heap_header* first; NULL
      Uint64 overhead;               0
    unsigned alloc_size;	    23
    unsigned used_size;             23
    Eterm mem[1];		     ?
      ... 22 free words

Then the message is copied into the heap fragment:

erl_heap_fragment:
    ErlHeapFragment* next;	  NULL
    ErlOffHeap off_heap:
      erl_off_heap_header* first; Boxed tag+&amp;mem+2*WS-+
      Uint64 overhead;               0                |
    unsigned alloc_size;	    23                |
    unsigned used_size;             23                |
    Eterm mem:                    2+ARITYVAL   <------+
                                  &amp;mem+3*WS+1  ---+
                                  &amp;mem+13*WS+1 ------+
                                  (H*16)+15    <--+  |
                                  &amp;mem+5*WS+1  --+   |
                                  (e*16)+15    <-+   |
                                  &amp;mem+7*WS+1  ----| |
                                  (l*16)+15    <---+ |
                                  &amp;mem+9*WS+1  ---+  |
                                  (l*16)+15    <--+  |
                                  &amp;mem+11*WS+1 ----+ |
                                  (o*16)+15    <---+ |
                                  NIL                |
                                  (H*16)+15    <-----+
                                  &amp;mem+15*WS+1 --+
                                  (e*16)+15    <-+
                                  &amp;mem+17*WS+1 ----|
                                  (l*16)+15    <---+
                                  &amp;mem+19*WS+1 ---+
                                  (l*16)+15    <--+
                                  &amp;mem+21*WS+1 ----+
                                  (o*16)+15    <---+
                                  NIL</pre>

In either case a a new mbox (ErlMessage) is allocated, a lock (ERTS_PROC_LOCK_MSGQ) is taken on the receiver and the message on the heap or the in the new heap fragment is linked into the mbox.

 erl_mesg {
    struct erl_mesg* next = NULL;
    data:  ErlHeapFragment *heap_frag = bp;
    Eterm m[0]            = message;
 } ErlMessage;

Then the mbox is linked into the in message queue (msg_inq) of the receiver, and the lock is released. Note that msg_inq.last points to the next field of the last message in the queue. When a new mbox is linked in this next pointer is updated to point to the new mbox, and the last pointer is updated to point to the next field of the new mbox.

Binaries

As we saw in [CH-TypeSystem] there are four types of binaries internally. Three of these types, heap binaries, sub binaries and match contexts are stored on the local heap and handled by the garbage collector and message passing as any other object, copied as needed.

Reference Counting

The fourth type. large binaries or refc binaries on the other hand are partially stored outside of the process heap and they are reference counted.

The payload of a refc binary is stored in memory allocated by the binary allocator. There is also a small reference to the payload call a ProcBin which is stored on the process heap. This reference is copied by message passing and by the GC, but the payload is untouched. This makes it relatively cheap to send large binaries to other processes since the whole binary doesn’t need to be copied.

All references through a ProcBin to a refc binary increases the reference count of the binary by one. All ProcBin objects on a process heap are linked together in a linked list. After a GC pass this linked list is traversed and the reference count of the binary is decreased with one for each ProcBin that has deceased. If the reference count of the refc binary reaches zero that binary is deallocated.

Having large binaries reference counted and not copied by send or garbage collection is a big win, but there is one problem with having a mixed environment of garbage collection and reference counting. In a pure reference counted implementation the reference count would be reduce as soon as a reference to the object dies, and when the reference count reaches zero the object is freed. In the ERTS mixed environment a reference to a reference counted object does not die until a garbage collection detects that the reference is dead.

This means that binaries, which has a tendency to be large or even huge, can hang around for a long time after all references to the binary are dead. Note that since binaries are allocated globally, all references from all processes need to be dead, that is all processes that has seen a binary need to do a GC.

Unfortunately it is not always easy, as a developer, to see which processes have seen a binary in the GC sense of the word seen. Imagine for example that you have a load balancer that receives work items and dispatches them to workers.

In this code there is an example of a loop which doesn’t need to do GC. (See listing lb for a full example.)

loop(Workers, N) ->
  receive
    WorkItem ->
       Worker = lists:nth(N+1, Workers),
       Worker ! WorkItem,
       loop(Workers, (N+1) rem length(Workers))
  end.

This server will just keep on grabbing references to binaries and never free them, eventually using up all system memory.

When one is aware of the problem it is easy to fix, one can either do a garbage_collect on each iteration of loop or one could do it every five seconds or so by adding an after clause to the receive. (after 5000 → garbage_collect(), loop(Workers, N) ).

Sub Binaries and Matching

When you match out a part of a binary you get a sub binary. This sub binary will be a small structure just containing pointers into the real binary. This increases the reference count for the binary but uses very little extra space.

If a match would create a new copy of the matched part of the binary it would cost both space and time. So in most cases just doing a pattern match on a binary and getting a sub binary to work on is just what you want.

There are some degenerate cases, imagine for example that you load huge file like a book into memory and then you match out a small part like a chapter to work on. The problem is then that the whole of the rest of the book is still kept in memory until you are done with processing the chapter. If you do this for many books, perhaps you want to get the introduction of every book in your file system, then you will keep the whole of each book in memory and not just the introductory chapter. This might lead to huge memory usage.

The solution in this case, when you know you only want one small part of a large binary and you want to have the small part hanging around for some time, is to use binary:copy/1. This function is only used for its side effect, which is to actually copy the sub binary out of the real binary removing the reference to the larger binary and therefore hopefully letting it be garbage collected.

There is a pretty thorough explanation of how binary construction and matching is done in the Erlang documentation: http://www.erlang.org/doc/efficiency_guide/binaryhandling.html.

Garbage Collection

When a process runs out of space on the stack and heap the process will try to reclaim space by doing a minor garbage collection. The code for this can be found in erl_gc.c.

ERTS uses a generational copying garbage collector. A copying collector means that during garbage collection all live young terms are copied from the old heap to a new heap. Then the old heap is discarded. A generational collector works on the principle that most terms die young, they are temporary terms created, used, and thrown away. Older terms are promoted to the old generation which is collected more seldom, with the rational that once a term has become old it will probably live for a long time.

Conceptually a garbage collection cycle works as follows:

  • First you collect all roots (e.g. the stack).

  • Then for each root, if the root points to a heap allocated object which doesn’t have a forwarding pointer you copy the object to the new heap. For each copied object update the original with a forwarding pointer to the new copy.

  • Now go through the new heap and do the same as for the roots.

We will go through an example to see how this is done in detail. We will go through a minor collection without an old generation, and we will only use the stack as the root set. In reality the process dictionary, trace data and probe data among other things are also included in the rootset.

Let us look at how the call to garbage_collect in the gc_example behaves. The code will generate a string which is shared by two elements of a cons and a tuple, the tuple will the be eliminated resulting in garbage. After the GC there should only be one string on the heap. That is, first we generate the term {["Hello","Hello"], "Hello"} (sharing the same string "Hello" in all instances. Then we just keep the term ["Hello","Hello"] when triggering a gc.

Note
We will take the opportunity to go through how you, on a linux system, can used gdb to examine the behavior of ERTS. You can of course use the debugger of your choice. If you already know how to use gdb or if you have no interest in going into the debugger you can just ignore the meta text about how to inspect the system and just look at the diagrams and the explanations of how the GC works.
link:code/memory_chapter/src/gc_example.erl[role=include]

After compiling the example I start an erlang shell, test the call and prepare for a new call to the example (without hitting return):

1> gc_example:example().
["Hello","Hello"]
2> spawn(gc_example,example,[]).

Then I use gdb to attach to my erlang node (os PID: 2955 in this case)

$ gdb /home/happi/otp/lib/erlang/erts-6.0/bin/beam.smp 2955
Note
Depending on your settings for ptrace_scope you might have to precede the gdb invocation with 'sudo'.

Then in gdb I set a breakpoint at the start of the main GC function and let the node continue:

(gdb) break garbage_collect_0
(gdb) cont
Continuing.

Now I hit enter in the Erlang shell and execution stops at the breakpoint:

Breakpoint 1, garbage_collect_0 (A__p=0x7f673d085f88, BIF__ARGS=0x7f673da90340) at beam/bif.c:3771
3771	    FLAGS(BIF_P) |= F_NEED_FULLSWEEP;

Now we can inspect the PCB of the process:

(gdb) p *(Process *) A__p
$1 = {common = {id = 1408749273747, refc = {counter = 1}, tracer_proc = 18446744073709551611, trace_flags = 0, u = {alive = {
        started_interval = 0, reg = 0x0, links = 0x0, monitors = 0x0, ptimer = 0x0}, release = {later = 0, func = 0x0, data = 0x0,
        next = 0x0}}}, htop = 0x7f6737145950, stop = 0x7f6737146000, heap = 0x7f67371458c8, hend = 0x7f6737146010, heap_sz = 233,
  min_heap_size = 233, min_vheap_size = 46422, fp_exception = 0, hipe = {nsp = 0x0, nstack = 0x0, nstend = 0x0, ncallee = 0x7f673d080000,
    closure = 0, nstgraylim = 0x0, nstblacklim = 0x0, ngra = 0x0, ncsp = 0x7f673d0863e8, narity = 0, float_result = 0}, arity = 0,
  arg_reg = 0x7f673d086080, max_arg_reg = 6, def_arg_reg = {393227, 457419, 18446744073709551611, 233, 46422, 2000}, cp = 0x7f673686ac40,
  i = 0x7f673be17748, catches = 0, fcalls = 1994, rcount = 0, schedule_count = 0, reds = 0, group_leader = 893353197987, flags = 0,
  fvalue = 18446744073709551611, freason = 0, ftrace = 18446744073709551611, next = 0x7f673d084cc0, nodes_monitors = 0x0,
  suspend_monitors = 0x0, msg = {first = 0x0, last = 0x7f673d086120, save = 0x7f673d086120, len = 0, mark = 0x0, saved_last = 0x7d0}, u = {
    bif_timers = 0x0, terminate = 0x0}, dictionary = 0x0, seq_trace_clock = 0, seq_trace_lastcnt = 0,
  seq_trace_token = 18446744073709551611, initial = {393227, 457419, 0}, current = 0x7f673be17730, parent = 1133871366675,
  approx_started = 1407857804, high_water = 0x7f67371458c8, old_hend = 0x0, old_htop = 0x0, old_heap = 0x0, gen_gcs = 0,
  max_gen_gcs = 65535, off_heap = {first = 0x0, overhead = 0}, mbuf = 0x0, mbuf_sz = 0, psd = 0x0, bin_vheap_sz = 46422,
  bin_vheap_mature = 0, bin_old_vheap_sz = 46422, bin_old_vheap = 0, sys_task_qs = 0x0, state = {counter = 41002}, msg_inq = {first = 0x0,
    last = 0x7f673d086228, len = 0}, pending_exit = {reason = 0, bp = 0x0}, lock = {flags = {counter = 1}, queue = {0x0, 0x0, 0x0, 0x0},
    refc = {counter = 1}}, scheduler_data = 0x7f673bd6c080, suspendee = 18446744073709551611, pending_suspenders = 0x0, run_queue = {
    counter = 140081362118912}, hipe_smp = {have_receive_locks = 0}}

Wow, that was a lot of information. The interesting part is about the stack and the heap:

hend = 0x7f6737146010,
stop = 0x7f6737146000,
htop = 0x7f6737145950,
heap = 0x7f67371458c8,

By using some helper scripts we can inspect the stack and the heap in a meaningful way. (see [AP-listings] for the definitions of the scripts in gdb_script.)

(gdb) source gdb_scripts
(gdb) print_p_stack A__p
0x00007f6737146008 [0x00007f6737145929] cons -> 0x00007f6737145928
(gdb) print_p_heap A__p
0x00007f6737145948 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145940 [0x00007f6737145929] cons -> 0x00007f6737145928
0x00007f6737145938 [0x0000000000000080] Tuple size 2
0x00007f6737145930 [0x00007f6737145919] cons -> 0x00007f6737145918
0x00007f6737145928 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145920 [0xfffffffffffffffb] NIL
0x00007f6737145918 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145910 [0x00007f67371458f9] cons -> 0x00007f67371458f8
0x00007f6737145908 [0x000000000000048f] 72
0x00007f6737145900 [0x00007f67371458e9] cons -> 0x00007f67371458e8
0x00007f67371458f8 [0x000000000000065f] 101
0x00007f67371458f0 [0x00007f67371458d9] cons -> 0x00007f67371458d8
0x00007f67371458e8 [0x00000000000006cf] 108
0x00007f67371458e0 [0x00007f67371458c9] cons -> 0x00007f67371458c8
0x00007f67371458d8 [0x00000000000006cf] 108
0x00007f67371458d0 [0xfffffffffffffffb] NIL
0x00007f67371458c8 [0x00000000000006ff] 111

Here we can see the heap of the process after it has allocated the list "Hello" on the heap and the cons containing that list twice, and the tuple containing the cons and the list. The root set, in this case the stack, contains a pointer to the cons containing two copies of the list. The tuple is dead, that is, there are no references to it.

The garbage collection starts by calculating the root set and by allocating a new heap (to space). By stepping into the gc code in the debugger you can see how this is done. I will not go through the details here. After a number of steps the execution will reach the point where all terms in the root set are copied to the new heap. This starts around (depending on version) line 1272 with a while loop in erl_gc.c.

In our case the root is a cons pointing to address 0x00007f95666597f0 containing the letter (integer) 'H'. When a cons cell is moved from the current heap, called from space, to to space the value in the head (or car) is overwritten with a moved cons tag (the value 0).

After the first step where the root set is moved, the from space and the to space looks like this:

from space:

(gdb) print_p_heap p
0x00007f6737145948 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145940 [0x00007f6737145929] cons -> 0x00007f6737145928
0x00007f6737145938 [0x0000000000000080] Tuple size 2
0x00007f6737145930 [0x00007f67371445b1] cons -> 0x00007f67371445b0
0x00007f6737145928 [0x0000000000000000] Tuple size 0
0x00007f6737145920 [0xfffffffffffffffb] NIL
0x00007f6737145918 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145910 [0x00007f67371458f9] cons -> 0x00007f67371458f8
0x00007f6737145908 [0x000000000000048f] 72
0x00007f6737145900 [0x00007f67371458e9] cons -> 0x00007f67371458e8
0x00007f67371458f8 [0x000000000000065f] 101
0x00007f67371458f0 [0x00007f67371458d9] cons -> 0x00007f67371458d8
0x00007f67371458e8 [0x00000000000006cf] 108
0x00007f67371458e0 [0x00007f67371458c9] cons -> 0x00007f67371458c8
0x00007f67371458d8 [0x00000000000006cf] 108
0x00007f67371458d0 [0xfffffffffffffffb] NIL
0x00007f67371458c8 [0x00000000000006ff] 111

to space:

(gdb) print_heap n_htop-1 n_htop-2
0x00007f67371445b8 [0x00007f6737145919] cons -> 0x00007f6737145918
0x00007f67371445b0 [0x00007f6737145909] cons -> 0x00007f6737145908

In from space the head of the first cons cell has been overwritten with 0 (looks like a tuple of size 0) and the tail has been overwritten with a forwarding pointer pointing to the new cons cell in the to space. In to space we now have the first cons cell with two backward pointers to the head and the tail of the cons in the from space.

When the collector is done with the root set the to space contains backward pointers to all still live terms. At this point the collector starts sweeping the to space. It uses two pointers n_hp pointing to the bottom of the unseen heap and n_htop pointing to the top of the heap.

n_htop:
        0x00007f67371445b8 [0x00007f6737145919] cons -> 0x00007f6737145918
n_hp    0x00007f67371445b0 [0x00007f6737145909] cons -> 0x00007f6737145908

The GC will then look at the value pointed to by n_hp, in this case a cons pointing back to the from space. So it moves that cons to the to space, incrementing n_htop to make room for the new cons, and incrementing n_hp to indicate that the first cons is seen.

from space:

0x00007f6737145948 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145940 [0x00007f6737145929] cons -> 0x00007f6737145928
0x00007f6737145938 [0x0000000000000080] Tuple size 2
0x00007f6737145930 [0x00007f67371445b1] cons -> 0x00007f67371445b0
0x00007f6737145928 [0x0000000000000000] Tuple size 0
0x00007f6737145920 [0xfffffffffffffffb] NIL
0x00007f6737145918 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145910 [0x00007f67371445c1] cons -> 0x00007f67371445c0
0x00007f6737145908 [0x0000000000000000] Tuple size 0
0x00007f6737145900 [0x00007f67371458e9] cons -> 0x00007f67371458e8
0x00007f67371458f8 [0x000000000000065f] 101
0x00007f67371458f0 [0x00007f67371458d9] cons -> 0x00007f67371458d8
0x00007f67371458e8 [0x00000000000006cf] 108
0x00007f67371458e0 [0x00007f67371458c9] cons -> 0x00007f67371458c8
0x00007f67371458d8 [0x00000000000006cf] 108
0x00007f67371458d0 [0xfffffffffffffffb] NIL
0x00007f67371458c8 [0x00000000000006ff] 111

to space:

n_htop:
        0x00007f67371445c8 [0x00007f67371458f9] cons -> 0x00007f67371458f8
        0x00007f67371445c0 [0x000000000000048f] 72
n_hp    0x00007f67371445b8 [0x00007f6737145919] cons -> 0x00007f6737145918
SEEN    0x00007f67371445b0 [0x00007f67371445c1] cons -> 0x00007f67371445c0

The same thing then happens with the second cons.

from space:

0x00007f6737145948 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145940 [0x00007f6737145929] cons -> 0x00007f6737145928
0x00007f6737145938 [0x0000000000000080] Tuple size 2
0x00007f6737145930 [0x00007f67371445b1] cons -> 0x00007f67371445b0
0x00007f6737145928 [0x0000000000000000] Tuple size 0
0x00007f6737145920 [0x00007f67371445d1] cons -> 0x00007f67371445d0
0x00007f6737145918 [0x0000000000000000] Tuple size 0
0x00007f6737145910 [0x00007f67371445c1] cons -> 0x00007f67371445c0
0x00007f6737145908 [0x0000000000000000] Tuple size 0
0x00007f6737145900 [0x00007f67371458e9] cons -> 0x00007f67371458e8
0x00007f67371458f8 [0x000000000000065f] 101
0x00007f67371458f0 [0x00007f67371458d9] cons -> 0x00007f67371458d8
0x00007f67371458e8 [0x00000000000006cf] 108
0x00007f67371458e0 [0x00007f67371458c9] cons -> 0x00007f67371458c8
0x00007f67371458d8 [0x00000000000006cf] 108
0x00007f67371458d0 [0xfffffffffffffffb] NIL
0x00007f67371458c8 [0x00000000000006ff] 111

to space:

n_htop:
        0x00007f67371445d8 [0xfffffffffffffffb] NIL
        0x00007f67371445d0 [0x00007f6737145909] cons -> 0x00007f6737145908
        0x00007f67371445c8 [0x00007f67371458f9] cons -> 0x00007f67371458f8
n_hp    0x00007f67371445c0 [0x000000000000048f] 72
SEEN    0x00007f67371445b8 [0x00007f6737145919] cons -> 0x00007f67371445d0
SEEN    0x00007f67371445b0 [0x00007f67371445c1] cons -> 0x00007f67371445c0

The next element in to space is the immediate 72, which is only stepped over (with n_hp++). Then there is another cons which is moved.

The same thing then happens with the second cons.

from space:

0x00007f6737145948 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145940 [0x00007f6737145929] cons -> 0x00007f6737145928
0x00007f6737145938 [0x0000000000000080] Tuple size 2
0x00007f6737145930 [0x00007f67371445b1] cons -> 0x00007f67371445b0
0x00007f6737145928 [0x0000000000000000] Tuple size 0
0x00007f6737145920 [0x00007f67371445d1] cons -> 0x00007f67371445d0
0x00007f6737145918 [0x0000000000000000] Tuple size 0
0x00007f6737145910 [0x00007f67371445c1] cons -> 0x00007f67371445c0
0x00007f6737145908 [0x0000000000000000] Tuple size 0
0x00007f6737145900 [0x00007f67371445e1] cons -> 0x00007f67371445e0
0x00007f67371458f8 [0x0000000000000000] Tuple size 0
0x00007f67371458f0 [0x00007f67371458d9] cons -> 0x00007f67371458d8
0x00007f67371458e8 [0x00000000000006cf] 108
0x00007f67371458e0 [0x00007f67371458c9] cons -> 0x00007f67371458c8
0x00007f67371458d8 [0x00000000000006cf] 108
0x00007f67371458d0 [0xfffffffffffffffb] NIL
0x00007f67371458c8 [0x00000000000006ff] 111

to space:

n_htop:
        0x00007f67371445e8 [0x00007f67371458e9] cons -> 0x00007f67371458e8
        0x00007f67371445e0 [0x000000000000065f] 101
        0x00007f67371445d8 [0xfffffffffffffffb] NIL
n_hp    0x00007f67371445d0 [0x00007f6737145909] cons -> 0x00007f6737145908
SEEN    0x00007f67371445c8 [0x00007f67371458f9] cons -> 0x00007f67371445e0
SEEN    0x00007f67371445c0 [0x000000000000048f] 72
SEEN    0x00007f67371445b8 [0x00007f6737145919] cons -> 0x00007f67371445d0
SEEN    0x00007f67371445b0 [0x00007f67371445c1] cons -> 0x00007f67371445c0

Now we come to a cons that points to a cell that has already been moved. The GC sees the IS_MOVED_CONS tag at 0x00007f6737145908 and copies the destination of the moved cell from the tail (*n_hp+ = ptr[1];+). This way sharing is preserved during GC. This step does not affect from space, but the backward pointer in to space is rewritten.

to space:

n_htop:
        0x00007f67371445e8 [0x00007f67371458e9] cons -> 0x00007f67371458e8
        0x00007f67371445e0 [0x000000000000065f] 101
n_hp    0x00007f67371445d8 [0xfffffffffffffffb] NIL
SEEN    0x00007f67371445d0 [0x00007f67371445c1] cons -> 0x00007f67371445c0
SEEN    0x00007f67371445c8 [0x00007f67371458f9] cons -> 0x00007f67371445e0
SEEN    0x00007f67371445c0 [0x000000000000048f] 72
SEEN    0x00007f67371445b8 [0x00007f6737145919] cons -> 0x00007f67371445d0
SEEN    0x00007f67371445b0 [0x00007f67371445c1] cons -> 0x00007f67371445c0

Then the rest of the list (the string) is moved.

from space:

0x00007f6737145948 [0x00007f6737145909] cons -> 0x00007f6737145908
0x00007f6737145940 [0x00007f6737145929] cons -> 0x00007f6737145928
0x00007f6737145938 [0x0000000000000080] Tuple size 2
0x00007f6737145930 [0x00007f67371445b1] cons -> 0x00007f67371445b0
0x00007f6737145928 [0x0000000000000000] Tuple size 0
0x00007f6737145920 [0x00007f67371445d1] cons -> 0x00007f67371445d0
0x00007f6737145918 [0x0000000000000000] Tuple size 0
0x00007f6737145910 [0x00007f67371445c1] cons -> 0x00007f67371445c0
0x00007f6737145908 [0x0000000000000000] Tuple size 0
0x00007f6737145900 [0x00007f67371445e1] cons -> 0x00007f67371445e0
0x00007f67371458f8 [0x0000000000000000] Tuple size 0
0x00007f67371458f0 [0x00007f67371445f1] cons -> 0x00007f67371445f0
0x00007f67371458e8 [0x0000000000000000] Tuple size 0
0x00007f67371458e0 [0x00007f6737144601] cons -> 0x00007f6737144600
0x00007f67371458d8 [0x0000000000000000] Tuple size 0
0x00007f67371458d0 [0x00007f6737144611] cons -> 0x00007f6737144610
0x00007f67371458c8 [0x0000000000000000] Tuple size 0

to space:

n_htop:
n_hp
SEEN    0x00007f6737144618 [0xfffffffffffffffb] NIL
SEEN    0x00007f6737144610 [0x00000000000006ff] 111
SEEN    0x00007f6737144608 [0x00007f6737144611] cons -> 0x00007f6737144610
SEEN    0x00007f6737144600 [0x00000000000006cf] 108
SEEN    0x00007f67371445f8 [0x00007f6737144601] cons -> 0x00007f6737144600
SEEN    0x00007f67371445f0 [0x00000000000006cf] 108
SEEN    0x00007f67371445e8 [0x00007f67371445f1] cons -> 0x00007f67371445f0
SEEN    0x00007f67371445e0 [0x000000000000065f] 101
SEEN    0x00007f67371445d8 [0xfffffffffffffffb] NIL
SEEN    0x00007f67371445d0 [0x00007f67371445c1] cons -> 0x00007f67371445c0
SEEN    0x00007f67371445c8 [0x00007f67371445e1] cons -> 0x00007f67371445e0
SEEN    0x00007f67371445c0 [0x000000000000048f] 72
SEEN    0x00007f67371445b8 [0x00007f67371445d1] cons -> 0x00007f67371445d0
SEEN    0x00007f67371445b0 [0x00007f67371445c1] cons -> 0x00007f67371445c0

There are some things to note from this example. When terms are created in Erlang they are created bottom up, starting with the elements. The garbage collector works top down, starting with the top level structure and then copying the elements. This means that the direction of the pointers change after the first GC. This has no real implications but it is good to know when looking at actual heaps. You can not assume that structures should be bottom up.

Also note that the GC does a breath first traversal. This means that locality for one term most often is worse after a GC. With the size of modern cashes this should not be a problem. You could of course create a pathological example where it becomes a problem, but you can also create a pathological example where a depth first approach would cause problems.

The third thing to note is that sharing is preserved which is really important otherwise we might end up using more space after a GC than before.

Generations…​

  hend ->  +----+
           |....|
  stop ->  |    |
           |    |    +----+ old_hend
           |    |    |    |
  htop ->  |    |    |    |
           |....|    |    | old_htop
           |....|    |....|
  heap ->  +----+    +----+ old_heap
          The Heap   Old Heap
+high_water, old_hend, old_htop, old_heap,
gen_gcs, max_gen_gcs, off_heap,  mbuf, mbuf_sz, psd, bin_vheap_sz,
bin_vheap_mature, bin_old_vheap_sz, bin_old_vheap+.

Other interesting memory areas

The atom table.

TODO ==== Code TODO ==== Constants TODO


1. We ignore tracing here which will add a trace token to the size of the message, and always use a heap fragment.
2. The -sizeof(Eterm) comes from mem in ErlHeapFragment already having the size of 1 Eterm