The main problem of all heavily-loaded Java applications which operate on a huge amount of data is memory management. We solve this problem with a new, byte-based memory management scheme that will debut in the upcoming Stardog 5 release.
The new scheme will make Stardog more efficient with respect to memory, more stable for very heavy load workloads, and in some cases more performant. The first use of the new scheme will be in query evaluation—the part of Stardog most vulnerable to memory pressure. Over the 5.x release cycle we will integrate the new approach into other parts of the system.
In this post we describe that scheme and its advantages. In next week’s post we will look at some micro-benchmarks to get an idea about the overhead of the new approach.
Background & Inspiration
JVM’s garbage collector has two fundamental problems:
- The resource consumption during garbage collection process (stop-the-world pauses, CPU consumption, etc.) can be considerable.
OutOfMemoryException
can be thrown if there is insufficient memory.
How do we solve this without rewriting Stardog in C++?
From Objects to Bytes
Consider the main problem of general Java approach. Using simple Java objects in heavily-loaded application, we allocate memory for the object, use this memory, and then we release strong reference on the corresponding object. If the number of such objects is large, that inevitably leads to big pressure on GC, and then we cannot control memory overflow.
To prevent this problem we don’t use Java objects to store data. Instead we use a byte-based representation of the data. We operate directly with serialized data and always control memory allocations and used memory directly, byte by byte.
If there is no more memory, we can always use disk. Because we always control each byte used for our data, we can prevent OutOfMemoryException
.
We were inspired by the Apache Flink approach as described in Juggling with Bits and Bytes. The same technique is used in Apache Drill, Apache Ignite (incubating), Apache Geode (incubating), and Apache Spark (Project Tungsten) to name a few.
Byte-based Memory Management
Here are some design basics and constraints we’ve implemented:
- We allocate memory by atomic blocks of memory.
- Each block of memory is represented by
MemoryBlock
interface and can on or off heap. - If the block is on heap, then it’s backed by Java’s
byte[]
-array. - If the block is off heap, then it’s backed by native memory.
- The size of block is 32Kb by default but can be modified.
- Blocks are allocated and never released, but they can be reused.
The main advantages of this approach:
- Resilience. Able to use disk on demand. Can also handle difficult cases in a principled fashion: everything will work even when no more available memory in pool. New elements just flow to disk.
- Reduce memory pressure. Because of block’s long life-time no need to run garbage collector so considerable reduction of garbage collection pressure happens.
- More memory efficient. Simple Java object causes bytes-overhead for the header, in case of huge number of Java Objects this overhead will be considerable.
- Performance. Good cache locality. All functional objects are long-lived and fly-weight.
We think it’s obvious how each of these benefits will improve Stardog.
Implementation
So how does this approach get implemented and then integrated into the existing codebase? Primarily through implementation of various collections, including simple array, sorted array, optimized sorted array, hash table, and aggregator.
Each collection is implemented using the atomic memory block scheme described in this post. Each collection has a main chain of blocks where data are written. The element can be spread between blocks. Element can have start at one blocks and finish at another block. So elements of any length can be written.
Simple Array
All elements are being subsequently written to the chain of memory blocks. If there are no blocks to write, then all used blocks are spilled to the disk and memory blocks are reused.
The data layout of a simple array:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
A file on disk with previously spilled elements:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
Sorted Array
All elements are written to the chain of memory blocks. The index and offset of the element in the data layout is written to the blocks of the address layout.
If no more blocks exist to write pointer in address layout, pairs <index,offset>
are sorted with a binary comparator. Using merge sort, data from data layout are written to the disk in the sorted format. On iteration data are sorted on address layout and are emitted as output using merge sort. If search is required, we build solid sorted index in memory or on disk. Using binary search elements can be found inside the collection with O(log(N))
time complexity.
The address layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|index1 |offset1| |index5 | offset5| |index9 |offset9 |
----------------- ------------------ ------------------
|index2 |offset2| |index6 | offset6| .... |index10|offset10|
----------------- ------------------ ------------------
|index3 |offset3| |index7 | offset7| |index11|offset11|
----------------- ------------------ ------------------
|index4 |offset4| |index8 | offset8| |index12|offset12|
────────────────- ────────────────-- ──────────────────
The data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On disk with previously spilled bytes in sorted order:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
Sorted Array
When we can compare values using only 1 long-value, we write the corresponding long-value of each element to the address layout, which gives significant performance improvement because of good cache locality.
Address layout:
__________________________ __________________________
| MemoryBlock1 | | MemoryBlockN |
|------------------------| |------------------------|
|index1 |offset1| long_1 | | |
|------------------------| | |
|index2 |offset2| long_2 | ... | .... |
|------------------------| | |
|index3 |offset3| long_3 | | |
|------------------------| |------------------------|
|index4 |offset4| long_4 | |indexK | offsetK |long_K|
|────────────────--------| |────────────────--------|
And the data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On disk with previously spilled bytes in the sorted order:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
HashTable
All elements are written to the chain of memory blocks. We calculate partition hash code of the element to obtain the partition. Using current block of the partition, we insert current element to the open addressing table of this block. If an element already exists, the reference to the current element is to the tail of the previous written elements, so all duplicate elements are stored as linked list in the data layout. If only unique keys are required (HashSet
keys) only one element will be stored in the address block.
When we look up by key, the partition will be calculated. After that all blocks inside the partition, plus additional blocks on disk, are used for the probe. We will extend this to use a special kind of disk index to achieve logarithmic complexity when finding blocks.
Address layout with open addressing tables in each block:
Partition1
____________________ _____________________ ________________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
-------------------- --------------------- ------------------------
|hashCode1|address1| |hashCode5|address5 | |hashCode13| address13 |
-------------------- --------------------- ------------------------
|hashCode2|address2| |hashCode6|address6 | .... |hashCode14| address14 |
-------------------- --------------------- ------------------------
|hashCode3|address3| |hashCode7|address7 | |hashCode15| address15 |
-------------------- -------------------- ------------------------
|hashCode4|address4| |hashCode8|address8 | |hashCode16| address16 |
──────────────────── ───────────────────── ────────────────────────
...
PartitionK
______________________ ______________________ ________________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock3 |
---------------------- ---------------------- ------------------------
|hashCode30|address30| |hashCode34|address34| |hashCode38| address38 |
---------------------- ---------------------- ------------------------
|hashCode31|address31| |hashCode35|address35| |hashCode39| address39 |
---------------------- ---------------------- ------------------------
|hashCode32|address32| |hashCode36|address36| |hashCode40| address40 |
---------------------- ---------------------- ------------------------
|hashCode33|address33| |hashCode37|address37| |hashCode41| address41 |
────────────────────── ────────────────────── ────────────────────────
Data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On disk with the spilled hash tables:
File_partition1:
----------------------------------------------------------------
|open_addressin_table_1 | open_addressin_table_2 | .... | ... |
----------------------------------------------------------------
...
File_partitionK:
----------------------------------------------------------------
|open_addressin_table_1 | open_addressin_table_2 | .... | ... |
----------------------------------------------------------------
File on the disk with previously spilled elements:
----------------------------------------
|length_1|data_1|length_2|data_2 .... |
----------------------------------------
Aggregator
Aggregator’s structure is very similar to HashTable
. The difference occurs during spilling to disk. Data are written to the file in pre-aggregated format. Additionally, we support Functor
for the aggregator.
For example, when we need to calculate a function over all elements in the group, we don’t write elements as a sequence. Rather, we calculate the function’s value directly in the block of memory. That allows us to reduce memory consumption in addition to writing pre-calculated function values to disk.
Further, we support second level of aggregation when elements in the first level can be aggregated using second level aggregation strategies.
Address layout with open addressing tables in each block:
Partition1
____________________ _____________________ ________________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
-------------------- --------------------- ------------------------
|hashCode1|address1| |hashCode5|address5 | |hashCode13| address13 |
-------------------- --------------------- ------------------------
|hashCode2|address2| |hashCode6|address6 | .... |hashCode14| address14 |
-------------------- --------------------- ------------------------
|hashCode3|address3| |hashCode7|address7 | |hashCode15| address15 |
-------------------- -------------------- ------------------------
|hashCode4|address4| |hashCode8|address8 | |hashCode16| address16 |
──────────────────── ───────────────────── ────────────────────────
...
PartitionK
______________________ ______________________ ________________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock3 |
---------------------- ---------------------- ------------------------
|hashCode30|address30| |hashCode34|address34| |hashCode38| address38 |
---------------------- ---------------------- ------------------------
|hashCode31|address31| |hashCode35|address35| |hashCode39| address39 |
---------------------- ---------------------- ------------------------
|hashCode32|address32| |hashCode36|address36| |hashCode40| address40 |
---------------------- ---------------------- ------------------------
|hashCode33|address33| |hashCode37|address37| |hashCode41| address41 |
────────────────────── ────────────────────── ────────────────────────
Data layout:
_________________ __________________ __________________
| MemoryBlock1 | | MemoryBlock2 | | MemoryBlock_N |
----------------- ------------------ ------------------
|length1 | data1| | data3 | | ... |
----------------- ------------------ | |
|length2 | data2| |length4 | data4 | .... | |
----------------- ------------------ | data_K |
|length3 | data3| | | | |
----------------- | .... | | ... |
| data3 | | | | |
────────────────- ────────────────-- ──────────────────
On the disk values are stored in the pre-aggregated format:
--------------------------------------------------------------------------------------
|partition_1|hash_code_1|elements_count_1| length_1|data_1|length_2|data_2| .... |
--------------------------------------------------------------------------------------
|partition_2|hash_code_2|elements_count_2| length_3|data_3|length_4|data_4 | .... |
--------------------------------------------------------------------------------------
...
--------------------------------------------------------------------------------------
|partition_k|hash_code_k|elements_count_k|length_(L-1)|data_(L-1)|length_(L)|data_(L)|
--------------------------------------------------------------------------------------
Preliminary Conclusions
We’ve described the new approach in Stardog to memory management, which has several advantages over the conventional GC approach in the JVM. In particular the byte-based approach gives us greater control over memory management in Stardog, which will lead to better performance, improved stability, and a more predictable system at scale.
In next week’s post we will look at the performance overhead of this approach, especially in Java Collections with small, medium, and large numbers of elements.
Download Stardog today to start your free 30-day evaluation.
Alexander Toktarev
31 January 2017