Melting Bits

When writing multi-threaded code, you face different challenges.

One of them is Memory Management.

0. A little story.

You need to protect your shared variables.
So you lock around them. With no deadlock, cause you are a hero.
Easy peasy.

Then you need better performances, because locking can kill you. Well, at least your perf.
So you compare different locking solutions. [cf previous post]
And you use the best available, according to your context.
Cool.

But you still need to share variables. And that costs a lot.
So you avoid shared variables as much as you can. You rethink your algorithms.
OK.

But you STILL need to share variables, at least some of them.
So you try atomic operations and lockfree code.
Great.
Not to mention waitfree code.
Duh!
And you try the nifty lock free list, among others.
Yeay.

But then you need to allocate/free nodes from multiple threads, and memory management in a multithread environment can kill your perf.
So you investigate a little more, and try different solutions, in different contexts, and get some numbers.
You love numbers.

Here are some of those tests...

1. Memory allocation methods

We will compare 5 different solutions.

a. Description of the solutions

Solution 1 : the default new / delete

That's it.

Solution 2 : a lockfree memory pool - the holy grail ?

An array of data blocks.
An array of integers to index the blocks : 0 is free, 1 is used;
When we search a free block index, we start searching after the latest used index - round robin like.
Atomic operations are used to modify the indices, with loops in order to retry if the state of the index has been changed before we could complete the transaction - typical lockfree stuff.

Solution 3 : locking a naive memory pool - to compare the lockfree one

An array of data blocks.
An array of integers to index the blocks : 0 is free, 1 is used;
When we search a free block index, we start searching after the latest used index - round robin like.
We lock the array of indices when we access it (allocate/free)

Solution 4 : locking a memory pool - lower foot print, faster access

An array of data blocks.
An array of bitfields to index the blocks : a bitfield is an 32 bits integer that indexes 32 blocks; a bit at 0 is "used", 1 means "free".
When we search a free block index bit, we start searching after the latest fully used bitfield index - round robin like.
We lock the array of bitfields when we access it (allocate/free)

Solution 5 : locking a constant access time memory pool - fastest access ?

An array of data blocks.
An hierarchy of bitfields, stored in a flat array : each bitfield at a given level of the hierarchy indicates the state of its corresponding lower level, 0 means "used", 1 means "free or partially free"
The lowest level works just like in solution 4.
We lock the hierarchy of bitfields when we access it (allocate/free)
We lock the array of bitfields when we access it (allocate/free)

b. Pros (+) and cons (-) of these solutions

Under our lense : locking, memory footprint, searching a free index, fragmentation.

Solution 1 : the default new / delete

... is here as a reference.

Solution 2 : lockfree memory pool

locking : (+) has no locks
(-) but retries can be expensive
memory footprint : (-) 32bits integer per index is required for atomic operations (where 1 bit is required to encode the free/used state)
searching a free index : (+) it can be fast if the "next indices" are free
(-) but it can/will be slow otherwise, since we will have frequent cache misses while iterating over 32bits integers

especially if the pool is almost full; worst case : we will go through all indices

fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan

in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 3 : locking a naive memory pool - to compare to the lockfree one

Just like Solution 2, except for the lockfree retries that are replaced by a raw "lock the whole indices array" that will work on the first try.

locking : (+) no retries !(-) has locks
memory footprint : (-) 32bits integer per index (where 1 bit is required to encode the free/used state of 1 block)
searching a free index : (+) it can be fast if the "next indices" are free
(-) but it can/will be slow otherwise, since we iterate over the 32bits indices, which triggers some frequent cache misses

especially if the pool is almost full; worst case : we will go through all indices

fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan

in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 4 : locking a memory pool - lower foot print, faster access

Just like Solution 3, except that we use 1 bit to encode the "free"/"used" state, leading to minimal footprint & minimized cache misses.

locking : (+) no retries !
(-) has locks
memory footprint : (+) 1bit per index is optimal
searching a free index : (+) we search through 32 indices at the same time by looking at an 32bits integer (bitfield)
(+) it can be fast if the "next indices" are free
(-) but it can be can/slow otherwise, since we iterate over the 32bits bitfields, which triggers some cache misses

especially if the pool is almost full; worst case : we will go through all indices

fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan

in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 5 : locking a constant access time memory pool - fastest access ?

locking : (+) no retries !
(-) has locks
memory footprint : (+) > 1bit per index is great
searching a free index :
(+) alwayse in constant time, whatever the state of the pool : we always have to go through the whole hierarchy of bitfields once.
(-) the bigger the pool, the more levels in the hierarchy
(-) constant time is great for the general case, but can still be slower than Solution 4 when "the pool would behave like a nicely balanced producer-consumer queue".
fragmentation :
(+) it will keep the pool fragmentation to a minimum, each time reusing the first free index

2. Tests & results

System w/ 8 CPU cores, on Win7 64bits
Nb threads : 8
Nb allocations / frees per thread : 1 000 000
Allocated data block size : 12 bytes.

We will run the tests in 2 different "memory contexts" :

Context (F) : in a clean & fresh "memory environment" :

no allocations have been made prior to the test

Context (P) : in a polluted context :

for memory pools, we set to 'used' half of the indices, interleaving a 'used' slot with a 'free' slot along the whole pool.
for the default allocator, we allocate half of the targeted max allocation count.

Test 1 : short life allocations - each thread : N times (allocate and free immediately)

Solution	Total time (s)	Time per task (s)
(F)S2 - lockfree	1.117134	0.000000
(P)S2 - lockfree	1.500297	0.000000
(P)S3 - lock naive pool	3.002592	0.000000
(F)S3 - lock naive pool	3.050532	0.000000
(F)S4 - lock 1bit pool	3.080593	0.000000
(P)S4 - lock 1bit pool	3.126127	0.000000
(P)S5 - lock constant access pool	4.081311	0.000001
(F)S5 - lock constant access pool	4.191293	0.000001
(F)S1 - default	8.033740	0.000001
(P)S1 - default	14.308695	0.000002

Test 2 : long life allocations - each thread : N times (allocate) then N times(free)

Solution	Total time (s)	Time per task (s)
(F) S2 - lockfree	1.158107	0.000000
(P) S2 - lockfree	1.430778	0.000000
(P) S3 - lock naive pool	3.120570	0.000000
(F) S3 - lock naive pool	3.155554	0.000000
(F) S4 - lock 1bit pool	3.376785	0.000000
(P) S4 - lock 1bit pool	3.382694	0.000000
(F) S5 - lock constant access pool	4.130811	0.000001
(P) S5 - lock constant access pool	5.545604	0.000001
(F) S1 - default	249.586365	0.000036
(P) S1 - default	263.046783	0.000038

Without the default allocator :

Test 3 : like Test 2 but on a single thread - no locking
- just because we might still have single thread context here and there !

Solution	Total time (s)	Time per task (s)
(F) S3 - naive pool	0.027042	0.000000
(F) S4 - 1bit pool	0.030131	0.000000
(P) S3 - naive pool	0.030818	0.000000
(P) S4 - 1bit pool	0.031371	0.000000
(F) S5 - constant access pool	0.070410	0.000000
(P) S5 - constant access pool	0.070472	0.000000
(F) S2 - lockfree	0.079383	0.000000
(P) S2 - lockfree	0.096421	0.000000
(F) S1 - default	64.339546	0.000009
(P) S1 - default	67.744881	0.000010

Without the default allocator :

3. Quick analyze

In the multi-threaded Tests, we have a few obvious facts :

Interestingly, whatever the number of allocations tested, the results are pretty stable.
The LockFreeMemoryPool outperformed any other solution, in all situations.
And that's a great win \o/
The default alloc/free is always the slower, by far - and that's normal, it fits a general purpose, so you can't expect it to compete with specific solutions like the various pool allocators
But it also cannot handle huge amount of allocations properly : it totally sank with the 8 times 1 000 000 operations in polluted environment.
The HBFIMemoryPool generally works with constant allocation times indeed, but it's always the worst pool solution in these tests.
... I guess the implementation could be optimized.

Moreover, the NaiveMemoryPool did pretty good. Usually better than the (1bit) MemoryPool.
Since the MemoryPool and the HBFIMemoryPool both do bits manipulation, I think this can be the bottleneck.

Of course, had I provided another type of pollution, such as 'the first half of the pool is already used, and we start searching for free indices at the beginning', we could have seen different results.
In such case, there would be much more cache misses for some of the memory pools, and maybe the HBFI would have been a little more powerful.

When looking at the single threaded test, we clearly see the cost of the atomic operations involved in the LockFreeMemoryPool, especially when compared to its unsafe equivalent, the NaiveMemoryPool.
I'd wish the LockFreeMemoryPool had outperformed any other solutions, in all contexts, even when running on a single thread alone.
But that's not how it works.

It really proves, if that was still unclear to some of you, how important the context is to choose the right tool.
Always make a decision with the context in mind.
Always.
Period.

PS : I would like to try a TLS based solution some day... (who said hazard pointers ?)
PPS : Damn, this is a long post >.<

Quick test on a simple scenario.

Description

I have a shared integer, that every thread tries to increment multiple times.
The information I need is the final value, sum of all these increments.

I prevent over subscription and under subscription by having as many software threads as logical cores.
I use basic thread functions, so there is no particular overhead (no task scheduling or whatsoever).

Proposed solutions

solution 1 : use a CRITICAL_SECTION to protect the increment operation
solution 2 : use a SRWLOCK to protect the increment operation
solution 3 : use a custom SpinLock to protect the increment operation
solution 4 : use an atomic operation and no lock
solution 5 : use TLS (thread local storage) to have a local counter per thread, and sync the results with an atomic operation in the end

Results

I ran the test multiple times, and always had about the same numbers.
Here are the results.

#######################################################################
Infos summary :
System w/ 8 CPU cores
Nb 'tasks' : 80 000 000
Nb (worker) threads : 8
Average work load per thread : 10 000 000 tasks
=============================================================

Results summary :

TEST_TLS_ATOMIC 0.295073 s

TEST_SRWLOCK 2.548695 s

TEST_ATOMIC 2.701041 s

TEST_CRITICAL_SECTION 7.383443 s

TEST_SPINLOCK 18.363033 s

#######################################################################

Explanations

With no surprise, solution 5 (TLS) wins, since there is almost no contention.

On the contrary, I was quite surprised to see that the SWRLOCK solution was about the same speed as the atomic increment one (sometimes slightly faster, sometimes slightly slower).
The only explanation I have is that each atomic operation inccurs a cache update & flush, so here we have 80M of these.
On the other hand, using SRWLOCK might lead to fewer cache update & flush, e.g. if the same thread increments the shared value multiple times before another thread manages to grab the lock.

And the SpinLock... damn, this is so slow. I wonder if my (naive) implementation is right >.<.

Melting Bits

vendredi 6 juillet 2012

Graphics - SSS gone cheap

lundi 23 avril 2012

MultiCore - Benchmarks - MemoryAllocation

0. A little story.

1. Memory allocation methods

a. Description of the solutions

Solution 1 : the default new / delete

Solution 2 : a lockfree memory pool - the holy grail ?

Solution 3 : locking a naive memory pool - to compare the lockfree one

Solution 4 : locking a memory pool - lower foot print, faster access

Solution 5 : locking a constant access time memory pool - fastest access ?

Solution 1 : the default new / delete

Solution 2 : lockfree memory pool

Solution 3 : locking a naive memory pool - to compare to the lockfree one

Solution 4 : locking a memory pool - lower foot print, faster access

Solution 5 : locking a constant access time memory pool - fastest access ?

2. Tests & results

3. Quick analyze

mardi 17 avril 2012

MultiCore -- Benchmark 01 - Increment shared variable

Description

Proposed solutions

Results

Explanations

jeudi 12 avril 2012

MultiCore -- CRITICAL_SECTION are damn fast !

Welcome !

Qui êtes-vous ?

TEST_TLS_ATOMIC	0.295073 s
TEST_SRWLOCK	2.548695 s
TEST_ATOMIC	2.701041 s
TEST_CRITICAL_SECTION	7.383443 s
TEST_SPINLOCK	18.363033 s