lundi 23 avril 2012

MultiCore - Benchmarks - MemoryAllocation

When writing multi-threaded code, you face different challenges.


One of them is Memory Management.


0. A little story.



You need to protect your shared variables.
So you lock around them. With no deadlock, cause you are a hero.
Easy peasy.



Then you need better performances, because locking can kill you. Well, at least your perf.
So you compare different locking solutions. [cf previous post]
And you use the best available, according to your context.
Cool.



But you still need to share variables. And that costs a lot.
So you avoid shared variables as much as you can. You rethink your algorithms.
OK.



But you STILL need to share variables, at least some of them.
So you try atomic operations and lockfree code.
Great.

Not to mention waitfree code.
Duh!

And you try the nifty lock free list, among others.
Yeay.



But then you need to allocate/free nodes from multiple threads, and memory management in a multithread environment can kill your perf.
So you investigate a little more, and try different solutions, in different contexts, and get some numbers.
You love numbers.


Here are some of those tests...




1. Memory allocation methods


We will compare 5 different solutions.


a. Description of the solutions


Solution 1 : the default new / delete

  • That's it.

Solution 2 : a lockfree memory pool - the holy grail ?

  • An array of data blocks.
  • An array of integers to index the blocks : 0 is free, 1 is used; 
  • When we search a free block index, we start searching after the latest used index - round robin like.
  • Atomic operations are used to modify the indices, with loops in order to retry if the state of the index has been changed before we could complete the transaction - typical lockfree stuff.

Solution 3 : locking a naive memory pool - to compare the lockfree one

  • An array of data blocks.
  • An array of integers to index the blocks : 0 is free, 1 is used;
  • When we search a free block index, we start searching after the latest used index - round robin like.
  • We lock the array of indices when we access it (allocate/free)

Solution 4 : locking a memory pool - lower foot print, faster access

  • An array of data blocks.
  • An array of bitfields to index the blocks : a bitfield is an 32 bits integer that indexes 32 blocks; a bit at 0 is "used", 1 means "free".
  • When we search a free block index bit, we start searching after the latest fully used bitfield index - round robin like.
  • We lock the array of bitfields when we access it (allocate/free)

Solution 5 : locking a constant access time memory pool - fastest access ?

  • An array of data blocks.
  • An hierarchy of bitfields, stored in a flat array : each bitfield at a given level of the hierarchy indicates the state of its corresponding lower level, 0 means "used", 1 means "free or partially free"
  • The lowest level works just like in solution 4.
  • We lock the hierarchy of bitfields when we access it (allocate/free)
  • We lock the array of bitfields when we access it (allocate/free)



b. Pros (+) and cons (-) of these solutions


Under our lense : locking, memory footprint, searching a free index, fragmentation.


Solution 1 : the default new / delete

  • ... is here as a reference.

Solution 2 : lockfree memory pool

  • locking : (+) has no locks
    (-) but retries can be expensive 
  • memory footprint : (-) 32bits integer per index is required for atomic operations (where 1 bit is required to encode the free/used state)
  • searching a free index : (+) it can be fast if the "next indices" are free
    (-) but it can/will be slow otherwise, since we will have frequent cache misses while iterating over 32bits integers 
    • especially if the pool is almost full; worst case : we will go through all indices
  • fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan 
    • in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 3 : locking a naive memory pool - to compare to the lockfree one

Just like Solution 2, except for the lockfree retries that are replaced by a raw "lock the whole indices array" that will work on the first try.
  • locking :   (+) no retries !(-) has locks
  • memory footprint : (-) 32bits integer per index (where 1 bit is required to encode the free/used state of 1 block)
  • searching a free index : (+) it can be fast if the "next indices" are free
    (-) but it can/will be slow otherwise, since we iterate over the 32bits indices, which triggers some frequent cache misses
    • especially if the pool is almost full; worst case : we will go through all indices
  • fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan 
    • in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 4 : locking a memory pool - lower foot print, faster access

Just like Solution 3, except that we use 1 bit to encode the "free"/"used" state, leading to minimal footprint & minimized cache misses.
  • locking : (+) no retries !
    (-) has locks
  • memory footprint : (+) 1bit per index is optimal
  • searching a free index : (+) we search through 32 indices at the same time by looking at an 32bits integer (bitfield)
    (+) it can be fast if the "next indices" are free
    (-) but it can be can/slow otherwise, since we iterate over the 32bits bitfields, which triggers some cache misses
    • especially if the pool is almost full; worst case : we will go through all indices
  • fragmentation : (-) it will fragment the pool , unless the allocations all have the same lifespan 
    • in this best-case scenario, the pool would behave like a nicely balanced producer-consumer queue

Solution 5 : locking a constant access time memory pool - fastest access ?

  • locking : (+) no retries !
    (-) has locks
  • memory footprint :  (+) > 1bit per index is great
  • searching a free index :
    (+) alwayse in constant time, whatever the state of the pool : we always have to go through the whole hierarchy of bitfields once.
    (-) the bigger the pool, the more levels in the hierarchy
    (-) constant time is great for the general case, but can still be slower than Solution 4 when "the pool would behave like a nicely balanced producer-consumer queue".
  • fragmentation :
    (+) it will keep the pool fragmentation to a minimum, each time reusing the first free index





2. Tests & results



System w/ 8 CPU cores, on Win7 64bits
Nb threads : 8
Nb allocations / frees per thread : 1 000 000
Allocated data block size : 12 bytes.

We will run the tests in 2 different "memory contexts" :

Context (F) : in a clean & fresh "memory environment" :
  • no allocations have been made prior to the test
Context  (P) : in a polluted context :
  • for memory pools, we set to 'used' half of the indices, interleaving a 'used' slot with a 'free' slot along the whole pool.
  • for the default allocator, we allocate half of the targeted max allocation count.



Test 1 : short life allocations - each thread : N times (allocate and free immediately)


Solution Total time (s)Time per task (s)
(F)S2 - lockfree1.1171340.000000
(P)S2 - lockfree1.5002970.000000
(P)S3 - lock naive pool3.0025920.000000
(F)S3 - lock naive pool3.0505320.000000
(F)S4 - lock 1bit pool3.0805930.000000
(P)S4 - lock 1bit pool3.1261270.000000
(P)S5 - lock constant access pool4.0813110.000001
(F)S5 - lock constant access pool4.1912930.000001
(F)S1 - default8.0337400.000001
(P)S1 - default14.3086950.000002







Test 2 : long life allocations - each thread : N times (allocate) then N times(free)


SolutionTotal time (s)Time per task (s)
(F) S2 - lockfree1.1581070.000000
(P) S2 - lockfree1.4307780.000000
(P) S3 - lock naive pool3.1205700.000000
(F) S3 - lock naive pool3.1555540.000000
(F) S4 - lock 1bit pool3.3767850.000000
(P) S4 - lock 1bit pool3.3826940.000000
(F) S5 - lock constant access pool4.1308110.000001
(P) S5 - lock constant access pool5.5456040.000001
(F) S1 - default249.5863650.000036
(P) S1 - default263.0467830.000038






Without the default allocator :




Test 3 : like Test 2 but on a single thread - no locking
- just because we might still have single thread context here and there !


Solution Total time (s)Time per task (s)
(F) S3 - naive pool0.0270420.000000
(F) S4 - 1bit pool0.0301310.000000
(P) S3 - naive pool0.0308180.000000
(P) S4 - 1bit pool0.0313710.000000
(F) S5 - constant access pool0.0704100.000000
(P) S5 - constant access pool0.0704720.000000
(F) S2 - lockfree0.0793830.000000
(P) S2 - lockfree0.0964210.000000
(F) S1 - default64.3395460.000009
(P) S1 - default67.7448810.000010









Without the default allocator :






3. Quick analyze




In the multi-threaded Tests, we have a few obvious facts :

  1. Interestingly, whatever the number of allocations tested, the results are pretty stable.
  2. The LockFreeMemoryPool outperformed any other solution, in all situations.
    And that's a great win \o/
  3. The default alloc/free is always the slower, by far - and that's normal, it fits a general purpose, so you can't expect it to compete with specific solutions like the various pool allocators
    But it also cannot handle huge amount of allocations properly : it totally sank with the 8 times 1 000 000 operations in polluted environment.
  4. The HBFIMemoryPool generally works with constant allocation times indeed, but it's always the worst pool solution in these tests.
    ... I guess the implementation could be optimized.


Moreover, the NaiveMemoryPool did pretty good. Usually better than the (1bit) MemoryPool. 
Since the MemoryPool and the HBFIMemoryPool both do bits manipulation, I think this can be the bottleneck.


Of course, had I provided another type of pollution, such as 'the first half of the pool is already used, and we start searching for free indices at the beginning', we could have seen different results.
In such case, there would be much more cache misses for some of the memory pools, and maybe the HBFI would have been a little more powerful.




When looking at the single threaded test, we clearly see the cost of the atomic operations involved in the LockFreeMemoryPool, especially when compared to its unsafe equivalent, the NaiveMemoryPool.
I'd wish the LockFreeMemoryPool had outperformed any other solutions, in all contexts, even when running on a single thread alone.
But that's not how it works.




It really proves, if that was still unclear to some of you, how important the context is to choose the right tool.
Always make a decision with the context in mind.
Always.
Period.


PS : I would like to try a TLS based solution some day... (who said hazard pointers ?)
PPS : Damn, this is a long post >.<


Aucun commentaire:

Enregistrer un commentaire