While looking into upgrading the FastMM4 version in Embarcadero Dev-C++ I ran across a fork of FastMM4 called FastMM4-AVX. It uses special instruction sets to achieve better results than the original FastMM4. Developer maximmasiutin has added a number of enhancements to the original FastMM4 specifically “AVX support and multi-threaded enhancements (faster locking)”. It claims “Under some multi-threading scenarios, the FastMM4-AVX branch is more than twice as fast comparing to the Original FastMM4.” That sounds pretty good. The take away about memory managers seems to be some are better in some scenarios than others so if you need more speed it could be worth your while to test out different ones in your specific application. Take a look at all the items which were added on top of FastMM4.
“What was added to FastMM4-AVX in comparison to the original FastMM4:
- if the CPU supports AVX or AVX2, use the 32-byte YMM registers for faster memory copy, and if the CPU supports AVX-512, use the 64-byte ZMM registers for even faster memory copy; use DisableAVX to turn AVX off completely or use DisableAVX1/DisableAVX2/DisableAVX512 to disable separately certain AVX-related instruction set from being compiled into FastMM4);
- if EnableAVX is defined, all memory blocks are aligned by 32 bytes, but you can also use Align32Bytes define without AVX; please note that the memory overhead is higher when the blocks are aligned by 32 bytes, because some memory is lost by padding;
- with AVX, memory copy is secure – all XMM/YMM/ZMM registers used to copy memory are cleared by vxorps/vpxor, so the leftovers of the copied memory are not exposed in the XMM/YMM/ZMM registers;
- properly handle AVX-SSE transitions to not incur the transition penalties, only call vzeroupper under AVX1, but not under AVX2 since it slows down subsequent SSE code under Skylake / Kaby Lake;
- improved multi-threading locking mechanism – so the testing application (from the FastCode challenge) works up to twitce faster when the number of threads is the same or larger than the number of physical cores;
- if the CPU supports Enhanced REP MOVSB/STOSB (ERMS), use this feature for faster memory copy (under 32 bit or 64-bit) (see the EnableERMS define, on by default, use DisableERMS to turn it off);
- proper branch target alignment in assembly routines;
- compare instructions + conditional jump instructions are put together to allow macro-op fusion (which happens since Core2 processors, when the first instruction is a CMP or TEST instruction and the second instruction is a conditional jump instruction);
- names assigned to some constants that used to be “magic constants”, i.e. unnamed numerical constants – plenty of them were present throughout the whole code.
- multiplication and division by constant which is a power of 2 replaced to shl/shr, because Delphi64 compiler doesn’t replace such multiplications and divisions to shl/shr processor instructions, and, according to the Intel Optimization Guide, shl/shr is much faster than imul/idiv, especially on Knights Landing processors;
- the compiler environment is more flexible now: you can now compile FastMM4 with, for example, typed “@” operator or any other option. Almost all externally-set compiler directives are honored by FastMM except a few (currently just one) – look for the “Compiler options for FastMM4” section below to see what options cannot be externally set and are always redefined by FastMM4 for itself – even if you set up these compiler options differently outside FastMM4, they will be silently redefined, and the new values will be used for FastMM4 only;
- those fixed-block-size memory move procedures that are not needed (under the current bitness and alignment compbinations) are explicitly excluded from compiling, to not rely on the compiler that is supposed to remove these function after copmpilation;
- added length parameter to what were dangerous nul-terminated string operations via PAnsiChar, to prevent potential stack buffer overruns (or maybe even stack-based exploitation?), and there some Pascal functions also left, the argument is not yet checked, see the “todo” comments to figure out where the length is not yet checked. Anyway, since these memory functions are only used in Debug mode, i.e. in development environment, not in Release (production), the impact of this “vulnerability” is minimal (albeit this is a questionable statement);
- removed some typecasts; the code is stricter to let the compiler do the job, check everything and mitigate probable error. You can even compile the code with “integer overflow checking” and “range checking”, as well as with “typed @ operator” – for safer code. Also added round bracket in the places where the typed @ operator was used, to better emphasize on who’s address is taken;
- one-byte data types of memory areas used for locking (“lock cmpxchg” or “lock xchg” replaced from Boolean to Byte for stricter type checking;
- used simpler lock instructions: “lock xchg” rather than “lock cmpxchg”;
- implemented dedicated lock and unlock procedures; before that, locking operations were scattered throughout the code; now the locking function have meaningful names: AcquireLockByte and ReleaseLockByte; the values of the lock byte is now checked for validity when “FullDebugMode” or “DEBUG” is on, to detect cases when the same lock is released twice, and other improper use of the lock bytes;
- added compile-time options (SmallBlocksLockedCriticalSection/ MediumBlocksLockedCriticalSection/LargeBlocksLockedCriticalSection) that remove spin-loops of Sleep(0) or (Sleep(InitialSleepTime)) and Sleep(1) (or Sleep(AdditionalSleepTime)) and replaced them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed that was affected each time by at least 1 millisecond by Sleep(1); on the other hand, the CriticalSections are much more CPU-friendly and have definitely lower latency than Sleep(1); besides that, it checks if the CPU supports SSE2 and thus the “pause” instruction, it uses “pause” spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn’t have the “pause” instrcution or Windows doesn’t have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.
- removed all non-US-ASCII characters, to avoid using UTF-8 BOM, for better compatibility with very early versions of Delphi (e.g. Delphi 5), thanks to Valts Silaputnins.”
There are also some additional benchmarks between FastMM4 and FastMM4-AVX available from Embarcadero MVP Radek Červinka. Check out that blog post.
Interested in trying out FastMM4-AVX in your own project? Head over to GitHub and download the full source code for FastMM4-AVX. Plus check out all of the benchmarks.