25/09/2020

When you’re done, join the discussion on Hacker News.

When youre done, join the discussion on Hacker News.
Lets say I ask you to fill a char array of size n with zeros. I dont know why, exactly, but please play along for now.
If this were C, we would probably reach for memset, but lets pretend we are trying to write idiomatic C++ instead.
You might come up with something like1:
voidfill1(char*p,size_tn){std::fill(p,p+n,0);}
Id give this solution full marks. In fact, Id call it more or less the canonical modern C++ solution to the problem.
What if told you there was a solution that was up to about 29 times faster? It doesnt even requiring sacrificing any goats to the C++ gods, either: just adding three characters:
voidfill2(char*p,size_tn){std::fill(p,p+n,’\0′);}
Yes, switching 0 to ‘\0’ speeds this up by nearly a factor of thirty on my SKL box2, at least with my default compiler (gcc) and optimization level (-O2):

Function Bytes / Cycle
fill1 1.0
fill2 29.1

The why is becomes obvious if you look at the assembly:
fill1:
fill1(char*,unsignedlong):addrsi,rdicmprsi,rdije.L1.L3:movBYTEPTR[rdi],0; store 0 into memory at [rdi] addrdi,1; increment rdicmprsi,rdi; compare rdi to size jne.L3; keep going if rdi < size.L1:ret
This version is using a byte-by-byte copy loop, which Ive annotated it is more or less a 1:1 translation of how youd imagine std::fill is written. The result of 1 cycle per byte is exactly what wed expect using speed limit analysis: it is simultaneously limited by two different bottlenecks: 1 taken branch per cycle, and 1 store per cycle.
The fill2 version doesnt have a loop at all:
fill2:
fill2(char*,unsignedlong):testrsi,rsijne.L8; skip the memcpy call if size == 0ret.L8:movrdx,rsixoresi,esijmpmemset; tailcall to memset
Rather, it simply defers immediately to memset. We arent going to dig into the assembly for memset here, but the fastest possible memset would run at 32 bytes/cycle, limited by 1 store/cycle and maximum vector the width of 32 bytes on my machine, so the measured value of 29 bytes/cycle indicates its using an implementation something along those lines.
So thats the why, but whats the why of the why (second order why)?
I thought this had something to do with the optimizer. After all, at -O3 even the fill1 version using the plain 0 constant calls memset.
I was wrong, however. The answer actually lies in the implementation of the C++ standard library (there are various, gcc is using libstdc++ in this case). Lets take a look at the implementation of std::fill (Ive reformatted the code for clarity and removed some compile-time concept checks):
/*
* …
*
* This function fills a range with copies of the same value. For char
* types filling contiguous areas of memory, this becomes an inline call
* to @c memset or @c wmemset.
*/template<typename_ForwardIterator,typename_Tp>inlinevoidfill(_ForwardIterator__first,_ForwardIterator__last,const_Tp&__value){std::__fill_a(std::__niter_base(__first),std::__niter_base(__last),__value);}
The included part of the comment3 already hints at what is to come: the implementor of std::fill has apparently considered specifically optimizing the call to a memset in some scenarios. So we keep following the trail, which brings us to the helper method std::__fill_a. There are two overloads that are relevant here, the general method and an overload which handles the special case:
template<typename_ForwardIterator,typename_Tp>inlinetypename__gnu_cxx::__enable_if<!__is_scalar<_Tp>::__value,void>::__type__fill_a(_ForwardIterator__first,_ForwardIterator__last,const_Tp&__value){for(;__first!=__last;++__first)*__first=__value;}// Specialization: for char types we can use memset.template<typename_Tp>inlinetypename__gnu_cxx::__enable_if<__is_byte<_Tp>::__value,void>::__type__fill_a(_Tp*__first,_Tp*__last,const_Tp&__c){const_Tp__tmp=__c;if(constsize_t__len=__last-__first)__builtin_memset(__first,static_cast<unsignedchar>(__tmp),__len);}
Now we see how the memset appears. It is called explicitly by the second implementation shown above, selected by enable_if when the SFINAE condition __is_byte<_Tp> is true. Note, however, that unlike the general function, this variant has a single template argument: template<typename _Tp>, and the function signature is:
__fill_a(_Tp*__first,_Tp*__last,const_Tp&__c)
Hence, it will only be considered when the __first and __last pointers which delimit the range have the exact same type as the value being filled. When when you write std::fill(p, p + n, 0) where p is char *, you rely on template type deduction for the parameters, which ends up deducing char * and int for the iterator type and value-to-fill type, because 0 is an integer constant.
That is, it is if you had written:
std::fill<char*,int>(p,p+n,0);
This prevents the clever memset optimization from taking place: the overload that does it is never called because the iterator value type is different than the value-to-fill type.
This suggests a fix: we can simply force the template argument types rather than rely on type deduction:
voidfill3(char*p,size_tn){std::fill<char*,char>(p,p+n,0);}
This way, we get the memset version.
Finally, why does fill2 using ‘\0’ get the fast version, without forcing the template arguments? Well ‘\0’ is a char constant, so the value-to-assign type is char. You could achieve the same effect with a cast, e.g., static_cast<char>(0) and for buffers which have types like unsigned char this is necessary because ‘\0’ does not have the same type as unsigned char (at least on gcc).
One might reasonably ask if this could be fixed in the standard library. I think so.
One idea would be to keying off of only the type of the first and last pointers, like this:
template<typename_Tp,typename_Tvalue>inlinetypename__gnu_cxx::__enable_if<__is_byte<_Tp>::__value,void>::__type__fill_a(_Tp*__first,_Tp*__last,const_Tvalue&__c){const_Tvalue__tmp=__c;if(constsize_t__len=__last-__first)__builtin_memset(__first,static_cast<unsignedchar>(__tmp),__len);}
This says: who cares about the type of the value, it is going to get converted during assignment to the value type of the pointer anyways, so just look at the pointer type. E.g., if the type of the value-to-assign _Tvalue is int, but _Tp is char then this expands to this version, which is totally equivalent:
__fill_a(char*__first,char*__last,constint&__c){constint__tmp=__c;if(constsize_t__len=__last-__first)__builtin_memset(__first,static_cast<unsignedchar>(__tmp),__len);}
This works for simple types like int. Where it fails is if the value to fill has a tricky, non-primitive type, like this:
struct conv_counting_int {
int v_;
mutable size_t count_ = 0;
operator char() const {
count_++;
return (char)v_;
}
};
size_t fill5(char *p, size_t n) {
conv_counting_int zero{0};
std::fill(p, p + n, zero);
return zero.count_;
}
Here, the pointer type passed to std::fill is char, but you cannot safely apply the memset optimization above, since the conv_counting_int counts the number of types it is converted to char, and this value will be wrong (in particular, it will be 1, not n) if you perform the above optimization.
This can be fixed. You could limit the optimization to the case where the pointer type is char-like and the value-to-assign type is simple in the sense that it wont notice how many times it has been converted. A sufficient check would be that the type is scalar, i.e. std::is_scalar<T> although there is probably a less conservative check possible. So something like this for the SNIFAE check:
template<typename_Tpointer,typename_Tp>inlinetypename__gnu_cxx::__enable_if<__is_byte<_Tpointer>::__value&&__is_scalar<_Tp>::__value,void>::__type__fill_a(_Tpointer*__first,_Tpointer*__last,const_Tp&__value){…
Heres an example of how that would work. Its not fully fleshed out but shows the idea.
Finally, one might ask why memsetis used when gcc is run at -O3 or when clang is used (like this). The answer is the optimizer. Even if the compile-time semantics of the language select what appears to a byte-by-byte copy loop, the compiler itself can transform that into memset, or something else like a vectorized loop, if it can prove it is as-if equivalent. That recognition happens at -O3 for gcc but at -O2 for clang.
What Does It Mean
So what does it all mean? Is there a moral to this story?
Some use this as evidence that the somehow C++ and/or the STL are irreparably broken. I dont agree. Some other languages, even fast ones, will never give you the memset speed, although many will – but many of those that do (e.g., java.util.Arrays.fill()) do it via special recognition or handling of the function by the compiler or runtime. In the C++ standard library, the optimization the library writers have done is available to anyone, which is a big advantage. That the optimization fails, perhaps unexpectedly, in some cases is unfortunate but its nice that you can fix it yourself.
Also, C++ gets two shots at this one: many other languages rely on the compiler to optimize these patterns, and this also occurs in C++. Its just a bit of a quirk of gcc that optimization doesnt help here: it doesnt vectorize at -O2, nor does it do idiom recogniztion. Both of those result in much faster code: weve seen the effect of idiom recognition already: it results in a memset. Even if idiom recognition wasnt enabled or didnt work, vectorization would help a lot: heres [gcc at -O3], but with idiom recognition disabled. It uses 32-byte stores (vmovdqu YMMWORD PTR [rax], ymm0) which will be close to memcpy speed. In many other languages it would only be up to the compiler: there wouldnt be a chance to get memset even with no optimization as there is in C++.
Do we throw out modern C++ idioms, at least where performance matters, for example by replacing std::fill with memset? I dont think so. It is far from clear where memset can even be used safely in C++. Unlike say memcpy and trivially copyable, there is no type trait for memset is equivalent to zeroing. Its probably OK for byte-like types, and is widely used for other primitive types (which we can be sure are trivial, but cant always be sure of the representation), but even that may not be safe. Once you introduced even simple structures or classes, the footguns multiply. I recommend std::fill and more generally sticking to modern idioms, except in very rare cases where profiling has identified a hotspot, and even then you should take the safest approach that still provides the performance you need (e.g., by passing (char)0 in this case).
Source
The source for my benchmark is available on GitHub.
Thanks
Thanks to Matt Godbolt for creating Godbolt, without which this type of investigation would be much more painful and often wouldnt happen at all.
tc for finding a typo.
Discuss
I am still working on my comments system (no, I dont want Disqus), but in the meantime you can discuss this post on Hacker News.
If you liked this post, check out the homepage for others you might enjoy4.