![]() Here is what IACA says about the innermost loop: It is somewhat tricky to use with Visual C++ 64-bit compiler: you have to copy-paste the piece of code from assembly listing, surround it with markers (which are a bit wrong in iacaMarks.h by the way), then compile it with ml64.exe, and pass the resulting object file into iaca.exe. It assumes that the code piece is run in the infinite loop, that all branches are well-predicted and not taken, and that all memory accesses hit L1 cache (plus maybe some other assumptions). We can take closer look by using Intel Architecture Code Analyzer: IACA is a small static analysis tool, which analyzes a snippet of code at instruction level. Data access is scalar (point 3): you cannot vectorize it even if several keys are searched simultaneously. ![]() But it is inherently sequental (point 2): you cannot know which element to load on the next step until the element on the current step has been loaded and compared completely. And it really happens (unless your compiler thinks that you compile for 486), according to assembly listing of the innermost lea rax, QWORD PTR cmp DWORD PTR, r8d cmovl rdx, rax sar rcx, 1 test rcx, rcx jg SHORT good is this binary search implementation? It has no branches (point 1 from the above list), which is great. The comparison itself is included in the ternary operator, which is supposed to compile into cmovXX instruction. The best approach for it is described in this blog post from Paul Khuong, which is: make the very first iteration special by using step = n+1 - 2^logstep instead of just 2^logstep, so that regardless of comparison result the next search interval would have length 2^logstep, including either the beginning of array or the end of array (note that two such possible intervals overlap, and this is not a problem). In order to support arbitrary size of input array, some sort of modification is required. This code only works properly when N+1 is power of two. Int binary_search_branchless ( const int * arr, int n, int key ) Here is a branchless implementation that we will use in the comparison later: You can read about it for instance in this demofox's blog post. Of course, not all implementations of binary search are created equal: for small arrays branchless implementation is preferred. The problem is typically solved with binary search in O(log N) time, but it might easily happen so that for small N simple linear algorithm would be faster. And this investigation soon developed into a full-length standalone article. I decided to start investigating a simpler problem first, which is solved by std::lower_bound: given a sorted array of elements and a key, find index of the first array element greater or equal than the key.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |