Abstract
Applications often require a fast, single-threaded search algorithm over sorted data, typical in table-lookup operations. We explore various search algorithms for a large number of search candidates over a relatively small array of logarithmically distributed sorted data. These include an innovative hash-based search that takes advantage of floating point representation to bin data by the exponent. Algorithms that can be optimized to take advantage of simd vector instructions are of particular interest. We then conduct a case study applying our results and analyzing algorithmic performance with the eospac package. eospac is a table lookup library for manipulation and interpolation of SESAME equation-of-state data. Our investigation results in a couple of algorithms with better performance with a best case 8× speedup over the original eospac Hunt-and-Locate implementation. Our techniques are generalizable to other instances of search algorithms seeking to get a performance boost from vectorization.
1 Introduction
Fast search over a sorted array is a common problem in computer science. Many algorithms developed for this purpose have been optimized for serial processing with a simple (single) processor model.
These algorithms often cannot be compiled to make use of specialized simd instructions (a process called vectorization). For example, the hunt-and-locate algorithm uses each index found as the starting index for the subsequent search iteration. This loop-carried dependency prevents vectorization. In conventional algorithm analysis, algorithms are often judged by their asymptotic behavior. However, in real-life applications, the constants in the scaling behavior and suitability for vectorization matter and are not necessarily addressed by an asymptotic analysis.
In this article, we conduct a case study of new and existing algorithms adapted for vectorization, which find the index of the largest element of an array less than a given target value. We will implement and test the algorithms within an equation of state library, eospac [1], which is employed by several large physics codes to interpolate equation of state and other material property data. The searched data are density and temperature arrays of a particular SESAME [2] material that is loaded and used by eospac. eospac provides interpolation mechanisms, which involve searching for values nearest to a sequence of targets within a small table. Applications that utilize this routine often pass a much larger number of target values than there are values in the table. They also often use all threads of the host machine, so parallelism through multi-threading is not an option for us. Due to these constraints, and the fact that user applications typically run on simd-enabled architectures, we determined that a more optimized approach was possible. We will investigate how the algorithms’ performance depends on array size, cache usage, portability, and other factors.
2 Background
We first quickly review the applicable historical background of search algorithms. Then, we give a short overview of the application, eospac, that we used in this study.
2.1 Search Algorithms.
Search algorithms are some of the oldest algorithms in computer science, dating back to the early days of digital computers. In this article, we are concerned with searching in an array, which is already sorted. Knuth gives an overview of searching ordered data in The Art of Programming, Volume 3: Sorting and Searching [3]. We briefly discuss here in general terms the search algorithms we compared.
The simplest form of search is a linear search, where every element of the array is compared against the target value. Linear search provides a good lower bound for search efficiency.
Binary (bisection) search is an improvement on linear search that takes advantage of a sorted array. Knuth [3] credits John Mauchly in 1946 for the earliest reference to binary searching [4]. The mid-point of the array is selected, a comparison with the target value is made, and then the search is narrowed by half: either the right or left half of the array is selected for the next iteration. The division of the search space in half at each step results in operations for every search, where n is the number of elements in the data array.
The hunt-and-locate [5] search algorithm utilizes the binary search technique in its second of two steps. In the first step, an exponential jumping pattern is used to bound the target value within the array. It then performs a binary search within these bounds, locating the target. If the search is successful, these bounds are used a starting point for the next search. If the next target is near or within these bounds, the next search will be faster than simply a binary search.
The skiplist search algorithm [6] further builds on the idea of intelligently partitioning the array to decrease the number of operations needed to locate the target value. The algorithm involves creating a “skiplist,” which is a reduced copy of the array including only every ith element, skipping the others. The skiplist can be thought of as linear array version of a search tree, but with potentially better cache performance. When a search is performed, the skiplist is searched over with a binary search, providing the bounds with which to search between for the target in the array. Multiple skip lists can be combined in this way to further improve search times.
Finally, for some arrays, a spatial hash or binning approach can perform very well. The values in the array are separated by a hash function into “bins,” partitioning the array. The same hash function is then used on the target, indicating which bin must be searched. An implementation involves analyzing the array to build an analytic hash function as demonstrated in the study by Robey et al. [7]. Since a hash function does not need a comparison operation, it can approach constant-time performance.
2.2 Domain Problem: Accessing Tabular SESAME Data Via eospac.
The eospac package [1] is a collection of C routines that can be used to access the SESAME data library [2], which contains tabulated data representing various equation-of-state (EOS) properties for a diverse collection of materials spanning more than 50 years of experimentation and modeling.
An equation of state is a formula describing the interconnections between various macroscopically measurable properties of a system. In physics and thermodynamics, an EOS is a constitutive equation describing the state of matter under a given set of physical conditions. It provides a mathematical relationship between two or more of that matter's state functions, such as its temperature, pressure, volume, or internal energy. Equations of state are useful in describing properties of fluids, mixtures of fluids, and solids.
The eospac package is designed to be used by physics codes that may be written in multiple languages and on multiple platforms. eospac has a long history with its Fortran predecessors dating back 40 years.
eospac accesses and loads SESAME data tables and provides interpolation mechanisms for use by a host application.
We chose the eospac software package as an exemplar for our analysis of search algorithms for several reasons:
– eospac provides a mature software and development framework that is dependent on tabular data search.
– Many other software packages are dependent on eospac and require it to be as efficient as possible on various high performance computing systems.
– The SESAME tables are relatively small, and they contain sorted, irregularly spaced data grids that must be efficiently searched to bound arbitrarily spaced and arbitrarily large lists of interpolator input values.
– Large numbers (i.e., millions/billions) of searches typically create an execution bottleneck that may degrade code performance.
Historically, eospac has utilized the Numerical Recipes [5] Hunt-and-Locate algorithm described earlier. Since eospac has been historically a serial software package, the selected search algorithm has provided sufficient performance for supported platforms and host applications. With the emergence of vectorized processors and threading on GPUs, the current hunt-and-locate serial search is no longer appropriate because of the loop-carried dependency.
3 Methodology
In this section, we discuss how we apply concepts from Robey and Zamora [8] to search algorithms with the goal of better taking advantage of vectorization. Each algorithm takes a targets array (Y, of size m) and a data array (X, of size n). They perform a lower-bounds search, returning an array of size m containing the index of the last element of X, which is less than or equal to each element of Y. If X contains no such lower bound for a given y ∈ Y, the index 0 is chosen. In the case that the element is higher than the highest of X, the index n − 1 is chosen.
The data array (X) is monotonically increasing, approximately logarithmically distributed, and much smaller than the search array (Y). For our tests, m = 5,000,000, and n = 110. The way in which the algorithms are vectorized takes advantage of these data characteristics. [5] By using openmpsimd pragmas,2 we configure the outer loop over Y to be the vectorized loop. This is optimal because the outer loop runs for many more iterations than the inner loop over X. The simd pragmas programs were added to the openmp 4.0 standard released in 2013 to provide a more portable syntax than the ad hoc directives provided by some vendors.
A common optimization we utilized was the removal of branching. Compilers, when performing vectorization, will sometimes fail to vectorize when there is complex branching. We removed branching in some places and left it in others because sometimes the compiler succeeds in vectorizing the branch statements, using masking. The masking that occurs we found to be generally faster than removing branching manually. When we were able to remove branching, we used a “branchless choice” inlined function to replace if/else statements:
Branchless choice function: returns vTrue if c is not 0, vFalse otherwise. Arguments must be integers
bchoice(c, vTrue, vFalse)
return (c * vTrue) | (!c * vFalse)
This sort of technique is also used for branchless min and max functions. It only works for integers, as it makes use of a bitwise operation.
We have made additional, minor optimizations in the C code [A], such as storing the minimum and maximum of the data array in constant variables outside the loop, where needed. These minor optimizations are not included in the following pseudocode for the purpose of clarity and are not as impactful as the major optimizations, such as the removal of branching.
3.1 Hunt and Locate.
Modified Hunt and Locate
hunt_n_locate(X, n, Y, m)
lowbounds[m]
for(int i = 0; i < m; i++):
if Y[i] < X[0]:
lowbounds[i] = 0
else if Y[i] > X[n−1]:
lowbounds[i] = n−1
else:
j = 1
start = 0
end = 1
while end < n && Y[i] > X[end]:
start = end
end << = 1
// Binary search within bounds
end = min(end, n−1)
start = binary_search(
max(start, 0), min(end, n−1),
x, target)
lowbounds[i] = start
return lowbounds
We modified Hunt and Locate to take advantage of vectorization by removing the saving of search bounds between each iteration of the loop over the target array [Listing 2]. By simplifying the logic in this way as well as removing branching for bounds checking, an increase in performance is possible through vectorization. Also, since the search array is not ordered, there is little advantage to maintaining search bounds between iterations anyways.
3.2 Branchless Binary Search.
The following is a simple binary search where branching has been removed from search bound selection (lines 10 and 11) and bounds checking (13 and 14) using the branchless choice function (Listing 1). As we will discuss later, this was done to improve performance when simd is enabled during compilation. However, it still contains a loop of a length, which is unknown at compile time, which makes it a nonideal candidate for simd optimization.
Branchless binary search
binary_search(X, n, Y, m)
lowbounds[m]
for(int i = 0; i < m; i++):
lower = 0
upper = n
// Search
while((upper - lower) > 1):
mid = (lower + upper) / 2
condition = Y[i] < X[mid]
upper = bchoice(condition, mid, upper)
lower = bchoice(!condition, mid, lower)
// Bounds checking
lower = bchoice(Y[i] < X[0], 0, lower)
lower = bchoice(Y[i] > X[n−1], n−1, lower)
lowbounds[i] = lower
return lowbounds
3.3 Skiplist Search.
Skiplist Search generates a table of indices (skplst) into the data array. We chose to build this table so that each index corresponded to its own cache line. When searching for a target, first, a binary search is performed over the table. Then, the resultant value is used to bound a subsequent linear search. The only improvement made to Skiplist Search for vectorization was to eliminate branching in bounds checking.
A skiplist implementation of a binary search
skiplist_search(X, n, Y, m)
skplst[n/8], lowbounds[m]
for(i = 0; i < n/8; i++):
skplst[i] = X[8 * i]
for(i = 0; i < m; i++):
j = binary_search(skplst, Y[i])
upper = n
if j < n/8−1:
upper = (j + 1)*8
lower = linear_search(X, 8*j, upper, Y[i])
// Bounds checking
lower = bchoice(Y[i] < X[0], 0, lower)
lower = bchoice(Y[i] > X[n−1], n−1, lower)
lowbounds[i] = lower
return lowbounds
Note that the setup cost for Skiplist Search is O(n/8) in time and n + n/8 in storage. This is larger than the setup cost for binary search.
3.4 Logarithm Hash Search.
Logarithm Hash Search uses a hash table H, the entries of which are indices into the data array, X. Each hash table entry Hi contains the smallest index j in the set {j|hash(Xj) = i}, where Xj is the jth element of the data array X. To perform a search for Yk in X, the hash of the target value v = hash(Yk) is used to index into the hash table at Hv. Then, a binary search is performed between Hv and Hv+1. The handling of boundary conditions is shown on lines 24 and 25 of Listing 5.
If there is no Xj where hash(Xj) = i, Hi = Hi+1 − 1, or n − 1, if i = n − 1.
A hash-based search approach
hash(x, base, offset)
return log(x) / log(base) + offset
log_hash_search(X, n, Y, m)
lowbounds[n], H[H_size]
base = pow(10, (log(X[m−1]) - log(X[0])) / H_size)
offset = floor(log(X[0]) / log(base))
for(i = 0; i < H_size; i++):
H[i] = m−1
for(i = 0; i < m; i++):
index = hash(X[i], base, offset)
H[index] = min(H[index], i)
for(i = 0; i < H_size; i++):
upper = H[i]
while H[i−1] > upper:
H[i−1] = upper - 1
i - = 1
for(i = 0; i < n; i++):
j = hash(Y[i], base, offset)
start = max(0, H[j])
end = min(n−1, H[min(j + 1, H_size−1)])
lower = linear_search(x, start, end, Y[i])
// Bounds checking
lower = bchoice(Y[i] < X[0], 0, lower)
lower = bchoice(Y[i] > X[n−1], n−1, lower)
lowbounds[i] = lower
return lowbounds
In Listing 5, linear search is used to search between bounds calculated using the hash table. The arguments are as follows: Array, start_index, end_index, searchFor. We tested hash search with both linear and binary search as the interior search algorithm. Linear search generally performs better with vectorization as each lane has equally divided work and a more predictable control flow. However, this is only true when the index distance between hash table entries is small. Choosing a large value for H_size (|H|) typically ensures this if the data are somewhat logarithmically distributed.
3.5 Exponent Hash Search.
To better take advantage of vectorization, we also explored a hash search that extracts the base 2 exponent from the exponent bits of double-precision IEEE 754 floating point numbers. Aside from the hash function itself, Exponent Hash Search is the same as Logarithm Hash Search. The trade-off is between increased vectorized performance and potentially longer linear searches. With Exponent Hash Search, the number of data array accesses increases with the number of data array values which share a base 2 exponent. The more the data array elements which bin into a power of 2, the longer the search range is for an element in that bin. In contrast, the base used for binning in Logarithm Hash Search is unconstrained, so in some cases, it can perform better.
4 Methods
We evaluated the performance and portability of the different search algorithms with and without vectorization on the Darwin computing cluster at Los Alamos National Laboratory [9].
The Darwin system is a heterogeneous cluster designed to be a test bed for new technologies and architectures. In this study, we included the following CPU models: Intel Skylake Gold,3 Intel KNL Xeon Phi,4 and AMD EPYC.5
We used the gcc/9.3.0 and Intel/20.0.2 compilers. The compiler flags (listed in the Appendix are those recommended by Robey and Zamora [8] to get optimal vectorization while maintaining numerical correctness as close to the IEEE floating point standard [10]. All the tests were run on a single core.
The table used as the search array is the density axis of the SESAME [2] table 301 for material 3720 (distributed with the release of eospac [1]). There are 111 density values ranging from 0 g/cc to 54,000 g/cc. We used a batch size (m) of 5 million and averaged over 20 trials. The target values (Y) were generated randomly from a uniform distribution ranging from 10−10 to 1010. We also experimented with a Gaussian distribution of target values, but it had no effect on the results.
To study the cache performance of the algorithms, we used likwid [11], which employs hardware counters for the purpose of performance analysis. In particular, we measured the L2 cache bandwidth, L2 cache miss rate, L3 cache bandwidth, and L3 cache miss rate. The Likwid measurements were made on a Intel Skylake Gold node.
5 Results
To demonstrate the performance of our algorithms, we measured their running time with m = 5,000,000 (targets array size), n = 110 (data array size), averaged over 20 trials, using configurations of both GCC and Intel compilers on several CPU models, with vectorization both enabled and disabled using compiler flags. The shorthand labels for the CPU models used in the performance plots is presented in Table 1. Figure 1 shows an average run time for each configuration of CPU and compiler. GCC was omitted except for a single configuration because there were no significant performance increases observed from vectorization. A refinement of the performance comparisons is shown in Fig. 2 where each search algorithm tested is compared to the baseline search algorithm, Hunt and Locate. For each CPU model, the performance of GCC was comparable to unvectorized Intel regardless of whether vectorization was enabled.
5.1 Array Sizes.
For testing, we chose to use a large targets array size, m, relative to the data array size, n, because it is the most relevant use case for table-lookups in general and for our specific implementation in eospac. The hash-based search algorithms perform best with these parameters. Hashtable generation time is O(n), so when m ≫ n, hashtable generation time becomes insignificant.
m ≫ n also has a significant impact on the viability of vectorization. As the outermost loop, which is over Y, of length m, must be the vectorized loop, the amount of work in its body varies by a factor of n. If n is small, there is little penalty due to masking. When n is large, the impact may be disproportionately large, sometimes even preventing vectorization from being viable at all.
The effects of this is shown in Fig. 3. As m approaches 102, the lead in performance of Exponent Hash Search is lost to Skiplist Search due to its much cheaper setup cost (O(n/8)). The figure uses the Pennycook formula shown in Fig. 4, which measures the performance portability across different configurations of CPU model and compiler.
5.2 Data Array Distribution.
For hashtable generation, performance is primarily affected by the way the data array is distributed. In our tests, the data array roughly follows a logarithmic distribution with a value range of 3 × 10−6 to 5.4 × 104. This is close to optimal for Exponent Hash Search (Sec. 3.5), which is faster the closer each value in X is to being a unique power of 2. This is because the hash function extracts the power of 2 directly from the exponent bits of each target value. In this case, there are about 34 powers of 2 between the smallest and largest value, causing the number of indices between each hashtable entry to be 110/34 = 3.235. If the data had a much narrower or wider range, Logarithm Hash Search (Sec. 3.4) would perform better, as it adjusts the base of its hash function to fit the range of values in the data array.
Skiplist Search, Binary Search, and Hunt and Locate are less sensitive to data distribution although they perform badly in the worst-case scenario where the target element is at either the beginning or end of the data array.
5.3 Search Call Frequency.
The number of search calls has a significant impact on performance for algorithms with a nonzero setup cost. In our experimental setup, only one call to search was made. When using algorithms that incur a setup cost for a table lookup library such as eospac, where many search calls are made, a more involved design decision must be made. For eospac, the arrays that are to be searched over are loaded only once at the beginning of the user program. This allows the search setup to also be performed only once. In situations where there is not a known set of data arrays to be searched over, a different approach must be taken, possibly introducing greater complexity than the search algorithm alone. In such cases, algorithms with zero setup cost may be a better choice.
5.4 Vectorization Considerations.
Directing a compiler to effectively vectorize code is challenging. We tried many different combinations of flags for both GCC and Intel. The flags that worked best for performance are documented in the Appendix. While we were unable to get GCC to vectorize the experimental setup, we were successful in getting Intel to vectorize effectively. This is likely due to differences in the maturity of the vectorization implementations between the two compilers.
Vectorization had a significant impact on performance when it was enabled through the Intel compiler as shown in Fig. 4, with varying results depending on the CPU model. The design of search algorithms must be constrained in particular ways to benefit from vectorization. For the compiler to generate simd instructions, memory access must be linear, and there can be no dependence between loop iterations. A linear control flow is also necessary for simd instructions to be generated. If the code's control flow is not linear, the compiler will compensate by generating masked simd instructions. This forces all computation of all possible branches to be performed, masking out the result of the branches not taken. The performance bottleneck then becomes the slowest branch. The Intel compiler was able to vectorize even when all branching was left in place, but performance suffered due to masking. To address this issue, branching was removed as much as possible so that fewer masked instructions were generated. However, in some cases, this caused a loss in performance when vectorization was disabled due to the loss of branch prediction and other compiler optimizations relating to control flow. The resulting performance due to these effects is shown in Fig. 5. This trade-off must be taken into account when integrating these algorithms into an application.
When vectorization is possible, Logarithm Hash Search (Sec. 3.4) typically outperforms Exponent Hash Search (Sec. 3.5) when the value range in the data array is narrow. When vectorization is not possible, Skiplist Search (Sec. 3.3) becomes a more viable alternative if the data are not distributed logarithmically. In general, we found that Exponent Hash Search performs the best overall when vectorization is possible as the hash function is composed of very cheap operations. Logarithm Hash Search's hash function uses the C library logarithm function, which depends on glibc to be decomposed into simd instructions. In our case, GCC was unable to perform this transformation, but with newer versions of glibc and GCC, it may be possible to do so. The Intel compiler was able to generate simd instructions for logarithm, but they are still more computationally expensive than those of Exponent Hash Search.
The algorithms that responded best to vectorization were Binary Search (Sec. 3.2), the hash-based searches (Secs. 3.5 and 3.4), and Hunt and Locate (Sec. 3.1). These successes are correlated with consistent workload and linear data access pattern between searches. Since the completion of each concurrent batch of operations performed by simd vector instructions is constrained by the longest running operation in the batch, improvements to the operations’ workload consistency will waste fewer computation cycles. A linear data access pattern is important because it leads to better cache performance.
5.5 Search Array Distribution.
The search array for testing consisted of values sampled from a uniform random distribution. Approximately half of the values were outside of the bounds of the data array. This best represents the worst case of a typical problem, the best case being a sorted search array with no out-of-bounds values. A sorted search array would be much better suited for algorithms such as the original eospacHunt and Locate, which uses the result of each search as the starting point for the next. However, this advantage may be superseded by vectorization, which cannot occur if there is a dependency between search iterations.
5.6 Compiler Flags.
We found that the most effective method of vectorization for the Intel compiler was explicitly enabling openmp vectorization through the -qopenmp-simd flag. This enables extra optimizations that are unavailable when auto-vectorizing, as well as extra available configuration (such as array alignment) that is not available through auto-vectorization alone. The flags that work best for auto-vectorization should also be enabled alongside the -qopenmp-simd flag. See Appendix for a list of these flags.
5.7 Cache Performance.
We used likwid [11] to measure L1 Instruction cache TLB miss rate, L2 cache bandwidth, L2 cache miss rate, L3 cache bandwidth, and L3 cache miss rate. These values are useful for determining the cache performance of each algorithm. The following plots, Figs. 6–8, present these results. For these tests, we used a search array size m = 5,000,000.
The cache performance of Linear Search serves as an optimal baseline. Skiplist Search (Sec. 3.3) has good cache performance because the skiplist array is exactly the size of a single cache line. The hash-based searches do not perform as well because of their hashtables’ unpredictable data access patterns. However, Logarithm Hash Search (Sec. 3.4) often performs fewer searches, so it has slightly better cache performance than Exponential Hash Search (Sec. 3.5).
6 Conclusion
Several different search algorithms were analyzed to determine how best to improve the overall performance of search within the eospac software package. Due to implementation constraints within the software that uses eospac, we investigated performance gains of vectorizing search as opposed to using other parallelization techniques like threading or MPI.6 When compared to the original Hunt and Locate algorithm, we saw a speedup by as much as 800% with a search array of size 5,000,000 using the vectorized (e.g., openmp simd) Exponent Hash Search. It was also shown that explicitly combining openmp vectorization and auto-vectorization compiler flags provided further performance enhancements beyond that of auto-vectorization alone by as much as 480%. Even when vectorization is not possible, Exponent Hash Search outperforms Hunt and Locate by more than 100%. The most consistent search algorithm performance improvement, according to the Pennycook metric, is the Exponential Hash Search algorithm for search array sizes greater than 1000. The Skiplist Search algorithm demonstrated the best performance under this threshold due to its much smaller setup cost.
While cache utilization is important, it is not the only factor for vectorization and the resulting algorithm performance. We note that the worst performing algorithm (Linear Search) exhibited the best cache performance. Good performance is dependent on a complex set of factors, including the effects of hardware and compilers.
Several search algorithms have been investigated, and the performance improvements of openmp vectorization have been determined to be beneficial as long as vectorization is adequately implemented by the chosen compiler.
Footnotes
Linux; 3.10.0-1127.19.1.el7.x86_64; #1 SMP Tue Aug 25 17:23:54 UTC 2020; x86_64; GNU/Linux; avx512f.
Linux; 3.10.0-1127.19.1.el7.x86_64; #1 SMP Tue Aug 25 17:23:54 UTC 2020; x86_64; GNU/Linux; avx512pf.
Linux; 3.10.0-1127.19.1.el7.x86_64; #1 SMP Tue Aug 25 17:23:54 UTC 2020; x86_64; GNU/Linux; avx2.
The Message Passing Interface (MPI) is an open library standard for distributed memory parallelization.
Acknowledgment
This collaboration was made possible thanks to the Parallel Computing Summer Research Internship program, which is run out of the Information Science and Technology Institute, and the Equation of State Project, which is run out of the Advanced Simulation and Computing Program (ASC). Additional technical support and collaboration was also provided by the members of the ASC Beyond Moore's Law Inexact Computing Project at Los Alamos National Laboratory. This work was funded by the U.S. Department of Energy through the Los Alamos National Laboratory. Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The data and information that support the findings of this article are freely available at: https://github.com/lanl/eospac/tree/main/simd_search. Data provided by a third party listed in Acknowledgment.
Appendix
C code for the algorithms explored in this paper can be found at https://github.com/lanl/eospac/tree/main/simd_search
Compiler Flags for Vectorization
GCC. -fstrict-aliasing -ftree-vectorize -march=native -mtune=native -fopt-info-vec-all=gcc_optrprt -fopenmp-simd-O3
Intel. -ansi-alias -fp-model:precise -xHOST -vecabi=cmdtarget -qopt-zmm-usage=high -qopt-report=5 -qopt-report-phase=openmp,loop,vec -qopenmp-simd -march=native -O3
Compiler Flags for No Vectorization
GCC. -fno-tree-vectorize -march=native -O3
Intel. -no-vec -qno-openmp-simd -march=native -O3
Preprocessor openmp Pragma
This pragma was used to enable openmp simd functionality on the search loops in the algorithms tested:
#pragma omp simd aligned(x, y, jlower)
jlower refers to the output lower-bounds array.