knncolle
Collection of KNN methods in C++
|
knncolle is a header-only C++ library that collects a variety of different k-nearest neighbor algorithms under a consistent interface. The aim is to enable downstream libraries to easily switch between different methods with a single runtime flag, or by just swapping out the relevant constructors at compile time.
The core library implements various interfaces along with the following methods:
Additional libraries extend the knncolle framework to more algorithms:
Most of the code in this library is derived from the BiocNeighbors R package.
Given a matrix with dimensions in the rows and observations in the columns, we can do:
Check out the reference documentation for more details.
We can perform the search manually by constructing a Searcher
instance and looping over the elements of interest. Continuing with the same variables defined in the previous section, we could replace the find_nearest_neighbors()
call with:
Similarly, we can query the prebuilt index for the neighbors of an arbitrary vector. The code below searches for the nearest 5 neighbors to a query vector at the origin:
To parallelize the loop, we just need to construct a separate Searcher
(and the result vector) for each thread. This is already implemented in find_nearest_neighbors()
but is also easy to do by hand, e.g., with OpenMP:
Either (or both) of indices
and distances
may be NULL
, in which case the corresponding values are not reported. This allows implementations to skip the extraction of distances when only the identities of the neighbors are of interest.
A related problem involves finding all neighbors within a certain distance of an observation. This can be achieved using the Searcher::search_all()
method:
This method is optional so developers of Searcher
subclasses may choose to not implement it. Applications should check Searcher::can_search_all()
before attempting a call, as shown above. Otherwise, the default method will raise an exception.
All KNN search algorithms implement the Builder
, Prebuilt
and Searcher
interfaces via inheritance. This means that users can swap algorithms at run-time:
Similarly, for algorithms that accept a DistanceMetric
, we can switch between distances at run-time:
We can even switch between input matrix representations at run-time, as long as they follow the Matrix
interface. This allows the various Builder
classes to accept input data in other formats (e.g., sparse, file-backed). For example, knncolle implements the L2NormalizedMatrix
subclass to apply on-the-fly L2 normalization of each observation's vector of coordinates. This is used inside the L2NormalizedBuilder
class to transform an existing neighbor search method from Euclidean to cosine distances.
Check out the reference documentation for more details on these interfaces.
Each interface has a few template parameters to define its types. In general, we recommend using int
s for the observation indices and double
s for the data and distances. If precision is not a concern, we can achieve greater speed by swapping double
s with float
s. We may also need to swap int
with size_t
for larger datasets, e.g., more than 2 billion observations.
Advanced users can set up the templates to bypass virtual dispatch at the cost of more compile-time complexity. For example, we could parametrize the VptreeBuilder
so that it is hard-coded to use Euclidean distances and to only accept column-major in-memory matrices. This gives the compiler an opportunity to devirtualize the relevant method calls for a potential performance improvement.
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to knncolle to make the headers available during compilation:
By default, this will use FetchContent
to fetch all external dependencies. Applications are advised to pin the versions of each dependency for stability - see extern/CMakeLists.txt
to find suggested versions. If you want to install dependencies manually, set -DKNNCOLLE_FETCH_EXTERN=OFF
in the Cmake configuration.
find_package()
If knncolle is already installed on the system, it can be discovered via:
To install the library, use:
Again, this will automatically acquire all its dependencies, see recommendations above.
If you're not using CMake, the simple approach is to just copy the files in include/
- either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. This will also require the external dependencies listed in extern/CMakeLists.txt
.
Hanov S (2011). VP trees: A data structure for finding stuff fast. http://stevehanov.ca/blog/index.php?id=130
Yianilos PN (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, 311-321.