knncolle
Collection of KNN methods in C++
Loading...
Searching...
No Matches
knncolle Namespace Reference

Collection of KNN algorithms. More...

Classes

class  BruteforceBuilder
 Perform a brute-force nearest neighbor search. More...
 
class  BruteforcePrebuilt
 Index for a brute-force nearest neighbor search. More...
 
class  BruteforceSearcher
 Brute-force nearest neighbor searcher. More...
 
class  Builder
 Interface to build nearest-neighbor search indices. More...
 
class  DistanceMetric
 Interface for a distance metric. More...
 
class  EuclideanDistance
 Compute Euclidean distances between two input vectors. More...
 
class  L2NormalizedBuilder
 Wrapper around a builder with L2 normalization. More...
 
class  L2NormalizedMatrix
 Wrapper around a matrix with L2 normalization. More...
 
class  L2NormalizedMatrixExtractor
 Extractor for the L2NormalizedMatrix. More...
 
class  L2NormalizedPrebuilt
 Wrapper around a prebuilt index with L2 normalization. More...
 
class  L2NormalizedSearcher
 Wrapper around a search interface with L2 normalization. More...
 
class  ManhattanDistance
 Compute Manhattan distances between two input vectors. More...
 
class  Matrix
 Interface for matrix data. More...
 
class  MatrixExtractor
 Extractor interface for matrix data. More...
 
class  NeighborQueue
 Helper class to track nearest neighbors. More...
 
class  Prebuilt
 Interface for prebuilt nearest-neighbor search indices. More...
 
class  Searcher
 Interface for searching nearest-neighbor search indices. More...
 
class  SimpleMatrix
 Simple wrapper for an in-memory matrix. More...
 
class  SimpleMatrixExtractor
 Extractor for a SimpleMatrix. More...
 
class  VptreeBuilder
 Perform a nearest neighbor search based on a vantage point (VP) tree. More...
 
class  VptreePrebuilt
 Index for a VP-tree search. More...
 
class  VptreeSearcher
 VP-tree searcher. More...
 

Typedefs

template<typename Index_ , typename Distance_ >
using NeighborList = std::vector<std::vector<std::pair<Index_, Distance_> > >
 

Functions

template<typename Task_ , class Run_ >
void parallelize (int num_workers, Task_ num_tasks, Run_ run_task_range)
 
template<typename Index_ >
int cap_k (int k, Index_ num_observations)
 
template<typename Index_ , typename Data_ , typename Distance_ >
NeighborList< Index_, Distance_ > find_nearest_neighbors (const Prebuilt< Index_, Data_, Distance_ > &index, int k, int num_threads=1)
 
template<typename Index_ , typename Data_ , typename Distance_ >
std::vector< std::vector< Index_ > > find_nearest_neighbors_index_only (const Prebuilt< Index_, Data_, Distance_ > &index, int k, int num_threads=1)
 
template<typename Index_ >
Index_ count_all_neighbors_without_self (Index_ count)
 
template<typename Distance_ , typename Index_ >
void report_all_neighbors (std::vector< std::pair< Distance_, Index_ > > &all_neighbors, std::vector< Index_ > *output_indices, std::vector< Distance_ > *output_distances, Index_ self)
 
template<typename Distance_ , typename Index_ >
void report_all_neighbors (std::vector< std::pair< Distance_, Index_ > > &all_neighbors, std::vector< Index_ > *output_indices, std::vector< Distance_ > *output_distances)
 

Detailed Description

Collection of KNN algorithms.

Typedef Documentation

◆ NeighborList

template<typename Index_ , typename Distance_ >
using knncolle::NeighborList = std::vector<std::vector<std::pair<Index_, Distance_> > >

List of nearest neighbors for multiple observations. Each entry corresponds to an observation and contains a nested list (i.e., vector) of its neighbors. Each entry of the nested vector is a pair that contains the identity of the neighbor as an observation index (first) and the distance from the observation to the neighbor (second), sorted by increasing distance.

Template Parameters
Index_Integer type for the indices.
Distance_Floating point type for the distances.

Function Documentation

◆ cap_k()

template<typename Index_ >
int knncolle::cap_k ( int k,
Index_ num_observations )

Cap the number of neighbors to use in Searcher::search() with an index i.

Template Parameters
Index_Integer type for the number of observations.
Parameters
kNumber of nearest neighbors, should be non-negative.
num_observationsNumber of observations in the dataset.
Returns
Capped number of neighbors to search for. This is equal to k if it is less than num_observations; otherwise it is equal to num_observations - 1 if num_observations > 0; otherwise it is equal to zero.

◆ count_all_neighbors_without_self()

template<typename Index_ >
Index_ knncolle::count_all_neighbors_without_self ( Index_ count)

Count the number of neighbors from a range-based search, after removing the observation being searched. This is intended for developer use in implementations of Searcher::search_all(), and protects against pathological situations where an observation fails to be detected as its own neighbor.

Parameters
countNumber of neighbors within range of the specified observation, including the observation itself.
Returns
count - 1 in most cases, otherwise zero.

◆ find_nearest_neighbors()

template<typename Index_ , typename Data_ , typename Distance_ >
NeighborList< Index_, Distance_ > knncolle::find_nearest_neighbors ( const Prebuilt< Index_, Data_, Distance_ > & index,
int k,
int num_threads = 1 )

Find the nearest neighbors within a pre-built index. This is a convenient wrapper around Searcher::search that saves the caller the trouble of writing a loop.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the input data.
Distance_Floating point type for the distances.
Parameters
indexA Prebuilt index.
kNumber of nearest neighbors. This should be non-negative. Explicitly calling cap_k() is not necessary as this is done automatically inside this function.
num_threadsNumber of threads to use. The parallelization scheme is defined by parallelize().
Returns
A NeighborList of length equal to the number of observations in index. Each entry contains (up to) the k nearest neighbors for each observation, sorted by increasing distance. The i-th entry is guaranteed to not contain i itself.

◆ find_nearest_neighbors_index_only()

template<typename Index_ , typename Data_ , typename Distance_ >
std::vector< std::vector< Index_ > > knncolle::find_nearest_neighbors_index_only ( const Prebuilt< Index_, Data_, Distance_ > & index,
int k,
int num_threads = 1 )

Find the nearest neighbors within a pre-built search index. Here, only the neighbor indices are returned, not the distances.

Template Parameters
Index_Integer type for the indices.
Data_Numeric type for the input and query data.
Distance_Floating point type for the distances.
Parameters
indexA Prebuilt index.
kNumber of nearest neighbors. This should be non-negative. Explicitly calling cap_k() is not necessary as this is done automatically inside this function.
num_threadsNumber of threads to use. The parallelization scheme is defined by parallelize().
Returns
A vector of vectors of length equal to the number of observations in index. Each vector contains the indices of (up to) the k nearest neighbors for each observation, sorted by increasing distance. The i-th entry is guaranteed to not contain i itself.

◆ parallelize()

template<typename Task_ , class Run_ >
void knncolle::parallelize ( int num_workers,
Task_ num_tasks,
Run_ run_task_range )
Template Parameters
Task_Integer type for the number of tasks.
Run_Function to execute a range of tasks.
Parameters
num_workersNumber of workers.
num_tasksNumber of tasks.
run_task_rangeFunction to iterate over a range of tasks within a worker.

By default, this is an alias to subpar::parallelize_range(). However, if the KNNCOLLE_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().

◆ report_all_neighbors() [1/2]

template<typename Distance_ , typename Index_ >
void knncolle::report_all_neighbors ( std::vector< std::pair< Distance_, Index_ > > & all_neighbors,
std::vector< Index_ > * output_indices,
std::vector< Distance_ > * output_distances )

Report the indices and distances of all neighbors in range of an observation of interest. It is assumed that the observation of interest is not detected as its own neighbor, presumably as it does not exist in the original input dataset.

Template Parameters
Distance_Floating point type for the distances.
Index_Integer type for the observation indices.
Parameters
[in]all_neighborsVector of (distance, index) pairs for all neighbors within range of the observation of interest. This may include the observation itself. Note that this will be sorted on output.
[out]output_indicesPointer to a vector in which to store the indices of all neighbors in range, sorted by distance. If NULL, the indices will not be reported.
[out]output_distancesPointer to a vector in which to store the (sorted) distances to all neighbors in range. If NULL, the distances will not be reported. Otherwise, on output, this will have the same length as *output_indices and contain distances to each of those neighbors.

◆ report_all_neighbors() [2/2]

template<typename Distance_ , typename Index_ >
void knncolle::report_all_neighbors ( std::vector< std::pair< Distance_, Index_ > > & all_neighbors,
std::vector< Index_ > * output_indices,
std::vector< Distance_ > * output_distances,
Index_ self )

Report the indices and distances of all neighbors in range of an observation of interest. If the observation of interest is detected as its own neighbor, it will be removed from the output vectors.

Template Parameters
Distance_Floating point type for the distances.
Index_Integer type for the observation indices.
Parameters
[in]all_neighborsVector of (distance, index) pairs for all neighbors within range of the observation of interest. This may include the observation itself. Note that this will be sorted on output.
[out]output_indicesPointer to a vector in which to store the indices of all neighbors in range, sorted by distance. If NULL, the indices will not be reported.
[out]output_distancesPointer to a vector in which to store the (sorted) distances to all neighbors in range. If NULL, the distances will not be reported. Otherwise, on output, this will have the same length as *output_indices and contain distances to each of those neighbors.
selfIndex of the observation of interest, i.e., for which neighbors are to be identified. If present in all_neighbors, this will be removed from output_indices and output_distances.