knncolle
Collection of KNN methods in C++
Loading...
Searching...
No Matches
knncolle Namespace Reference

Collection of KNN algorithms. More...

Classes

class  BruteforceBuilder
 Perform a brute-force nearest neighbor search. More...
 
class  Builder
 Interface to build nearest-neighbor search indices. More...
 
class  DistanceMetric
 Interface for a distance metric. More...
 
class  EuclideanDistance
 Compute Euclidean distances between two input vectors. More...
 
class  L2NormalizedBuilder
 Wrapper around a builder with L2 normalization. More...
 
class  L2NormalizedMatrix
 Wrapper around a matrix with L2 normalization. More...
 
class  ManhattanDistance
 Compute Manhattan distances between two input vectors. More...
 
class  Matrix
 Interface for matrix data. More...
 
class  MatrixExtractor
 Extractor interface for matrix data. More...
 
class  NeighborQueue
 Helper class to track nearest neighbors. More...
 
class  Prebuilt
 Interface for prebuilt nearest-neighbor search indices. More...
 
class  Searcher
 Interface for searching nearest-neighbor search indices. More...
 
class  SimpleMatrix
 Simple wrapper for an in-memory matrix. More...
 
class  VptreeBuilder
 Perform a nearest neighbor search based on a vantage point (VP) tree. More...
 

Typedefs

template<typename Data_ , typename Distance_ >
using LoadDistanceMetricFunction = std::function<DistanceMetric<Data_, Distance_>* (const std::string&)>
 
template<typename Index_ , typename Distance_ >
using NeighborList = std::vector<std::vector<std::pair<Index_, Distance_> > >
 
template<typename Index_ , typename Data_ , typename Distance_ >
using LoadPrebuiltFunction = std::function<Prebuilt<Index_, Data_, Distance_>* (const std::string&)>
 

Functions

template<typename Data_ , typename Distance_ >
std::unordered_map< std::string, LoadDistanceMetricFunction< Data_, Distance_ > > & load_distance_metric_registry ()
 
template<typename Data_ , typename Distance_ >
DistanceMetric< Data_, Distance_ > * load_distance_metric_raw (const std::string &prefix)
 
template<typename Task_ , class Run_ >
void parallelize (int num_workers, Task_ num_tasks, Run_ run_task_range)
 
template<typename Index_ >
int cap_k (int k, Index_ num_observations)
 
template<typename Index_ >
int cap_k_query (int k, Index_ num_observations)
 
template<typename Index_ , typename Data_ , typename Distance_ >
NeighborList< Index_, Distance_ > find_nearest_neighbors (const Prebuilt< Index_, Data_, Distance_ > &index, int k, int num_threads=1)
 
template<typename Index_ , typename Data_ , typename Distance_ >
std::vector< std::vector< Index_ > > find_nearest_neighbors_index_only (const Prebuilt< Index_, Data_, Distance_ > &index, int k, int num_threads=1)
 
template<typename Index_ , typename Data_ , typename Distance_ >
std::unordered_map< std::string, LoadPrebuiltFunction< Index_, Data_, Distance_ > > & load_prebuilt_registry ()
 
template<typename Index_ , typename Data_ , typename Distance_ >
Prebuilt< Index_, Data_, Distance_ > * load_prebuilt_raw (const std::string &prefix)
 
template<typename Index_ , typename Data_ , typename Distance_ >
std::unique_ptr< Prebuilt< Index_, Data_, Distance_ > > load_prebuilt_unique (const std::string &prefix)
 
template<typename Index_ , typename Data_ , typename Distance_ >
std::shared_ptr< Prebuilt< Index_, Data_, Distance_ > > load_prebuilt_shared (const std::string &prefix)
 
template<typename Index_ >
Index_ count_all_neighbors_without_self (Index_ count)
 
template<typename Distance_ , typename Index_ >
void report_all_neighbors (std::vector< std::pair< Distance_, Index_ > > &all_neighbors, std::vector< Index_ > *output_indices, std::vector< Distance_ > *output_distances, Index_ self)
 
template<typename Distance_ , typename Index_ >
void report_all_neighbors (std::vector< std::pair< Distance_, Index_ > > &all_neighbors, std::vector< Index_ > *output_indices, std::vector< Distance_ > *output_distances)
 
template<typename Input_ , typename Length_ >
void quick_save (const std::string &path, const Input_ *const contents, const Length_ length)
 
template<typename Input_ , typename Length_ >
void quick_load (const std::string &path, Input_ *const contents, const Length_ length)
 

Detailed Description

Collection of KNN algorithms.

Typedef Documentation

◆ LoadDistanceMetricFunction

template<typename Data_ , typename Distance_ >
using knncolle::LoadDistanceMetricFunction = std::function<DistanceMetric<Data_, Distance_>* (const std::string&)>

Distance loading function. This accepts a file path prefix (see DistanceMetric::save()) and returns a pointer to a Distance instance.

Template Parameters
Data_Numeric type for the input data.
Distance_Floating-point type for the output distance.

◆ LoadPrebuiltFunction

template<typename Index_ , typename Data_ , typename Distance_ >
using knncolle::LoadPrebuiltFunction = std::function<Prebuilt<Index_, Data_, Distance_>* (const std::string&)>

Prebuilt loading function. This accepts a file path prefix (see Prebuilt::save()) and returns a pointer to a Prebuilt instance.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the query data.
Distance_Floating point type for the distances.

◆ NeighborList

template<typename Index_ , typename Distance_ >
using knncolle::NeighborList = std::vector<std::vector<std::pair<Index_, Distance_> > >

List of nearest neighbors for multiple observations. Each entry corresponds to an observation and contains a nested list (i.e., vector) of its neighbors. Each entry of the nested vector is a pair that contains the identity of the neighbor as an observation index (first) and the distance from the observation to the neighbor (second), sorted by increasing distance.

Template Parameters
Index_Integer type for the indices.
Distance_Floating point type for the distances.

Function Documentation

◆ cap_k()

template<typename Index_ >
int knncolle::cap_k ( int k,
Index_ num_observations )

Cap the number of neighbors to use in Searcher::search() with an index i.

Template Parameters
Index_Integer type for the number of observations.
Parameters
kNumber of nearest neighbors, should be non-negative.
num_observationsNumber of observations in the dataset.
Returns
Capped number of neighbors to search for. This is equal to k if it is less than num_observations; otherwise it is equal to num_observations - 1 if num_observations > 0; otherwise it is equal to zero.

◆ cap_k_query()

template<typename Index_ >
int knncolle::cap_k_query ( int k,
Index_ num_observations )

Cap the number of neighbors to use in Searcher::search() with a pointer query.

Template Parameters
Index_Integer type for the number of observations.
Parameters
kNumber of nearest neighbors, should be non-negative.
num_observationsNumber of observations in the dataset.
Returns
Capped number of neighbors to query. This is equal to the smaller of k and num_observations.

◆ count_all_neighbors_without_self()

template<typename Index_ >
Index_ knncolle::count_all_neighbors_without_self ( Index_ count)

Count the number of neighbors from a range-based search, after removing the observation being searched. This is intended for developer use in implementations of Searcher::search_all(), and protects against pathological situations where an observation fails to be detected as its own neighbor.

Parameters
countNumber of neighbors within range of the specified observation, including the observation itself.
Returns
count - 1 in most cases, otherwise zero.

◆ find_nearest_neighbors()

template<typename Index_ , typename Data_ , typename Distance_ >
NeighborList< Index_, Distance_ > knncolle::find_nearest_neighbors ( const Prebuilt< Index_, Data_, Distance_ > & index,
int k,
int num_threads = 1 )

Find the nearest neighbors within a pre-built index. This is a convenient wrapper around Searcher::search that saves the caller the trouble of writing a loop.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the input data.
Distance_Floating point type for the distances.
Parameters
indexA Prebuilt index.
kNumber of nearest neighbors. This should be non-negative. Explicitly calling cap_k() is not necessary as this is done automatically inside this function.
num_threadsNumber of threads to use. The parallelization scheme is defined by parallelize().
Returns
A NeighborList of length equal to the number of observations in index. Each entry contains (up to) the k nearest neighbors for each observation, sorted by increasing distance. The i-th entry is guaranteed to not contain i itself.

◆ find_nearest_neighbors_index_only()

template<typename Index_ , typename Data_ , typename Distance_ >
std::vector< std::vector< Index_ > > knncolle::find_nearest_neighbors_index_only ( const Prebuilt< Index_, Data_, Distance_ > & index,
int k,
int num_threads = 1 )

Find the nearest neighbors within a pre-built search index. Here, only the neighbor indices are returned, not the distances.

Template Parameters
Index_Integer type for the indices.
Data_Numeric type for the input and query data.
Distance_Floating point type for the distances.
Parameters
indexA Prebuilt index.
kNumber of nearest neighbors. This should be non-negative. Explicitly calling cap_k() is not necessary as this is done automatically inside this function.
num_threadsNumber of threads to use. The parallelization scheme is defined by parallelize().
Returns
A vector of vectors of length equal to the number of observations in index. Each vector contains the indices of (up to) the k nearest neighbors for each observation, sorted by increasing distance. The i-th entry is guaranteed to not contain i itself.

◆ load_distance_metric_raw()

template<typename Data_ , typename Distance_ >
DistanceMetric< Data_, Distance_ > * knncolle::load_distance_metric_raw ( const std::string & prefix)

Load a distance metric from disk into a Distance object.

Template Parameters
Data_Numeric type for the input data.
Distance_Floating-point type for the output distance.
Parameters
prefixFile path prefix for a distance index that was saved to disk by DistanceMetric::save().
Returns
Pointer to a Distance instance, created from the files at prefix.

◆ load_distance_metric_registry()

template<typename Data_ , typename Distance_ >
std::unordered_map< std::string, LoadDistanceMetricFunction< Data_, Distance_ > > & knncolle::load_distance_metric_registry ( )
inline
Template Parameters
Data_Numeric type for the input data.
Distance_Floating-point type for the output distance.
Returns
Reference to a global map of method names (see DistanceMetric::save()) to loading functions.

◆ load_prebuilt_raw()

template<typename Index_ , typename Data_ , typename Distance_ >
Prebuilt< Index_, Data_, Distance_ > * knncolle::load_prebuilt_raw ( const std::string & prefix)

Load a neighbor search index from disk into a Prebuilt object.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the query data.
Distance_Floating point type for the distances.
Parameters
prefixFile path prefix for a prebuilt index that was saved to disk by Prebuilt::save().
Returns
Pointer to a Prebuilt instance, created from the files at prefix.

◆ load_prebuilt_registry()

template<typename Index_ , typename Data_ , typename Distance_ >
std::unordered_map< std::string, LoadPrebuiltFunction< Index_, Data_, Distance_ > > & knncolle::load_prebuilt_registry ( )
inline
Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the query data.
Distance_Floating point type for the distances.
Returns
Reference to a global map of method names (see Prebuilt::save()) to loading functions.

◆ load_prebuilt_shared()

template<typename Index_ , typename Data_ , typename Distance_ >
std::shared_ptr< Prebuilt< Index_, Data_, Distance_ > > knncolle::load_prebuilt_shared ( const std::string & prefix)

Load a neighbor search index from disk into a Prebuilt object. This calls load_prebuilt_raw() internally.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the query data.
Distance_Floating point type for the distances.
Parameters
prefixFile path prefix for a prebuilt index that was saved to disk by Prebuilt::save().
Returns
Shared pointer to a Prebuilt instance, created from the files at prefix.

◆ load_prebuilt_unique()

template<typename Index_ , typename Data_ , typename Distance_ >
std::unique_ptr< Prebuilt< Index_, Data_, Distance_ > > knncolle::load_prebuilt_unique ( const std::string & prefix)

Load a neighbor search index from disk into a Prebuilt object. This calls load_prebuilt_raw() internally.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the query data.
Distance_Floating point type for the distances.
Parameters
prefixFile path prefix for a prebuilt index that was saved to disk by Prebuilt::save().
Returns
Unique pointer to a Prebuilt instance, created from the files at prefix.

◆ parallelize()

template<typename Task_ , class Run_ >
void knncolle::parallelize ( int num_workers,
Task_ num_tasks,
Run_ run_task_range )
Template Parameters
Task_Integer type for the number of tasks.
Run_Function to execute a range of tasks.
Parameters
num_workersNumber of workers.
num_tasksNumber of tasks.
run_task_rangeFunction to iterate over a range of tasks within a worker.

By default, this is an alias to subpar::parallelize_range(). However, if the KNNCOLLE_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().

◆ quick_load()

template<typename Input_ , typename Length_ >
void knncolle::quick_load ( const std::string & path,
Input_ *const contents,
const Length_ length )

Read an array from a binary file at path. This is intended for developer use in load_prebuilt() functions.

Template Parameters
Input_Type of values to be read.
Length_Integer type of the length of the array.
Parameters
pathFile path to read from.
contentsPointer to an array in which to store the contents of path. This should have at least length addressable elements.
lengthNumber of elements (not bytes) to be read from path. This should be non-negative.

◆ quick_save()

template<typename Input_ , typename Length_ >
void knncolle::quick_save ( const std::string & path,
const Input_ *const contents,
const Length_ length )

Saves an array to a binary file at path. This is intended for developer use in implementations of Prebuilt::save().

Template Parameters
Input_Type of values to be saved.
Length_Integer type of the length of the array.
Parameters
pathFile path to save to. Any directories in the path should already exist.
contentsPointer to an array of contents to be saved.
lengthLength of the array, in terms of the number of elements (not the number of bytes). This should be non-negative.

◆ report_all_neighbors() [1/2]

template<typename Distance_ , typename Index_ >
void knncolle::report_all_neighbors ( std::vector< std::pair< Distance_, Index_ > > & all_neighbors,
std::vector< Index_ > * output_indices,
std::vector< Distance_ > * output_distances )

Report the indices and distances of all neighbors in range of an observation of interest. It is assumed that the observation of interest is not detected as its own neighbor, presumably as it does not exist in the original input dataset.

Template Parameters
Distance_Floating point type for the distances.
Index_Integer type for the observation indices.
Parameters
[in]all_neighborsVector of (distance, index) pairs for all neighbors within range of the observation of interest. This may include the observation itself. Note that this will be sorted on output.
[out]output_indicesPointer to a vector in which to store the indices of all neighbors in range, sorted by distance. If NULL, the indices will not be reported.
[out]output_distancesPointer to a vector in which to store the (sorted) distances to all neighbors in range. If NULL, the distances will not be reported. Otherwise, on output, this will have the same length as *output_indices and contain distances to each of those neighbors.

◆ report_all_neighbors() [2/2]

template<typename Distance_ , typename Index_ >
void knncolle::report_all_neighbors ( std::vector< std::pair< Distance_, Index_ > > & all_neighbors,
std::vector< Index_ > * output_indices,
std::vector< Distance_ > * output_distances,
Index_ self )

Report the indices and distances of all neighbors in range of an observation of interest. If the observation of interest is detected as its own neighbor, it will be removed from the output vectors.

Template Parameters
Distance_Floating point type for the distances.
Index_Integer type for the observation indices.
Parameters
[in]all_neighborsVector of (distance, index) pairs for all neighbors within range of the observation of interest. This may include the observation itself. Note that this will be sorted on output.
[out]output_indicesPointer to a vector in which to store the indices of all neighbors in range, sorted by distance. If NULL, the indices will not be reported.
[out]output_distancesPointer to a vector in which to store the (sorted) distances to all neighbors in range. If NULL, the distances will not be reported. Otherwise, on output, this will have the same length as *output_indices and contain distances to each of those neighbors.
selfIndex of the observation of interest, i.e., for which neighbors are to be identified. If present in all_neighbors, this will be removed from output_indices and output_distances.