|
knncolle
Collection of KNN methods in C++
|
Collection of KNN algorithms. More...
Classes | |
| class | BruteforceBuilder |
| Perform a brute-force nearest neighbor search. More... | |
| class | Builder |
| Interface to build nearest-neighbor search indices. More... | |
| class | DistanceMetric |
| Interface for a distance metric. More... | |
| class | EuclideanDistance |
| Compute Euclidean distances between two input vectors. More... | |
| class | L2NormalizedBuilder |
| Wrapper around a builder with L2 normalization. More... | |
| class | L2NormalizedMatrix |
| Wrapper around a matrix with L2 normalization. More... | |
| struct | L2NormalizedPrebuiltTypes |
| Template type of a saved L2-normalized index. More... | |
| class | LoadDistanceMetricNotFoundError |
Exception for unknown distance metrics in load_distance_metric_raw(). More... | |
| class | LoadPrebuiltNotFoundError |
Exception for unknown search algorithms in load_prebuilt_raw(). More... | |
| class | ManhattanDistance |
| Compute Manhattan distances between two input vectors. More... | |
| class | Matrix |
| Interface for matrix data. More... | |
| class | MatrixExtractor |
| Extractor interface for matrix data. More... | |
| class | NeighborQueue |
| Helper class to track nearest neighbors. More... | |
| class | Prebuilt |
| Interface for prebuilt nearest-neighbor search indices. More... | |
| class | Searcher |
| Interface for searching nearest-neighbor search indices. More... | |
| class | SimpleMatrix |
| Simple wrapper for an in-memory matrix. More... | |
| class | VptreeBuilder |
| Perform a nearest neighbor search based on a vantage point (VP) tree. More... | |
| struct | VptreeOptions |
Options for VptreeBuilder construction. More... | |
Typedefs | |
| template<typename Data_ , typename Distance_ > | |
| using | LoadDistanceMetricFunction = std::function<DistanceMetric<Data_, Distance_>* (const std::filesystem::path&)> |
| template<typename Index_ , typename Distance_ > | |
| using | NeighborList = std::vector<std::vector<std::pair<Index_, Distance_> > > |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| using | LoadPrebuiltFunction = std::function<Prebuilt<Index_, Data_, Distance_>* (const std::filesystem::path&)> |
Enumerations | |
| enum class | NumericType : char { UINT8_T , INT8_T , UINT16_T , INT16_T , UINT32_T , INT32_T , UINT64_T , INT64_T , UNSIGNED_CHAR , SIGNED_CHAR , CHAR , UNSIGNED_SHORT , SHORT , UNSIGNED_INT , INT , UNSIGNED_LONG , LONG , UNSIGNED_LONG_LONG , LONG_LONG , SIZE_T , PTRDIFF_T , FLOAT , DOUBLE , UNKNOWN } |
Functions | |
| template<typename Data_ , typename Distance_ > | |
| std::unordered_map< std::string, LoadDistanceMetricFunction< Data_, Distance_ > > & | load_distance_metric_registry () |
| template<typename Data_ , typename Distance_ > | |
| void | register_load_euclidean_distance () |
| template<typename Data_ , typename Distance_ > | |
| void | register_load_manhattan_distance () |
| template<typename Data_ , typename Distance_ > | |
| DistanceMetric< Data_, Distance_ > * | load_distance_metric_raw (const std::filesystem::path &dir) |
| template<typename Task_ , class Run_ > | |
| void | parallelize (int num_workers, Task_ num_tasks, Run_ run_task_range) |
| template<typename Index_ > | |
| int | cap_k (int k, Index_ num_observations) |
| template<typename Index_ > | |
| int | cap_k_query (int k, Index_ num_observations) |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| NeighborList< Index_, Distance_ > | find_nearest_neighbors (const Prebuilt< Index_, Data_, Distance_ > &index, int k, int num_threads=1) |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| std::vector< std::vector< Index_ > > | find_nearest_neighbors_index_only (const Prebuilt< Index_, Data_, Distance_ > &index, int k, int num_threads=1) |
| template<typename Normalized_ > | |
| std::function< void(const std::filesystem::path &)> & | custom_save_for_l2normalized_normalized () |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| std::unordered_map< std::string, LoadPrebuiltFunction< Index_, Data_, Distance_ > > & | load_prebuilt_registry () |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| Prebuilt< Index_, Data_, Distance_ > * | load_prebuilt_raw (const std::filesystem::path &dir) |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| std::unique_ptr< Prebuilt< Index_, Data_, Distance_ > > | load_prebuilt_unique (const std::filesystem::path &dir) |
| template<typename Index_ , typename Data_ , typename Distance_ > | |
| std::shared_ptr< Prebuilt< Index_, Data_, Distance_ > > | load_prebuilt_shared (const std::filesystem::path &dir) |
| L2NormalizedPrebuiltTypes | load_l2normalized_prebuilt_types (const std::filesystem::path &dir) |
| template<typename Index_ , typename Data_ , typename Distance_ , typename Normalized_ > | |
| Prebuilt< Index_, Data_, Distance_ > * | load_l2normalized_prebuilt (const std::filesystem::path &dir) |
| template<typename Index_ , typename Data_ , typename Distance_ , class DistanceMetric_ = DistanceMetric<Data_, Distance_>> | |
| Prebuilt< Index_, Data_, Distance_ > * | load_bruteforce_prebuilt (const std::filesystem::path &dir) |
| template<typename Index_ , typename Data_ , typename Distance_ , class DistanceMetric_ = DistanceMetric<Data_, Distance_>> | |
| Prebuilt< Index_, Data_, Distance_ > * | load_vptree_prebuilt (const std::filesystem::path &dir) |
| template<typename Type_ > | |
| NumericType | get_numeric_type () |
| template<typename Index_ > | |
| Index_ | count_all_neighbors_without_self (Index_ count) |
| template<typename Distance_ , typename Index_ > | |
| void | report_all_neighbors (std::vector< std::pair< Distance_, Index_ > > &all_neighbors, std::vector< Index_ > *output_indices, std::vector< Distance_ > *output_distances, Index_ self) |
| template<typename Distance_ , typename Index_ > | |
| void | report_all_neighbors (std::vector< std::pair< Distance_, Index_ > > &all_neighbors, std::vector< Index_ > *output_indices, std::vector< Distance_ > *output_distances) |
| template<typename Input_ , typename Length_ > | |
| void | quick_save (const std::filesystem::path &path, const Input_ *const contents, const Length_ length) |
| template<typename Input_ , typename Length_ > | |
| void | quick_load (const std::filesystem::path &path, Input_ *const contents, const Length_ length) |
| std::string | quick_load_as_string (const std::filesystem::path &path) |
Collection of KNN algorithms.
| using knncolle::LoadDistanceMetricFunction = std::function<DistanceMetric<Data_, Distance_>* (const std::filesystem::path&)> |
Distance loading function. This accepts a path to a directory (see DistanceMetric::save()) and returns a pointer to a DistanceMetric instance.
| Data_ | Numeric type for the input data. |
| Distance_ | Numeric type for the output distance, usually floating-point. |
| using knncolle::LoadPrebuiltFunction = std::function<Prebuilt<Index_, Data_, Distance_>* (const std::filesystem::path&)> |
Prebuilt loading function. This accepts a directory (see Prebuilt::save()) and returns a pointer to a Prebuilt instance.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| using knncolle::NeighborList = std::vector<std::vector<std::pair<Index_, Distance_> > > |
List of nearest neighbors for multiple observations. Each entry corresponds to an observation and contains a nested list (i.e., vector) of its neighbors. Each entry of the nested vector is a pair that contains the identity of the neighbor as an observation index (first) and the distance from the observation to the neighbor (second), sorted by increasing distance.
| Index_ | Integer type for the indices. |
| Distance_ | Numeric type for the distances, usually floating-point. |
|
strong |
Standard numeric types, typically returned by get_numeric_type().
| int knncolle::cap_k | ( | int | k, |
| Index_ | num_observations ) |
Cap the number of neighbors to use in Searcher::search() with an index i.
| Index_ | Integer type for the number of observations. |
| k | Number of nearest neighbors, should be non-negative. |
| num_observations | Number of observations in the dataset. |
k if it is less than num_observations; otherwise it is equal to num_observations - 1 if num_observations > 0; otherwise it is equal to zero. | int knncolle::cap_k_query | ( | int | k, |
| Index_ | num_observations ) |
Cap the number of neighbors to use in Searcher::search() with a pointer query.
| Index_ | Integer type for the number of observations. |
| k | Number of nearest neighbors, should be non-negative. |
| num_observations | Number of observations in the dataset. |
k and num_observations. | Index_ knncolle::count_all_neighbors_without_self | ( | Index_ | count | ) |
Count the number of neighbors from a range-based search, after removing the observation being searched. This is intended for developer use in implementations of Searcher::search_all(), and protects against pathological situations where an observation fails to be detected as its own neighbor.
| count | Number of neighbors within range of the specified observation, including the observation itself. |
count - 1 in most cases, otherwise zero. | std::function< void(const std::filesystem::path &)> & knncolle::custom_save_for_l2normalized_normalized | ( | ) |
Define a global function to preserve Normalized_ type information when saving a prebuilt L2-normalized index in Prebuilt::save(). Users should define their own function here to handle an Normalized_ type that is unknown to get_numeric_type(). The action of setting/unsetting the global function is not thread-safe and should be done in a serial section.
The sole argument of the global function is the same dir provided to Prebuilt::save(). If a global function is provided, it is generally expected to write information about Normalized_ to files inside dir. It is recommended that the names of these files should not start with upper-case letters to avoid conflicts with files generated by save().
| Normalized_ | Floating-point type for the L2-normalized data. |
Normalized_. By default, no global function is defined. If set, the global function will be called by the Prebuilt::save() method for the L2-normalized Prebuilt subclass. | NeighborList< Index_, Distance_ > knncolle::find_nearest_neighbors | ( | const Prebuilt< Index_, Data_, Distance_ > & | index, |
| int | k, | ||
| int | num_threads = 1 ) |
Find the nearest neighbors within a pre-built index. This is a convenient wrapper around Searcher::search that saves the caller the trouble of writing a loop.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the input data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| index | A Prebuilt index. |
| k | Number of nearest neighbors. This should be non-negative. Explicitly calling cap_k() is not necessary as this is done automatically inside this function. |
| num_threads | Number of threads to use. The parallelization scheme is defined by parallelize(). |
NeighborList of length equal to the number of observations in index. Each entry contains (up to) the k nearest neighbors for each observation, sorted by increasing distance. The i-th entry is guaranteed to not contain i itself. | std::vector< std::vector< Index_ > > knncolle::find_nearest_neighbors_index_only | ( | const Prebuilt< Index_, Data_, Distance_ > & | index, |
| int | k, | ||
| int | num_threads = 1 ) |
Find the nearest neighbors within a pre-built search index. Here, only the neighbor indices are returned, not the distances.
| Index_ | Integer type for the indices. |
| Data_ | Numeric type for the input and query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| index | A Prebuilt index. |
| k | Number of nearest neighbors. This should be non-negative. Explicitly calling cap_k() is not necessary as this is done automatically inside this function. |
| num_threads | Number of threads to use. The parallelization scheme is defined by parallelize(). |
index. Each vector contains the indices of (up to) the k nearest neighbors for each observation, sorted by increasing distance. The i-th entry is guaranteed to not contain i itself. | NumericType knncolle::get_numeric_type | ( | ) |
| Type_ | Some integer or floating-point type. |
This function is intended for developers writing their own Prebuilt::save() methods, where a subclass may have additional template types beyond those required by the Prebuilt template. In such cases, developers can convert the type into a NumericType that can be saved to disk. The corresponding loader function can then read this type information to accurately reconstitute the original Prebuilt object.
| Prebuilt< Index_, Data_, Distance_ > * knncolle::load_bruteforce_prebuilt | ( | const std::filesystem::path & | dir | ) |
Load an brute-force search index (i.e., a Prebuilt created by BruteforceBuilder) from its on-disk representation. This is not provided in the registry by default as its depends on the application's choice of DistanceMetric_.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Floating-point type for the distances. |
| DistanceMetric_ | Class implementing the distance metric calculation. This should satisfy the DistanceMetric interface. |
| dir | Path to a directory in which a prebuilt brute-force index was saved. Files should have been generated by the Prebuilt::save() method of the brute-force Prebuilt subclass instance. |
Prebuilt instance. | DistanceMetric< Data_, Distance_ > * knncolle::load_distance_metric_raw | ( | const std::filesystem::path & | dir | ) |
Load a distance metric from disk into a Distance object.
| Data_ | Numeric type for the input data. |
| Distance_ | Numeric type for the output distance, usually floating-point. |
| dir | Path to a directory containing a distance metric that was saved to disk by DistanceMetric::save(). |
Distance instance, created from the files at dir. If no loading function is available for the saved distance, a LoadDistanceMetricNotFoundError is thrown.
|
inline |
| Data_ | Numeric type for the input data. |
| Distance_ | Numeric type for the output distance, usually floating-point. |
DistanceMetric::save()) and the values are distance loading functions.No loading functions are available when the global map is first initialized. Users should call register_load_euclidean_distance() and/or register_load_manhattan_distance() to populate the map with loaders for distances they intend to support.
| Prebuilt< Index_, Data_, Distance_ > * knncolle::load_l2normalized_prebuilt | ( | const std::filesystem::path & | dir | ) |
Load an L2-normalized search index (i.e., a Prebuilt created by L2NormalizedBuilder) from its on-disk representation. This is not provided in the registry by default as its depends on the application's set of the acceptable Normalized_ types. The Normalized_ type in the saved index can be retrived by load_l2normalized_prebuilt_types().
| Index_ | Integer type for the indices. |
| Data_ | Numeric type for the input and query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| Normalized_ | Floating-point type for the L2-normalized data. |
| dir | Path to a directory in which a prebuilt L2-normalized index was saved. Files should have been generated by the Prebuilt::save() method of the L2-normalized Prebuilt subclass instance. |
Prebuilt instance.
|
inline |
| dir | Path to a directory in which a prebuilt L2-normalized index was saved. Files should have been generated by the Prebuilt::save() method of the L2-normalized Prebuilt subclass instance. |
Prebuilt L2-normalized subclass. This is typically used to choose template parameters for load_l2normalized_prebuilt(). | Prebuilt< Index_, Data_, Distance_ > * knncolle::load_prebuilt_raw | ( | const std::filesystem::path & | dir | ) |
Load a neighbor search index from disk into a Prebuilt object. This should be called with the same template parameters as the Prebuilt interface from which Prebuilt::save() was called. It is expected that load_prebuilt_raw() should create an object that is "equivalent" to the object that was saved with save(), i.e., any neighbor search results should be the same across the original and reloaded Prebuilt object.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| dir | Path to a directory containing a prebuilt index that was saved to disk by Prebuilt::save(). |
Prebuilt instance, created from the files at dir. If no loading function can be found for the search algorithm, a LoadPrebuiltNotFoundError is thrown.
|
inline |
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
Prebuilt::save()) and the values are the prebuilt loading functions.No loading functions are available when the global is first initialized. Users should register LoadPrebuiltFunction functions for each algorithm they intend to support. This is facilitated by the load_bruteforce_prebuilt(), load_vptree_prebuilt() and load_l2normalized_prebuilt() helper functions.
| std::shared_ptr< Prebuilt< Index_, Data_, Distance_ > > knncolle::load_prebuilt_shared | ( | const std::filesystem::path & | dir | ) |
Load a neighbor search index from disk into a Prebuilt object. This should be called with the same template parameters as the Prebuilt interface from which Prebuilt::save() was called.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| dir | Path to a directory containing a prebuilt index that was saved to disk by Prebuilt::save(). |
Prebuilt instance, created from the files at dir. This uses the return value of load_prebuilt_raw(). | std::unique_ptr< Prebuilt< Index_, Data_, Distance_ > > knncolle::load_prebuilt_unique | ( | const std::filesystem::path & | dir | ) |
Load a neighbor search index from disk into a Prebuilt object. This should be called with the same template parameters as the Prebuilt interface from which Prebuilt::save() was called.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| dir | Path to a directory containing a prebuilt index that was saved to disk by Prebuilt::save(). |
Prebuilt instance, created from the files at dir. This uses the return value of load_prebuilt_raw(). | Prebuilt< Index_, Data_, Distance_ > * knncolle::load_vptree_prebuilt | ( | const std::filesystem::path & | dir | ) |
Load a VP-tree search index (i.e., a Prebuilt created by VptreeBuilder) from its on-disk representation. This is not provided in the registry by default as its depends on the application's choice of DistanceMetric_.
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the query data. |
| Distance_ | Floating-point type for the distances. |
| DistanceMetric_ | Class implementing the distance metric calculation. This should satisfy the DistanceMetric interface. |
| dir | Path to a directory in which a prebuilt VP-tree index was saved. Files should have been generated by the Prebuilt::save() method of the VP-tree Prebuilt subclass instance. |
Prebuilt instance. | void knncolle::parallelize | ( | int | num_workers, |
| Task_ | num_tasks, | ||
| Run_ | run_task_range ) |
| Task_ | Integer type for the number of tasks. |
| Run_ | Function to execute a range of tasks. |
| num_workers | Number of workers. |
| num_tasks | Number of tasks. |
| run_task_range | Function to iterate over a range of tasks within a worker. |
By default, this is an alias to subpar::parallelize_range(). However, if the KNNCOLLE_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().
| void knncolle::quick_load | ( | const std::filesystem::path & | path, |
| Input_ *const | contents, | ||
| const Length_ | length ) |
Read an array from a binary file at path. This is intended for developer use when defining LoadPrebuiltFunction functions for load_prebuilt_registry().
| Input_ | Type of values to be read. |
| Length_ | Integer type of the length of the array. |
| path | File path to read from. |
| contents | Pointer to an array in which to store the contents of path. This should have at least length addressable elements. |
| length | Number of elements (not bytes) to be read from path. This should be non-negative. |
|
inline |
Read the contents of a binary file as a string. This is intended for developer use when defining LoadPrebuiltFunction functions for load_prebuilt_registry().
| path | File path to read from. |
path as a string. | void knncolle::quick_save | ( | const std::filesystem::path & | path, |
| const Input_ *const | contents, | ||
| const Length_ | length ) |
Saves an array to a binary file at path. This is intended for developer use in implementations of Prebuilt::save().
| Input_ | Type of values to be saved. |
| Length_ | Integer type of the length of the array. |
| path | File path to save to. Any directories in the path should already exist. |
| contents | Pointer to an array of contents to be saved. |
| length | Length of the array, in terms of the number of elements (not the number of bytes). This should be non-negative. |
| void knncolle::register_load_euclidean_distance | ( | ) |
Register a loading function for EuclideanDistance using euclidean_distance_save_name.
| Data_ | Numeric type for the input data. |
| Distance_ | Numeric type for the output distance, usually floating-point. |
| void knncolle::register_load_manhattan_distance | ( | ) |
Register a loading function for ManhattanDistance using manhattan_distance_save_name.
| Data_ | Numeric type for the input data. |
| Distance_ | Numeric type for the output distance, usually floating-point. |
| void knncolle::report_all_neighbors | ( | std::vector< std::pair< Distance_, Index_ > > & | all_neighbors, |
| std::vector< Index_ > * | output_indices, | ||
| std::vector< Distance_ > * | output_distances ) |
Report the indices and distances of all neighbors in range of an observation of interest. It is assumed that the observation of interest is not detected as its own neighbor, presumably as it does not exist in the original input dataset.
| Distance_ | Numeric type for the distances, usually floating-point. |
| Index_ | Integer type for the observation indices. |
| [in] | all_neighbors | Vector of (distance, index) pairs for all neighbors within range of the observation of interest. This may include the observation itself. Note that this will be sorted on output. |
| [out] | output_indices | Pointer to a vector in which to store the indices of all neighbors in range, sorted by distance. If NULL, the indices will not be reported. |
| [out] | output_distances | Pointer to a vector in which to store the (sorted) distances to all neighbors in range. If NULL, the distances will not be reported. Otherwise, on output, this will have the same length as *output_indices and contain distances to each of those neighbors. |
| void knncolle::report_all_neighbors | ( | std::vector< std::pair< Distance_, Index_ > > & | all_neighbors, |
| std::vector< Index_ > * | output_indices, | ||
| std::vector< Distance_ > * | output_distances, | ||
| Index_ | self ) |
Report the indices and distances of all neighbors in range of an observation of interest. If the observation of interest is detected as its own neighbor, it will be removed from the output vectors.
| Distance_ | Numeric type for the distances, usually floating-point. |
| Index_ | Integer type for the observation indices. |
| [in] | all_neighbors | Vector of (distance, index) pairs for all neighbors within range of the observation of interest. This may include the observation itself. Note that this will be sorted on output. |
| [out] | output_indices | Pointer to a vector in which to store the indices of all neighbors in range, sorted by distance. If NULL, the indices will not be reported. |
| [out] | output_distances | Pointer to a vector in which to store the (sorted) distances to all neighbors in range. If NULL, the distances will not be reported. Otherwise, on output, this will have the same length as *output_indices and contain distances to each of those neighbors. |
| self | Index of the observation of interest, i.e., for which neighbors are to be identified. If present in all_neighbors, this will be removed from output_indices and output_distances. |