template<typename Index_, typename Data_, typename Distance_, class Matrix_ = Matrix<Index_, Data_>, class DistanceMetric_ = DistanceMetric<Data_, Distance_>>
class knncolle::VptreeBuilder< Index_, Data_, Distance_, Matrix_, DistanceMetric_ >
Perform a nearest neighbor search based on a vantage point (VP) tree.
In a vantage point tree (Yianilos, 1993), each node contains a subset of points that is split into two further partitions. The split is determined by picking an arbitrary point inside that subset as the node center, computing the distance to all other points from the center, and using the median distance as the "radius" of a hypersphere. The left child of this node contains all points within that hypersphere while the right child contains the remaining points. This procedure is applied recursively until all points resolve to individual nodes, thus yielding a VP tree. Upon searching, the algorithm traverses the tree and exploits the triangle inequality between query points and node centers to narrow the search space.
The major advantage of VP trees over more conventional KD-trees or ball trees is that the former does not need to construct intermediate nodes, instead using the data points themselves at the nodes. This reduces the memory usage of the tree and total number of distance calculations for any search. It can also be very useful when the concept of an intermediate is not well-defined (e.g., for non-numeric data), though this is not particularly relevant for knncolle.
Note that the VP tree search is theoretically "exact" but the behavior of the implementation will be affected by round-off error for floating-point inputs. Search decisions based on the triangle inequality may not be correct in some edge cases involving tied distances. This manifests as a different selection of neighbors compared to an exhaustive search (e.g., by BruteforceBuilder), typically when (i) an observation is equidistant to multiple other observations that are not duplicates of each other and (ii) the tied distances occur at the k-th nearest neighbor for Searcher::search() or are tied with threshold for Searcher::search_all(). In practice, any errors are very rare and can probably be ignored for most applications. If more accuracy is required, a partial mitigation is to just increase k or threshold to reduce the risk of incorrect search decisions, and then filter the results to the desired set of neighbors.
- Template Parameters
-
| Index_ | Integer type for the observation indices. |
| Data_ | Numeric type for the input and query data. |
| Distance_ | Numeric type for the distances, usually floating-point. |
| Matrix_ | Class of the input data matrix. This should satisfy the Matrix interface. |
| DistanceMetric_ | Class implementing the distance metric calculation. This should satisfy the DistanceMetric interface. |
- See also
- Yianilos PN (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, 311-321.
-
Hanov S (2011). VP trees: A data structure for finding stuff fast. http://stevehanov.ca/blog/index.php?id=130