knncolle package

knncolle.includes()[source]

Provides access to knncolle_py.h C++ header.

Return type:

str

Returns:

Path to a directory containing the header.

Submodules

knncolle.annoy module

class knncolle.annoy.AnnoyIndex(ptr)[source]

Bases: GenericIndex

A prebuilt index for the Approximate Nearest Neighbors Oh Yeah (Annoy) algorithm, created by define_builder() with a AnnoyParameters object.

__init__(ptr)[source]
Parameters:

ptr (int) – Address of a knncolle_py::WrappedPrebuilt containing an Annoy search index, allocated in C++.

class knncolle.annoy.AnnoyParameters(num_trees=50, search_mult=None, distance='Euclidean')[source]

Bases: Parameters

Parameters for the Approximate Nearest Neighbors Oh Yeah (Annoy) algorithm, see here for details.

__init__(num_trees=50, search_mult=None, distance='Euclidean')[source]
Parameters:
  • num_trees (int) – Number of trees to use to generate the search index. More trees increase accuracy at the cost of more computational work, in terms of both the indexing time and search speed.

  • search_mult (Optional[float]) – Multiplier for the number of observations to search. Specifically, the product of k and search_mult is used to define the number of points to search exhaustively and dictates the balance between search speed and accuracy. If None, this defaults to the value of num_trees.

  • distance (Literal['Euclidean', 'Manhattan', 'Cosine']) – Distance metric for index construction and search. This should be one of Euclidean, Manhattan or Cosine.

property distance: str

Distance metric, see __init__().

property num_trees: int

Number of trees, see __init__().

property search_mult: int

Search multiplier, see __init__().

knncolle.build_index module

knncolle.build_index.build_index(param, x, **kwargs)[source]

Build a search index for a given nearest neighbor search algorithm. The default method calls define_builder() to obtain an algorithm-specific factory that builds the index from x.

Parameters:
  • param (Parameters) – Parameters for a particular search algorithm.

  • x (ndarray) – Matrix of coordinates for the observations to be searched. This should be a two-dimensional double-precision NumPy array in Fortran order where the rows are dimensions and columns are observations.

  • kwargs – Additional arguments to be passed to individual methods.

Return type:

Index

Returns:

Instance of a Index subclass.

knncolle.classes module

class knncolle.classes.Builder(ptr)[source]

Bases: object

Pointer to a search index builder, i.e., knncolle_py::WrappedBuilder, for use in C++ to build new neighbor search indices. The associated memory is automatically freed upon garbage collection.

__init__(ptr)[source]
Parameters:

ptr (int) – Address of a knncolle_py::WrappedBuilder.

property ptr

Address of a knncolle_py::WrappedBuilder, to be passed into C++ as a uintptr_t; see knncolle_py.h for details.

class knncolle.classes.GenericIndex(ptr)[source]

Bases: Index

Abstract base class for a prebuilt nearest neighbor-search index that holds an address to a knncolle_py::WrappedPrebuilt instance in C++. The associated memory is automatically freed upon garbage collection.

__init__(ptr)[source]
Parameters:

ptr (int) – Address of a knncolle_py::WrappedPrebuilt.

num_dimensions()[source]
Return type:

int

Returns:

Number of dimensions in this index.

num_observations()[source]
Return type:

int

Returns:

Number of observations in this index.

property ptr: int

Address of a knncolle_py::WrappedPrebuilt, to be passed into C++ as a uintptr_t; see knncolle_py.h for details.

class knncolle.classes.Index[source]

Bases: ABC

Abstract base class for a prebuilt nearest neighbor-search index. Each search algorithm should implement their own subclasses, but are free to use any data structure to represent their search indices.

class knncolle.classes.Parameters[source]

Bases: ABC

Abstract base class for the parameters of a nearest neighbor search. Each search algorithm should implement a subclass that contains the relevant parameters for controlling index construction or search.

knncolle.define_builder module

knncolle.define_builder.define_builder(param)[source]

Create a builder instance for a given nearest neighbor search algorithm. The builder can be used in build_index() to create a search index from a matrix of observations.

Parameters:

param (Parameters) – Parameters for a particular search algorithm.

Return type:

Tuple

Returns:

Tuple where the first element is a Builder and the second element is a GenericIndex.

knncolle.exhaustive module

class knncolle.exhaustive.ExhaustiveIndex(ptr)[source]

Bases: GenericIndex

A prebuilt index for an exhaustive search, created by define_builder() with a ExhaustiveParameters instance.

__init__(ptr)[source]
Parameters:

ptr – Address of a knncolle_py::WrappedPrebuilt containing an exhaustive search index, allocated in C++.

class knncolle.exhaustive.ExhaustiveParameters(distance='Euclidean')[source]

Bases: Parameters

Parameters for an exhaustive search.

__init__(distance='Euclidean')[source]
Parameters:

distance (Literal['Euclidean', 'Manhattan', 'Cosine']) – Distance metric for index construction and search. This should be one of Euclidean, Manhattan or Cosine.

property distance: str

Distance metric, see __init__().

knncolle.find_distance module

knncolle.find_distance.find_distance(X, num_neighbors, num_threads=1, subset=None, **kwargs)[source]

Find the distance to the k-th closest point for each observation.

Parameters:
  • X (Index) – A prebuilt search index.

  • num_neighbors (Union[int, Sequence]) –

    Number of nearest neighbors at which to compute the distance from each observation in X, i.e., k. This is automatically capped at the number of observations minus 1.

    Alternatively, this may be a sequence of non-negative integers of length equal to the number of observations in X, specifying the neighbor at which to compute the distance for each observation.

    If subset is supplied and num_neighbors is a sequence, it should have length equal to subset instead, and should specify the neighbor for each observation in the subset.

  • num_threads (int) – Number of threads to use for the search.

  • subset (Optional[Sequence]) – Sequence of integers containing the indices of the observations for which to compute the distances. All indices should be non-negative and less than the total number of observations.

  • kwargs – Additional arguments to pass to specific methods.

Return type:

ndarray

Returns:

A NumPy array of length equal to the number of observations in X (or subset, if provided) containing the distance to the num_neighbors-th point for each observation.

knncolle.find_knn module

class knncolle.find_knn.FindKnnResults(index, distance)[source]

Bases: object

Results of find_knn().

If num_neighbors is an integer, index and distance are both matrices. Each row corresponds to an observation in X and each column corresponds to one of its neighbors. index contains the indices of the nearest neighbors while distance contains the distance to those neighbors. In each row, neighbors are guaranteed to be sorted in order of increasing distance. Each row of index is guaranteed to not contain the index of the corresponding observation.

If num_neighbors is a sequence, index and distance are lists instead. Each list element corresponds to an observation in X and is a NumPy array containing the indices (for index) or distances (for distance) to the requested number of neighbors for that observation. For each observation, the neighbors are guaranteed to be sorted in order of increasing distance. Each element of index is guaranteed to not contain the index of the corresponding observation.

If get_index = False, index is set to None.

If get_distance = False, distance is set to None.

If subset is provided, the number of rows in index and distance (if num_neighbors is an integer) or their length (otherwise) is instead equal to the length of the subset. Each row or list entry corresponds to one of the observations in the subset.

__init__(index, distance)
distance: Optional[ndarray]
index: Optional[ndarray]
knncolle.find_knn.find_knn(X, num_neighbors, num_threads=1, subset=None, get_index=True, get_distance=True, **kwargs)[source]

Find the k-nearest neighbors for each observation.

Parameters:
  • X (Index) – A prebuilt search index.

  • num_neighbors (Union[int, Sequence]) –

    Number of nearest neighbors to identify for each observation in X. This is automatically capped at the number of observations minus 1.

    Alternatively, this may be a sequence of non-negative integers of length equal to the number of observations in X, specifying the number of neighbors to find for each observation.

    If subset is supplied and num_neighbors is a sequence, it should have length equal to subset instead, and should specify the number of neighbors for each observation in the subset.

  • num_threads (int) – Number of threads to use for the search.

  • subset (Optional[Sequence]) – Sequence of integers containing the indices of the observations for which to identify neighbors. All indices should be non-negative and less than the total number of observations.

  • get_index (bool) – Whether to report the indices of each nearest neighbor.

  • get_distance (bool) – Whether to report the distances to each nearest neighbor.

  • kwargs – Additional arguments to pass to specific methods.

Return type:

FindKnnResults

Returns:

Results of the nearest-neighbor search.

knncolle.find_neighbors module

class knncolle.find_neighbors.FindNeighborsResults(index, distance)[source]

Bases: object

Results of find_neighbors().

index and distance are lists where each element corresponds to an observation in X. Each element is a NumPy array containing the indices of (for index) or distances to (for distance) the neighbors of the corresponding observation within the specified threshold distance. For each observation, neighbors are guaranteed to be sorted in order of increasing distance. Each element of index is guaranteed to not contain the index of the corresponding observation.

If get_index = False, index is set to None.

If get_distance = False, distance is set to None.

If subset is provided, the length of index and distance is instead equal to the length of the subset. Each row or list entry corresponds to one of the observations in the subset.

__init__(index, distance)
distance: Optional[list]
index: Optional[list]
knncolle.find_neighbors.find_neighbors(X, threshold, num_threads=1, subset=None, get_index=True, get_distance=True, **kwargs)[source]

Find all neighbors within a certain distance for each observation.

Parameters:
  • X (Index) – A prebuilt search index.

  • threshold (Union[float, Sequence]) –

    Distance threshold at which to identify neighbors for each observation in X.

    Alternatively, this may be a sequence of non-negative floats of length equal to the number of observations in X, specifying the distance threshold to search for each observation.

    If subset is supplied and threshold is a sequence, it should have length equal to subset instead, and should specify the distance threshold for each observation in the subset.

  • num_threads (int) – Number of threads to use for the search.

  • subset (Optional[Sequence]) – Sequence of integers containing the indices of the observations for which to identify neighbors. All indices should be non-negative and less than the total number of observations.

  • get_index (bool) – Whether to report the indices of each nearest neighbor.

  • get_distance (bool) – Whether to report the distances to each nearest neighbor.

  • kwargs – Additional arguments to pass to specific methods.

Return type:

FindNeighborsResults

Returns:

Results of the neighbor search.

knncolle.hnsw module

class knncolle.hnsw.HnswIndex(ptr)[source]

Bases: GenericIndex

A prebuilt index for the hierarchical navigable small worlds (HNSW) algorithm, created by define_builder() with a HnswParameters object.

__init__(ptr)[source]
Parameters:

ptr – Address of a knncolle_py::WrappedPrebuilt containing a HNSW search index, allocated in C++.

class knncolle.hnsw.HnswParameters(num_links=16, ef_construction=200, ef_search=10, distance='Euclidean')[source]

Bases: Parameters

Parameters for the hierarchical navigable small worlds (HNSW) algorithm, see here for details.

__init__(num_links=16, ef_construction=200, ef_search=10, distance='Euclidean')[source]
Parameters:
  • num_links (int) – Number of bi-directional links to create per observation during index construction. Larger values improve accuracy at the expense of speed and memory usage.

  • ef_construction (int) – Size of the dynamic list for index generation. Larger values improve the quality of the index at the expense of time.

  • ef_search (int) – Size of the dynamic list for neighbor searching. Larger values improve accuracy at the expense of a slower search.

  • distance (Literal['Euclidean', 'Manhattan', 'Cosine']) – Distance metric for index construction and search. This should be one of Euclidean, Manhattan or Cosine.

property distance: str

Distance metric, see __init__().

property ef_construction: int

Size of the dynamic list during index construction, see __init__().

Size of the dynamic list during search, see __init__().

Number of links, see __init__().

knncolle.kmknn module

class knncolle.kmknn.KmknnIndex(ptr)[source]

Bases: GenericIndex

A prebuilt index for the k-means k-nearest neighbors algorithm, created by define_builder() with a KmknnParameters instance.

__init__(ptr)[source]
Parameters:

ptr – Address of a knncolle_py::WrappedPrebuilt containing a KMKNN search index, allocated in C++.

class knncolle.kmknn.KmknnParameters(distance='Euclidean')[source]

Bases: Parameters

Parameters for the k-means k-nearest neighbors (KMKNN) algorithm.

__init__(distance='Euclidean')[source]
Parameters:

distance (Literal['Euclidean', 'Manhattan', 'Cosine']) – Distance metric for index construction and search. This should be one of Euclidean, Manhattan or Cosine.

property distance: str

Distance metric, see __init__().

knncolle.query_distance module

knncolle.query_distance.query_distance(X, query, num_neighbors, num_threads=1, **kwargs)[source]

Find the distance to the k-th nearest neighbor in the search index for each observation in the query dataset.

Parameters:
  • X (Index) – A prebuilt search index.

  • query (ndarray) – Matrix of coordinates for the query observations. This should be a two-dimensional double-precision NumPy array in Fortran order where the rows are dimensions and columns are observations. The number of dimensions should be consistent with that in X.

  • num_neighbors (Union[int, Sequence]) –

    Number of nearest neighbors in X at which to compute the distance from each observation in query, i.e., k. This is automatically capped at the total number of observations in X.

    Alternatively, this may be a sequence of integers of length equal to the number of observations in query, specifying the neighbor at which to compute the distance for each observation.

  • num_threads (int) – Number of threads to use for the search.

  • kwargs – Additional arguments to pass to specific methods.

Return type:

ndarray

Returns:

A NumPy array of length equal to the number of observations in query containing the distance to the num_neighbors-th point in X for each observation.

knncolle.query_knn module

class knncolle.query_knn.QueryKnnResults(index, distance)[source]

Bases: object

Results of query_knn().

If num_neighbors is an integer, index and distance are both matrices. Each row corresponds to an observation in query and each column corresponds to one of its neighbors in X. index contains the indices of the nearest neighbors while distance contains the distance to those neighbors. In each row, neighbors are guaranteed to be sorted in order of increasing distance.

If num_neighbors is a sequence, index and distance are lists instead. Each list element corresponds to an observation in X and is a NumPy array containing the indices (for index) or distances (for distance) to the requested number of neighbors for that observation. For each observation, the neighbors are guaranteed to be sorted in order of increasing distance.

If get_index = False, index is set to None.

If get_distance = False, distance is set to None.

__init__(index, distance)
distance: Optional[ndarray]
index: Optional[ndarray]
knncolle.query_knn.query_knn(X, query, num_neighbors, num_threads=1, get_index=True, get_distance=True, **kwargs)[source]

Find the k-nearest neighbors in the search index for each observation in the query matrix.

Parameters:
  • X (Index) – A prebuilt search index.

  • query (ndarray) – Matrix of coordinates for the query observations. This should be a two-dimensional double-precision NumPy array in Fortran order where the rows are dimensions and columns are observations. The number of dimensions should be consistent with that in X.

  • num_neighbors (Union[int, Sequence]) –

    Number of nearest neighbors in X to identify for each observation in query, i.e., k. This is automatically capped at the total number of observations in X.

    Alternatively, this may be a sequence of non-negative integers of length equal to the number of observations in query, specifying the number of neighbors to find for each observation.

  • num_threads (int) – Number of threads to use for the search.

  • get_index (bool) – Whether to report the indices of each nearest neighbor.

  • get_distance (bool) – Whether to report the distances to each nearest neighbor.

  • kwargs – Additional arguments to pass to specific methods.

Return type:

QueryKnnResults

Returns:

Results of the nearest-neighbor search.

knncolle.query_neighbors module

class knncolle.query_neighbors.QueryNeighborsResults(index, distance)[source]

Bases: object

Results of query_neighbors().

index and distance are lists where each element is a NumPy array that corresponds to an observation in query. Each array contains the indices of (for index) or distances to (for distance) the observations of X that neighbor the corresponding observation within the specified threshold distance. For each query observation, neighbors are guaranteed to be sorted in order of increasing distance.

If get_index = False, index is set to None.

If get_distance = False, distance is set to None.

__init__(index, distance)
distance: Optional[list]
index: Optional[list]
knncolle.query_neighbors.query_neighbors(X, threshold, num_threads=1, subset=None, get_index=True, get_distance=True, **kwargs)[source]

Find all observations in the search index that lie within a threshold distance of each observation in the query dataset.

Parameters:
  • X (Index) – A prebuilt search index.

  • query – Matrix of coordinates for the query observations. This should be a two-dimensional double-precision NumPy array in Fortran order where the rows are dimensions and columns are observations. The number of dimensions should be consistent with that in X.

  • threshold (Union[float, Sequence]) –

    Distance threshold at which to identify neighbors for each observation in X.

    Alternatively, this may be a sequence of non-negative floats of length equal to the number of observations in X, specifying the distance threshold to search for each observation.

  • num_threads (int) – Number of threads to use for the search.

  • get_index (bool) – Whether to report the indices of each nearest neighbor.

  • get_distance (bool) – Whether to report the distances to each nearest neighbor.

  • kwargs – Additional arguments to pass to specific methods.

Return type:

QueryNeighborsResults

Returns:

Results of the neighbor search.

knncolle.vptree module

class knncolle.vptree.VptreeIndex(ptr)[source]

Bases: GenericIndex

A prebuilt index for the vantage point tree algorithm, created by define_builder() with a VptreeParameters instance.

__init__(ptr)[source]
Parameters:

ptr – Address of a knncolle_py::WrappedPrebuilt containing a VP tree search index, allocated in C++.

class knncolle.vptree.VptreeParameters(distance='Euclidean')[source]

Bases: Parameters

Parameters for the vantage point (VP) tree algorithm.

__init__(distance='Euclidean')[source]
Parameters:

distance (Literal['Euclidean', 'Manhattan', 'Cosine']) – Distance metric for index construction and search. This should be one of Euclidean, Manhattan or Cosine.

property distance: str

Distance metric, see __init__().