esda.adbscan.ADBSCAN¶
- class esda.adbscan.ADBSCAN(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]¶
A-DBSCAN, as introduced in [].
A-DSBCAN is an extension of the original DBSCAN algorithm that creates an ensemble of solutions generated by running DBSCAN on a random subset and “extending” the solution to the rest of the sample through nearest-neighbor regression.
See the original reference ([]) for more details or the notebook guide for an illustration. …
- Parameters:
- eps
python:float
The maximum distance between two samples for them to be considered as in the same neighborhood.
- min_samples
python:int
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
- n_jobs
python:int
[Optional. Default=1] The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.
- pct_exact
python:float
[Optional. Default=0.1] Proportion of the entire dataset used to calculate DBSCAN in each draw
- reps
python:int
[Optional. Default=100] Number of random samples to draw in order to build final solution
- keep_solus
Boolean
[Optional. Default=False] If True, the solus and solus_relabelled objects are kept, else it is deleted to save memory
- pct_thr
python:float
[Optional. Default=0.9] Minimum proportion of replications that a non-noise label need to be assigned to an observation for that observation to be labelled as such
- eps
Examples
>>> import pandas >>> from esda.adbscan import ADBSCAN >>> import numpy as np >>> np.random.seed(10) >>> db = pandas.DataFrame({'X': np.random.random(25), 'Y': np.random.random(25) })
ADBSCAN can be run following scikit-learn like API as:
>>> np.random.seed(10) >>> clusterer = ADBSCAN(0.03, 3, reps=10, keep_solus=True) >>> _ = clusterer.fit(db) >>> clusterer.labels_ array(['-1', '-1', '-1', '0', '-1', '-1', '-1', '0', '-1', '-1', '-1', '-1', '-1', '-1', '0', '0', '0', '-1', '0', '-1', '0', '-1', '-1', '-1', '-1'], dtype=object)
We can inspect the winning label for each observation, as well as the proportion of votes:
>>> print(clusterer.votes.head().to_string()) lbls pct 0 -1 0.7 1 -1 0.5 2 -1 0.7 3 0 1.0 4 -1 0.7
If you have set the option to keep them, you can even inspect each solution that makes up the ensemble:
>>> print(clusterer.solus.head().to_string()) rep-00 rep-01 rep-02 rep-03 rep-04 rep-05 rep-06 rep-07 rep-08 rep-09 0 0 1 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 1 1 2 0 1 1 0 0 1 0 0 1 0 3 0 1 1 0 0 1 1 1 0 0 4 0 1 1 1 0 1 0 1 0 1
If we select only one replication and the proportion of the entire dataset that is sampled to 100%, we obtain a traditional DBSCAN:
>>> clusterer = ADBSCAN(0.2, 5, reps=1, pct_exact=1) >>> np.random.seed(10) >>> _ = clusterer.fit(db) >>> clusterer.labels_ array(['0', '-1', '0', '0', '0', '-1', '-1', '0', '-1', '-1', '0', '-1', '-1', '-1', '0', '0', '0', '-1', '0', '0', '0', '-1', '-1', '0', '-1'], dtype=object)
- Attributes:
- labels_
array
[Only available after fit] Cluster labels for each point in the dataset given to fit(). Noisy (if the proportion of the most common label is < pct_thr) samples are given the label -1.
- votes
DataFrame
[Only available after fit] Table indexed on X.index with labels_ under the lbls column, and the frequency across draws of that label under pct
- solus
DataFrame
, shape = [n
,reps
] [Only available after fit] Each solution of labels for every draw
- solus_relabelled: DataFrame, shape = [n, reps]
[Only available after fit] Each solution of labels for every draw, relabelled to be consistent across solutions
- labels_
- __init__(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]¶
Methods
__init__
(eps, min_samples[, algorithm, ...])fit
(X[, y, sample_weight, xy])Perform ADBSCAN clustering from fetaures
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.