I want to use DBSCAN to find clusters in my dataset. First, I calculated a proximity matrix with unsupervised random forest, which gives me a N x N matrix of size 516,516. Then I use it in DBSCAN as precomputed input.
It is good practice to optimise epsilon in DBSCAN to get meaningful results. I found multiple posts online where this is performed on 2D data (x and y). However, when my data is of larger dimensions and I feel like the elbow plot doesn't make sense here.
# Use nearestneighbors for calculating distance between points
from sklearn.neighbors import NearestNeighbors
# Calculating distances
neigh=NearestNeighbors(n_neighbors=2)
distance=neigh.fit(Prox_mat)
# indices and distance values
distances,indices=distance.kneighbors(Prox_mat)
# Now sorting the distance increasing order
sorting_distances=np.sort(distances,axis=0)
# sorted distances
sorted_distances=sorting_distances[:,1]
# plot between distance vs epsilon
plt.plot(sorted_distances)
plt.xlabel('Distance')
plt.ylabel('Epsilon')
plt.show()
The elbow plot looks something like this:
Elbow plot
Then I use the epsilon of 1.3 as input in the DBSCAN.
clustering_model=DBSCAN(eps=1.3, metric="precomputed")
# fit the model to proximity matrix
clustering_model.fit(Prox_mat)
# predicted labels by DBSCAN
predicted_labels=clustering_model.labels_
# visualising clusters after PCA
plt.scatter(Prox_mat_PCA.iloc[:,0], Prox_mat_PCA.iloc[:,1],c=predicted_labels, cmap='Paired')
plt.title("DBSCAN")
DBSCAN scattterplot
Unfortunately, every instance is assigned number 0, meaning that it belongs to the same cluster.
I was wondering, would it be a good idea to perform PCA on proximity matrix (technically obtaining PCoAs) and then inputting the first 2 PCoAs in the DBSCAN to find the epsilon and subsequent clusters?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…