I am using CountVectorizer()
to create a term-frequency matrix.
(我正在使用CountVectorizer()
创建项频矩阵。)
I want to delete the vocabulary all of the terms which a frequency of two or less. (我想删除词汇表中频率不超过两个的所有术语。)
Then I use tfidfTransformer()
for creating a ti*idf matrix (然后我使用tfidfTransformer()
创建ti * idf矩阵)
vectorizer=CountVectorizer()
X =vectorizer.fit_transform(docs)
matrix_terms = np.array(vectorizer.get_feature_names())
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
tfidf_transformer=TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(X)
Then I want to use the LSA algorithm for dimensionality reduction, and k-means to clustering.
(然后,我想使用LSA算法进行降维,并将k均值用于聚类。)
But I want to make the clusters without the terms that have a frequency of two or less. (但是我想使聚类不包含频率为两个或更少的项。)
Can someone help me, please? (有谁可以帮助我吗?)
ask by rootware translate from so 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…