python - What hashing function does Spark use for HashingTF and how do I duplicate it?

Question

Welcome To Ask or Share your Answers For Others

python - What hashing function does Spark use for HashingTF and how do I duplicate it?

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:28:48+0000

If you're in doubt is it usually good to check the source. The bucket for a given term is determined as follows:

def indexOf(self, term):
    """ Returns the index of the input term. """
    return hash(term) % self.numFeatures

As you can see it is just a plain old hash module number of buckets.

Final hash is just a vector of counts per bucket (I've omitted docstring and RDD case for brevity):

def transform(self, document):
    freq = {}
    for term in document:
        i = self.indexOf(term)
        freq[i] = freq.get(i, 0) + 1.0
    return Vectors.sparse(self.numFeatures, freq.items())

If you want to ignore frequencies then you can use set(document) as an input, but I doubt there is much to gain here. To create set you'll have to compute hash for each element anyway.

Categories

python - What hashing function does Spark use for HashingTF and how do I duplicate it?

python - What hashing function does Spark use for HashingTF and how do I duplicate it?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags