apache spark - How to merge multiple feature vectors in DataFrame?

Question

Welcome To Ask or Share your Answers For Others

apache spark - How to merge multiple feature vectors in DataFrame?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - How to merge multiple feature vectors in DataFrame?

Using Spark ML transformers I arrived at a DataFrame where each row looks like this:

Row(object_id, text_features_vector, color_features, type_features)

where text_features is a sparse vector of term weights, color_features is a small 20-element (one-hot-encoder) dense vector of colors, and type_features is also a one-hot-encoder dense vector of types.

What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the cosine distance between any two objects?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:51:18+0000

You can use VectorAssembler:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame

val df: DataFrame = ???

val assembler = new VectorAssembler()
  .setInputCols(Array("text_features", "color_features", "type_features"))
  .setOutputCol("features")

val transformed = assembler.transform(df)

For PySpark example see: Encode and assemble multiple features in PySpark

Categories

apache spark - How to merge multiple feature vectors in DataFrame?

apache spark - How to merge multiple feature vectors in DataFrame?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags