Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
599 views
in Technique[技术] by (71.8m points)

pandas - Add feature to countvectorizer from different column with FeatureUnion

I currently am trying to add an additional feature to a countvectorizer matrix which is created with scikit-learn.

The workflow is the following: I've got a dataframe which includes a column with text and one that contains an additional feature.

I first split my data into training and test data dataframes. Then I apply the countvectorizer on the text column of my training data. Then I fit a RandomForest classifier with the countvectorizer matrix as input.

What I now am trying to archieve is that I want to run the RandomForest classifier with the matrix and the additional feature which is in an other column of my dataframe.

How would I do this best? I already read about scikit feature-union but could not get this working with a different column in my dataframe.

Here an code example:

# Split the data
x_train, x_test, y_train, y_test = train_test_split(df.drop(['gender'], axis=1), df['gender'], test_size=0.2)
df_x_train = pandas.DataFrame(x_train)
df_x_test = pandas.DataFrame(x_test)
df_y_train = pandas.DataFrame(y_train)
df_y_test  = pandas.DataFrame(y_test)

vectorizer = CountVectorizer()
X__train = vectorizer.fit_transform(df_x_train['text']).toarray()
X__test = vectorizer.transform(df_x_test['text']).toarray()

# Now here I would like to add df['feature_new'] to my X_train and X_test

model = RandomForest()
model.fit(X_train, df_y_train['gender'])
...

question from:https://stackoverflow.com/questions/65876758/add-feature-to-countvectorizer-from-different-column-with-featureunion

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You're looking for ColumnTransformer, not FeatureUnion. The latter applies multiple transformers to every column, whereas the former lets you apply transformers to specific columns.

preproc = ColumnTransformer(
    [('text_vect', CountVectorizer(), 'text')],
    remainder='passthrough',
)
x_train_preproc = preproc.fit_transform(x_train)
x_test_preproc = preproc.transform(x_test)

model.fit(x_train_preproc, y_train)

You could add another transformer for the other column(s) instead of just passing them through with remainder. And I'd consider using a Pipeline to add the model into the same object as the preprocessing; it saves you some wrangling the "preprocessed" datasets. Note that the column specification in ColumnTransformer is a little finicky as to dimensionality; text preprocessors generally need one-dimensional input.

The ColumnTransformer, at least as I've given it, needs dataframes as inputs (so that text refers to the column name). The output of train_test_split will be frames if the inputs were, and sklearn methods will all take frames as input just fine, so just drop the frame-casting and array-casting .toarray().


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...