You're looking for ColumnTransformer
, not FeatureUnion
. The latter applies multiple transformers to every column, whereas the former lets you apply transformers to specific columns.
preproc = ColumnTransformer(
[('text_vect', CountVectorizer(), 'text')],
remainder='passthrough',
)
x_train_preproc = preproc.fit_transform(x_train)
x_test_preproc = preproc.transform(x_test)
model.fit(x_train_preproc, y_train)
You could add another transformer for the other column(s) instead of just passing them through with remainder
. And I'd consider using a Pipeline
to add the model into the same object as the preprocessing; it saves you some wrangling the "preprocessed" datasets. Note that the column specification in ColumnTransformer
is a little finicky as to dimensionality; text preprocessors generally need one-dimensional input.
The ColumnTransformer
, at least as I've given it, needs dataframes as inputs (so that text
refers to the column name). The output of train_test_split
will be frames if the inputs were, and sklearn methods will all take frames as input just fine, so just drop the frame-casting and array-casting .toarray()
.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…