Comment effectuer la sélection des fonctionnalités avec gridsearchcv dans sklearn dans python

Question

J'utilise recursive feature elimination with cross validation (rfecv) comme sélecteur de fonction pour randomforest classifier comme suit.

X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_])

J'exécute également GridSearchCV comme suit pour régler les hyperparamètres de RandomForestClassifier comme suit.

X = df[[my_features]] #all my features y = df['gold_standard'] #labels x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0) rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced') param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc') CV_rfc.fit(x_train, y_train) print(CV_rfc.best_params_) print(CV_rfc.best_score_) print(CV_rfc.best_estimator_) pred = CV_rfc.predict_proba(x_test)[:,1] print(roc_auc_score(y_test, pred))

Cependant, je ne sais pas comment fusionner la sélection de fonctionnalités (rfecv) avec GridSearchCV.

MODIFIER:

Lorsque j'exécute la réponse suggérée par @Gambit, j'obtiens l'erreur suivante:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False), estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False), min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1, verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

Je pourrais résoudre le problème ci-dessus en utilisant estimator__ dans le param_grid liste des paramètres.

Ma question est maintenant Comment utiliser les fonctionnalités et paramètres sélectionnés dans x_test pour vérifier si le modèle fonctionne correctement avec des données invisibles. Comment obtenir le best features et entraînez-le avec le optimal hyperparameters?

Je suis heureux de fournir plus de détails si nécessaire.

Venkatachalam · Accepted Answer

Fondamentalement, vous souhaitez affiner le paramètre hyper de votre classificateur (avec validation croisée) après la sélection de fonctionnalités à l'aide de l'élimination des fonctionnalités récursive (avec validation croisée).

L'objet Pipeline est exactement destiné à cet effet d'assemblage de la transformation de données et d'application d'estimateur.

Il se peut que vous puissiez utiliser un modèle différent (GradientBoostingClassifier, etc.) pour votre classification finale. Ce serait possible avec l'approche suivante:

from sklearn.datasets import load_breast_cancer from sklearn.feature_selection import RFECV from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) from sklearn.pipeline import Pipeline #this is the classifier used for feature selection clf_featr_sele = RandomForestClassifier(n_estimators=30, random_state=42, class_weight="balanced") rfecv = RFECV(estimator=clf_featr_sele, step=1, cv=5, scoring = 'roc_auc') #you can have different classifier for your final classifier clf = RandomForestClassifier(n_estimators=10, random_state=42, class_weight="balanced") CV_rfc = GridSearchCV(clf, param_grid={'max_depth':[2,3]}, cv= 5, scoring = 'roc_auc') pipeline = Pipeline([('feature_sele',rfecv), ('clf_cv',CV_rfc)]) pipeline.fit(X_train, y_train) pipeline.predict(X_test)

Maintenant, vous pouvez appliquer ce pipeline (y compris la sélection des fonctionnalités) aux données de test.

Mohammed Kashif · Answer

Vous avez juste besoin de passer l'estimateur d'élimination des entités récursives directement dans l'objet GridSearchCV. Quelque chose comme ça devrait fonctionner

X = df[my_features] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc') param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) #------------- Just pass your RFECV object as estimator here directly --------# CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc') CV_rfc.fit(x_train, y_train) print(CV_rfc.best_params_) print(CV_rfc.best_score_) print(CV_rfc.best_estimator_)

gmds · Answer

Vous pouvez faire ce que vous voulez en préfixant les noms des paramètres que vous voulez passer à l'estimateur avec 'estimator__'.

X = df[[my_features]] y = df[gold_standard] clf = RandomForestClassifier(random_state=0, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc') param_grid = { 'estimator__n_estimators': [200, 500], 'estimator__max_features': ['auto', 'sqrt', 'log2'], 'estimator__max_depth' : [4,5,6,7,8], 'estimator__criterion' :['gini', 'entropy'] } k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0) CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc') X_train, X_test, y_train, y_test = train_test_split(X, y) CV_rfc.fit(X_train, y_train)

Sortie sur de fausses données que j'ai faites:

{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'} 0.5653035605690997 RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False), estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='entropy', max_depth=6, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False), min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1, verbose=0)