如何在SelectFromModel（）中确定用于选择特征的阈值？

问题内容：

我正在使用随机森林分类器进行特征选择。我共有70个功能，并且我要从70个功能中选择最重要的功能。下面的代码显示了分类器，该分类器显示了从最重要到最不重要的功能。

码：

feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

现在，我尝试使用SelectFromModelfrom，sklearn.feature_selection但是如何确定给定数据集的阈值。

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)

# Train the selector
sfm.fit(X_train, y_train)

当我尝试threshold=0.15然后尝试训练我的模型时，出现错误，提示数据太嘈杂或选择太严格。

但是，如果我使用该threshold=0.015模型，就能够在选定的新功能上训练我的模型，那么如何确定该阈值？

问题答案：

我会尝试以下方法：

从低阈值开始，例如： 1e-4
使用SelectFromModel拟合和变换来减少特征
计算所选要素的估算器（RandomForestClassifier根据您的情况）的指标（准确性等）
增加阈值并重复从点1开始的所有步骤。

使用这种方法，您可以估算出最threshold适合您的特定数据和估算器的方法

如何在SelectFromModel（）中确定用于选择特征的阈值？

微信关注