sklearn standardscaler结果与手动结果不同
问题内容:
我使用了sklearn
standardscaler(均值去除和方差缩放)来缩放数据框,并将其与数据框进行比较,在这里我“手动”减去了平均值并除以标准差。比较显示出一致的微小差异。有人可以解释为什么吗?(我使用的数据集是这样的:http
:
//archive.ics.uci.edu/ml/datasets/Wine
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("~/DataSets/WineDataSetItaly/wine.data.txt", names=["Class", "Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline"])
cols = list(df.columns)[1:] # I didn't want to scale the "Class" column
std_scal = StandardScaler()
standardized = std_scal.fit_transform(df[cols])
df_standardized_fit = pd.DataFrame(standardized, index=df.index, columns=df.columns[1:])
df_standardized_manual = (df - df.mean()) / df.std()
df_standardized_manual.drop("Class", axis=1, inplace=True)
df_differences = df_standardized_fit - df_standardized_manual
df_differences.iloc[:,:5]
Alcohol Malic acid Ash Alcalinity Magnesium
0 0.004272 -0.001582 0.000653 -0.003290 0.005384
1 0.000693 -0.001405 -0.002329 -0.007007 0.000051
2 0.000554 0.000060 0.003120 -0.000756 0.000249
3 0.004758 -0.000976 0.001373 -0.002276 0.002619
4 0.000832 0.000640 0.005177 0.001271 0.003606
5 0.004168 -0.001455 0.000858 -0.003628 0.002421
问题答案:
scikit-
learn使用np.std,默认情况下是人口标准差(其中平方差的总和除以观察数),而pandas使用样本标准差(其中分母是观察数-1)(请参阅维基百科的标准差文章)。这是对总体标准偏差进行无偏估计并由自由度(ddof
)确定的校正因子。因此,默认情况下,numpy和scikit-
learn的计算使用,ddof=0
而pandas使用ddof=1
(docs)。
DataFrame.std(axis = None,skipna = None,level = None,ddof = 1,numeric_only =
None,** kwargs)返回要求轴上的样品标准偏差。
默认情况下由N-1标准化。可以使用ddof参数更改
如果您将熊猫版本更改为:
df_standardized_manual = (df - df.mean()) / df.std(ddof=0)
差异实际上为零:
Alcohol Malic acid Ash Alcalinity of ash Magnesium
0 -8.215650e-15 -5.551115e-16 3.191891e-15 0.000000e+00 2.220446e-16
1 -8.715251e-15 -4.996004e-16 3.441691e-15 0.000000e+00 0.000000e+00
2 -8.715251e-15 -3.955170e-16 2.886580e-15 -5.551115e-17 1.387779e-17
3 -8.437695e-15 -4.440892e-16 3.164136e-15 -1.110223e-16 1.110223e-16
4 -8.659740e-15 -3.330669e-16 2.886580e-15 5.551115e-17 2.220446e-16