Python负二项式回归-结果与R中的结果不匹配
问题内容:
我正在尝试使用Python进行负二项式回归。我发现使用R以及数据集的示例:
http://www.karlin.mff.cuni.cz/~pesta/NMFM404/NB.html
我尝试使用以下代码在网页上复制结果:
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
df = pd.read_stata("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/nb_data.dta")
model = smf.glm(formula = "daysabs ~ math + prog", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
不幸的是,这没有给出相同的系数。它给出了以下内容:
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 3.4875 0.236 14.808 0.000 3.026 3.949
math -0.0067 0.003 -2.600 0.009 -0.012 -0.002
prog -0.6781 0.101 -6.683 0.000 -0.877 -0.479
这些甚至与网站上的内容不尽相同。假设R代码正确,我在做什么错?
问题答案:
差异的原因是,当您使用Pandas读取数据集时,默认情况下将prog
变量视为类型float
:
df.prog.head()
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
Name: prog, dtype: float32
另一方面,在R示例中,该prog
变量已显式转换为因子(类别)变量:
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
结果,当您查看R中的拟合摘要时,可以看到该prog
变量已被拆分为n-1个二进制编码的术语:
> summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
Call:
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1547 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
math -0.005993 0.002505 -2.392 0.0167 *
progAcademic -0.440760 0.182610 -2.414 0.0158 *
progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
将此与prog
变量在您发布的Python适合摘要中的显示方式进行比较。
要解决此问题,您可以使用C()
函数将变量强制转换为statsmodels中的类别。这样,您将获得相同的结果:
model = smf.glm(formula = "daysabs ~ math + C(prog)", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: daysabs No. Observations: 314
Model: GLM Df Residuals: 310
Model Family: NegativeBinomial Df Model: 3
Link Function: log Scale: 1.06830885199
Method: IRLS Log-Likelihood: -865.68
Date: Thu, 16 Feb 2017 Deviance: 350.98
Time: 10:34:16 Pearson chi2: 331.
No. Iterations: 6
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 2.6150 0.207 12.630 0.000 2.209 3.021
C(prog)[T.2.0] -0.4408 0.192 -2.302 0.021 -0.816 -0.065
C(prog)[T.3.0] -1.2786 0.210 -6.079 0.000 -1.691 -0.866
math -0.0060 0.003 -2.281 0.023 -0.011 -0.001
==================================================================================
"""