欠測データ#
欠測データ(missing data)は
代入法#
単一代入法#
平均値代入法:対象の変数の欠損していないレコードから計算した平均値で代入
回帰代入法:当該レコードの欠損していない他の変数を用いて予測した値で代入
多重代入法#
疑似乱数から代入する値をサンプリングして代入した複数(\(k\in \mathbb{N}\)個)のデータセットを作成する
欠測値を補完した\(k\)個のデータセットで分析を\(k\)回行う
MICE#
古典的な多重代入法はMCMCだったが、FCSアルゴリズムを使う MICE(multivvariate imputation by chained equations) が人気
高橋将宜, & 伊藤孝之. (2014). 様々な多重代入法アルゴリズムの比較~ 大規模経済系データを用いた分析~. 統計研究彙報, 71(3), 39-82.
StatsmodelsパッケージのMICE#
Multiple Imputation with Chained Equations - statsmodels 0.14.1
MICEの実行例#
# データの生成:完全データ
import numpy as np
np.random.seed(0)
n = 20
x1 = np.random.uniform(size=n)
x2 = np.random.normal(size=n)
e = np.random.normal(size=n)
y = 2 + 5*x1 + 3*x2 + e
import pandas as pd
df_full = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
df_full
x1 | x2 | y | |
---|---|---|---|
0 | 0.548814 | 1.494079 | 10.456595 |
1 | 0.715189 | -0.205158 | 6.162852 |
2 | 0.602763 | 0.313068 | 5.565693 |
3 | 0.544883 | -0.854096 | 1.859826 |
4 | 0.423655 | -2.552990 | -4.589248 |
5 | 0.645894 | 0.653619 | 5.770308 |
6 | 0.437587 | 0.864436 | 5.074974 |
7 | 0.891773 | -0.742165 | 6.183145 |
8 | 0.963663 | 2.269755 | 13.117925 |
9 | 0.383442 | -1.454366 | -0.883964 |
10 | 0.791725 | 0.045759 | 4.843105 |
11 | 0.528895 | -0.187184 | 4.860413 |
12 | 0.568045 | 1.532779 | 7.824663 |
13 | 0.925597 | 1.469359 | 10.823319 |
14 | 0.071036 | 0.154947 | 1.924556 |
15 | 0.087129 | 0.378163 | 3.957037 |
16 | 0.020218 | -0.887786 | -1.073070 |
17 | 0.832620 | -1.980796 | -0.959922 |
18 | 0.778157 | -0.347912 | 4.818865 |
19 | 0.870012 | 0.156349 | 7.247440 |
# データの関係
import seaborn as sns
sns.pairplot(df_full, height=1.5)
<seaborn.axisgrid.PairGrid at 0x7f7dd4683e80>
# 一部を意図的に欠損させる
df = df_full.copy()
missing_idx = df.sample(frac=0.2, random_state=0).index
df.loc[missing_idx, "x1"] = None
df
x1 | x2 | y | |
---|---|---|---|
0 | 0.548814 | 1.494079 | 10.456595 |
1 | NaN | -0.205158 | 6.162852 |
2 | 0.602763 | 0.313068 | 5.565693 |
3 | 0.544883 | -0.854096 | 1.859826 |
4 | 0.423655 | -2.552990 | -4.589248 |
5 | 0.645894 | 0.653619 | 5.770308 |
6 | 0.437587 | 0.864436 | 5.074974 |
7 | 0.891773 | -0.742165 | 6.183145 |
8 | NaN | 2.269755 | 13.117925 |
9 | 0.383442 | -1.454366 | -0.883964 |
10 | 0.791725 | 0.045759 | 4.843105 |
11 | 0.528895 | -0.187184 | 4.860413 |
12 | 0.568045 | 1.532779 | 7.824663 |
13 | 0.925597 | 1.469359 | 10.823319 |
14 | 0.071036 | 0.154947 | 1.924556 |
15 | 0.087129 | 0.378163 | 3.957037 |
16 | 0.020218 | -0.887786 | -1.073070 |
17 | 0.832620 | -1.980796 | -0.959922 |
18 | NaN | -0.347912 | 4.818865 |
19 | NaN | 0.156349 | 7.247440 |
MICEを使ったデータセット群の生成#
# MICEでデータセットを生成
from statsmodels.imputation import mice
samples = []
imp = mice.MICEData(df, perturbation_method='gaussian', k_pmm=20)
imp.set_imputer(endog_name='x1', formula='y + x2')
for j in range(100):
imp.update_all()
samples.append(imp.data.loc[missing_idx, "x1"].to_dict())
samples = pd.DataFrame(samples)
# 代入された値の分布
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
colors = ["steelblue", "darkorange", "mediumseagreen", "tomato"]
for color, i in zip(colors, samples.columns):
ax.axvline(df_full.loc[i, "x1"], label=f"sample-{i}", color=color)
ax.hist(samples[i], label=f"sample-{i}", alpha=0.5, color=color, bins=20)
ax.legend()
fig.show()
MICEを使って推定#
import statsmodels.api as sm
from statsmodels.imputation import mice
imp = mice.MICEData(df)
fml = "y ~ x1 + x2"
m = mice.MICE(fml, sm.OLS, imp)
results = m.fit(n_burnin=10, n_imputations=10)
print(results.summary())
Results: MICE
=============================================================
Method: MICE Sample size: 20
Model: OLS Scale 1.90
Dependent variable: y Num. imputations 10
-------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975] FMI
-------------------------------------------------------------
Intercept 2.0910 0.8916 2.3453 0.0190 0.3435 3.8385 0.4320
x1 4.6978 1.4068 3.3395 0.0008 1.9406 7.4550 0.3596
x2 3.1663 0.2912 10.8730 0.0000 2.5955 3.7370 0.1961
=============================================================
けっこうズレてる
import statsmodels.formula.api as smf
print(smf.ols(fml, data=df).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.940
Model: OLS Adj. R-squared: 0.931
Method: Least Squares F-statistic: 102.2
Date: Fri, 29 Nov 2024 Prob (F-statistic): 1.12e-08
Time: 04:55:54 Log-Likelihood: -22.843
No. Observations: 16 AIC: 51.69
Df Residuals: 13 BIC: 54.00
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.3872 0.610 2.276 0.040 0.070 2.704
x1 5.3861 1.038 5.191 0.000 3.144 7.628
x2 3.0146 0.236 12.760 0.000 2.504 3.525
==============================================================================
Omnibus: 1.792 Durbin-Watson: 2.266
Prob(Omnibus): 0.408 Jarque-Bera (JB): 1.322
Skew: 0.665 Prob(JB): 0.516
Kurtosis: 2.539 Cond. No. 5.13
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/usr/local/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
import statsmodels.formula.api as smf
print(smf.ols(fml, data=df_full).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.948
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-statistic: 154.6
Date: Fri, 29 Nov 2024 Prob (F-statistic): 1.24e-11
Time: 04:55:55 Log-Likelihood: -27.831
No. Observations: 20 AIC: 61.66
Df Residuals: 17 BIC: 64.65
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.3384 0.561 2.384 0.029 0.154 2.523
x1 5.6626 0.876 6.463 0.000 3.814 7.511
x2 2.9649 0.204 14.558 0.000 2.535 3.395
==============================================================================
Omnibus: 1.327 Durbin-Watson: 2.023
Prob(Omnibus): 0.515 Jarque-Bera (JB): 1.136
Skew: 0.418 Prob(JB): 0.567
Kurtosis: 2.186 Cond. No. 5.19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.