欠測データ#

欠測データ(missing data)は

代入法#

単一代入法#

  • 平均値代入法:対象の変数の欠損していないレコードから計算した平均値で代入

  • 回帰代入法:当該レコードの欠損していない他の変数を用いて予測した値で代入

多重代入法#

  1. 疑似乱数から代入する値をサンプリングして代入した複数(\(k\in \mathbb{N}\)個)のデータセットを作成する

  2. 欠測値を補完した\(k\)個のデータセットで分析を\(k\)回行う

MICE#

古典的な多重代入法はMCMCだったが、FCSアルゴリズムを使う MICE(multivvariate imputation by chained equations) が人気

高橋将宜, & 伊藤孝之. (2014). 様々な多重代入法アルゴリズムの比較~ 大規模経済系データを用いた分析~. 統計研究彙報, 71(3), 39-82.

StatsmodelsパッケージのMICE#

Multiple Imputation with Chained Equations - statsmodels 0.14.1

MICEの実行例#

# データの生成:完全データ
import numpy as np
np.random.seed(0)

n = 20
x1 = np.random.uniform(size=n)
x2 = np.random.normal(size=n)
e = np.random.normal(size=n)
y = 2 + 5*x1 + 3*x2 + e

import pandas as pd
df_full = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
df_full
x1 x2 y
0 0.548814 1.494079 10.456595
1 0.715189 -0.205158 6.162852
2 0.602763 0.313068 5.565693
3 0.544883 -0.854096 1.859826
4 0.423655 -2.552990 -4.589248
5 0.645894 0.653619 5.770308
6 0.437587 0.864436 5.074974
7 0.891773 -0.742165 6.183145
8 0.963663 2.269755 13.117925
9 0.383442 -1.454366 -0.883964
10 0.791725 0.045759 4.843105
11 0.528895 -0.187184 4.860413
12 0.568045 1.532779 7.824663
13 0.925597 1.469359 10.823319
14 0.071036 0.154947 1.924556
15 0.087129 0.378163 3.957037
16 0.020218 -0.887786 -1.073070
17 0.832620 -1.980796 -0.959922
18 0.778157 -0.347912 4.818865
19 0.870012 0.156349 7.247440
# データの関係
import seaborn as sns
sns.pairplot(df_full, height=1.5)
<seaborn.axisgrid.PairGrid at 0x7f7dd4683e80>
../_images/a0ad1273b48484fbd4a0d14957dfbeb47669fe4c1727ed2819399736a0226c29.png
# 一部を意図的に欠損させる
df = df_full.copy()

missing_idx = df.sample(frac=0.2, random_state=0).index
df.loc[missing_idx, "x1"] = None
df
x1 x2 y
0 0.548814 1.494079 10.456595
1 NaN -0.205158 6.162852
2 0.602763 0.313068 5.565693
3 0.544883 -0.854096 1.859826
4 0.423655 -2.552990 -4.589248
5 0.645894 0.653619 5.770308
6 0.437587 0.864436 5.074974
7 0.891773 -0.742165 6.183145
8 NaN 2.269755 13.117925
9 0.383442 -1.454366 -0.883964
10 0.791725 0.045759 4.843105
11 0.528895 -0.187184 4.860413
12 0.568045 1.532779 7.824663
13 0.925597 1.469359 10.823319
14 0.071036 0.154947 1.924556
15 0.087129 0.378163 3.957037
16 0.020218 -0.887786 -1.073070
17 0.832620 -1.980796 -0.959922
18 NaN -0.347912 4.818865
19 NaN 0.156349 7.247440

MICEを使ったデータセット群の生成#

# MICEでデータセットを生成
from statsmodels.imputation import mice

samples = []
imp = mice.MICEData(df, perturbation_method='gaussian', k_pmm=20)
imp.set_imputer(endog_name='x1', formula='y + x2')
for j in range(100):
    imp.update_all()
    samples.append(imp.data.loc[missing_idx, "x1"].to_dict())
samples = pd.DataFrame(samples)

# 代入された値の分布
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
colors = ["steelblue", "darkorange", "mediumseagreen", "tomato"]
for color, i in zip(colors, samples.columns):
    ax.axvline(df_full.loc[i, "x1"], label=f"sample-{i}", color=color)
    ax.hist(samples[i], label=f"sample-{i}", alpha=0.5, color=color, bins=20)
ax.legend()
fig.show()
../_images/9c8809e65d19dbd7168d205b28258c2b51e5a6968157824ab5357eee7d777d74.png

MICEを使って推定#

import statsmodels.api as sm
from statsmodels.imputation import mice

imp = mice.MICEData(df)
fml = "y ~ x1 + x2"
m = mice.MICE(fml, sm.OLS, imp)
results = m.fit(n_burnin=10, n_imputations=10)
print(results.summary())
                        Results: MICE
=============================================================
Method:                  MICE      Sample size:          20  
Model:                   OLS       Scale                 1.90
Dependent variable:      y         Num. imputations      10  
-------------------------------------------------------------
          Coef.  Std.Err.    t    P>|t|  [0.025 0.975]  FMI  
-------------------------------------------------------------
Intercept 2.0910   0.8916  2.3453 0.0190 0.3435 3.8385 0.4320
x1        4.6978   1.4068  3.3395 0.0008 1.9406 7.4550 0.3596
x2        3.1663   0.2912 10.8730 0.0000 2.5955 3.7370 0.1961
=============================================================

けっこうズレてる

import statsmodels.formula.api as smf

print(smf.ols(fml, data=df).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.940
Model:                            OLS   Adj. R-squared:                  0.931
Method:                 Least Squares   F-statistic:                     102.2
Date:                Fri, 29 Nov 2024   Prob (F-statistic):           1.12e-08
Time:                        04:55:54   Log-Likelihood:                -22.843
No. Observations:                  16   AIC:                             51.69
Df Residuals:                      13   BIC:                             54.00
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.3872      0.610      2.276      0.040       0.070       2.704
x1             5.3861      1.038      5.191      0.000       3.144       7.628
x2             3.0146      0.236     12.760      0.000       2.504       3.525
==============================================================================
Omnibus:                        1.792   Durbin-Watson:                   2.266
Prob(Omnibus):                  0.408   Jarque-Bera (JB):                1.322
Skew:                           0.665   Prob(JB):                        0.516
Kurtosis:                       2.539   Cond. No.                         5.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/usr/local/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
import statsmodels.formula.api as smf

print(smf.ols(fml, data=df_full).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.948
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     154.6
Date:                Fri, 29 Nov 2024   Prob (F-statistic):           1.24e-11
Time:                        04:55:55   Log-Likelihood:                -27.831
No. Observations:                  20   AIC:                             61.66
Df Residuals:                      17   BIC:                             64.65
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.3384      0.561      2.384      0.029       0.154       2.523
x1             5.6626      0.876      6.463      0.000       3.814       7.511
x2             2.9649      0.204     14.558      0.000       2.535       3.395
==============================================================================
Omnibus:                        1.327   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.515   Jarque-Bera (JB):                1.136
Skew:                           0.418   Prob(JB):                        0.567
Kurtosis:                       2.186   Cond. No.                         5.19
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.