欠測データ

Contents

欠測データ#

欠測データ（missing data）は

代入法#

単一代入法#

平均値代入法：対象の変数の欠損していないレコードから計算した平均値で代入
回帰代入法：当該レコードの欠損していない他の変数を用いて予測した値で代入

多重代入法#

疑似乱数から代入する値をサンプリングして代入した複数（ $k \in N$ 個）のデータセットを作成する
欠測値を補完した $k$ 個のデータセットで分析を $k$ 回行う

MICE#

古典的な多重代入法はMCMCだったが、FCSアルゴリズムを使う MICE（multivvariate imputation by chained equations） が人気

高橋将宜, & 伊藤孝之. (2014). 様々な多重代入法アルゴリズムの比較~ 大規模経済系データを用いた分析~. 統計研究彙報, 71(3), 39-82.

StatsmodelsパッケージのMICE#

Multiple Imputation with Chained Equations - statsmodels 0.14.1

MICEの実行例#

# データの生成：完全データ
import numpy as np
np.random.seed(0)

n = 20
x1 = np.random.uniform(size=n)
x2 = np.random.normal(size=n)
e = np.random.normal(size=n)
y = 2 + 5*x1 + 3*x2 + e

import pandas as pd
df_full = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
df_full

	x1	x2	y
0	0.548814	1.494079	10.456595
1	0.715189	-0.205158	6.162852
2	0.602763	0.313068	5.565693
3	0.544883	-0.854096	1.859826
4	0.423655	-2.552990	-4.589248
5	0.645894	0.653619	5.770308
6	0.437587	0.864436	5.074974
7	0.891773	-0.742165	6.183145
8	0.963663	2.269755	13.117925
9	0.383442	-1.454366	-0.883964
10	0.791725	0.045759	4.843105
11	0.528895	-0.187184	4.860413
12	0.568045	1.532779	7.824663
13	0.925597	1.469359	10.823319
14	0.071036	0.154947	1.924556
15	0.087129	0.378163	3.957037
16	0.020218	-0.887786	-1.073070
17	0.832620	-1.980796	-0.959922
18	0.778157	-0.347912	4.818865
19	0.870012	0.156349	7.247440

# データの関係
import seaborn as sns
sns.pairplot(df_full, height=1.5)

<seaborn.axisgrid.PairGrid at 0x7f18fc3d5930>

../_images/40dd53d1cbee757baa159561267de654fd20426a29d27010c8422240cd6aca98.png

# 一部を意図的に欠損させる
df = df_full.copy()

missing_idx = df.sample(frac=0.2, random_state=0).index
df.loc[missing_idx, "x1"] = None
df

	x1	x2	y
0	0.548814	1.494079	10.456595
1	NaN	-0.205158	6.162852
2	0.602763	0.313068	5.565693
3	0.544883	-0.854096	1.859826
4	0.423655	-2.552990	-4.589248
5	0.645894	0.653619	5.770308
6	0.437587	0.864436	5.074974
7	0.891773	-0.742165	6.183145
8	NaN	2.269755	13.117925
9	0.383442	-1.454366	-0.883964
10	0.791725	0.045759	4.843105
11	0.528895	-0.187184	4.860413
12	0.568045	1.532779	7.824663
13	0.925597	1.469359	10.823319
14	0.071036	0.154947	1.924556
15	0.087129	0.378163	3.957037
16	0.020218	-0.887786	-1.073070
17	0.832620	-1.980796	-0.959922
18	NaN	-0.347912	4.818865
19	NaN	0.156349	7.247440

MICEを使ったデータセット群の生成#

# MICEでデータセットを生成
from statsmodels.imputation import mice

samples = []
imp = mice.MICEData(df, perturbation_method='gaussian', k_pmm=20)
imp.set_imputer(endog_name='x1', formula='y + x2')
for j in range(100):
    imp.update_all()
    samples.append(imp.data.loc[missing_idx, "x1"].to_dict())
samples = pd.DataFrame(samples)

# 代入された値の分布
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
colors = ["steelblue", "darkorange", "mediumseagreen", "tomato"]
for color, i in zip(colors, samples.columns):
    ax.axvline(df_full.loc[i, "x1"], label=f"sample-{i}", color=color)
    ax.hist(samples[i], label=f"sample-{i}", alpha=0.5, color=color, bins=20)
ax.legend()
fig.show()

../_images/5dbeee19a36c172e74d3bc4529544e9cdfa59d8fe9d79a4e0a915e937ac6013c.png

MICEを使って推定#

import statsmodels.api as sm
from statsmodels.imputation import mice

imp = mice.MICEData(df)
fml = "y ~ x1 + x2"
m = mice.MICE(fml, sm.OLS, imp)
results = m.fit(n_burnin=10, n_imputations=10)
print(results.summary())

                        Results: MICE
=============================================================
Method:                  MICE      Sample size:          20  
Model:                   OLS       Scale                 1.90
Dependent variable:      y         Num. imputations      10  
-------------------------------------------------------------
          Coef.  Std.Err.    t    P>|t|  [0.025 0.975]  FMI  
-------------------------------------------------------------
Intercept 2.0910   0.8916  2.3453 0.0190 0.3435 3.8385 0.4320
x1        4.6978   1.4068  3.3395 0.0008 1.9406 7.4550 0.3596
x2        3.1663   0.2912 10.8730 0.0000 2.5955 3.7370 0.1961
=============================================================

けっこうズレてる

import statsmodels.formula.api as smf

print(smf.ols(fml, data=df).fit().summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.940
Model:                            OLS   Adj. R-squared:                  0.931
Method:                 Least Squares   F-statistic:                     102.2
Date:                Fri, 04 Apr 2025   Prob (F-statistic):           1.12e-08
Time:                        11:05:59   Log-Likelihood:                -22.843
No. Observations:                  16   AIC:                             51.69
Df Residuals:                      13   BIC:                             54.00
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.3872      0.610      2.276      0.040       0.070       2.704
x1             5.3861      1.038      5.191      0.000       3.144       7.628
x2             3.0146      0.236     12.760      0.000       2.504       3.525
==============================================================================
Omnibus:                        1.792   Durbin-Watson:                   2.266
Prob(Omnibus):                  0.408   Jarque-Bera (JB):                1.322
Skew:                           0.665   Prob(JB):                        0.516
Kurtosis:                       2.539   Cond. No.                         5.13
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/usr/local/lib/python3.10/site-packages/scipy/stats/_stats_py.py:1806: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

import statsmodels.formula.api as smf

print(smf.ols(fml, data=df_full).fit().summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.948
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     154.6
Date:                Fri, 04 Apr 2025   Prob (F-statistic):           1.24e-11
Time:                        11:05:59   Log-Likelihood:                -27.831
No. Observations:                  20   AIC:                             61.66
Df Residuals:                      17   BIC:                             64.65
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.3384      0.561      2.384      0.029       0.154       2.523
x1             5.6626      0.876      6.463      0.000       3.814       7.511
x2             2.9649      0.204     14.558      0.000       2.535       3.395
==============================================================================
Omnibus:                        1.327   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.515   Jarque-Bera (JB):                1.136
Skew:                           0.418   Prob(JB):                        0.567
Kurtosis:                       2.186   Cond. No.                         5.19
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.