部分決定係数（partial R2）

部分決定係数（partial R2）#

Partial \(R^2\) （偏決定係数、部分決定係数、偏重相関などと呼ばれる）は Rosenbaum (2002) の改良として Imbens (2003)が提案した、線形回帰モデルにおける未観測の交絡因子の強さ（欠落変数バイアスの強さ）を把握する手法。

考え方#

結果\(Y\)、処置\(T\)、未観測の交絡因子\(U\)それぞれに分布型を仮定してパラメトリックモデルを構築する。

1. 未観測の交絡因子\(U\)のモデル#

未観測の交絡因子\(U\)は確率0.5のベルヌーイ分布に従うとする。つまり未観測の交絡因子が影響する/しないの値\({1,0}\)が\(1/2\)の確率で決まるとする。

\[ U_i \stackrel{\text { ind }}{\sim} \operatorname{Bern} \left(\frac{1}{2} \right) \]

2. 処置変数\(T\)のモデル#

処置\(T\)は二値変数で、ロジスティック回帰モデルで表現できると仮定する。観測済みの共変量\(X\)だけでなく、未観測の交絡\(U\)も影響するとして、処置\(T\)へ\(\alpha\)倍の強さで影響すると仮定する。

\[ T_i \mid X_i, U_i \stackrel{\text { ind }}{\sim} \operatorname{Bern}\left(\operatorname{sigmoid}\left(\gamma X_i+\alpha U_i\right)\right) \]

別の書き方をすれば

\[ P(T_i = 1 \mid X_i, U_i) = \frac{ 1 }{ 1 + \exp( -(\gamma X_i + \alpha U_i) )} \]

3. 結果変数\(Y\)のモデル#

結果\(Y\)と他の変数との関係には正規線形回帰モデルを仮定する。未観測の交絡\(U\)は結果\(Y\)へ\(\delta\)倍の強さで影響すると仮定する。

\[ Y_i \mid X_i, T_i, U_i \stackrel{\text { ind }}{\sim} \mathcal{N}\left(\tau T_i+\beta X_i+\delta U_i, \sigma^2\right) \]

別の書き方をすれば

\[ Y_i = \tau T_i + \beta X_i + \delta U_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]

となる。

推定#

\(U\)は未観測だが、\(\alpha, \delta\)をなんらかの値に固定すれば、他のパラメータは最尤推定できる。

したがって、\(\alpha, \delta\)をさまざまな値に変えたもとでの推定を行うことで、

未観測の交絡因子が割当や結果変数にどれくらい影響力があるか
因果効果の推定値\(\tau\)がどの程度変化するか

を推定することができる

部分決定係数#

未観測の交絡因子の影響を表す部分決定係数（partial \(R^2\)）は推定したパラメータをもとに構成される。

部分決定係数

\(Y\)の部分決定係数\(R^2_{Y, par}(\alpha, \delta)\)は次のように定義される

\[ R^2_{Y, par}(\alpha, \delta) = 1 - \frac{ RSS_{\text{full}} }{ RSS_{\text{reduced}} } \]

\(RSS_{\text{full}}\)：\(U\)を含むモデルの残差平方和
\(RSS_{\text{reduced}}\)：\(U\)を含まない、観測した変数のみのモデルの残差平方和

直感的な解釈としては「\(U\)を追加したことで説明できた部分がどれだけあるか」。\(U\)を追加して\(\text{full}\)モデルの説明力が\(\text{reduced}\)モデルより高くなれば、 \(RSS_{\text{full}} < RSS_{\text{reduced}}\) になる。

なので、

未観測の交絡因子\(U\)の影響が多い場合、\(RSS_{\text{full}} < RSS_{\text{reduced}}\)になり\(R^2_{Y, par}(\alpha, \delta)\)は 1に近づく
未観測の交絡因子\(U\)の影響が 少ない 場合、\(RSS_{\text{full}} \approx RSS_{\text{reduced}}\)になり\(R^2_{Y, par}(\alpha, \delta)\)は 0に近づく

ということになる。

→ ⚠️通常の決定係数は1に近いほどいいが、 部分決定係数は0に近いほどよい ことになる点に注意。

Imbens (2003) に記載の方法

未観測の交絡因子\(U\)を入れた回帰モデル

\[ Y_i \mid X_i, T_i, U_i \sim \mathcal{N}\left(\tau T_i+\beta X_i+\delta U_i, \sigma^2\right) \]

を最尤推定したときの残差の分散の推定量 \(\hat{\sigma}^2(\alpha, \delta)\) （残差平方和？）をもとに

\[ R_Y^2(\alpha, \delta)= \frac{ 1-\hat{\sigma}^2(\alpha, \delta) }{ \frac{1}{N}\sum_i(Y_i - \bar{Y})^2 } \]

という関数を構築する。\(\alpha=0, \delta = 0\)を仮定すれば\(U\)の影響がゼロになるので、「未観測の交絡因子がない」という仮定になる。

\(Y\)に対する部分決定係数\(R^2_{Y, par}(\alpha, \delta)\)は次のように定義される

\[\begin{split} \begin{aligned} R_{Y, \mathrm{par}}^2(\alpha, \delta) & =\frac{R_Y^2(\alpha, \delta)-R_Y^2(0,0)}{1-R_Y^2(0,0)} \\ & =\frac{\hat{\sigma}^2(0,0)-\hat{\sigma}^2(\alpha, \delta)}{\hat{\sigma}^2(0,0)} \end{aligned} \end{split}\]

処置変数への影響は

\[ R_T^2(\alpha, \delta) =\frac{\hat{\gamma}(\alpha, \delta)^{\prime} \boldsymbol{\Sigma}_X \hat{\gamma}(\alpha, \delta)+\alpha^2 / 4} {\hat{\gamma}(\alpha, \delta)^{\prime} \boldsymbol{\Sigma}_{\mathrm{x}} \hat{\gamma}(\alpha, \delta)+\alpha^2 / 4+\pi^2 / 3} \]

部分決定係数は

\[ R_{W, \mathrm{par}}^2(\alpha, \delta)=\frac{R_T^2(\alpha, \delta)-R_T^2(0,0)}{1-R_T^2(0,0)} \]

実装#

PySensemakrパッケージが便利。

PySensemakr: Sensitivity Analysis Tools for Regression Models in Python

import sensemakr as smkr
import statsmodels.formula.api as smf

# loads data
darfur = smkr.load_darfur()
# runs regression model
reg_model = smf.ols(formula='peacefactor ~ directlyharmed + age + farmer_dar + herder_dar + '\
                    'pastvoted + hhsize_darfur + female + village', data=darfur)
darfur_model = reg_model.fit()

# Create a sensemakr object and print summary of results
darfur_sense = smkr.Sensemakr(model = darfur_model,
                              treatment = "directlyharmed",
                              benchmark_covariates = ["female"],
                              kd = [1,2,3])

/home/runner/work/notes/notes/.venv/lib/python3.10/site-packages/sensemakr/sensitivity_statistics.py:177: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  return (t_statistic ** 2 / (t_statistic ** 2 + dof))[0]  # extracts float
/home/runner/work/notes/notes/.venv/lib/python3.10/site-packages/sensemakr/sensitivity_statistics.py:177: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  return (t_statistic ** 2 / (t_statistic ** 2 + dof))[0]  # extracts float

# 最小限のsummary
_ = darfur_sense.ovb_minimal_reporting()

	Outcome: peacefactor
Treatment	Est.	S.E.	t-value	R²_Y~D\|X	RV_{q = 1}	RV_{q = 1, α = 0.05}
directlyharmed	0.097	0.023	4.2	2.2%	13.9%	7.6%
Note: df = 783; Bound ( 1x female ): R²_Y~Z\|X,D = 12.5%, R²_D~Z\|X =0.9%

# summary全文
darfur_sense.summary()

Sensitivity Analysis to Unobserved Confounding

Model Formula: peacefactor ~ directlyharmed + age + farmer_dar + herder_dar + pastvoted + hhsize_darfur + female + village

Null hypothesis: q = 1 and reduce = True 

-- This means we are considering biases that reduce the absolute value of the current estimate.
-- The null hypothesis deemed problematic is H0:tau = 0.0 

Unadjusted Estimates of ' directlyharmed ':
  Coef. estimate: 0.097
  Standard Error: 0.023
  t-value: 4.184 

Sensitivity Statistics:
  Partial R2 of treatment with outcome: 0.022
  Robustness Value, q = 1 : 0.139
  Robustness Value, q = 1 alpha = 0.05 : 0.076 

Verbal interpretation of sensitivity statistics:

-- Partial R2 of the treatment with the outcome: an extreme confounder (orthogonal to the covariates)  that explains 100% of the residual variance of the outcome, would need to explain at least 2.187 % of the residual variance of the treatment to fully account for the observed estimated effect.

-- Robustness Value, q = 1 : unobserved confounders (orthogonal to the covariates) that  explain more than 13.878 % of the residual variance of both the treatment and the outcome are strong enough to bring the point estimate to 0.0 (a bias of 100.0 % of the original estimate). Conversely, unobserved confounders that do not explain more than 13.878 % of the residual variance of both the treatment and the outcome are not strong enough to bring the point estimate to 0.0 .

-- Robustness Value, q = 1 , alpha = 0.05 : unobserved confounders (orthogonal to the covariates) that explain more than 7.626 % of the residual variance of both the treatment and the outcome are strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0.0 (a bias of 100.0 % of the original estimate), at the significance level of alpha = 0.05 . Conversely, unobserved confounders that do not explain more than 7.626 % of the residual variance of both the treatment and the outcome are not strong enough to bring the estimate to a range where it is no longer 'statistically different' from 0.0 , at the significance level of alpha = 0.05 .

Bounds on omitted variable bias:
--The table below shows the maximum strength of unobserved confounders with association with the treatment and the outcome bounded by a multiple of the observed explanatory power of the chosen benchmark covariate(s).

  bound_label    r2dz_x   r2yz_dx       treatment  adjusted_estimate  \
0   1x female  0.009164  0.124641  directlyharmed           0.075220   
1   2x female  0.018329  0.249324  directlyharmed           0.052915   
2   3x female  0.027493  0.374050  directlyharmed           0.030396   

   adjusted_se  adjusted_t  adjusted_lower_CI  adjusted_upper_CI  
0     0.021873    3.438904           0.032283           0.118158  
1     0.020350    2.600246           0.012968           0.092862  
2     0.018670    1.628062          -0.006253           0.067045  

結果の読み方#

Sensitivity Statistics

の箇所に部分決定係数が表示されている

Sensitivity Statistics:
  Partial R2 of treatment with outcome: 0.022
  Robustness Value, q = 1 : 0.139
  Robustness Value, q = 1 alpha = 0.05 : 0.076 

Partial R2 of treatment with outcome：部分決定係数
Robustness Value, q = 1： Robustness Value は未観測の交絡因子が処置変数と結果変数に与える影響を定量的に測る指標。値が大きいほど影響を受けないことを示す。
- 例えばRobustness Valueが0.1なら、処置変数と結果変数の両方の残差の分散のうち10%を説明するほどの未観測の交絡因子が存在しない限り、処置の結果への影響はロバストであるという意味。
Robustness Value, q = 1 alpha = 0.05：統計的有意性も加味したもの。

– Partial R2 of the treatment with the outcome: an extreme confounder (orthogonal to the covariates) that explains 100% of the residual variance of the outcome, would need to explain at least 2.187 % of the residual variance of the treatment to fully account for the observed estimated effect.

(治療と結果の部分 R2: 結果の残差分散の 100% を説明する極端な交絡因子 (共変量に対して直交) は、観察された推定効果を完全に説明するには、治療の残差分散の少なくとも 2.187% を説明する必要があります。)

この説明を見ると、 \(R^2_{Y\sim D|X}\) は高いほうがいい指標の様子（E-valueと同様）

Contour plot#

# contour plot for the estimate
darfur_sense.plot()

../../_images/1596bb7c2e559a87e923c6dba1f9b4943ba20e0b354c43b141b58e3200d06d74.png

contour plotの見方

x軸が処置変数の部分決定係数
y軸が結果変数の部分決定係数
等高線：感度パラメーター（\alpha, \delta）の値のもとでのFullモデルで得られる、因果効果の調整済み推定値（adjusted estimate）
点：指定したcovariateの1倍、2倍、3倍の影響を持つ未観測の交絡因子のもとでの、因果効果の調整済み推定値（adjusted estimate）

# extreme scenarios plot
darfur_sense.plot(plot_type = 'extreme')

../../_images/4fc9ab025c4d770860976d6e6181b7976197ac9f335b4e6732b602e0312e2042.png

発展手法#

Cinelli & Hazlett (2020) は複数の交絡因子に対応するよう一般化した手法を提案し、 sensmakerパッケージとして提供

参考文献#

Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation. American Economic Review, 93(2), 126-132.
Cinelli, C., & Hazlett, C. (2020). Making sense of sensitivity: Extending omitted variable bias. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(1), 39-67.
星野崇宏. (2009). 調査観察データの統計科学: 因果推論・選択バイアス・データ融合.
金本拓. (2024). 因果推論: 基礎から機械学習・時系列解析・因果探索を用いた意思決定のアプローチ. 株式会社オーム社.