GBDT/Treesのバイアス

GBDT/Treesのバイアス#

論点	何が偏るか	原因	代表的な対策
split selection bias	木が選ぶ特徴量	候補分割数が多い特徴ほど偶然よい split が出る	conditional inference trees, unbiased split selection, OOB評価
gain importance bias	特徴量重要度	訓練データ上の impurity/gain を足す	permutation importance, OOB gain, unbiased gain
prediction shift	boostingの予測	自分自身の目的変数情報が学習過程に混入	CatBoost ordered boosting
high-cardinality bias	連続値・多カテゴリ特徴の過大評価	分割候補数・自由度が大きい	カテゴリ処理、正則化、OOB/validation評価
correlated feature bias	相関特徴の重要度分散・過小/過大評価	片方で代替できる	grouped permutation importance, feature clustering

Gain/Feature Importanceのバイアス#

過学習によるGain推定のバイアス#

不純度は過学習を起こしている場合、重要でない特徴量に高い重要度を与えてしまうことがある（Permutation Importanceはこの問題を持たない）（scikit-learn documentation）

Hothorn et al. (2006) はCART系決定木のGain推定のバイアスの問題を指摘し、改善案を提案。

Strobl et al. (2007) はRandom ForestのFeature Importanceのバイアスを指摘

標準的なGain（情報利得）の推定では、目的変数と独立な特徴量のGainでもゼロになることはめったになく、多くの場合でプラスの値になる（→ 重要じゃない特徴量でもGainはプラスになりがち）。また、この問題はGain推定時にtraining setだけじゃなくvalidation setも使うことで解消する（Zhang et al., 2023）

高カーディナリティ特徴量を高く評価するバイアス#

二値変数やカテゴリ数が少ないカテゴリ変数などの低カーディナリティの特徴よりも、高カーディナリティの特徴（典型的には連続変数）のGainを高く評価し、分岐候補点の探索でも高カーディナリティ特徴で分岐しがちな傾向がある（scikit-learn documentation; Zhang et al., 2023; Hothorn et al., 2006）

prediction shift#

CatBoostの「unbiased boosting」で対象にしているのはprediction shift、つまり各サンプルの予測を作るときにそのサンプル自身の目的変数情報が過去の木や target statistics に混入する一種の target leakage 。

解決策#

Hothorn et al. (2006): gain最大化ではなく、split変数選択を検定問題にする#

Hothorn et al. の Conditional Inference Tree / ctree は、CARTのような不純度（Gain）最大化 \(\max_{j, s} \Delta I(j, s)\) を避ける（ここで\(j\)は特徴量インデックス、\(s\) が分割点）

各ノードで

「目的変数 \(Y\) と特徴量 \(X_j\) が独立である」という帰無仮説 \(H_0^j: Y \perp X_j\) を特徴量ごとに検定
最も強く \(Y\) と関連している特徴量を、p値などの検定統計量に基づいて選ぶ
その特徴量の中で分割点を決める

候補分割点が多い連続変数・多カテゴリ変数が有利になる問題を、条件付き分布に基づく検定で緩和

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674. https://doi.org/10.1198/106186006X133933

Strobl et al. (2007): RFの重要度バイアスを、条件付き推論木・permutation importanceで緩和する#

Strobl et al. (2007) は、Random Forestの variable importance が、特徴量の尺度やカテゴリ数が異なる場合に信頼できない（例：多カテゴリ特徴量や連続特徴量が有利になる）ことを示した。

この論文での解決策は主に2つ：

Random Forestのベース木として CART ではなく、Hothorn et al. 系の conditional inference tree を使う
変数重要度として、impurity decrease 型ではなく permutation importance 系の尺度を使う

さらに Strobl らの後続研究では、相関特徴量がある場合に通常のpermutation importanceが不安定になるため、条件付き分布を保ちながらpermuteする conditional permutation importance が提案された。

Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25. https://link.springer.com/article/10.1186/1471-2105-8-25

Zhang et al. (2023): GBDTのgain自体をOOBで不偏化する#

Zhang et al. はより直接的に、GBDTのsplit gainについて次の2つのバイアス源を指摘した。

各splitのgain計算そのものが、biased estimation になっている
split improvement の評価と、best split の選択に同じデータを使うため、選択バイアスが生じる

XGBoost / LightGBMなど現代的なGBDTのgainは、

\[ \text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{G^2}{H+\lambda} \right] \]

のように、同じ訓練データから得た勾配・ヘッセ行列で「このsplitを入れるとどれだけ損失が下がるか」を見積もる。

Zhang et al. の Unbiased Gain は、この推定を out-of-bag sample で評価することで、予測力のない特徴量の重要度が期待値0になるように作られている。

参考#

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674. https://doi.org/10.1198/106186006X133933
Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25. https://link.springer.com/article/10.1186/1471-2105-8-25