分位点回帰#

分位点τにおける条件付分位関数を

Qτ(yi|Xi)=Fy1(τ|Xi)

と表す。ここでFy1(τ|Xi)yにおいてXiに条件づけられたyiの分布関数である(Fy1(τ|Xi)=inf{y:Fy(y|Xi)τ})。

例えばτ=0.1のとき、Qτ(yi|Xi)yiの下位10分位である。

標準的な回帰モデルは二乗誤差(yim(Xi))2の和や期待値を最小化するようにモデルm(Xi)を学習して条件付き期待値E(yi|Xi)を予測する

E(yi|Xi)=arg minm(Xi)E[(yim(Xi))2]

分位点回帰 (quantile regression)モデルはpinball lossρτ(yiq(Xi))の和や期待値を最小化するようにモデルq(Xi)を学習させ、条件付き分位関数Qτ(yi|Xi)=Fy1(τ|Xi)を予測する

Qτ(yi|Xi)=arg minq(Xi)E[ρτ(yiq(Xi))]

pinball lossは τ-tiled absolute value function や 検定関数(check function)とも呼ばれる(グラフを描くとチェックマークに似てるため)

ρτ(x)=(τ1(x0))x

あるいは

ρτ(x)={(τ1)x if x0τx if x>0

あるいは

ρτ(x)=τmax(x,0)+(τ1)min(x,0)

と書かれる

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

def pinball_loss(x, tau):
    return (tau - 1 * (x <= 0)) * x

x = np.linspace(-1, 1, 100)
fig, axes = plt.subplots(figsize=[10, 2], ncols=3)
for i, tau in enumerate([0.1, 0.5, 0.9]):
    y = pinball_loss(x, tau=tau)
    axes[i].plot(x, y)
    if i == 0:
        axes[i].set(title=f"τ={tau}", xlabel=r"$x$", ylabel=r"$y = (\tau - 1(x <= 0)) x$")
    else:
        axes[i].set(title=f"τ={tau}", xlabel=r"$x$")
fig.show()
../../_images/54794bf6727a7695472b5e0372e6c7d57d1fbc5662994c2bf4d08f44988701fd.png

なお、pinball lossはτ=0.5のとき

ρ0.5(x)={0.5x if x00.5x if x>0=12|x|

と、絶対誤差と比例する形になる。

絶対誤差の和を目的関数にとった線形モデルは統計学においてleast absolute deviations (LAD) と呼ばれ、その解は条件付き中央値になる

median(yi|Xi)=Q0.5(yi|Xi)=arg minq(Xi)E[ρ0.5(yiq(Xi))]
Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

def pinball_loss(x, tau):
    return (tau - 1 * (x <= 0)) * x

x = np.linspace(-3, 3, 100)
fig, ax = plt.subplots(figsize=[4, 3])
ax.plot(x, pinball_loss(x, tau=0.5), label=r"$\rho_{0.5}(x)$")
ax.plot(x, abs(x), label="|x|")
ax.legend()
fig.show()
../../_images/6acebc68f03af4871b3d4665fce396c7b7befe0a5afff3a753bb7dec050336fc.png

分位点回帰モデルの実践#

LightGBMでのquantile regression#

目的関数をbinball lossにすればいいだけなので他のアルゴリズムでも実行できる

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

# create data
np.random.seed(0)
n = 1000
x = np.random.uniform(-3, 3, size=n)
X = x.reshape(-1, 1)
y = np.sin(x) + np.random.normal(scale=0.5, size=n)

fig, ax = plt.subplots()
ax.scatter(x, y, alpha=0.7)

# regression
x_plot = np.linspace(-3, 3, 500)
X_test = x_plot.reshape(-1, 1)
import lightgbm as lgb
taus = [0.1, 0.9]
colors = ["orange", "tomato"]
for i in range(2):
    model = lgb.LGBMRegressor(objective="quantile", alpha=taus[i], verbose=-1)
    model.fit(X, y)
    y_hat = model.predict(X_test)
    ax.plot(X_test, y_hat, color=colors[i], label=fr"$\tau = {taus[i]}$")
ax.set(xlabel="x", ylabel="y", title="Quantile Regression with LightGBM")
ax.legend()
fig.show()
/usr/local/lib/python3.10/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but LGBMRegressor was fitted with feature names
  warnings.warn(
/usr/local/lib/python3.10/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but LGBMRegressor was fitted with feature names
  warnings.warn(
../../_images/ff4fdfbec3d4ca6fb70c69fd903a6718a46202d69b00ad5b4a7ff2523fde8164.png
from sklearn.metrics import d2_pinball_score, make_scorer
d2_pinball_score_09 = make_scorer(d2_pinball_score, alpha=0.9)
d2_pinball_score_09(model, X, y)
/usr/local/lib/python3.10/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but LGBMRegressor was fitted with feature names
  warnings.warn(
0.48475107952573926