単語の埋め込み

単語の埋め込み#

ディープラーニングが自然言語処理に活用され始める前は、単語を計算可能なベクトルに変換する処理としては、文書集合内の各単語がある文書に出現した場合に1を、そうでない場合は0を出すようなBag of Wordsというものがよく用いられた。これは扱うデータが大きく語彙数が多くなると非常に高次元になるため、特異値分解などで次元削減を行うこともあった。

ディープラーニングが活用されるようになり登場したものが単語の埋め込み（embedding）や分散表現（distributed representation）と呼ばれるもので、これらは ${0, 1}$ に限らない値をとり、また次元数を語彙の数よりも小さく（次元圧縮）することもできる（実はニューラルネットワークを使った言語モデルの中間層を埋め込みとするので、それゆえ任意の次元数にできる）。

有名なのはWord2VecやElMoである。両者の大まかな違いとしては、Word2Vecは単語ごとに一意な分散表現になるが、ElMoは文脈によって分散表現が変わるということ。例えば「彼はスポーツが下手だ」と「彼はいつも下手に出がちだ」は同じ「下手」という語だが意味が異なる。Word2Vecはこのような文脈を考慮しないが、ElMoは考慮する。

共起行列#

文章 $S = (w_{1}, \dots, w_{d})$ のある単語 $w_{i}$ の周囲の単語の集合（例えば両側 $c$ 個をとって $C = {w_{i - c}, \dots, w_{i - 1}, w_{i + 1}, \dots, w_{i + c}}$ ）を文脈（context） $C$ といい、各単語 $w_{j} (i \neq j)$ が $C$ に含まれるかどうかを ${0, 1}$ で表現する。

この関係性を表す行列を共起行列（co-occurence matrix）という。

	$w_{1}$	$w_{2}$	$w_{3}$	$w_{4}$	$w_{5}$
$w_{3}$	0	1	0	1	0

共起行列を特異値分解にかけて単語の分散表現を取得するなどといった方法がニューラルネットワーク登場以前の自然言語処理の主要なアプローチであった

Word2Vec#

文脈をもとに単語を予測するモデルをニューラルネットワークで構築し、中間層の重みベクトルを埋め込みとして使う方法。

King - Man + Woman = Queen

のような語彙間の類似度や計算が可能な表現

Word2Vecのアプローチは複数ある#

Word2Vecの方法は、

CBOW (continuous bag of words)
Skip-Gram

の2種類がある。

CBOWはt番目の単語を予測対象にしてその周囲の単語を入力とする。 Skip-gramはt番目の単語を使ってその周囲の単語を予測する

Mikolov et al. (2013)

CBOW#

CBOW (continuous bag of words)モデルは文章 $S = (w_{1}, w_{2}, \dots, w_{n})$ が与えられた時、その $i$ 番目の単語 $w_{i}$ を、その周りの単語である文脈 $C_{i} = (w_{i - c}, \dots, w_{i - 1}, w_{i + 1}, \dots, w_{i + c})$ から予測するモデルである。ここで $c$ はウィンドウサイズと呼ばれるハイパーパラメータ。

P (w_{i} | C_{i})

モデルとしては2層の全結合層から成るモデルになる

PyTorchのEmbeddingレイヤってなんなの

Embedding — PyTorch 2.0 documentation

one-hot表現のコンテキスト（例えば $c = (0, 1, 0, 0)$ ）と重み行列との全結合層は、結局のところkey-valueからの取り出しのようなもの。

計算の高速化のために専用のlook-up tableだけの層を作ったほうがいい → Embeddingレイヤになった（ゼロから作るDeep Learning (2) 135ページ）

import numpy as np

# 全結合層による変換のイメージ
np.random.seed(0)
c = np.array([0, 1, 0, 0]) # context（one-hotなので対応する単語のWを取り出す形になる）
n_hidden = 2
W = np.random.randn(len(c), n_hidden)
h = c @ W
print("      h =", h)
print("W[i, :] =", W[np.argmax(c), ])

      h = [0.97873798 2.2408932 ]
W[i, :] = [0.97873798 2.2408932 ]

PyTorch実装#

（参考：FraLotito/pytorch-continuous-bag-of-words）

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}

data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
          [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))

print(data[:3])

[(['are', 'We', 'to', 'study'], 'about'), (['about', 'are', 'study', 'the'], 'to'), (['to', 'about', 'the', 'idea'], 'study')]

import torch
import torch.nn as nn

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        # out: 1 x emdedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()
        # out: 1 x vocab_size
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim = -1)

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1,-1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

    def get_word_emdedding(self, word):
        word = torch.tensor([word_to_ix[word]])
        return self.embeddings(word).view(1,-1)
    

# set model
model = CBOW(vocab_size, embedding_dim=100)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

# training
for epoch in range(50):
    total_loss = 0

    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)  

        log_probs = model(context_vector)

        total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))

    #optimize at the end of each epoch
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()

# testing
context = ['People','create', 'to', 'direct']
context_vector = make_context_vector(context, word_to_ix)
a = model(context_vector)

# result
print(f'Raw text: {" ".join(raw_text)}\n')
print(f'Context: {context}\n')
print(f'Prediction: {ix_to_word[torch.argmax(a[0]).item()]}')

Raw text: We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.

Context: ['People', 'create', 'to', 'direct']

Prediction: programs

model.get_word_emdedding("programs")

tensor([[ 0.1668, -0.5796,  1.5907, -0.6294, -1.6611, -0.3255, -0.2113, -1.8095,
         -2.0887, -0.1145,  0.1132,  0.9143,  1.0431,  0.8204,  0.6881, -1.9051,
         -1.2344,  0.2445, -1.0943,  0.3168,  0.3860, -0.9065,  1.0441,  0.2928,
          2.1992,  0.5055, -0.8070,  0.6610, -1.7019,  0.2656,  1.2820,  1.2471,
         -0.3776,  2.5445, -0.5076, -0.1624,  0.4491, -0.7438,  0.0946,  0.4787,
          0.7029, -0.6734, -2.1051, -0.0332,  1.1394, -0.5427, -0.3893,  0.8412,
         -1.7474,  0.3699, -0.9868, -2.1558, -0.1784,  1.7104,  0.7475, -0.8386,
         -1.5202, -0.9841,  1.2340,  0.5906,  1.0264, -0.5635,  2.3951,  0.8399,
         -0.1783, -0.5652, -1.6071, -0.0553, -1.2841, -1.0072, -0.1787, -1.0199,
          0.0556,  0.5510,  0.2147,  1.9663,  2.6188,  0.4018,  0.3036,  0.7832,
         -0.3007,  0.0112,  1.5098,  0.8185, -0.7557,  0.6621,  0.4945,  0.7458,
          1.0004,  1.7364, -0.9689, -0.7362, -1.2900, -0.3330, -0.0457, -0.2386,
          1.0443, -0.9503,  0.1780,  0.2635]], grad_fn=<ViewBackward0>)

Skip-Gram#

単語から文脈を予測するモデル $P (C_{i} | w_{i})$ を使って単語の分散表現を得る方法。

訓練データの単語 $w_{1}, w_{2}, \dots, w_{T}$ のもとで、確率の対数の平均を最大化するのが目的

\frac{1}{T} \sum_{t = 1}^{T} \sum_{- c \leq j \leq c, j \neq 0} \log p (w_{t + j} | w_{t})

確率はsoftmaxで計算される

p (w_{O} | w_{I}) = \frac{\exp (v_{w_{O}}^{T} v_{w_{I}})}{\sum_{w = 1}^{W} \exp (v_{w}^{T} v_{w_{I}})}

ここで $W$ は語彙数、 $v$ は単語のベクトル表現。 $\nabla \log p (w_{O} | w_{I})$ は $W$ に比例し、計算不可能なオーダー（ $10^{7}$ とか）になりうるので計算量の削減の工夫が必要になる

計算量の問題#

Skip-Gramはそのままでは計算量が多すぎるので対策がとられる（Mikolov et al., 2013）

Hierarchical Softmax：二分木探索のように探索範囲を絞るっぽい
Noise Constrastive Estimation
Negative Sampling：多値分類（ $w_{i}$ はどの単語か）を二値分類（ $w_{i}$ は”woman”か）に近似する + 負例はランダムサンプリングする。

CBOWとSkip-Gramのどちらがよいか#

精度がいいのはSkip-gramらしい

gensimによるWord2Vecの実行#

https://radimrehurek.com/gensim/models/word2vec.html

# トークンに分割した文章の集合の例
sentences = [
    ["king", "male", "ruler"],
    ["queen", "female", "ruler"],
    ["man", "male"],
    ["woman", "female"],
]

from gensim.models import Word2Vec
model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)

# kingに意味が近い上位3個（similarの基準はコサイン類似度）
sims = model.wv.most_similar('king', topn=3)
sims

[('male', 0.1459505707025528),
 ('woman', 0.041577354073524475),
 ('man', 0.03476494178175926)]

# king + woman - man ≒ queen というアレ
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.0066775488667190075)]

# 自前で足し引きして類似度計算してみる
import numpy as np
def cosine_sim(a, b):
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

v = model.wv["king"] - model.wv["man"] + model.wv["woman"]
words = set(sum(sentences, [])) - {"king", "man", "woman"} # 計算に使ったものは必然的に類似度が高くなっちゃうので除く
for word in words:
    print(f"{word}: {cosine_sim(v, model.wv[word]):.3g}")

# queenが一番近くなった

ruler: -0.0263
queen: 0.00656
male: -0.0251
female: -0.0395

GloVe#

CBOWとは違ったモデルでの埋め込み表現の獲得を行う

website: GloVe: Global Vectors for Word Representation
Rのtext2vecパッケージの解説記事: GloVe Word Embeddings