隐马尔可夫模型词性标注
程序员文章站
2023-12-24 13:26:03
...
隐马尔可夫模型词性标注
1 隐马尔可夫模型(Hidden Markov Model)
:时刻,观测变量;
:时刻,隐含变量。
2 词性标注(part of speech tagging,POS-Tagging)
HMM词性标注(part of speech tagging,POS-Tagging)属于推理问题(inference problem),即HMM参数已知(统计语料库),给定观测序列,推理观测变量(词条)的隐含状态(词性)。
隐马尔可夫模型参数:,通过统计语料库得到
-
初始概率:
-
状态转移矩阵:
-
发射矩阵:
为防止数值下溢,需对参数取对数。
推理过程使用维特比算法。
3 维特比算法
假设隐含状态(词性)有种可能取值,给定长度为的序列:
- 维特比网格(viterbi trellis):矩阵,每个格点记录当前最优路径分值(the score of the best path ending at state at time ),
即,
2. 回溯路径(back trace):矩阵,每个格点记录当前最优路径前一时刻的状态
4 实现
4.1 语料库
tag2id = {}
id2tag = {}
word2id = {}
id2word = {}
tag_filepath = "./data/traindata.txt"
with open(tag_filepath, "r") as f:
id_word = 1
id_tag = 1
for line in f.readlines():
word, tag = line.strip().split("/")
word = word.lower()
if word not in word2id:
word2id[word] = id_word
id2word[id_word] = word
id_word += 1
if tag not in tag2id:
tag2id[tag] = id_tag
id2tag[id_tag] = tag
id_tag += 1
print("tag size: {}".format(len(tag2id)))
print("tag2id: {}".format(tag2id))
tag size: 54
tag2id: {'NNP': 1, ',': 2, 'VBG': 3, 'TO': 4, 'VB': 5, 'NN': 6, 'IN': 7, 'JJ': 8, 'VBD': 9, 'NNS': 10, 'CD': 11, 'CC': 12, 'PRP': 13, 'MD': 14, 'DT': 15, '.': 16, 'VBZ': 17, 'VBN': 18, 'WDT': 19, 'VBP': 20, 'POS': 21, 'RB': 22, '$': 23, 'PRP$': 24, ':': 25, 'JJR': 26, '``': 27, "''": 28, 'WP': 29, 'JJS': 30, 'WRB': 31, 'RBR': 32, 'NNPS': 33, 'RP': 34, 'WP$': 35, 'EX': 36, '(': 37, ')': 38, 'PDT': 39, 'RBS': 40, 'FW': 41, 'UH': 42, 'SYM': 43, 'LS': 44, '#': 45, 'VBG|NN': 46, 'JJ|NN': 47, 'RB|IN': 48, 'NNS|NN': 49, 'VBN|JJ': 50, 'VB|NN': 51, 'RBR|JJR': 52, 'NN|NNS': 53, 'JJ|RB': 54}
4.2 统计
import pandas as pd
import numpy as np
num_tags = len(tag2id)
num_words = len(word2id)
# transition probability matrix A
transit_probs = pd.DataFrame(
data=np.zeros(shape=(num_tags, num_tags)),
columns=tag2id.keys(),
index=tag2id.keys()
)
# emission probability matrix B
emission_probs = pd.DataFrame(
data=np.zeros(shape=(num_tags, num_words)),
columns=word2id.keys(),
index=tag2id.keys()
)
# initial probability vector, pi
init_probs = pd.Series(
data=np.zeros(shape=(num_tags, )),
index=tag2id.keys()
)
with open(tag_filepath, "r") as f:
tag_prev = ""
for line in f.readlines():
word, tag = line.strip().split("/")
word = word.lower()
emission_probs.loc[tag, word] += 1
# at the beginning of a sentence, initial probality
if not tag_prev:
init_probs[tag] += 1
# transition probabiliyt
else:
transit_probs.loc[tag_prev, tag] += 1
# at the end of a sentence, reset tag_prev
if word == ".":
tag_prev = ""
else:
tag_prev = tag
# normalization
init_probs /= init_probs.sum()
partition = transit_probs.sum(axis="columns")
transit_probs = transit_probs.apply(
func=lambda x : x / partition, axis="index"
)
transit_probs[transit_probs.isna()] = 0
partition = emission_probs.sum(axis="columns")
emission_probs = emission_probs.apply(
func=lambda x : x / partition, axis="index"
)
print("initial probability vector:")
print(init_probs)
print("transition probability matrix, A:")
print(transit_probs.head())
print("emission probability matrix, B:")
print(emission_probs.head())
initial probability vector:
NNP 0.181324
, 0.000000
VBG 0.010005
TO 0.003335
VB 0.003953
NN 0.036808
IN 0.111660
JJ 0.036685
VBD 0.000618
NNS 0.038167
CD 0.008770
CC 0.051877
PRP 0.060277
MD 0.000247
DT 0.217268
. 0.000000
VBZ 0.001482
VBN 0.006052
WDT 0.000865
VBP 0.000247
POS 0.000000
RB 0.047307
$ 0.000000
PRP$ 0.007164
: 0.001729
JJR 0.002100
`` 0.075346
'' 0.063612
WP 0.002594
JJS 0.001853
WRB 0.005929
RBR 0.001976
NNPS 0.002841
RP 0.000000
WP$ 0.000000
EX 0.002717
( 0.005929
) 0.005929
PDT 0.000988
RBS 0.000371
FW 0.000124
UH 0.000000
SYM 0.000000
LS 0.001853
# 0.000000
VBG|NN 0.000000
JJ|NN 0.000000
RB|IN 0.000000
NNS|NN 0.000000
VBN|JJ 0.000000
VB|NN 0.000000
RBR|JJR 0.000000
NN|NNS 0.000000
JJ|RB 0.000000
dtype: float64
transition probability matrix, A:
NNP , VBG TO VB NN IN \
NNP 0.379116 0.141891 0.001290 0.008258 0.000981 0.052803 0.042738
, 0.132745 0.000000 0.045207 0.008427 0.003767 0.046495 0.078021
VBG 0.037096 0.014268 0.004756 0.089093 0.001585 0.136018 0.135701
TO 0.042640 0.000643 0.006214 0.000000 0.579387 0.027855 0.005357
VB 0.034148 0.017973 0.016175 0.039720 0.006470 0.063803 0.117362
JJ VBD NNS ... # VBG|NN JJ|NN RB|IN \
NNP 0.008723 0.067255 0.023330 ... 0.000 0.0 0.000000 0.0
, 0.044116 0.052245 0.027759 ... 0.000 0.0 0.000000 0.0
VBG 0.071972 0.002536 0.089410 ... 0.000 0.0 0.000317 0.0
TO 0.034069 0.000000 0.023355 ... 0.003 0.0 0.000000 0.0
VB 0.088426 0.001438 0.045830 ... 0.000 0.0 0.000000 0.0
NNS|NN VBN|JJ VB|NN RBR|JJR NN|NNS JJ|RB
NNP 0.0 0.0 0.000000 0.0 0.00000 0.0
, 0.0 0.0 0.000000 0.0 0.00000 0.0
VBG 0.0 0.0 0.000000 0.0 0.00000 0.0
TO 0.0 0.0 0.000214 0.0 0.00000 0.0
VB 0.0 0.0 0.000000 0.0 0.00018 0.0
[5 rows x 54 columns]
emission probability matrix, B:
newsweek , trying to keep pace with rival \
NNP 0.000516 0.000000 0.000000 0.000000 0.000000 0.000103 0.0 0.0
, 0.000000 0.999802 0.000000 0.000000 0.000000 0.000000 0.0 0.0
VBG 0.000000 0.000000 0.014902 0.000000 0.000000 0.000000 0.0 0.0
TO 0.000000 0.000000 0.000000 0.999786 0.000000 0.000000 0.0 0.0
VB 0.000000 0.000000 0.000000 0.000000 0.005212 0.000000 0.0 0.0
time magazine ... platform trillion-dollar amaury souza \
NNP 0.000877 0.000103 ... 0.0 0.0 0.000052 0.000052
, 0.000000 0.000000 ... 0.0 0.0 0.000000 0.000000
VBG 0.000000 0.000000 ... 0.0 0.0 0.000000 0.000000
TO 0.000000 0.000000 ... 0.0 0.0 0.000000 0.000000
VB 0.000180 0.000000 ... 0.0 0.0 0.000000 0.000000
valiant mailson ferreira nobrega endured massively
NNP 0.0 0.000052 0.000052 0.000052 0.0 0.0
, 0.0 0.000000 0.000000 0.000000 0.0 0.0
VBG 0.0 0.000000 0.000000 0.000000 0.0 0.0
TO 0.0 0.000000 0.000000 0.000000 0.0 0.0
VB 0.0 0.000000 0.000000 0.000000 0.0 0.0
[5 rows x 17224 columns]
4.3 Viterbi
from nltk.tokenize import word_tokenize
EPSILON = 1e-5
def log(x):
"""
smoothing log
"""
return np.log(x + EPSILON)
def viterbi(sent, init_probs, transit_probs, emission_probs):
"""
viterbi decoding
"""
# tag-id mapping
tags = init_probs.index
id2tag = {idx: tag for idx, tag in enumerate(tags)}
vocab = emission_probs.columns
# log probability
init_log_probs = log(init_probs)
transit_log_probs = log(transit_probs)
emission_log_probs = log(emission_probs)
words = word_tokenize(sent)
words = [word.lower() for word in words]
len_sent = len(words)
# drop words those are not in the vocabulary
for idx in range(len_sent - 1, -1, -1):
word = words[idx]
if word not in vocab:
words.pop(idx)
len_sent = len(words)
num_tags = len(tags)
# viterbi trellis: the optimal path score to the current point
trellis = pd.DataFrame(
data=np.zeros(shape=(num_tags, len_sent)),
index=tags
)
# back trace
back_trace = pd.DataFrame(
data=np.zeros(shape=(num_tags, len_sent - 1)),
index=tags,
dtype=np.int
)
# dynamic programming
# initial states, p(y_0 | z_0) p(z_0) = log p(y_0 | z_0) + log p(z_0)
word = words[0]
trellis.loc[:, 0] = emission_log_probs[word] + init_log_probs
for idx_word in range(1, len_sent):
# delta_k(j), the score of the best pase ending at state j at time k
# delta_k(j) = max([ delta_{k-1}(i) + log p(y_k | z_k(j)) + log p(z_k(j) | z_{k-1}(i)) ]_{i=1}^{m})
# = log p(y_k | z_k(j)) + max([ delta_{k-1}(i) + log p(z_k(j) | z_{k-1}(i)) ]_{i=1}^{m})
# traverse tags
for tag in tags:
# trellis
word = words[idx_word]
trellis.loc[tag, idx_word] = np.max(
transit_log_probs.loc[tag] + trellis.loc[:, idx_word - 1]
) + emission_log_probs.loc[tag, word]
# back trace
back_trace.loc[tag, idx_word - 1] = np.argmax(
transit_log_probs.loc[tag] + trellis.loc[:, idx_word - 1]
)
tag = np.argmax(trellis.iloc[:, -1])
best_path = [tag]
for idx in range(len_sent - 2, -1, -1):
tag = back_trace.loc[tag, idx]
best_path.insert(0, tag)
return best_path, words
if __name__ == "__main__":
sent = "Newsweek, trying to keep pace with rival Time magzine, announced new advertising rates for 1990 and said it will introduce a new incentive plan for advertisers."
best_path, words = viterbi(sent, init_probs, transit_probs, emission_probs)
print(" ".join(words))
print(" ".join(best_path))
newsweek , trying to keep pace with rival time , announced new advertising rates for 1990 and said it will introduce a new incentive plan for advertisers .
NNP , VBG TO VB NN IN NN NN , VBD NNP VBG NNS IN CD CC VBD PRP MD NN DT JJ IN NN IN NNS .