隐马尔可夫模型词性标注

程序员文章站 2023-12-24 13:26:03

...

隐马尔可夫模型词性标注

1 隐马尔可夫模型（Hidden Markov Model）

$y_{t}$ ： $t$ 时刻，观测变量；

$z_{t}$ ： $t$ 时刻，隐含变量。

隐马尔可夫模型词性标注

2 词性标注（part of speech tagging，POS-Tagging）

HMM词性标注（part of speech tagging，POS-Tagging）属于推理问题（inference problem），即HMM参数已知（统计语料库），给定观测序列 $y_{1}, y_{2}, \dots, y_{T}$ ，推理观测变量（词条） $y_{t}$ 的隐含状态（词性） $z_{t}$ 。

隐马尔可夫模型参数： $\theta = (\mathbf{A}, \mathbf{B}, \mathbf{\pi})$ ，通过统计语料库得到

初始概率： $\mathbf{\pi}$
状态转移矩阵： $\mathbf{A}$
发射矩阵： $\mathbf{B}$

为防止数值下溢，需对参数 $\theta$ 取对数。

推理过程使用维特比算法。

3 维特比算法

假设隐含状态（词性）有 $m$ 种可能取值，给定长度为 $n$ 的序列：

维特比网格（viterbi trellis）： $m \times n$ 矩阵，每个格点记录当前最优路径分值（the score of the best path ending at state $j$ at time $k$ ）， $\delta_{k}(j)$

$\delta_{k}(j) = \begin{cases} \log P(y_{k} | z_{k} = j) + \max_{i = 1, \dots, m} \left( \delta_{k - 1}(z_{k - 1} = i) + \log P(z_{k} = j | z_{k - 1} = i) \right), & \, k \gt 1 \\ \log P(y_{1} | z_{1} = j) + \log P(z_{1} = j), & \, k = 1 \\ \end{cases}$

即，

$\delta_{k}(j) = \begin{cases} B_{j, y_{k}} + \max_{i = 1, \dots, m} (\delta_{k - 1}(i) + A_{i, j}), & \, k \gt 1 \\ B_{j, y_{1}} + \pi_{j}, & \, k = 1 \\ \end{cases}$

隐马尔可夫模型词性标注
2. 回溯路径（back trace）： $m \times (n - 1)$ 矩阵，每个格点记录当前最优路径前一时刻 $k - 1$ 的状态

$\text{bptr}_{k}(j) = \argmax_{i = 1, \dots, m} (\delta_{k - 1}(i) + A_{i, j}), \, k \gt 1$
隐马尔可夫模型词性标注

4 实现

4.1 语料库

tag2id = {}
id2tag = {}
word2id = {}
id2word = {}

tag_filepath = "./data/traindata.txt"

with open(tag_filepath, "r") as f:
    
    id_word = 1
    id_tag = 1
    for line in f.readlines():
        word, tag =  line.strip().split("/")
        word = word.lower()
        
        if word not in word2id:
            word2id[word] = id_word
            id2word[id_word] = word
            id_word += 1
            
        if tag not in tag2id:
            tag2id[tag] = id_tag
            id2tag[id_tag] = tag
            id_tag += 1

print("tag size: {}".format(len(tag2id)))
print("tag2id: {}".format(tag2id))

tag size: 54
tag2id: {'NNP': 1, ',': 2, 'VBG': 3, 'TO': 4, 'VB': 5, 'NN': 6, 'IN': 7, 'JJ': 8, 'VBD': 9, 'NNS': 10, 'CD': 11, 'CC': 12, 'PRP': 13, 'MD': 14, 'DT': 15, '.': 16, 'VBZ': 17, 'VBN': 18, 'WDT': 19, 'VBP': 20, 'POS': 21, 'RB': 22, '$': 23, 'PRP$': 24, ':': 25, 'JJR': 26, '``': 27, "''": 28, 'WP': 29, 'JJS': 30, 'WRB': 31, 'RBR': 32, 'NNPS': 33, 'RP': 34, 'WP$': 35, 'EX': 36, '(': 37, ')': 38, 'PDT': 39, 'RBS': 40, 'FW': 41, 'UH': 42, 'SYM': 43, 'LS': 44, '#': 45, 'VBG|NN': 46, 'JJ|NN': 47, 'RB|IN': 48, 'NNS|NN': 49, 'VBN|JJ': 50, 'VB|NN': 51, 'RBR|JJR': 52, 'NN|NNS': 53, 'JJ|RB': 54}

4.2 统计

import pandas as pd
import numpy as np

num_tags = len(tag2id)
num_words = len(word2id)

# transition probability matrix A
transit_probs = pd.DataFrame(
    data=np.zeros(shape=(num_tags, num_tags)),
    columns=tag2id.keys(),
    index=tag2id.keys()
)

# emission probability matrix B
emission_probs = pd.DataFrame(
    data=np.zeros(shape=(num_tags, num_words)),
    columns=word2id.keys(),
    index=tag2id.keys()
)

# initial probability vector, pi
init_probs = pd.Series(
    data=np.zeros(shape=(num_tags, )),
    index=tag2id.keys()
)

with open(tag_filepath, "r") as f:
    
    tag_prev = ""
    for line in f.readlines():
        
        word, tag = line.strip().split("/")
        word = word.lower()
        emission_probs.loc[tag, word] += 1
        
        # at the beginning of a sentence, initial probality
        if not tag_prev:
            init_probs[tag] += 1
        # transition probabiliyt
        else:
            transit_probs.loc[tag_prev, tag] += 1
            
        # at the end of a sentence, reset tag_prev
        if word == ".":
            tag_prev = ""
        else:
            tag_prev = tag
            
# normalization
init_probs /= init_probs.sum()
partition = transit_probs.sum(axis="columns")
transit_probs = transit_probs.apply(
    func=lambda x : x / partition, axis="index"
)
transit_probs[transit_probs.isna()] = 0
partition = emission_probs.sum(axis="columns")
emission_probs = emission_probs.apply(
    func=lambda x : x / partition, axis="index"
)

print("initial probability vector:")
print(init_probs)

print("transition probability matrix, A:")
print(transit_probs.head())

print("emission probability matrix, B:")
print(emission_probs.head())

initial probability vector:
NNP        0.181324
,          0.000000
VBG        0.010005
TO         0.003335
VB         0.003953
NN         0.036808
IN         0.111660
JJ         0.036685
VBD        0.000618
NNS        0.038167
CD         0.008770
CC         0.051877
PRP        0.060277
MD         0.000247
DT         0.217268
.          0.000000
VBZ        0.001482
VBN        0.006052
WDT        0.000865
VBP        0.000247
POS        0.000000
RB         0.047307
$          0.000000
PRP$       0.007164
:          0.001729
JJR        0.002100
``         0.075346
''         0.063612
WP         0.002594
JJS        0.001853
WRB        0.005929
RBR        0.001976
NNPS       0.002841
RP         0.000000
WP$        0.000000
EX         0.002717
(          0.005929
)          0.005929
PDT        0.000988
RBS        0.000371
FW         0.000124
UH         0.000000
SYM        0.000000
LS         0.001853
#          0.000000
VBG|NN     0.000000
JJ|NN      0.000000
RB|IN      0.000000
NNS|NN     0.000000
VBN|JJ     0.000000
VB|NN      0.000000
RBR|JJR    0.000000
NN|NNS     0.000000
JJ|RB      0.000000
dtype: float64
transition probability matrix, A:
          NNP         ,       VBG        TO        VB        NN        IN  \
NNP  0.379116  0.141891  0.001290  0.008258  0.000981  0.052803  0.042738   
,    0.132745  0.000000  0.045207  0.008427  0.003767  0.046495  0.078021   
VBG  0.037096  0.014268  0.004756  0.089093  0.001585  0.136018  0.135701   
TO   0.042640  0.000643  0.006214  0.000000  0.579387  0.027855  0.005357   
VB   0.034148  0.017973  0.016175  0.039720  0.006470  0.063803  0.117362   

           JJ       VBD       NNS  ...      #  VBG|NN     JJ|NN  RB|IN  \
NNP  0.008723  0.067255  0.023330  ...  0.000     0.0  0.000000    0.0   
,    0.044116  0.052245  0.027759  ...  0.000     0.0  0.000000    0.0   
VBG  0.071972  0.002536  0.089410  ...  0.000     0.0  0.000317    0.0   
TO   0.034069  0.000000  0.023355  ...  0.003     0.0  0.000000    0.0   
VB   0.088426  0.001438  0.045830  ...  0.000     0.0  0.000000    0.0   

     NNS|NN  VBN|JJ     VB|NN  RBR|JJR   NN|NNS  JJ|RB  
NNP     0.0     0.0  0.000000      0.0  0.00000    0.0  
,       0.0     0.0  0.000000      0.0  0.00000    0.0  
VBG     0.0     0.0  0.000000      0.0  0.00000    0.0  
TO      0.0     0.0  0.000214      0.0  0.00000    0.0  
VB      0.0     0.0  0.000000      0.0  0.00018    0.0  

[5 rows x 54 columns]
emission probability matrix, B:
     newsweek         ,    trying        to      keep      pace  with  rival  \
NNP  0.000516  0.000000  0.000000  0.000000  0.000000  0.000103   0.0    0.0   
,    0.000000  0.999802  0.000000  0.000000  0.000000  0.000000   0.0    0.0   
VBG  0.000000  0.000000  0.014902  0.000000  0.000000  0.000000   0.0    0.0   
TO   0.000000  0.000000  0.000000  0.999786  0.000000  0.000000   0.0    0.0   
VB   0.000000  0.000000  0.000000  0.000000  0.005212  0.000000   0.0    0.0   

         time  magazine  ...  platform  trillion-dollar    amaury     souza  \
NNP  0.000877  0.000103  ...       0.0              0.0  0.000052  0.000052   
,    0.000000  0.000000  ...       0.0              0.0  0.000000  0.000000   
VBG  0.000000  0.000000  ...       0.0              0.0  0.000000  0.000000   
TO   0.000000  0.000000  ...       0.0              0.0  0.000000  0.000000   
VB   0.000180  0.000000  ...       0.0              0.0  0.000000  0.000000   

     valiant   mailson  ferreira   nobrega  endured  massively  
NNP      0.0  0.000052  0.000052  0.000052      0.0        0.0  
,        0.0  0.000000  0.000000  0.000000      0.0        0.0  
VBG      0.0  0.000000  0.000000  0.000000      0.0        0.0  
TO       0.0  0.000000  0.000000  0.000000      0.0        0.0  
VB       0.0  0.000000  0.000000  0.000000      0.0        0.0  

[5 rows x 17224 columns]

4.3 Viterbi

from nltk.tokenize import word_tokenize

EPSILON = 1e-5

def log(x):
    """
    smoothing log
    """
    return np.log(x + EPSILON)

def viterbi(sent, init_probs, transit_probs, emission_probs):
    """
    viterbi decoding
    """
    # tag-id mapping
    tags = init_probs.index
    id2tag = {idx: tag for idx, tag in enumerate(tags)}
    vocab = emission_probs.columns
    
    # log probability
    init_log_probs = log(init_probs)
    transit_log_probs = log(transit_probs)
    emission_log_probs = log(emission_probs)
    
    words = word_tokenize(sent)
    words = [word.lower() for word in words]
    len_sent = len(words)
    # drop words those are not in the vocabulary
    for idx in range(len_sent - 1, -1, -1):
        word = words[idx]
        if word not in vocab:
            words.pop(idx)
    len_sent = len(words)
    num_tags = len(tags)
    
    # viterbi trellis: the optimal path score to the current point
    trellis = pd.DataFrame(
        data=np.zeros(shape=(num_tags, len_sent)),
        index=tags
    )
    
    # back trace
    back_trace = pd.DataFrame(
        data=np.zeros(shape=(num_tags, len_sent - 1)),
        index=tags,
        dtype=np.int
    )
    
    # dynamic programming
    # initial states, p(y_0 | z_0) p(z_0) = log p(y_0 | z_0) + log p(z_0)
    word = words[0]
    trellis.loc[:, 0] = emission_log_probs[word] + init_log_probs
    for idx_word in range(1, len_sent):
        # delta_k(j), the score of the best pase ending at state j at time k
        # delta_k(j) = max([ delta_{k-1}(i) + log p(y_k | z_k(j)) + log p(z_k(j) | z_{k-1}(i)) ]_{i=1}^{m})
        #            = log p(y_k | z_k(j)) + max([ delta_{k-1}(i) + log p(z_k(j) | z_{k-1}(i)) ]_{i=1}^{m})
        # traverse tags
        for tag in tags:
            # trellis
            word = words[idx_word]
            trellis.loc[tag, idx_word] = np.max(
                transit_log_probs.loc[tag] + trellis.loc[:, idx_word - 1]
            ) + emission_log_probs.loc[tag, word]
            # back trace
            back_trace.loc[tag, idx_word - 1] = np.argmax(
                transit_log_probs.loc[tag] + trellis.loc[:, idx_word - 1]
            )
            
    tag = np.argmax(trellis.iloc[:, -1])
    best_path = [tag]
    for idx in range(len_sent - 2, -1, -1):
        tag = back_trace.loc[tag, idx]
        best_path.insert(0, tag)
        
    return best_path, words

if __name__ == "__main__":
    sent = "Newsweek, trying to keep pace with rival Time magzine, announced new advertising rates for 1990 and said it will introduce a new incentive plan for advertisers."
    best_path, words = viterbi(sent, init_probs, transit_probs, emission_probs)
    print(" ".join(words))
    print(" ".join(best_path))

newsweek , trying to keep pace with rival time , announced new advertising rates for 1990 and said it will introduce a new incentive plan for advertisers .
NNP , VBG TO VB NN IN NN NN , VBD NNP VBG NNS IN CD CC VBD PRP MD NN DT JJ IN NN IN NNS .

隐马尔可夫模型词性标注

隐马尔可夫模型词性标注

1 隐马尔可夫模型（Hidden Markov Model）

2 词性标注（part of speech tagging，POS-Tagging）

3 维特比算法

4 实现

4.1 语料库

4.2 统计

4.3 Viterbi