Bert模型 fine tuning 代码run_squad.py学习

程序员文章站 2022-05-14 15:57:40

...

关于run_squad.py

bert用于机器阅读理解的fine tuning脚本，主要作用：

读取squad类型数据作为训练样本
使用bert预训练模型，取最后一层输出，类似ELMO词向量
将预训练模型输出接一层全连接层
输出为元组，表示答案片段的起始位置

分模块学习

SquadExample

A single training/test example for simple sequence classification.
For examples without an answer, the start and end position are -1.

__str__ 调用__repr__，print(object)时的输出
__repr__ 拼接字符串

InputFeatures

A single set of features of data.

read_squad_examples

Read a SQuAD json file into a list of SquadExample.

使用tf.gfile操作文件。依次读取json文件，data -> entry -> paragraphs -> context/qas(-> id question answers text answer_start)(Squad2中包含is_impossible字段)；最后将每个样本(一个问题为一个样本，一篇文章可能在多个样本中)保存为SquadExample类型对象。
详细处理包括：

char_to_word_offset 用于根据答案和答案开始位置确定答案的起始位置(即模型的输出)
tokenization.whitespace_tokenize对原始答案进行取空白符\s处理，判断能否从document获取答案，不能则跳过(避免weird Unicode stuff)

paragraph_text = ‘This is a test, good luck!\r’
doc_tokens = [‘This’, ‘is’, ‘a’, ‘test,’, ‘good’, ‘luck!’]
char_to_word_offset = [0, 0, 0, 0, 0, 1, 1, 1, 2, 2,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5]

convert_examples_to_features

Loads a data file into a list of InputBatchs.

将每个样本从SquadExample类型转为InputFeatures类型。

对每个样本的question_text用tokenizer.tokenize处理。
对每个问题进行max_query_length判断，超过最大长度则截断
对文本中每个词进行tokenizer.tokenize处理
doc_span，将超过最大长度的文件进行窗口移动截断成多个片段
连接文章和问题 [CLS]+ context + [SEP] + query + [SEP]
input_ids 使用tokenizer.convert_tokens_to_ids(tokens)将词用词表中的id表示
input_mask 词用1表示，填充用0表示
segment_ids 文章中词用0表示，问题中词用1表示
output_fn(feature)进行run callback，回调函数主要作用是进行特征写入

input_ids, input_mask, segment_ids都用0进行填充

def _improve_answer_span

Returns tokenized answer spans that better match the annotated answer.

主要是将 (1895-1943) 处理为 ( 1895 - 1943 )

_check_is_max_context

Check if this is the ‘max context’ doc span for the token.

当使用sliding window方法后，

Doc: the man went to the store and bought a gallon of milk
Span A: the man went to the
Span B: to the store and bought
Span C: and bought a gallon of

要获得一个词的最大上下文，比如bought在B中有4个左上下文和0个右上下文，而在C中有1个左上下文和3个右上下文，最终选择片段C。

create_model

Create a classification model

Bert fine tuning:

model = modeling.BertModel(
    config=bert_config,
    is_training=is_training,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=use_one_hot_embeddings)
# 得到词向量输出(run_classifier.py中model.get_pooled_output()是一维句子向量)
final_hidden = model.get_sequence_output()

# 输出维度为(batch_size, seq_length, word_vector_shape)
final_hidden_shape = modeling.get_shape_list(final_hidden, expected_rank=3)
batch_size = final_hidden_shape[0]
seq_length = final_hidden_shape[1]
hidden_size = final_hidden_shape[2]

# 获得weights和bias变量
output_weights = tf.get_variable(
    "cls/squad/output_weights", [2, hidden_size],
    initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
    "cls/squad/output_bias", [2], initializer=tf.zeros_initializer())

final_hidden_matrix = tf.reshape(final_hidden,
                                    [batch_size * seq_length, hidden_size])

# 全连接层：matmul + bias
logits = tf.matmul(final_hidden_matrix, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)

# 维度转换与句子分解
logits = tf.reshape(logits, [batch_size, seq_length, 2])
logits = tf.transpose(logits, [2, 0, 1])

unstacked_logits = tf.unstack(logits, axis=0)

# 模型输出
(start_logits, end_logits) = (unstacked_logits[0], unstacked_logits[1])

model_fn_builder

Returns model_fn closure for TPUEstimator.

model_fn

The model_fn for TPUEstimator.

构建模型函数的时候需要完成model_fh(features,labels,mode,params)这个函数
这个函数需要提供代码来处理三种mode(TRAIN,EVAL,PREDICT)值，返回tf.estimator.EstimatorSpec的一个实例
tf.estimator.ModeKeys.TRAIN模式：主要是需要返回loss，train_op(优化器)
tf.estimator.ModeKeys.PREDICT模式：主要是需要返回predictions结果
tf.estimator.ModeKeys.EVAL模式：主要是需要返回loss,eval_metrics=[评价函数]

input_fn_builder

Creates an input_fn closure to be passed to TPUEstimator.

设计一个模型的输出函数，完成读取tf.record文件，反序列化样本获得原始的样本，如果是训练的话，则打乱数据集，获取batch量的样本集

_decode_record

Decodes a record to a TensorFlow example.

input_fn

The actual input function.

write_predictions

Write final predictions to the json file and log-odds of null if needed.

get_final_text

Project the tokenized prediction back to the original text.

pred_text = steve smith
orig_text = Steve Smith's

_strip_spaces

_get_best_indexes

Get the n-best logits from a list.

_compute_softmax

Compute softmax probability over raw logits.

FeatureWriter

Writes InputFeature to TF example file

临时特征文件存储。

process_feature

Write a InputFeature to the TFRecordWriter as a tf.train.Example.

create_int_feature
定义特征。

features = collections.OrderedDict()
features['key'] = create_int_feature(value)
#...

# 定义一个Example，包含若干个feature，每个feature是key-value结构
tf_example = tf.train.Example(features=tf.train.Features(feature=features))

# 将样本序列化（压缩）保存到tf.record文件中
self.__writer.write(tf_example.SerializeToString())

validate_flags_or_throw

Validate the input FLAGS or throw an exception

关于命令行输出参数的异常判断。

执行及调用关系

主体函数的调用，源代码中在预测处理时进行了较多的细节操作，此处省略。

Estimator

model_fn_builder中调用create_model创建网络模型(fine tuning)
- 读取checkpoint，并根据tf.estimator.ModeKeys，进行具体的训练和预测操作；训练包括定义loss函数和优化函数；预测则直接得到预测结果
estimator = tf.contrib.TPUEstimator(model_fn)，创建时输入网络模型
训练，estimator.train(train_input_fn)
预测，estimator.predict(predict_input_fn)

FLAGS.do_train

train_examples = read_squad_examples读取样本数据，返回为SquadExample对象
train_writer = FeatureWriter()，特征写入器
- tf.train.Example()
convert_examples_to_features(train_exampes，train_writer.process_feature)
- 将SquadExample对象解析为InputFeatures对象
- 使用回调函数train_writer.process_feature，将转换的InputFeatures特征写入文件
train_input_fn = input_fn_builder(train_writer.filename)，用于训练时特征读取器
- 将特征写入文件，包括unique_ids，input_ids，input_mask，segment_ids
- 以及tf.data.TFRecordDataset()的创建和读取
estimator.train(train_input_fn)，使用特征读取器训练

FLAGS.do_predict

eval_examples = read_squad_examples()读取测试数据，返回为SquadExample对象
eval_writer = FeatureWriter()，特征写入器
convert_examples_to_features(eval_examples，tokenizer，append_feature)
- eval_examples
- tokenizer
- append_feature，回调函数，保存到eval_features中(便于得到预测结果)；用eval_writer.process_feature写入文件
predict_input_fn = input_fn_builder(eval_writer.filename)，用于预测时的特征读取器
estimator.predict(predict_input_fn)，使用特征读取器预测
write_predictions(eval_examples, eval_features, all_result)，得到预测结果解析并保存文件

About

create_model(bert_config, is_training, input_ids, input_mask, segment_ids)
- is_training，do_train时为True，否则为False
- input_ids，input_mask，segment_ids，输入模型的特征向量

Ref

TensorFlow知识点

tf.flags.DEFINE_xxx

用于添加命令行参数

# 定义参数
tf.flags.DEFINE_string("strParam", value, "This is a string param")
#tf.flags.DEFINE_bool/integer/float/

# 使用参数
tf.flags.FLAGS.strParam

# 命令行输入，更换参数
# python file.py --strParam strname

Ref

tf.gfile

TensorFlow的文件操作，包括但不限于：

tf.gfile.MakeDirs(FLAGS.output_dir)
with tf.gfile.Open(file, ‘r’) as reader

其他

抛出异常 raise ValueError(“This causes a value error.”)

上一篇： XML Attack

下一篇： pytracking-ATOM代码训练run_training.py记录

Bert模型 fine tuning 代码run_squad.py学习

文章目录

关于run_squad.py

分模块学习

SquadExample

InputFeatures

create_model

model_fn_builder

input_fn_builder

write_predictions

get_final_text

_get_best_indexes

_compute_softmax

FeatureWriter

validate_flags_or_throw

执行及调用关系

Estimator

FLAGS.do_train

FLAGS.do_predict

About

TensorFlow知识点

tf.flags.DEFINE_xxx

tf.gfile

其他

使用Bert模型的run_classifier进行Fine-Tuning

Bert模型 fine tuning 代码run_squad.py学习