欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

GraphSage代码阅读笔记(TensorFlow版)

程序员文章站 2022-05-19 20:56:24
...

一、论文、代码链接

Inductive Representation Learning on Large Graphs论文链接

graphsage代码链接

二、文件结构

1.文件目录
├──eval_scripts //验证集
├──example_data //ppi数据集
└──graphsage//模型结构定义、GCN层定义、……

2.eval_scripts //验证集目录内容
├──citation_eval.py
├──ppi_eval.py
└──reddit_eval.py

3.example_data //ppi数据集
├──toy-ppi-class_map.json //图节点id映射到类。
├──toy-ppi-feats.npy //预训练好得到的features
├──toy-ppi-G.json //图的信息
├──ttoy-ppi-walks//从一点出发随机游走到邻居节点的情况,对于每个点取198次
└──toy-ppi-id_map.json //节点编号与序号的一一对应

4.graphsage//模型结构定义
├── init //导入模块
├──aggregators // 聚合函数定义
├──inits.py // 初始化的一些公用函数
├── layers // GCN层的定义
├── metrics // 评测指标的计算
├── minibatch//minibatch iterator函数定义
├── models // 各种模型结构定义
├── neigh_samplers //定义从节点的邻居中采样的采样器
├── prediction//
├── supervised_models
├── supervised_train
├── unsupervised_train
└── utils // 工具函数的定义

三、数据解析

3.1.数据集信息表:

表一:
GraphSage代码阅读笔记(TensorFlow版)
表二:
GraphSage代码阅读笔记(TensorFlow版)
注:表一摘自https://blog.csdn.net/yyl424525/article/details/102966617

3.2数据文件解析

1.toy-ppi-G.json //图的信息
数据中只有一个图,用来做节点分类任务。
图为无向图,由nodes集和links集合构成,每个集合都是一个list,里面包含的每一个node或link都是词典形式存储的
GraphSage代码阅读笔记(TensorFlow版)
数据格式:
GraphSage代码阅读笔记(TensorFlow版)
注:(上图摘自:https://blog.csdn.net/yyl424525/article/details/102966617

2.toy-ppi-class_map.json,图节点id映射到类。格式为:{“0”: [1, 0, 0,…],…,“14754”: [1, 1, 0, 0,…]}
GraphSage代码阅读笔记(TensorFlow版)
3.toy-ppi-id_map.jsontoy-ppi-id_map.json //节点编号与序号的一一对应
数据格式为:{“0”: 0, “1”: 1,…, “14754”: 14754}
GraphSage代码阅读笔记(TensorFlow版)
4.toy-ppi-feats.npy //预训练好得到的features

  • numpy存储的节点特征数组; 由id_map.json给出的顺序。 可以省略,仅使用身份功能。
    5.toy-ppi-walks.txt
    GraphSage代码阅读笔记(TensorFlow版)

四、代码运行及其环境安装与配置

4.1环境安装

1.可通过pip install -r requirements.txt 安装以下版本的包

absl-py==0.2.2
astor==0.6.2
backports.weakref==1.0.post1
bleach==1.5.0
decorator==4.3.0
enum34==1.1.6
funcsigs==1.0.2
futures==3.1.0
gast==0.2.0
grpcio==1.12.1
html5lib==0.9999999
Markdown==2.6.11
mock==2.0.0
networkx==1.11
numpy==1.14.5
pbr==4.0.4
protobuf==3.6.0
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
sklearn==0.0
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
Werkzeug==0.14.1

注:

  • networkx版本必须小于等于1.11,否则报错

GraphSage代码阅读笔记(TensorFlow版)

  • Docker程序安装:
    没有docker需要安装docker。
    也可以在[docker](https://docs.docker.com/)映像中运行GraphSage。 克隆项目后,按如下所示构建并运行映像:
    $ docker build -t graphsage .
    $ docker run -it graphsage bash
    还可以使用[nvidia-docker](https://github.com/NVIDIA/nvidia-docker)运行GPU映像:
    $ docker build -t graphsage:gpu -f Dockerfile.gpu .
    $ nvidia-docker run -it graphsage:gpu bash

4.2 代码运行

1.在命令运行unsupervised_train.py

  • unsupervised_train.py 是用节点和节点的邻接信息做loss训练,训练好可以输出节点embedding,使用EdgeMinibatchIterator
python -m graphsage.unsupervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10
#参考https://blog.csdn.net/yyl424525/article/details/102966617

model可选值:

* graphsage_mean -- GraphSage with mean-based aggregator
* graphsage_seq -- GraphSage with LSTM-based aggregator
* graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
* graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
* gcn -- GraphSage with GCN-based aggregator
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)

2.运行supervised_train.py,注意train_prefix参数的值也需要改:…/example_data/toy-ppi

python -m graphsage.supervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid

GraphSage代码阅读笔记(TensorFlow版)(一)

五、代码文件解析

5.1.init.py

from __future__ import print_function
'''即使在python2.X,使用print就得像python3.X那样加括号使用'''

from __future__ import division
'''导入python未来支持的语言特征division(精确除法),
6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法'''

5.2.unsupervised_train.py代码

# _*_ coding:UTF-8
#supervised_train.py 是用节点分类的label来做loss训练,不能输出节点embedding,使用NodeMinibatchIterator

#unsupervised_train.py 是用节点和节点的邻接信息做loss训练,训练好可以输出节点embedding,使用EdgeMinibatchIterator

from __future__ import division
'''即使在python2.X,使用print就得像python3.X那样加括号使用'''

from __future__ import print_function
'''导入python未来支持的语言特征division(精确除法),
6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法'''

import os#导入操作系统模块
import time#导入时间模块
import tensorflow as tf#导入TensorFlow模块
import numpy as np#导入numpy模块

from graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data
'''如果服务器有多个GPU,tensorflow默认会全部使用。如果只想使用部分GPU,可以通过参数CUDA_VISIBLE_DEVICES来设置GPU的可见性。'''

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"# 按照PCI_BUS_ID顺序从0开始排列GPU设备   # 使用哪一块gpu,本人只有一块,需将1改为0

'''Set random seed设置相同的seed,则每次生成的随机数也相同,如果不设置seed,则每次生成的随机数都会不一样。'''
seed = 123
np.random.seed(seed)#random是一个算法,设置随机数种子,再不同设备上生成的随机数一样。
tf.set_random_seed(seed)

# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS  #构造了一个解析器FLAGS  这样就可以从命令行中传入数据,从外部定义参数,如python train.py --model gcn

tf.app.flags.DEFINE_boolean('log_device_placement', False,
                            """Whether to log device placement.""")#定义变量bool型。
#core params..#定义变量,通过命令行解析传入参数
flags.DEFINE_string('model', 'graphsage', 'model names. See README for possible values.')  #传入模型,模型名字等参数 
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '', 'name of the object file that stores the training data. must be specified.')

# left to default values in main experiments 实验默认值
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')#迭代次数
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')#dropout率 避免过拟合(按照一定的概率随机丢弃一部分神经元)
# loss计算方式(权值衰减+正则化):self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')#权衰减 目的就是为了让权重减少到更小的值,在一定程度上减少模型过拟合的问题
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')#矩阵的度

flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')#第一层采样节点数 k =1 s = 25
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')#第二层采样节点数 k = 2 s = 10

#若有concat操作,则维度变为2倍
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')#负采样数
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')#bachsize
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')n2v SGD迭代次数
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')#设置负值,用于识别嵌入特征维度  默认为0

#logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')#选择是否保存嵌入
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')#用于记录和保存嵌入的基本目录
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")#验证集迭代次数
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")#验证集bach_size
flags.DEFINE_integer('gpu', 1, "which gpu to use.")#使用哪一个GPU,只有1块时需要改为0
flags.DEFINE_integer('print_every', 50, "How often to print training info.")#设置多久打印训练信息
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")#最大迭代次数

os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)#传入参数 # 使用哪一块gpu,只有一块,需将1改为0

GPU_MEM_FRACTION = 0.8#分配GPU多少资源给它使用

def log_dir():#定义嵌入数据保存目录设置函数
    log_dir = FLAGS.base_log_dir + "/unsup-" + FLAGS.train_prefix.split("/")[-2]
    log_dir += "/{model:s}_{model_size:s}_{lr:0.6f}/".format(
            model=FLAGS.model,
            model_size=FLAGS.model_size,
            lr=FLAGS.learning_rate)
    if not os.path.exists(log_dir):#如果不存在就创建一个
        os.makedirs(log_dir)#os模块创建dir
    return log_dir#将保存目录返回

# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):#定义模型评估函数,
    t_test = time.time()
    feed_dict_val = minibatch_iter.val_feed_dict(size)#采用minibatch梯度下降,比SGD、BGD快,通过feed_dict传值
    outs_val = sess.run([model.loss, model.ranks, model.mrr], 
                        feed_dict=feed_dict_val)#运行损失函数
    return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)#返回损失值,模型ranks,

def incremental_evaluate(sess, model, minibatch_iter, size):#增加评估
    t_test = time.time()
    finished = False
    val_losses = []
    val_mrrs = []
    iter_num = 0
    while not finished:
        feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)
        iter_num += 1
        outs_val = sess.run([model.loss, model.ranks, model.mrr], 
                            feed_dict=feed_dict_val)
        val_losses.append(outs_val[0])
        val_mrrs.append(outs_val[2])
    return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)

def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):#保存验证集嵌入
    val_embeddings = []
    finished = False
    seen = set([])
    nodes = []
    iter_num = 0
    name = "val"
    while not finished:
        feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)
        iter_num += 1
        outs_val = sess.run([model.loss, model.mrr, model.outputs1], 
                            feed_dict=feed_dict_val)
        #ONLY SAVE FOR embeds1 because of planetoid
        for i, edge in enumerate(edges):
            if not edge[0] in seen:
                val_embeddings.append(outs_val[-1][i,:])
                nodes.append(edge[0])
                seen.add(edge[0])
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    val_embeddings = np.vstack(val_embeddings)#按垂直方向嵌入
    np.save(out_dir + name + mod + ".npy",  val_embeddings)
    with open(out_dir + name + mod + ".txt", "w") as fp:
        fp.write("\n".join(map(str,nodes)))#将节点映射后转化为json格式数据存储

def construct_placeholders():#定义放置placeholder函数,tf中占位符
    # Define placeholders
    placeholders = {
        'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
        'batch2' : tf.placeholder(tf.int32, shape=(None), name='batch2'),
        # negative samples for all nodes in the batch  所有nodes均为负样本
        'neg_samples': tf.placeholder(tf.int32, shape=(None,),
            name='neg_sample_size'),
        'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
        'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
    }
    return placeholders

def train(train_data, test_data=None):#定义训练函数
    G = train_data[0]# 加载图信息
    features = train_data[1] # 训练数据的features
    id_map = train_data[2]# "n" : n  节点与节点直接的id映射 已经删除了节点是不具有'val'或'test'属性 的节点

    if not features is None:#只要features不为None
        #vstack为features添加列一行0向量,用于WX + b中与b相加
        features = np.vstack([features, np.zeros((features.shape[1],))])
        #这里vstack为features添加列一行0向量,用于WX + b中与b相加。
        
    context_pairs = train_data[3] if FLAGS.random_context else None  #random walk的点对
    placeholders = construct_placeholders()
    # def construct_placeholders()定义的placeholders包含:
    # batch1, batch2, neg_samples, dropout, batch_size
    minibatch = EdgeMinibatchIterator(G, 
            id_map,
            placeholders, batch_size=FLAGS.batch_size,
            max_degree=FLAGS.max_degree, 
            num_neg_samples=FLAGS.neg_sample_size,
            context_pairs = context_pairs)
    adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)
    adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")

    if FLAGS.model == 'graphsage_mean':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)
    elif FLAGS.model == 'gcn':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="gcn",
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     concat=False,
                                     logging=True)

    elif FLAGS.model == 'graphsage_seq':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     identity_dim = FLAGS.identity_dim,
                                     aggregator_type="seq",
                                     model_size=FLAGS.model_size,
                                     logging=True)

    elif FLAGS.model == 'graphsage_maxpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                    features,
                                    adj_info,
                                    minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="maxpool",
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)
    elif FLAGS.model == 'graphsage_meanpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                    features,
                                    adj_info,
                                    minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="meanpool",
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)

    elif FLAGS.model == 'n2v':
        model = Node2VecModel(placeholders, features.shape[0],
                                       minibatch.deg,
                                       #2x because graphsage uses concat
                                       nodevec_dim=2*FLAGS.dim_1,
                                       lr=FLAGS.learning_rate)
    else:
        raise Exception('Error: model name unrecognized.')

    config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
    config.gpu_options.allow_growth = True
    # 使用allow_growth option,刚一开始分配少量的GPU容量,然后按需慢慢的增加
    #config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
    
    # 设置每个GPU应该拿出多少容量给进程使用,
    # per_process_gpu_memory_fraction =0.4代表 40%
     
    config.allow_soft_placement = True  # 如果指定的设备不存在,允许TF自动分配设备
    
    # 自动选择运行设备
    # 在tf中,通过命令 "with tf.device('/cpu:0'):",允许手动设置操作运行的设备
    # 如果手动设置的设备不存在或者不可用,就会导致tf程序等待或异常,
    # 为了防止这种情况,可以设置tf.ConfigProto()中参数allow_soft_placement=True,
    # 允许tf自动选择一个存在并且可用的设备来运行操作。
 
    
    # Initialize session
    sess = tf.Session(config=config)
    merged = tf.summary.merge_all()#能够保存训练过程以及参数分布图并在tensorboard显示  
                                   #merge_all 可以将所有summary全部保存到磁盘,以便tensorboard显示。
    # 指定一个文件用来保存图
    # 格式:tf.summary.FileWritter(path,sess.graph)
    # 可以调用其add_summary()方法将训练过程数据保存在filewriter指定的文件中
    summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
     
    # Init variables
    sess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})
    
    # Train model
    
    train_shadow_mrr = None
    shadow_mrr = None

    total_steps = 0
    avg_time = 0.0
    epoch_val_costs = []

    train_adj_info = tf.assign(adj_info, minibatch.adj)
    val_adj_info = tf.assign(adj_info, minibatch.test_adj)
    for epoch in range(FLAGS.epochs): 
        minibatch.shuffle() 

        iter = 0
        print('Epoch: %04d' % (epoch + 1))
        epoch_val_costs.append(0)
        while not minibatch.end():
            # Construct feed dictionary
            feed_dict = minibatch.next_minibatch_feed_dict()
            feed_dict.update({placeholders['dropout']: FLAGS.dropout})

            t = time.time()
            # Training step
            outs = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all, 
                    model.mrr, model.outputs1], feed_dict=feed_dict)
            train_cost = outs[2]
            train_mrr = outs[5]
            if train_shadow_mrr is None:
                train_shadow_mrr = train_mrr#
            else:
                train_shadow_mrr -= (1-0.99) * (train_shadow_mrr - train_mrr)

            if iter % FLAGS.validate_iter == 0:
                # Validation
                sess.run(val_adj_info.op)
                val_cost, ranks, val_mrr, duration  = evaluate(sess, model, minibatch, size=FLAGS.validate_batch_size)
                sess.run(train_adj_info.op)
                epoch_val_costs[-1] += val_cost
            if shadow_mrr is None:
                shadow_mrr = val_mrr
            else:
                shadow_mrr -= (1-0.99) * (shadow_mrr - val_mrr)

            if total_steps % FLAGS.print_every == 0:
                summary_writer.add_summary(outs[0], total_steps)
    
            # Print results
            avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)

            if total_steps % FLAGS.print_every == 0:
                print("Iter:", '%04d' % iter, 
                      "train_loss=", "{:.5f}".format(train_cost),
                      "train_mrr=", "{:.5f}".format(train_mrr), 
                      "train_mrr_ema=", "{:.5f}".format(train_shadow_mrr), # exponential moving average
                      "val_loss=", "{:.5f}".format(val_cost),
                      "val_mrr=", "{:.5f}".format(val_mrr), 
                      "val_mrr_ema=", "{:.5f}".format(shadow_mrr), # exponential moving average
                      "time=", "{:.5f}".format(avg_time))

            iter += 1
            total_steps += 1

            if total_steps > FLAGS.max_total_steps:
                break

        if total_steps > FLAGS.max_total_steps:
                break
    
    print("Optimization Finished!")
    if FLAGS.save_embeddings:# 训练以后是否存储节点的embeddings
        sess.run(val_adj_info.op)

        save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir())

        if FLAGS.model == "n2v":
            # stopping the gradient for the already trained nodes
            train_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if not G.node[n]['val'] and not G.node[n]['test']],
                    dtype=tf.int32)
            test_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if G.node[n]['val'] or G.node[n]['test']], 
                    dtype=tf.int32)
            update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(test_ids))
            no_update_nodes = tf.nn.embedding_lookup(model.context_embeds,tf.squeeze(train_ids))
            update_nodes = tf.scatter_nd(test_ids, update_nodes, tf.shape(model.context_embeds))
            no_update_nodes = tf.stop_gradient(tf.scatter_nd(train_ids, no_update_nodes, tf.shape(model.context_embeds)))
            model.context_embeds = update_nodes + no_update_nodes
            sess.run(model.context_embeds)

            # run random walks
            from graphsage.utils import run_random_walks
            nodes = [n for n in G.nodes_iter() if G.node[n]["val"] or G.node[n]["test"]]
            start_time = time.time()
            pairs = run_random_walks(G, nodes, num_walks=50)
            walk_time = time.time() - start_time

            test_minibatch = EdgeMinibatchIterator(G, 
                id_map,
                placeholders, batch_size=FLAGS.batch_size,
                max_degree=FLAGS.max_degree, 
                num_neg_samples=FLAGS.neg_sample_size,
                context_pairs = pairs,
                n2v_retrain=True,
                fixed_n2v=True)
            
            start_time = time.time()
            print("Doing test training for n2v.")
            test_steps = 0
            for epoch in range(FLAGS.n2v_test_epochs):
                test_minibatch.shuffle()
                while not test_minibatch.end():
                    feed_dict = test_minibatch.next_minibatch_feed_dict()
                    feed_dict.update({placeholders['dropout']: FLAGS.dropout})
                    outs = sess.run([model.opt_op, model.loss, model.ranks, model.aff_all, 
                        model.mrr, model.outputs1], feed_dict=feed_dict)
                    if test_steps % FLAGS.print_every == 0:
                        print("Iter:", '%04d' % test_steps, 
                              "train_loss=", "{:.5f}".format(outs[1]),
                              "train_mrr=", "{:.5f}".format(outs[-2]))
                    test_steps += 1
            train_time = time.time() - start_time
            save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir(), mod="-test")
            print("Total time: ", train_time+walk_time)
            print("Walk time: ", walk_time)
            print("Train time: ", train_time)

    
# main函数,加载数据并训练
def main(argv=None):
    print("Loading training data..")
    train_data = load_data(FLAGS.train_prefix, load_walks=True)'''load_data函数在graphsage.utils中定义,加载标签数据集'''
    print("Done loading training data..")
    train(train_data)'''# train函数在该文件中定义def train(train_data, test_data=None)'''

if __name__ == '__main__':
    tf.app.run()  # 解析命令行参数,调用main 函数 main(sys.argv) 
''' 
tf.app.run()的作用:通过处理flag解析,然后执行main函数
如果你的代码中的入口函数不叫main(),而是一个其他名字的函数,如test(),则你应该这样写入口tf.app.run(test())
 如果你的代码中的入口函数叫main(),则你就可以把入口写成tf.app.run()
 
 使用tf.app.run() ,上面已经有FLAGS = tf.app.flags.FLAGS了,则已经解析了输入。

则tf.app.run() 中argv=None,通过args = argv[1:] if argv else None则args=None(即不指定,后面会自动解析command)

f = flags.FLAGS构造了解析器f用以解析args, f._parse_flags(参数args)解析args列表或者command输入,args列表为空,则解析command输入,返回的flags_passthrough内为无法解析的数据列表(不包括文件名) 。
'''

5.3utils.py代码

from __future__ import print_function

import numpy as np'''导入numpy模块'''
import random'''导入randomm模块'''
import json'''导入json模块'''
import sys'''导入系统模块'''
import os'''导入操作系统模块'''

import networkx as nx'''networkx(图论)的基本操作,用于创建图等操作'''
from networkx.readwrite import json_graph'''用于将networks图保存为json图'''
version_info = list(map(int, nx.__version__.split('.')))#获取netwoeks版本信息然后转换为列表
major = version_info[0]#获取版本号点号前面的数字
minor = version_info[1]#获取版本号点号后面的数字
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"#networkx版本必须小于等于1.11,否则断言

WALK_LEN=5
N_WALKS=50
'''
type() 与 isinstance() 区别:
type() 不会认为子类是一种父类类型,不考虑继承关系。
isinstance() 会认为子类是一种父类类型,考虑继承关系。
如果要判断两个类型是否相同推荐使用 isinstance()。
参数
object – 实例对象。
classinfo – 可以是直接或间接类名、基本类型或者由它们组成的元组。

返回值
如果对象的类型与参数二的类型(classinfo)相同则返回 True,否则返回 False。
'''
'''G.nodes()  返回的是图中节点n与节点属性nodedata。'''

def load_data(prefix, normalize=True, load_walks=False):
    G_data = json.load(open(prefix + "-G.json"))#加载图信息  图信息为json文件 所以用json模块导入
    G = json_graph.node_link_graph(G_data)#Return graph from node-link data format
    #定义conversion函数
     #判断G.nodes()[0] 是否为int型(即不带nodedata)
    if isinstance(G.nodes()[0], int):
        conversion = lambda n : int(n)# lambda parameters:express
    else:
        conversion = lambda n : n#保持n不动

    if os.path.exists(prefix + "-feats.npy"):#如果路径下面存在预训练好得到的features文件
        feats = np.load(prefix + "-feats.npy")
    else:
        print("No features present.. Only identity features will be used.")
        feats = None
        #一个json存储的字典,将图节点id映射为连续整数。
    id_map = json.load(open(prefix + "-id_map.json"))#加载节点编号与序号的一一对应的id数据
    id_map = {conversion(k):int(v) for k,v in id_map.items()}
    walks = []
    class_map = json.load(open(prefix + "-class_map.json"))#标签数据加载,字典集合
     # print("class_map:",class_map                                    )
    #{"0": [1, 0, 0,...],...,"14754": [1, 1, 0, 0,...]}
    if isinstance(list(class_map.values())[0], list):#将字典数据转换为列表数据,在判断是否转换成功
        lab_conversion = lambda n : n
    else:
        lab_conversion = lambda n : int(n)#将标签转换为整型

    class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()}
    """遍历标签数据集中所有键--值,以列表返回,构造集合   #id_map的迭代中k为str类型,v为int型,将其全部转换成整型"""
        # print("class_map:",class_map)
        #{0: [1, 0, 0,...],...,14754: [1, 1, 0, 0,...]}
    
'''代码中edge对edges迭代,每次去list中的一个元组,而edge[0], edge[1]则分别表示两个顶点。
若两个顶点中至少有一个的val / test不为空,则将该边的’train_removed’设为True,否则为False。
该操作为保证’train_removed’不为空。
'''
    ## Remove all nodes that do not have val/test annotations
    ## (necessary because of networkx weirdness with the Reddit data)怪异性
    broken_count = 0
    for node in G.nodes():
        if not 'val' in G.node[node] or not 'test' in G.node[node]:
            G.remove_node(node)
            broken_count += 1
    print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count))#format():把传统的%替换为{}来实现格式化输出
'''G.edges() 得到edge_list, [( , ), ( , ), … ( , )]。list中每一个元素是所表示边的两个节点信息。若设置data = True,则会显示边的权重等属性信息。'''
    ## Make sure the graph has edge train_removed annotations
    ## (some datasets might already have this..)
    print("Loaded data.. now preprocessing..")
    for edge in G.edges():
        if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
            G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
            G[edge[0]][edge[1]]['train_removed'] = True
        else:
            G[edge[0]][edge[1]]['train_removed'] = False
             #获取训练数据features并标准化
'''将val,test均为None的node选为训练数据,通过id_map获取其在feature表中的索引值,添加到train_ids数组中。根据索引train_ids,train_fests获取这些nodes的features.'''
    if normalize and not feats is None:#feats 非空  获取训练数据features并标准化
        from sklearn.preprocessing import StandardScaler
        train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])#将val,test均为None的node选为训练数据
        train_feats = feats[train_ids]#获取节点特征
        scaler = StandardScaler()
        scaler.fit(train_feats)#计算训练数据的均值和方差
        feats = scaler.transform(feats)#计算训练数据的均值和方差,还会基于计算出来的均值和方差来转换训练数据,从而把数据转换成标准的正太分布
    ## 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导
    if load_walks:# false by default
        with open(prefix + "-walks.txt") as fp:
            for line in fp:
                walk

5.4.自用测试代码

from __future__ import print_function

import numpy as np#'''导入numpy模块'''
import random#'''导入randomm模块'''
import json#'''导入json模块'''
import sys#'''导入系统模块'''
import os#'''导入操作系统模块'''
import networkx as nx#'''networkx(图论)的基本操作,用于创建图等操作'''
from networkx.readwrite import json_graph#'''用于将networks图保存为json图'''
version_info = list(map(int, nx.__version__.split('.')))#获取netwoeks版本信息然后转换为列表
WALK_LEN=5
N_WALKS=50
major = version_info[0]#获取版本号点号前面的数字
minor = version_info[1]#获取版本号点号后面的数字
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"#networkx版本必须小于等于1.11,否则断言
prefix ='toy-ppi'
G_data = json.load(open('./toy-ppi-G.json'))#加载图信息  图信息为json文件 所以用json模块导入
G = json_graph.node_link_graph(G_data)#Return graph from node-link data format
if isinstance(G.nodes()[0], int):
    conversion = lambda n : int(n)# lambda parameters:express
else:
     conversion = lambda n : n#保持n不动

print(G.nodes()[0])
0
print(G.nodes())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,  14649, 14650, 14651, 14652, 14653, 14654, 14655, 14656, 14657, 14658, 14659, 14660, 14661, 14662, 14663, 14664, 14665, 14666, 14667, 14668, 14669, 14670, 14671, 14672, 14673, 14674, 14675, 14676, 14677, 14678, 14679, 14680, 14681, 14682, 14683, 14684, 14685, 14686, 14687, 14688, 14689, 14690, 14691, 14692, 14693, 14694, 14695, 14696, 14697, 14698, 14699, 14700, 14701, 14702, 14703, 14704, 14705, 14706, 14707, 14708, 14709, 14710, 14711, 14712, 14713, 14714, 14715, 14716, 14717, 14718, 14719, 14720, 14721, 14722, 14723, 14724, 14725, 14726, 14727, 14728, 14729, 14730, 14731, 14732, 14733, 14734, 14735, 14736, 14737, 14738, 14739, 14740, 14741, 14742, 14743, 14744, 14745, 14746, 14747, 14748, 14749, 14750, 14751, 14752, 14753, 14754]
if isinstance(G.nodes()[0], int):
    conversion = lambda n : int(n)# lambda parameters:express
else:
    conversion = lambda n : n#保持n不动
print(conversion(1.0))
1
if os.path.exists(prefix + "-feats.npy"):#如果路径下面存在预训练好得到的features文件
    feats = np.load(prefix + "-feats.npy")
else:
    print("No features present.. Only identity features will be used.")
    feats = None
id_map = json.load(open(prefix + "-id_map.json"))#加载节点编号与序号的一一对应的id数据
print(id_map)
print('*******************************************************')
id_map = {conversion(k):int(v) for k,v in id_map.items()}
print(id_map)
{'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, '10': 10, '11': 11, '12': 12, '13': 13, '14': 14, '15': 15, '16': 16, '17': 17, '18': 18, '19': 19, '20': 20, '21': 21, '22': 22, '23': 23, '24': 24, '25': 25, '26':