python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

程序员文章站 2022-03-06 14:04:09

● 研二在读学生，非工科非计算机专业，故代码简陋初级勿喷，本文仅为记录和快乐分享。○ 感谢肯定，转载请注明本页出处即可。 ____Ⓙ即刻@王昭没有君python识别word文件格式——（基于python编写简单office阅卷程序①）......

● 研二在读学生，非工科非计算机专业，故代码简陋初级勿喷，本文仅为记录和快乐分享。
○ 感谢肯定，感谢点赞收藏分享，转载请注明本页出处即可。 ____Ⓙ即刻@王昭没有君

本文仅为笔者摸索总结-欢迎订正补充交流讨论-

❤python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

————————

一、整体思路：

????1. 使用python第三方库docx识别尽可能多的word格式；（更简单方便）

使用 dir() 查看当级存在的属性或下级对象（不含双下划线__的）
使用 (.属性）试图调用查看属性，或（.对象）进入下级对象

????2. 将.docx转为.xml格式文件，读取标签，补充识别docx库无法识别的格式；

解压word.docx文件为xml文件（不止一个，有好几个文件夹）
找到相应的属性在xml文件中的存储标签名和层级
使用（层级.tag）（层级.attrib）(层级.text) 试图取出该属性

????3. office有个懒惰且简洁的规则是，很多默认属性和格式，若该文档中作者未修改默认格式或属性，则在xml文件中该属性或格式的标签不存在，则在用python抽取该格式或属性时，返回值为None或不存在，有时还会报错。例如：

默认字体为宋体（有的版本是宋体(标题)或宋体(正文)）
默认字号小三（也可能因版本不同而不同或.doc和.docx差异）
默认无首行缩进、默认行间距1.0倍等

而在修改了这些格式后，该属性标签会存储在.docx和.xml文件中。又不像是完全的日志文件。

————————

二、使用python库情况

此处均为编写阅卷程序用到的，若只识别word格式，并不需要以下全部：

import docx                                     # 读取word文件
import xlrd                                     # 读取excel文件,主要是获取名单和创建地址用
import openpyxl                                 # 读取/写入 excel文件,主要是记录成绩用
import os                                       # 使用文件路径等
import xml.etree.ElementTree as ET              # 读取xml文件

除此之外，在解压转为xml文件时还用到以下库：

import os										# 因笔者分开写的解压程序，解压也用到os库
import xlrd										# 因笔者分开写的解压程序，解压也用到xlrd库,主要是获取名单和创建地址用
import shutil									# 删除配置文件
import zipfile
# 解压word（.docx）、excel(.xlsx)、ppt(.pptx)文件成为.xml格式文件

————————

三、docx库识别文档结构：

document：
- sections：
- parapraphs：
  - runs：
sections和paragraphs是同级关系
表格和图片游离于sections和paragraphs

python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

例如：
python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

????1.读取文件：docx.Document ( ’ 文件地址 ’ )

file = docx.Document(r"F:\文件地址\word.docx")

????2.读取节们：文件.sections

sections = file.sections		# 节们
for section in sections: 		# 遍历节
	print(section.page_height) 	# 页高

（1）分节符是分割节与节的标志（未尝试过分页符，欢迎补充）；
（2）有关页面的属性基本都在sections部分里；
（3）节属性包括但不仅限于： # 使用print(dir(section)) 、print(dir(sections))查看更多属性和下级对象

 - 页高 ：section.page_height
 - 页宽 ：section.page_width
 - 页面横纵 ：section.orientation
 - 装订线 ：section.gutter
 - 左边距 ：section.left_margin
 - 右边距 ：section.right_margin
 - 上边距 ：section.top_margin
 - 下边距 ：section.bottom_margin
 - 页眉：section.header 
 - 页脚 ：section.footer

✨——其中页眉： # 使用print(dir(section.header)) 查看更多属性和下级对象

 - 页眉顶端距离 ：section.header_distance
 - 页脚底端距离 ：section.footer_distance
 - 页眉内容 ：section.header.paragraphs[0].text
 - 页眉对齐 ：section.header.paragraphs[0].alignment
 - 页眉字号：section.header.paragraphs[0].runs[0].font.size
 - 页眉字体：section.header.paragraphs[0].runs[0].font.name

✨——页脚类似，但页码只能从xml文件识别。

????3.读取段落们：文件.paragraphs

paragraphs = file.paragraphs		# 段落们
for i in range(len(paragraphs)):  	# 遍历段落 也可以写成上面节的遍历形式，此处须为后续保留段号i，故写成这种形式。
	paragraph = paragraphs[i]
	if paragraphs[i].text != "":  	# 筛选非空段
		print(paragraph.text) 		# 段落内容

（1）有关段落的属性基本都在paragraphs部分里；
（2）节属性包括但不仅限于： # 使用print(dir(paragraph)) 、使用print(dir(paragraphs))查看更多属性和下级对象

- 整段内容 ：paragraph.text
- 对齐方式 ：paragraph.alignment
- 段前距 ：paragraph.paragraph_format.space_before
- 段后距 ：paragraph.paragraph_format.space_after
- 左侧缩进 ：paragraph.paragraph_format.left_indent
- 右侧缩进 ：paragraph.paragraph_format.right_indent
- 首行缩进 ：paragraph.paragraph_format.first_line_indent
- 行间距 ：paragraph.paragraph_format.line_spacing

（3）分栏、项目符号不在paragraphs属性里，只能从xml文件识别。

????4.读取字块们：文件.paragraphs

paragraphs = file.paragraphs			# 段落们
for i in range(len(paragraphs)):		# 遍历段落
    paragraph = paragraphs[i]
    if paragraph.text!="":  			# 筛选非空段
    	for run in paragraph.runs:   	# 遍历字块
    		print(run.text)				# 字块内容
    		break

（1）有关字的属性基本都在runs部分里；
（2）runs字块在切割时，常以相同属性分割，遇到不同属性时分割，如：

全球超级计算机500强榜单20日公布，“神威太湖之光”登上榜首。
- 若设置中英文不同字体，则该段落字块被分成：全球超级计算机///500///强榜单///20///日公布，“神威太湖之光”登上榜首。
- 原理上笔者猜测各属性完全一致的一个段落会被划分为一个整字块。~~而在笔者实际操作阅卷时，学生总有离奇的神操作，同样的段落常被分割为不同的字块，且同一文件运行几次每次结果都不一样，令人咬牙切齿。故~~ 在实际运用时，笔者只取首字块的首字调取其属性。

（3）字块属性包括但不仅限于： # 使用print(dir(run)) 、使用print(dir(runs))查看更多属性和下级对象

- 内容 ：run.text
- 字体 ：run.font.name						# font
- 字号 ：run.font.size
- 斜体 ：run.font.italic
- 加粗 ：run.font.bold
- 下划线 ：run.font.underline
- 颜色 ：run.font.color.rgb 				# 颜色RGB值

????5.读取表格们：文件.tables

python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

tables = file.tables					# 表格们
for table in tables: 					# 遍历表格
   for row in table.rows:				# 遍历表格行
       	r1 = r1 + 1
       	for cells in row.cells:
       		print(cell.text)			# 逐行打印表格内容
   for column in table.columns:			# 遍历表格列
       	r2 = r2 + 1
   
print(r1,r2)							# r1行数  r2列数

（1）表格属性在tables中：

- 表格对齐方式 ：table.alignment 							# 区别于单元格对齐方式

（2）行列属性在rows和columns中：

- 表格行高 ：row.height 
- 表格列宽 ：column.width

（3）单元格属性在cells中：

- 单元格内容 ：cell.text
- 单元格对齐方式 ：cell.alignment

????6.读取图片们：文件.inline_shapes

pics = file.inline_shapes     	# 图片们
for pic in pics:        		# 遍历图片

- 图片长 ：pic.width
- 图片宽 ：pic.height
- (不知道是什么type) ：pic.type

docx库里能识别到的格式并不完整，本文有提到可识别的大部分格式，其余多数只能通过读取xml文件调取。虽然xml文件可以识别到全部格式，但使用docx库读取还是更加简便，能不用xml就尽量不用。
然而库功能并不完全，此时需要读取.docx文件转成的.xml文件，识别其中格式。其中最常用的格式是页码。本文以识别页码为例。

————————

四、读取xml文件识别文档结构：

????1.文件转换

（1）最直接的方法：手工将word.docx文件重命名为word.zip文件，再解压缩。
①原文件
python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）
②是

③解压到当前文件夹，或你选择的地方

④我们需要使用到的xml文件都在解压后的word文件夹里

⑤根据文件的不同，里面的xml文件数量和内容均有差异。例如笔者这个文档有页眉和页脚，故该文件夹下才有footer123.xml和header123.xml。若无页眉页脚且无修改页眉页脚历史记录，则该文件夹下不含footer123.xml和header123.xml。用记事本可以简单地打开查看内容。
python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）
（2）使用代码批量转换。在CSDN上有很多代码分享，可自行查阅word转xml文件等关键字。此处贴笔者使用的代码，尴尬的是笔者找不到原出处了，若原作者看见本文请联系笔者填写出处或删除本部分，非常抱歉。

因笔者将试卷文件夹设置为学生学号，每个学号的文件夹里有三个需要读取的文件，故先读取名单中学号，以学号作为试卷地址路径的一部分，批量每个学号解压文件夹里的三个文件。

import os
import zipfile
import shutil
import xlrd

class Name_list():
    def __init__(self, file_address):
        self.file_address = file_address
    pass

    def read(self, sheet_name):
        workbook = xlrd.open_workbook(self.file_address)
        sheet = workbook.sheet_by_name(sheet_name)
        data = []
        for i in range(0, sheet.nrows):
            data.append(sheet.row_values(i))
        return data
    pass

def unzip_file(path, filenames):

    print(path)
    #print(os.listdir(path))
    for filename in filenames:
        filepath = os.path.join(path,filename)
        if os.path.exists(filepath):
            zip_file = zipfile.ZipFile(filepath) 		# 获取压缩文件
            #print(filename)
            newfilepath = filename.split(".",1)[0] 		# 获取压缩文件的文件名
            newfilepath = os.path.join(path,newfilepath)
            #print(newfilepath)
            if os.path.isdir(newfilepath): 				# 根据获取的压缩文件的文件名建立相应的文件夹
                pass
            else:
                os.mkdir(newfilepath)
            for name in zip_file.namelist():			# 解压文件
                zip_file.extract(name,newfilepath)
            zip_file.close()
            Conf = os.path.join(newfilepath,'conf')
            if os.path.exists(Conf):					# 如存在配置文件，则删除（需要删则删，不要的话不删）
                shutil.rmtree(Conf)

            print("解压{0}成功".format(filename))

def main():
    for j in range(int(len(student_list) - 2)):
        stu_id = student_idlist[j]
        address_stu_id = str(address_beforeid + str(stu_id))  # 试卷地址
        if os.path.exists(address_stu_id):
            filenames = ['excel操作题.xlsx', 'PPT操作题.pptx', 'word操作题.docx']  # 目录下需要解压的文件名
            unzip_file(address_stu_id, filenames)
        pass


if __name__ == '__main__':

    address_idlist =  r"F:\名单.xlsx" 						# 名单
    address_beforeid = 'F:\\试卷\\'   						# 试卷路径学号文件夹之前的部分

    student_list = Name_list(address_idlist).read('Sheet1')
    student_idlist = [[] for r in range(int(len(student_list) - 2))]
    for k in range(int(len(student_list) - 2)):
        student_idlist[k] = int(student_list[k + 2][1])
    pass

    main()

（3）转换完成后的效果：
python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

????2.读取xml识别word页码

————关于格式在哪个xml文件的哪层标签、如何找放在下一小节，先上结论。

（1）页码的标签一般在word文件夹中footer2.xml文件中，有页码但找不到考虑寻找footer1.xml和footer3.xml内容。

（2）根标签内的< ftr > < /ftr>标签内的2级< sdt > < /sdt >标签中存储页码相关属性，文件中有< sdt > < /sdt >标签在阅卷时至少证明该考生对页码进行过操作，会使用页码功能。

（3）< sdt > < /sdt >标签内的5级标签< instrText > < /instrText >内存储页码格式信息。页码样式不同在 instrText标签例如本题要求学生添加型如 “ - 1 - ” 形式的页码，则在instrText中的.text为：PAGE * MERGEFORMAT，另有5级标签＜jc＞＜/jc＞中的.attrib存储页码对齐方式，此处默认对齐方式为居中，页码默认居中时， xml文件中无 jc标签。

贴上读xml文件的代码，此处代码参考：

https://blog.csdn.net/weixin_36279318/article/details/79176475

import xml.etree.ElementTree as ET

class Xml2DataFrame:
	def __init__(self,xmlFileName):
		self.xmlFileName = xmlFileName
	pass
	
	def read_xml(self):
		tree = ET.parse(self.xmlFileName)
		root = tree.getroot()# 第一层解析
		#print('root.tag:', root.tag, ',root-attrib:', root.attrib, ',root-text:', root.text)
		for sub1 in root:# 第二层解析
			child = sub1
			print('sub1.tag:', child.tag, ',sub1.attrib:', child.attrib, ',sub1.text:', child.text)
			for sub2 in sub1:# 第三层解析
				child = sub2
				print('sub2.tag:', child.tag, ',sub2.attrib:', child.attrib, ',sub2.text:', child.text)
				（此处继续嵌套for 略写）
				

if __name__ == '__main__':

	file_path = r'F:\考试文件夹\学号\word操作题\word\footer2.xml'
	xml_df = Xml2DataFrame(file_path)
	xml_df.read_xml()

得以下输出结果：

sub1.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt ,sub1.attrib: {} ,sub1.text: None
sub2.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdtPr ,sub2.attrib: {} ,sub2.text: None
sub3.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}id ,sub3.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '-482698706'} ,sub3.text: None
sub3.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}docPartObj ,sub3.attrib: {} ,sub3.text: None
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}docPartGallery ,sub4.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'Page Numbers (Bottom of Page)'} ,sub4.text: None
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}docPartUnique ,sub4.attrib: {} ,sub4.text: None
sub2.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdtContent ,sub2.attrib: {} ,sub2.text: None
sub3.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}p ,sub3.attrib: {'{http://schemas.microsoft.com/office/word/2010/wordml}paraId': '57F7B8B0', '{http://schemas.microsoft.com/office/word/2010/wordml}textId': '0880F220', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '001547AD', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '001547AD'} ,sub3.text: None
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr ,sub4.attrib: {} ,sub4.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle ,sub5.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'ac'} ,sub5.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}jc ,sub5.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'right'} ,sub5.text: None
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}r ,sub4.attrib: {} ,sub4.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}fldChar ,sub5.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fldCharType': 'begin'} ,sub5.text: None
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}r ,sub4.attrib: {} ,sub4.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}instrText ,sub5.attrib: {} ,sub5.text: PAGE   \* MERGEFORMAT
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}r ,sub4.attrib: {} ,sub4.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}fldChar ,sub5.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fldCharType': 'separate'} ,sub5.text: None
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}r ,sub4.attrib: {} ,sub4.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}rPr ,sub5.attrib: {} ,sub5.text: None
sub6.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}lang ,sub6.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'zh-CN'} ,sub6.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}t ,sub5.attrib: {} ,sub5.text: 2
sub4.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}r ,sub4.attrib: {} ,sub4.text: None
sub5.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}fldChar ,sub5.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}fldCharType': 'end'} ,sub5.text: None
sub1.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}p ,sub1.attrib: {'{http://schemas.microsoft.com/office/word/2010/wordml}paraId': '337E501C', '{http://schemas.microsoft.com/office/word/2010/wordml}textId': '77777777', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '001547AD', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '001547AD'} ,sub1.text: None
sub2.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pPr ,sub2.attrib: {} ,sub2.text: None
sub3.tag: {http://schemas.openxmlformats.org/wordprocessingml/2006/main}pStyle ,sub3.attrib: {'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': 'ac'} ,sub3.text: None

（4）在读取xml每一层内容时，因笔者水平有限，使用for嵌套for嵌套for读取每一层。在读取较复杂文件时不得不嵌套到了20个for。经请教得知可使用递归写法。但递归算法占用内存较多，对电脑要求较高。考虑到学院办公室的电脑配置，也有笔者水平有限，运用递归实属痛苦，此优化暂停。

（5）阅卷抽取关键信息时定义一变量a用以存储内容，if ‘sdt’ in child.tag : a = str(child.tag + child.attrib + child.text)，判断字符串内容，再定标判分即可。

（6）两个非常相似的.docx可能因为一个小差异，在xml文件存储时格式信息存储的层数有很大区别，笔者遇见最高相差5层。因此在寻找某标签的层数时、输出时若找不到，除了考虑是不是默认属性，还要考虑是否存储在相邻的上下几层中。

????3.查看xml寻找位置

（1）识别某格式，首先需要找到该格式的xml存储标签。笔者新建3个空白word，第一个空白保存，第二个添加普通页尾，第三个添加要求格式页码。

（2）上面提到我们可以使用记事本简单地打开xml文件查看内容，此处对这三个文件进行比较。使用记事本打开xml文件，复制全文粘贴到某新建excel表格中某单元格，使用excel分列功能以’ < '符号分列，就可以找到每个标签中的区别了。在经过筛查后最终确定该格式存储在某标签中。

~~反正是笨办法英文单词连蒙带猜加谷歌翻译四级水平绰绰有余~~

python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

（3）下一步我们需要确定标签大致层数，使用代码使其xml输出显示逐层推进，此处代码参考：

https://blog.csdn.net/qq_41958123/article/details/105357692

import xml.dom.minidom

uglyxml = '需要输出的xml内容'
xml = xml.dom.minidom.parseString(uglyxml)
xml_pretty_str = xml.toprettyxml()
print(xml_pretty_str)

输出如图：
python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）
全选复制粘贴进excel：

再搜索标签名确定位置，数列数即可。

————————

五、总结

这是office三件套的第①部分——word部分，接下来要去准备网课考试和组会了。有空再梳理excel部分、ppt部分和面对学生们的奇葩操作，为了防止程序崩溃中断应注意的各种注意事项。
反正就是笨办法只考验耐心的办法，从一开始就处处暴露我的非专业性，我也找不到更好的办法，我就是闲的，不想人工改卷子想一劳永逸。我菜我认了别骂我，不爱看右上角八叉关闭谢谢您的善意。

我终于梳理完了怎么这么长

六、参考链接

https://blog.csdn.net/weixin_36279318/article/details/79176475
https://blog.csdn.net/qq_41958123/article/details/105357692

本文地址：https://blog.csdn.net/zhizhangtaoer__/article/details/110299439

相关标签：基于python编写简单office阅卷程序 python笔记 python xml

上一篇：苹果App Store评奖公布：国产手游《原神》荣膺年度iPhone游戏

下一篇： Python 利用flask搭建一个共享服务器的步骤

python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

❤python识别word文件格式 ——（专栏：基于python编写简单office阅卷程序①）

一、整体思路：

????1. 使用python第三方库docx识别尽可能多的word格式；（更简单方便）

????2. 将.docx转为.xml格式文件，读取标签，补充识别docx库无法识别的格式；

????3. office有个懒惰且简洁的规则是，很多默认属性和格式，若该文档中作者未修改默认格式或属性，则在xml文件中该属性或格式的标签不存在 ，则在用python抽取该格式或属性时，返回值为None或不存在，有时还会报错。例如：

二、使用python库情况

三、docx库识别文档结构：

????1.读取文件：docx.Document ( ’ 文件地址 ’ )

????2.读取节们：文件.sections

????3.读取段落们：文件.paragraphs

????4.读取字块们：文件.paragraphs

????5.读取表格们：文件.tables

????6.读取图片们：文件.inline_shapes

四、读取xml文件识别文档结构：

????1.文件转换

????2.读取xml识别word页码

????3.查看xml寻找位置

五、总结

六、参考链接