音视频数据库 GRID 爬取
程序员文章站
2022-03-02 13:25:18
...
介绍如何爬取 GRID 数据库
地址在 http://spandh.dcs.shef.ac.uk/gridcorpus/
该网页比较简单,xpath直接找到需要的连接
找到所有的连接
# -*- coding:utf-8 -*
import urllib.request
from lxml import etree
# root_url="http://spandh.dcs.shef.ac.uk/gridcorpus/"
root_url="http://spandh.dcs.shef.ac.uk/"
url=f"{root_url}/gridcorpus"
def main():
html = urllib.request.urlopen(url).read()
tree = etree.HTML(html)
links = tree.xpath(".//td/a/@href")
#print( links )
with open("dw.raw.list", "w", encoding="utf-8") as fp:
for e in links:
f = True
if "example" in e:
f = False
# 这个是为了忽略高清视频
if "part" in e:
f = False
if f:
print( e )
if __name__ == '__main__':
main()
制作 wget 命令
这一步就是对每个连接制作成 wget 命令,要设置好保存文件名,最后可以方便,parallel
并行爬取
一下代码为参考
f1="full_path.list" # 对前一步,补全路路径,也可以在本脚本中添加
f2="last.list"
with open(f1) as fp:
lines = fp.readlines()
with open(f2, "w") as fp:
for line in lines:
line = line.strip()
ss=line.split("/")[-2:]
ss = "-".join(ss)
# wrong? line=f'wget -c -P datasets/ {line} -O {ss}\n'
line=f'wget -c -O datasets/{ss} {line} \n'
fp.write(line)
并行下载
cat last.list | parallel
parllel 比 xargs 好用
https://www.jianshu.com/p/cc54a72616a1
解压
解压还需要重命名,否则会覆盖
文本和音频
for x in {1..34}
do
{
nm=s$x
echo $nm
mkdir -p ../db_uncompress/$nm/tmp
mkdir -p ../db_uncompress/$nm/align
mkdir -p ../db_uncompress/$nm/audio
f1="../datasets/align-$nm.tar"
tar -xf $f1 -C ../db_uncompress/$nm/tmp
mv ../db_uncompress/$nm/tmp/align ../db_uncompress/$nm/
f2="../datasets/audio-$nm.tar"
tar -xf $f2 -C ../db_uncompress/$nm/tmp
mv ../db_uncompress/$nm/tmp/$nm/* ../db_uncompress/$nm/audio
#exit
} &
done
视频
for x in {1..34}
do
{
nm=s$x
echo $nm
mkdir -p ../db_uncompress/$nm/video
f1="../datasets/video-$nm.mpg_vcd.zip"
unzip -q $f1 -d ../db_uncompress/$nm/tmp
mv ../db_uncompress/$nm/tmp/$nm/* ../db_uncompress/$nm/video
#exit
} &
done
将align 文件拼接处目标文本
原本是按照 25kHz,采样点的范围,为了数据处理,先去掉这一部分
cnt=0
Dir["db_uncompress/s*/align/*.align"].each do |fnm|
#puts fnm
ss=File.open(fnm).readlines.map{|e| e.strip.split[-1] }.join " "
#puts ss
fnm.sub!("align","test")
#puts fnm
dir = fnm.split("/")[0...-1].join "/"
Dir.mkdir dir unless File.exist? dir
File.open(fnm, "w").puts ss
cnt+=1
puts cnt if cnt%1000 == 0
end