欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

音视频数据库 GRID 爬取

程序员文章站 2022-03-02 13:25:18
...

介绍如何爬取 GRID 数据库
地址在 http://spandh.dcs.shef.ac.uk/gridcorpus/

该网页比较简单,xpath直接找到需要的连接

找到所有的连接

# -*- coding:utf-8 -*
import urllib.request
from lxml import etree

# root_url="http://spandh.dcs.shef.ac.uk/gridcorpus/"
root_url="http://spandh.dcs.shef.ac.uk/"
url=f"{root_url}/gridcorpus"
def main():

    html = urllib.request.urlopen(url).read()
    tree = etree.HTML(html)

    links = tree.xpath(".//td/a/@href")    
    #print( links )
    with open("dw.raw.list", "w", encoding="utf-8") as fp:
            for e in links:
                f = True
                if "example" in e:
                    f = False
                
                # 这个是为了忽略高清视频
                if "part" in e:
                    f = False
                if f:
                    print( e )

if __name__ == '__main__':
    main()

制作 wget 命令

这一步就是对每个连接制作成 wget 命令,要设置好保存文件名,最后可以方便,parallel 并行爬取

一下代码为参考

f1="full_path.list" # 对前一步,补全路路径,也可以在本脚本中添加
f2="last.list"

with open(f1) as fp:
    lines = fp.readlines()

with open(f2, "w") as fp:
    for line in lines:
        line = line.strip()
        ss=line.split("/")[-2:]
        ss = "-".join(ss)

        # wrong? line=f'wget -c -P datasets/ {line} -O {ss}\n'
        line=f'wget -c -O datasets/{ss} {line} \n'
        fp.write(line)

并行下载

cat last.list | parallel

parllel 比 xargs 好用
https://www.jianshu.com/p/cc54a72616a1

解压

解压还需要重命名,否则会覆盖

文本和音频

for x in {1..34}
do
    {
        nm=s$x
        echo $nm
        mkdir -p ../db_uncompress/$nm/tmp
        mkdir -p ../db_uncompress/$nm/align
        mkdir -p ../db_uncompress/$nm/audio
        f1="../datasets/align-$nm.tar"
        tar -xf $f1 -C ../db_uncompress/$nm/tmp
        mv ../db_uncompress/$nm/tmp/align ../db_uncompress/$nm/

        f2="../datasets/audio-$nm.tar"
        tar -xf $f2 -C ../db_uncompress/$nm/tmp
        mv ../db_uncompress/$nm/tmp/$nm/* ../db_uncompress/$nm/audio

        #exit
    } &
done

视频

for x in {1..34}
do
    {
        nm=s$x
        echo $nm
        mkdir -p ../db_uncompress/$nm/video
        f1="../datasets/video-$nm.mpg_vcd.zip"
        unzip -q $f1 -d ../db_uncompress/$nm/tmp
        mv ../db_uncompress/$nm/tmp/$nm/* ../db_uncompress/$nm/video

        #exit
    } &
done

将align 文件拼接处目标文本

原本是按照 25kHz,采样点的范围,为了数据处理,先去掉这一部分

cnt=0
Dir["db_uncompress/s*/align/*.align"].each do |fnm|

    #puts fnm
    ss=File.open(fnm).readlines.map{|e| e.strip.split[-1] }.join " "
    #puts ss
    fnm.sub!("align","test")
    #puts fnm

    dir = fnm.split("/")[0...-1].join "/"
    Dir.mkdir dir unless File.exist? dir

    File.open(fnm, "w").puts ss

    cnt+=1
    puts cnt if cnt%1000 == 0
end