欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

解决爬虫爬取路径错误的问题

程序员文章站 2022-03-15 23:01:30
...

参考其他博主的代码,结果发现xpath的路径有问题,list找不到结点取不到链接。在源网页上copy了全路径后运行代码没问题list不在为空,现在附上代码和说明步骤:

import re,requests,codecs,time,random
from lxml import html
 
 
#proxies={"http" : "123.53.86.133:61234"}
proxies=None
headers = {
    'Host': 'guba.eastmoney.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
def get_url(page):
    stocknum=600519
    url='http://guba.eastmoney.com/list,'+str(stocknum)+'_'+str(page)+'.html'
    print("url:")
    print(url)
    #http://guba.eastmoney.com/list,600519.html
    try:
        text=requests.get(url,headers=headers,proxies=proxies,timeout=20)
        requests.adapters.DEFAULT_RETRIES = 5
        s = requests.session()
        s.keep_alive = False
        text=html.fromstring(text.text)
        urls=text.xpath('/html/body/div[6]/div[2]/div[4]/div[5]/span[3]/a/@href')
        print("urls:")
        print(urls)
    except Exception as e:
        print("Exception",e)
        time.sleep(random.random() + random.randint(0, 3))
        urls=''
    return urls
def get_comments(urls):
    for newurl in urls:
        newurl1='http://guba.eastmoney.com'+newurl
        try:
            text1=requests.get(newurl1,headers=headers,proxies=proxies,timeout=20)
            requests.adapters.DEFAULT_RETRIES = 5
            s = requests.session()
            s.keep_alive = False
            text1=html.fromstring(text1.text)
            times1=text1.xpath('//head/title/text()')
            times='!'.join(re.sub(re.compile('发表于| '),'',x)[:10] for x in times1).split('!')
            #times=list(map(lambda x:re.sub(re.compile('发表于| '),'',x)[:10],times))
            comments1=text1.xpath('////head/title/text()')
            comments='!'.join(w.strip() for w in comments1).split('!')
            dic=dict(zip(times,comments))
            print("dic",dic)
            print(type(dic))
            save_to_file(dic)
        except:
            print('error!!!!')
            time.sleep(random.random()+random.randint(0,3))
        #print(dic)
        #if times and comments:
            #dic.append({'time':times,'comment':comments})
    #return dic
def save_to_file(dic):
    if dic:
        #dic=dic
        print(dic)
        #df=pd.DataFrame([dic]).T
        #df.to_excel('E:\\1\\eastnoney.xlsx')
        for i,j in dic.items():
            output='{}\t{}\n'.format(i,j)
            #output.to_csv('E:\\1\\\df_maotai.csv',index=False,sep=',')
            f=codecs.open('E:\\1\\eastmoney.xlsx','a+','utf-8')
            f.write(output)
            f.close()
 
for page in range(2,257):
    print('正在爬取第{}页'.format(page))
    print(page)
    urls=get_url(page)
    dic=get_comments(urls)

首先Ctril+U打开源网页:

解决爬虫爬取路径错误的问题

找到目标结点及其所有父节点然后返回原网页按F12加入开发者模式:

解决爬虫爬取路径错误的问题

如图中标记的红色框按照各个父节点ID或者class依次打开直到找到目标结点的目标内容,右键->粘贴完整xpath路径:

解决爬虫爬取路径错误的问题

最后最后:给完整路径加上自己想要的属性,比如@href

/html/body/div[6]/div[2]/div[4]/div[5]/span[3]/a/@href

相关标签: 心得