解决爬虫爬取路径错误的问题

程序员文章站 2022-03-15 23:01:30

...

参考其他博主的代码，结果发现xpath的路径有问题，list找不到结点取不到链接。在源网页上copy了全路径后运行代码没问题list不在为空，现在附上代码和说明步骤：

import re,requests,codecs,time,random
from lxml import html
 
 
#proxies={"http" : "123.53.86.133:61234"}
proxies=None
headers = {
    'Host': 'guba.eastmoney.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
def get_url(page):
    stocknum=600519
    url='http://guba.eastmoney.com/list,'+str(stocknum)+'_'+str(page)+'.html'
    print("url:")
    print(url)
    #http://guba.eastmoney.com/list,600519.html
    try:
        text=requests.get(url,headers=headers,proxies=proxies,timeout=20)
        requests.adapters.DEFAULT_RETRIES = 5
        s = requests.session()
        s.keep_alive = False
        text=html.fromstring(text.text)
        urls=text.xpath('/html/body/div[6]/div[2]/div[4]/div[5]/span[3]/a/@href')
        print("urls:")
        print(urls)
    except Exception as e:
        print("Exception",e)
        time.sleep(random.random() + random.randint(0, 3))
        urls=''
    return urls
def get_comments(urls):
    for newurl in urls:
        newurl1='http://guba.eastmoney.com'+newurl
        try:
            text1=requests.get(newurl1,headers=headers,proxies=proxies,timeout=20)
            requests.adapters.DEFAULT_RETRIES = 5
            s = requests.session()
            s.keep_alive = False
            text1=html.fromstring(text1.text)
            times1=text1.xpath('//head/title/text()')
            times='!'.join(re.sub(re.compile('发表于| '),'',x)[:10] for x in times1).split('!')
            #times=list(map(lambda x:re.sub(re.compile('发表于| '),'',x)[:10],times))
            comments1=text1.xpath('////head/title/text()')
            comments='!'.join(w.strip() for w in comments1).split('!')
            dic=dict(zip(times,comments))
            print("dic",dic)
            print(type(dic))
            save_to_file(dic)
        except:
            print('error!!!!')
            time.sleep(random.random()+random.randint(0,3))
        #print(dic)
        #if times and comments:
            #dic.append({'time':times,'comment':comments})
    #return dic
def save_to_file(dic):
    if dic:
        #dic=dic
        print(dic)
        #df=pd.DataFrame([dic]).T
        #df.to_excel('E:\\1\\eastnoney.xlsx')
        for i,j in dic.items():
            output='{}\t{}\n'.format(i,j)
            #output.to_csv('E:\\1\\\df_maotai.csv',index=False,sep=',')
            f=codecs.open('E:\\1\\eastmoney.xlsx','a+','utf-8')
            f.write(output)
            f.close()
 
for page in range(2,257):
    print('正在爬取第{}页'.format(page))
    print(page)
    urls=get_url(page)
    dic=get_comments(urls)

首先Ctril+U打开源网页：

解决爬虫爬取路径错误的问题

找到目标结点及其所有父节点然后返回原网页按F12加入开发者模式：

解决爬虫爬取路径错误的问题

如图中标记的红色框按照各个父节点ID或者class依次打开直到找到目标结点的目标内容，右键->粘贴完整xpath路径：

解决爬虫爬取路径错误的问题

最后最后：给完整路径加上自己想要的属性，比如@href

/html/body/div[6]/div[2]/div[4]/div[5]/span[3]/a/@href

解决爬虫爬取路径错误的问题

解决MYSQL连接端口被占引入文件路径错误的问题

vue :src 文件路径错误问题的解决方法

解决vue-cli项目打包出现空白页和路径错误的问题

scrapy 中如何爬取json数据，并解决加载慢的问题

【Python3.6爬虫学习记录】（七）使用Selenium+ChromeDriver爬取知乎某问题的回答

解决python路径错误,运行.py文件,找不到路径的问题

关于Vue背景图打包之后访问路径错误问题的解决

解决vue项目打包后提示图片文件路径错误的问题

解决MYSQL连接端口被占引入文件路径错误的问题

python爬虫爬取数据遇到的问题