解决爬虫爬取路径错误的问题
程序员文章站
2022-03-15 23:01:30
...
参考其他博主的代码,结果发现xpath的路径有问题,list找不到结点取不到链接。在源网页上copy了全路径后运行代码没问题list不在为空,现在附上代码和说明步骤:
import re,requests,codecs,time,random
from lxml import html
#proxies={"http" : "123.53.86.133:61234"}
proxies=None
headers = {
'Host': 'guba.eastmoney.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
def get_url(page):
stocknum=600519
url='http://guba.eastmoney.com/list,'+str(stocknum)+'_'+str(page)+'.html'
print("url:")
print(url)
#http://guba.eastmoney.com/list,600519.html
try:
text=requests.get(url,headers=headers,proxies=proxies,timeout=20)
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
text=html.fromstring(text.text)
urls=text.xpath('/html/body/div[6]/div[2]/div[4]/div[5]/span[3]/a/@href')
print("urls:")
print(urls)
except Exception as e:
print("Exception",e)
time.sleep(random.random() + random.randint(0, 3))
urls=''
return urls
def get_comments(urls):
for newurl in urls:
newurl1='http://guba.eastmoney.com'+newurl
try:
text1=requests.get(newurl1,headers=headers,proxies=proxies,timeout=20)
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
text1=html.fromstring(text1.text)
times1=text1.xpath('//head/title/text()')
times='!'.join(re.sub(re.compile('发表于| '),'',x)[:10] for x in times1).split('!')
#times=list(map(lambda x:re.sub(re.compile('发表于| '),'',x)[:10],times))
comments1=text1.xpath('////head/title/text()')
comments='!'.join(w.strip() for w in comments1).split('!')
dic=dict(zip(times,comments))
print("dic",dic)
print(type(dic))
save_to_file(dic)
except:
print('error!!!!')
time.sleep(random.random()+random.randint(0,3))
#print(dic)
#if times and comments:
#dic.append({'time':times,'comment':comments})
#return dic
def save_to_file(dic):
if dic:
#dic=dic
print(dic)
#df=pd.DataFrame([dic]).T
#df.to_excel('E:\\1\\eastnoney.xlsx')
for i,j in dic.items():
output='{}\t{}\n'.format(i,j)
#output.to_csv('E:\\1\\\df_maotai.csv',index=False,sep=',')
f=codecs.open('E:\\1\\eastmoney.xlsx','a+','utf-8')
f.write(output)
f.close()
for page in range(2,257):
print('正在爬取第{}页'.format(page))
print(page)
urls=get_url(page)
dic=get_comments(urls)
首先Ctril+U打开源网页:
找到目标结点及其所有父节点然后返回原网页按F12加入开发者模式:
如图中标记的红色框按照各个父节点ID或者class依次打开直到找到目标结点的目标内容,右键->粘贴完整xpath路径:
最后最后:给完整路径加上自己想要的属性,比如@href
/html/body/div[6]/div[2]/div[4]/div[5]/span[3]/a/@href