scrapy爬虫之LinkExtractor的使用
程序员文章站
2022-10-07 21:17:41
LinkExtractorLinkExtractor构造器所有的参数都有默认值,如果构造对象不传参,默认提取页面中所有的链接2020-07-13 15:24:53 [parso.python.diff] DEBUG: diff parser endIn [1]: from scrapy.linkextractors import LinkExtractor In [2]:...
LinkExtractor
LinkExtractor构造器所有的参数都有默认值,如果构造对象不传参,默认提取页面中所有的链接
2020-07-13 15:24:53 [parso.python.diff] DEBUG: diff parser end
In [1]: from scrapy.linkextractors import LinkExtractor
In [2]: le = LinkExtractor()
In [3]: links = le.extract_links(response)
In [4]: [link.url for link in links]
Out[4]:
['http://books.toscrape.com/index.html',
'http://books.toscrape.com/catalogue/category/books_1/index.html',
'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
。。。省略。。。
'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html',
'http://books.toscrape.com/catalogue/page-2.html']
LinkExtractor中allow参数
接收一个正则表达式或正则表达式列表,提取绝对url与正则匹配的链接,如果该参数为空,提取全部链接
In [21]: from scrapy.linkextractors import LinkExtractor
In [22]: le = LinkExtractor(allow="/catalogue/page.*\.html$")
In [23]: links = le.extract_links(response)
In [24]: [link.url for link in links]
Out[24]: ['http://books.toscrape.com/catalogue/page-2.html']
LinkExtractor中deny参数
接收一个正则表达式或一个正则表达式列表,排除绝对url与正则匹配的链接
In [25]: from scrapy.linkextractors import LinkExtractor
In [26]: le = LinkExtractor(deny="/catalogue/.*/books/.*")
In [27]: links = le.extract_links(response)
In [28]: [link.url for link in links]
Out[28]:
['http://books.toscrape.com/index.html',
'http://books.toscrape.com/catalogue/category/books_1/index.html',
'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
'http://books.toscrape.com/catalogue/soumission_998/index.html',
'http://books.toscrape.com/catalogue/sharp-objects_997/index.html',
'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html',
'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
'http://books.toscrape.com/catalogue/the-black-maria_991/index.html',
'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
'http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
'http://books.toscrape.com/catalogue/set-me-free_988/index.html',
'http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
'http://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html',
'http://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html',
'http://books.toscrape.com/catalogue/olio_984/index.html',
'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html',
'http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html',
'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html',
'http://books.toscrape.com/catalogue/page-2.html']
LinkExtractor中allow_domains参数和deny_domains参数
allow_domains:接收一个域名和域名列表,提取指定域名的链接
deny_domains:接收一个域名和域名列表,排除指定域名的链接
#只演示deny_domains
In [29]: from scrapy.linkextractors import LinkExtractor
In [30]: le = LinkExtractor(deny_domains="books.toscrape.com")
In [31]: links = le.extract_links(response)
In [32]: [link.url for link in links]
Out[32]: []
LinkExtractor中restrict_xpaths参数和restrict_css参数
restrict_xpaths:接收一个xpath的表达式,提取表达式选中区域的链接
restrict_css:接收一个css的表达式提取表达式选中区域的链接
#xpaths
In [29]: from scrapy.linkextractors import LinkExtractor
In [30]: le = LinkExtractor(restrict_xpaths="//li[@class='next']")
In [31]: links = le.extract_links(response)
LinkExtractor中tags参数和attrs参数
tags:接收一个标签或标签列表,提取标签内的列表,默认为[‘a’, ‘area’]
attrs:接收一个属性或属性列表,提取指定属性内的链接,默认为[‘href’]
LinkExtractor中process_value参数
用来回调函数,用来处理JavaScript代码
本文地址:https://blog.csdn.net/fengzhilaoling/article/details/107317870
上一篇: 企业官方微博营销技巧分享
下一篇: 做好这几点 你的微博营销会有起色
推荐阅读
-
android应用开发之spinner控件的简单使用
-
PHP小技巧之JS和CSS优化工具Minify的使用方法
-
零基础写python爬虫之urllib2中的两个重要概念:Openers和Handlers
-
零基础写python爬虫之使用urllib2组件抓取网页内容
-
零基础写python爬虫之urllib2使用指南
-
零基础写python爬虫之使用Scrapy框架编写爬虫
-
Element-UI踩坑之Pagination组件的使用
-
在JavaScript中操作时间之setYear()方法的使用
-
webpack实践之DLLPlugin 和 DLLReferencePlugin的使用教程
-
微信小游戏之使用three.js 绘制一个旋转的三角形