python之scrapy(二)
程序员文章站
2024-03-17 15:05:28
...
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selector的用法"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"前面介绍过利用BeautifulSoup和PyQuery以及正则表达式来提取网页数据,非常方便。而Scrapy也有自己的提取数据的方法,即Selector选择器。Select是基于lxml来构建的,支持XPath选择器、CSS选择器以及正则表达式,功能齐全,解析速度和准确度非常高。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本来可以采用scrapy shell来调试选择器的使用方法。也可以直接使用Selector模块直接模拟。官网也提供了相应的方法:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/selectors.html 。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"关于XPath的语法和运算符可以参考:http://www.runoob.com/xpath/xpath-tutorial.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. 两种选择器"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"由于在response中使用XPath、CSS查询十分普遍,因此,Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css():"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[<Selector xpath='//title/text()' data='Example website'>]\n",
"[<Selector xpath='descendant-or-self::title/text()' data='Example website'>]\n"
]
}
],
"source": [
"from scrapy import Selector\n",
"html='''\n",
"<html>\n",
" <head>\n",
" <base href='http://example.com/' />\n",
" <title>Example website</title>\n",
" </head>\n",
" <body>\n",
" <div id='images'>\n",
" <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
" <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
" <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
" <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
" <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"'''\n",
"selector = Selector(text=html)\n",
"print(selector.xpath('//title/text()'))\n",
"print(selector.css('title::text'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如你所见, .xpath() 及 .css() 方法返回一个类 SelectorList 的实例, 它是一个新选择器的列表。这个API可以用来快速的提取嵌套数据。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.1 提取数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"为了提取真实的原文数据,你需要调用 .extract()和extract_first() 方法如下:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Example website']\n",
"Example website\n"
]
}
],
"source": [
"print(selector.xpath('//title/text()').extract())\n",
"print(selector.css('title::text').extract_first())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"extract()返回的是一个列表,extract_first()返回单个值。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1.2 根据属性来提取"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from scrapy import Selector\n",
"html='''\n",
"<html>\n",
" <head>\n",
" <base href='http://example.com/' />\n",
" <title>Example website</title>\n",
" </head>\n",
" <body>\n",
" <div id='images'>\n",
" <a href='123image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
" <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
" <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
" <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
" <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"'''\n",
"selector = Selector(text=html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在我们想获取base标签上的链接:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://example.com/\n",
"http://example.com/\n"
]
}
],
"source": [
"print(selector.xpath('//base/@href').extract_first())\n",
"print(selector.css('base::attr(\"href\")').extract_first())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"获取图片链接:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n",
"['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n"
]
}
],
"source": [
"print(selector.xpath('//div[@id=\"images\"]/a/@href').extract())\n",
"print(selector.css('#images a::attr(\"href\")').extract())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['123image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n"
]
}
],
"source": [
"#print(selector.xpath('//a[contains(@href, \"image\")]/@href').extract())\n",
"print(selector.css('a[href*=image]::attr(\"href\")').extract())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"获取img标签内的src内容:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']\n",
"['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']\n"
]
}
],
"source": [
"print(selector.xpath('//a[contains(@href, \"image\")]/img/@src').extract())\n",
"print(selector.css('a[href*=image] img::attr(src)').extract())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.嵌套选择器"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"选择器方法( .xpath() 或 .css() )返回相同类型的选择器列表,因此你也可以对这些选择器调用选择器方法。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scrapy import Selector\n",
"html='''\n",
"<html>\n",
" <head>\n",
" <base href='http://example.com/' />\n",
" <title>Example website</title>\n",
" </head>\n",
" <body>\n",
" <div id='images'>\n",
" <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
" <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
" <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
" <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
" <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
" </div>\n",
" </body>\n",
"</html>\n",
"'''\n",
"selector = Selector(text=html)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[<Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"123image1.html\">Name: My image '>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image2.html\">Name: My image 2 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image3.html\">Name: My image 3 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image4.html\">Name: My image 4 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image5.html\">Name: My image 5 <'>]\n",
"链接0指向url是:123image1.html 和图是:image1_thumb.jpg\n",
"链接1指向url是:image2.html 和图是:image2_thumb.jpg\n",
"链接2指向url是:image3.html 和图是:image3_thumb.jpg\n",
"链接3指向url是:image4.html 和图是:image4_thumb.jpg\n",
"链接4指向url是:image5.html 和图是:image5_thumb.jpg\n"
]
}
],
"source": [
"links = selector.xpath('//a[contains(@href, \"image\")]')\n",
"print(links)\n",
"for index, link in enumerate(links):\n",
" args = (index, link.css('::attr(href)').extract_first(), link.xpath('img/@src').extract_first())\n",
" print ('链接%d指向url是:%s 和图是:%s' % args)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. 结合正则表达式"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selector 也有一个 .re() 方法,用来通过正则表达式来提取数据。然而,不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'My image 1 '"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"selector.xpath('//a[contains(@href, \"image\")]/text()').extract()\n",
"selector.xpath('//a[contains(@href, \"image\")]/text()').re_first(r'Name:\\s(.*)')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
上一篇: 实现pdf转word
下一篇: word 转PDF 文件