欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python之scrapy(二)

程序员文章站 2024-03-17 15:05:28
...

 {
"cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selector的用法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "前面介绍过利用BeautifulSoup和PyQuery以及正则表达式来提取网页数据,非常方便。而Scrapy也有自己的提取数据的方法,即Selector选择器。Select是基于lxml来构建的,支持XPath选择器、CSS选择器以及正则表达式,功能齐全,解析速度和准确度非常高。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本来可以采用scrapy shell来调试选择器的使用方法。也可以直接使用Selector模块直接模拟。官网也提供了相应的方法:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/selectors.html 。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "关于XPath的语法和运算符可以参考:http://www.runoob.com/xpath/xpath-tutorial.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. 两种选择器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于在response中使用XPath、CSS查询十分普遍,因此,Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css():"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Selector xpath='//title/text()' data='Example website'>]\n",
      "[<Selector xpath='descendant-or-self::title/text()' data='Example website'>]\n"
     ]
    }
   ],
   "source": [
    "from scrapy import Selector\n",
    "html='''\n",
    "<html>\n",
    " <head>\n",
    "  <base href='http://example.com/' />\n",
    "  <title>Example website</title>\n",
    " </head>\n",
    " <body>\n",
    "  <div id='images'>\n",
    "   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
    "   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
    "   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
    "   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
    "   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
    "  </div>\n",
    " </body>\n",
    "</html>\n",
    "'''\n",
    "selector = Selector(text=html)\n",
    "print(selector.xpath('//title/text()'))\n",
    "print(selector.css('title::text'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如你所见, .xpath() 及 .css() 方法返回一个类 SelectorList 的实例, 它是一个新选择器的列表。这个API可以用来快速的提取嵌套数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 提取数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了提取真实的原文数据,你需要调用 .extract()和extract_first() 方法如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Example website']\n",
      "Example website\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//title/text()').extract())\n",
    "print(selector.css('title::text').extract_first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "extract()返回的是一个列表,extract_first()返回单个值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 根据属性来提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy import Selector\n",
    "html='''\n",
    "<html>\n",
    " <head>\n",
    "  <base href='http://example.com/' />\n",
    "  <title>Example website</title>\n",
    " </head>\n",
    " <body>\n",
    "  <div id='images'>\n",
    "   <a href='123image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
    "   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
    "   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
    "   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
    "   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
    "  </div>\n",
    " </body>\n",
    "</html>\n",
    "'''\n",
    "selector = Selector(text=html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在我们想获取base标签上的链接:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http://example.com/\n",
      "http://example.com/\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//base/@href').extract_first())\n",
    "print(selector.css('base::attr(\"href\")').extract_first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取图片链接:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n",
      "['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//div[@id=\"images\"]/a/@href').extract())\n",
    "print(selector.css('#images a::attr(\"href\")').extract())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['123image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']\n"
     ]
    }
   ],
   "source": [
    "#print(selector.xpath('//a[contains(@href, \"image\")]/@href').extract())\n",
    "print(selector.css('a[href*=image]::attr(\"href\")').extract())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "获取img标签内的src内容:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']\n",
      "['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']\n"
     ]
    }
   ],
   "source": [
    "print(selector.xpath('//a[contains(@href, \"image\")]/img/@src').extract())\n",
    "print(selector.css('a[href*=image] img::attr(src)').extract())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2.嵌套选择器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "选择器方法( .xpath() 或 .css() )返回相同类型的选择器列表,因此你也可以对这些选择器调用选择器方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scrapy import Selector\n",
    "html='''\n",
    "<html>\n",
    " <head>\n",
    "  <base href='http://example.com/' />\n",
    "  <title>Example website</title>\n",
    " </head>\n",
    " <body>\n",
    "  <div id='images'>\n",
    "   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
    "   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
    "   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
    "   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
    "   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>\n",
    "  </div>\n",
    " </body>\n",
    "</html>\n",
    "'''\n",
    "selector = Selector(text=html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"123image1.html\">Name: My image '>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image2.html\">Name: My image 2 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image3.html\">Name: My image 3 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image4.html\">Name: My image 4 <'>, <Selector xpath='//a[contains(@href, \"image\")]' data='<a href=\"image5.html\">Name: My image 5 <'>]\n",
      "链接0指向url是:123image1.html  和图是:image1_thumb.jpg\n",
      "链接1指向url是:image2.html  和图是:image2_thumb.jpg\n",
      "链接2指向url是:image3.html  和图是:image3_thumb.jpg\n",
      "链接3指向url是:image4.html  和图是:image4_thumb.jpg\n",
      "链接4指向url是:image5.html  和图是:image5_thumb.jpg\n"
     ]
    }
   ],
   "source": [
    "links = selector.xpath('//a[contains(@href, \"image\")]')\n",
    "print(links)\n",
    "for index, link in enumerate(links):\n",
    "    args = (index, link.css('::attr(href)').extract_first(), link.xpath('img/@src').extract_first())\n",
    "    print ('链接%d指向url是:%s  和图是:%s' % args)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 结合正则表达式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selector 也有一个 .re() 方法,用来通过正则表达式来提取数据。然而,不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'My image 1 '"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selector.xpath('//a[contains(@href, \"image\")]/text()').extract()\n",
    "selector.xpath('//a[contains(@href, \"image\")]/text()').re_first(r'Name:\\s(.*)')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}