Selenium爬取淘宝数据

程序员文章站 2022-04-26 15:40:51

...

使用Selenium抓取淘宝数据

淘宝的反扒措施太严格了。搞了半天没**，最后使用使用所见即可爬的Selenium。
在正常浏览器打开淘宝在console面板输入window.navigator.webdriver，返回的时undefined,使用selenium驱动的浏览器同样的操作，返回的是True。可能是淘宝的反扒措施之一。

尝试了淘宝的登录页面，但是没有获取到“密码登录”的接口。
Selenium爬取淘宝数据
所以直接在程序运行过程中扫码登录。

定义一个登录方法login()：

def login():
    logon_url = "https://login.taobao.com/"
    browser.get(logon_url)

    try:
        print("请扫码登录")
        time.sleep(10)
        # until（）方法传入等待条件，presence_of_element_located()代表节点出现，其参数为节点的定位元祖。
        input = wait.until(EC.presence_of_element_located((By.XPATH, '//input[@id="q"]')))
        submit = wait.until(EC.element_to_be_clickable((By.XPATH, '//button[@class="btn-search tb-bg"]')))
        input.clear()
        input.send_keys(KEYWORD)
        submit.click()
        time.sleep(2)
        print("登录成功")
    except Exception as e :
        print("登录失败",e)

login方法的input ,submit 为搜索框和确定搜索按钮，获取到搜索框后往搜索框里输入文字（KEYWORD）即我们想要搜索的关键字，然后点击搜索按钮。

定义抓取商品信息的方法

def get_products(html):
    """
    抓取商品信息
    :return:
    """
    doc = pq(html)
    items = doc('#mainsrp-itemlist .items .item').items()
    for item in items:
        product = {
            'image': item.find('.pic .img').attr('data-src'),
            'price': item.find('.price').text(),
            'deal': item.find('.deal-cnt').text(),
            'title': item.find('.title').text(),
            'shop': item.find('.shop').text(),
            'location': item.find('.location').text()
        }

这个方法里定义商品字段。接受的参数为html，即浏览器当前所在的页面。

定义一个循环逻辑的方法

def index_page():
    # 循环页数
    for i in range(2,11):
        html = browser.page_source
        get_products(html)

        print("正在翻页-------------")
        # 获取数字框，往里面写入数字，然后点击确定
        input = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="mainsrp-pager"]/div/div/div/div[2]/input')))
        input.clear()
        input.send_keys(i)
        # 确定按钮
        next = wait.until(EC.element_to_be_clickable((By.XPATH, '//span[@class="btn J_Submit"]')))
        next.click()
        print("翻页完成")
        time.sleep(3)

    print("请求完成！")

这个方法的input ,next 为翻页数字框与确定按钮。
Selenium爬取淘宝数据
获取到这个数字框，然后点击确定实现翻页操作。翻页之后将html代码作为参数传递给get_products()。
最后定义一个写入数据库的方法。
即可完成整个抓取过程。
完整代码如下：


# coding : utf-8
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
import time
import pymysql

browser = webdriver.Chrome()
# 显示等待，指定最长等待时间
wait = WebDriverWait(browser, 10)
# 搜索关键字
KEYWORD = 'IPAD'
conn = pymysql.connect(
    host="localhost",
    database="test",
    user="root",
    password="root",
    port=3306,
    charset='utf8'
)
cursor = conn.cursor()

def login():
    logon_url = "https://login.taobao.com/"
    browser.get(logon_url)

    try:
        print("请扫码登录")
        time.sleep(10)
        # until（）方法传入等待条件，presence_of_element_located()代表节点出现，其参数为节点的定位元祖。
        input = wait.until(EC.presence_of_element_located((By.XPATH, '//input[@id="q"]')))
        submit = wait.until(EC.element_to_be_clickable((By.XPATH, '//button[@class="btn-search tb-bg"]')))
        input.clear()
        input.send_keys(KEYWORD)
        submit.click()
        time.sleep(2)
        print("登录成功")
    except Exception as e :
        print("登录失败",e)

def index_page():
    # 循环页数
    for i in range(2,11):
        html = browser.page_source
        get_products(html)

        print("正在翻页-------------")
        # 获取数字框，往里面写入数字，然后点击确定
        input = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="mainsrp-pager"]/div/div/div/div[2]/input')))
        input.clear()
        input.send_keys(i)
        # 确定按钮
        next = wait.until(EC.element_to_be_clickable((By.XPATH, '//span[@class="btn J_Submit"]')))
        next.click()
        print("翻页完成")
        time.sleep(3)

    print("请求完成！")

def get_products(html):
    """
    抓取商品信息
    :return:
    """
    doc = pq(html)
    items = doc('#mainsrp-itemlist .items .item').items()
    for item in items:
        product = {
            'image': item.find('.pic .img').attr('data-src'),
            'price': item.find('.price').text(),
            'deal': item.find('.deal-cnt').text(),
            'title': item.find('.title').text(),
            'shop': item.find('.shop').text(),
            'location': item.find('.location').text()
        }
        write(product)
        print("写入一条数据")

def write(info):
    sql = "insert into taobao_ipad (image,price,deal,title,shop,location)values (%s,%s,%s,%s,%s,%s);"
    cursor.execute(sql,(info["image"],info["price"],info["deal"],info["title"],info["shop"],info["location"]))

def main():

    login()
    index_page()

if __name__ == "__main__":
    main()
    conn.commit()

数据库显示如下：
Selenium爬取淘宝数据
抓取成功。变换关键字即可抓取不同的商品。

Selenium爬取淘宝数据

使用Selenium抓取淘宝数据

定义一个登录方法login()：

定义抓取商品信息的方法

定义一个循环逻辑的方法

python3爬取数据至mysql的方法

浅析php如何实现爬取数据原理

Python爬虫使用selenium爬取qq群的成员信息（全自动实现自动登陆）

通过抓取淘宝评论为例讲解Python爬取ajax动态生成的数据(经典)

亲手撸码，爬取手机号码归属地最新数据（201911）

Python使用Selenium爬取淘宝异步加载的数据方法

网易云歌单信息爬取及数据分析（python爬虫）

手把手教你用Node.js爬虫爬取网站数据的方法

基于 PHP 的数据爬取（QueryList）

Python爬取租房数据实例，据说可以入门爬虫的小案例！

Selenium爬取淘宝数据

使用Selenium抓取淘宝数据

定义一个登录方法login()：

定义抓取商品信息的方法

定义一个循环逻辑的方法

python3爬取数据至mysql的方法

浅析php如何实现爬取数据原理

Python爬虫使用selenium爬取qq群的成员信息（全自动实现自动登陆）

通过抓取淘宝评论为例讲解Python爬取ajax动态生成的数据(经典)

亲手撸码，爬取 手机号码归属地最新数据（201911）

Python使用Selenium爬取淘宝异步加载的数据方法

网易云歌单信息爬取及数据分析（python爬虫）

手把手教你用Node.js爬虫爬取网站数据的方法

基于 PHP 的数据爬取（QueryList）

Python爬取租房数据实例，据说可以入门爬虫的小案例！

亲手撸码，爬取手机号码归属地最新数据（201911）