phpspider 的简单使用

程序员文章站 2022-03-18 14:35:17

phpspider 的简单使用 phpspider是一款PHP开发蜘蛛爬虫框架。官方github下载地址：https://github.com/owner888/phpspider 官方文档下载地址：https://doc.phpspider.org/ 由于官方文档可能会出现打不开的情况（我一开始 ......

phpspider 的简单使用

phpspider是一款php开发蜘蛛爬虫框架。

官方github下载地址：
官方文档下载地址：
由于官方文档可能会出现打不开的情况（我一开始试了很多次都打不开），这里提供一个网盘下载地址：链接：https://pan.baidu.com/s/1lfjocw1rthn_luotf7iudw 密码：cylb
使用
代码下载下来后里面有几个例子，我这里就以代码中糗事百科为例，主要介绍几点注意事项。
1、代码必须放到命令行运行，可以使用 php -f 语句。
2、在代码中的例子糗事百科抓取网址写的是 http，运行不成功，要改成 https。
3、save_running_state 参数表示是否保存爬虫运行状态，如果选择了true，则会使用到redis。
4、抓取时默认使用的 selector 是 xpath，如果想使用其它的可以用 selector_type 参数修改，文档中介绍目前可用 xpath, jsonpath, regex，但是我看使用 css 也是可以的。
5、写 selector 时可以打开要抓取的页面，审查元素，选择要抓取的数据，右击->copy，选择 copy selector 或 copy xpath ，可以直接得到该元素的 selector。（这点在开发文档上也有相应的介绍）。
6、抓取的数据存储可以有三种选择，.csv 文件，.sql 文件，也可以直接插入数据库表中，选择相应的表名即可（要字段对应）。

下面贴上我修改后的代码：

<?php
// composer下载方式
// 先使用composer命令下载：
// composer require owner888/phpspider
// 引入加载器
//require './vendor/autoload.php';

// github下载方式
require_once __dir__ . '/../autoloader.php';
use phpspider\core\phpspider;

/* do not delete this comment */
/* 不要删除这段注释 */

$configs = array(
    'name' => '糗事百科',
    'log_show' => true,
    'tasknum' => 1,
    'save_running_state' => false,
    'domains' => array(
        'qiushibaike.com',
        'www.qiushibaike.com'
    ),
    'scan_urls' => array(
        'https://www.qiushibaike.com/'
    ),
    'list_url_regexes' => array(
        "https://www.qiushibaike.com/8hr/page/\d+\?s=\d+"
    ),
    'content_url_regexes' => array(
        "https://www.qiushibaike.com/article/\d+",
    ),
    'max_try' => 5,
    //'proxies' => array(
        //'http://h784u84r444yabqd:57a8b0b743f9b4d2@proxy.abuyun.com:9010'
    //),
    //'export' => array(
        //'type' => 'csv',
        //'file' => '../data/qiushibaike.csv',
    //),
    //'export' => array(
        //'type'  => 'sql',
        //'file'  => '../data/qiushibaike.sql',
        //'table' => 'content',
    //),
    'export' => array(
        'type' => 'db',
        'table' => 'content',
    ),
    'db_config' => array(
        'host'  => '127.0.0.1',
        'port'  => 3306,
        'user'  => 'root',
        'pass'  => 'root',
        'name'  => 'test',
    ),
//    'queue_config' => array(
//        'host'      => '127.0.0.1',
//        'port'      => 6379,
//        'pass'      => 'foobared',
//        'db'        => 5,
//        'prefix'    => 'phpspider',
//        'timeout'   => 30,
//    ),
    'fields' => array(
        array(
            'name' => "article_title",
            'selector' => "//*[@id='single-next-link']//div[contains(@class,'content')]/text()[1]",
            'required' => true,
        ),
        array(
            'name' => "article_author",
            'selector' => "//div[contains(@class,'author')]//h2",
            'required' => true,
        ),
        array(
            'name' => "article_headimg",
            'selector' => "//div[contains(@class,'author')]//a[1]",
            'required' => true,
        ),
        array(
            'name' => "article_content",
            'selector' => "//*[@id='single-next-link']//div[contains(@class,'content')]",
            'required' => true,
        ),
        array(
            'name' => "article_publish_time",
            'selector' => "//div[contains(@class,'author')]//h2",
            'required' => true,
        ),
        array(
            'name' => "url",
            'selector' => "//div[contains(@class,'author')]//h2",   // 这里随便设置，on_extract_field回调里面会替换
            'required' => true,
        ),
    ),
);

$spider = new phpspider($configs);

$spider->on_handle_img = function($fieldname, $img) 
{
    $regex = '/src="(https?:\/\/.*?)"/i';
    preg_match($regex, $img, $rs);
    if (!$rs) 
    {
        return $img;
    }

    $url = $rs[1];
    $img = $url;

    //$pathinfo = pathinfo($url);
    //$fileext = $pathinfo['extension'];
    //if (strtolower($fileext) == 'jpeg') 
    //{
        //$fileext = 'jpg';
    //}
    //// 以纳秒为单位生成随机数
    //$filename = uniqid().".".$fileext;
    //// 在data目录下生成图片
    //$filepath = path_root."/images/{$filename}";
    //// 用系统自带的下载器wget下载
    //exec("wget -q {$url} -o {$filepath}");

    //// 替换成真是图片url
    //$img = str_replace($url, $filename, $img);
    return $img;
};

$spider->on_extract_field = function($fieldname, $data, $page) 
{
    if ($fieldname == 'article_title') 
    {
//        if (strlen($data) > 10)
//        {
//            // 下面方法截取中文会有异常
//            //$data = substr($data, 0, 10)."...";
//            $data = mb_substr($data, 0, 10, 'utf-8')."...";
//            $data = trim($data);
//        }
    }
    elseif ($fieldname == 'article_publish_time') 
    {
        // 用当前采集时间戳作为发布时间
        $data = time();
    }
    // 把当前内容页url替换上面的field
    elseif ($fieldname == 'url') 
    {
        $data = $page['url'];
    }
    return $data;
};

$spider->start();

注意一点，on_handle_img，on_extract_field两个方法抓取其它项目时不一定适用，要改成自己的逻辑处理。

运行

phpspider 的简单使用

至此一个简单的抓取数据程序就完成了。

上一篇：世界500强排名远超阿里巴巴京东市值被严重低估？

下一篇：国美和京东百亿联采如何布局产业生态，助推“双链”升级？

phpspider 的简单使用

phpspider 的简单使用

Android App端与PHP Web端的简单数据交互实现示例

简单理解Python中的装饰器

Android使用Activity实现简单的可输入对话框

python使用MySQLdb访问mysql数据库的方法

浅谈Android中使用异步线程更新UI视图的几种方法

Win10搜索无法使用怎么办？Win10搜索功能无法使用的解决方法

Python使用smtplib模块发送电子邮件的流程详解

使用python加密自己的密码

python中使用序列的方法

Python使用自带的ConfigParser模块读写ini配置文件