使用原生php爬取图片并保存到本地

程序员文章站 2022-03-26 10:58:40

说到爬虫，一般都想到用python解决，其实PHP用来做爬虫也非常不错，本文没有使用现成的库，而是用了curl和正则来做一个简单的爬去并保存图片的例子，复习一下PHP原生函数，在此基础上，可以扩展为更复杂的爬虫。 ......

通过一个简单的例子复习一下几个php函数的用法

用到的函数或知识点

curl 发送网络请求
preg_match 正则匹配

代码

$url     = 'http://desk.zol.com.cn/bizhi/7386_91671_2.html';
$headers = [
    'user-agent: mozilla/5.0 (macintosh; intel mac os x 10_14_2) applewebkit/537.36 (khtml, like gecko) chrome/71.0.3578.98 safari/537.36'
];
$ch      = curl_init();
curl_setopt($ch, curlopt_url, $url);
curl_setopt($ch, curlopt_returntransfer, true);     //将curl_exec()获取的信息以字符串返回，而不是直接输出。
curl_setopt($ch, curlopt_header, $headers);
$output = curl_exec($ch);
curl_close($ch);
$str = mb_convert_encoding($output, 'utf-8', 'gb2312');
//或$str = iconv('gb2312//ignore', 'utf-8', $output);

preg_match('!<img id="bigimg" src="(?<src>http.*\.(?<ext>jpg|png))".*>!', $str, $m);
file_put_contents('./meinv.' . $m['ext'], file_get_contents($m['src']));

效果

使用原生php爬取图片并保存到本地

解释

curl 发送请求

在php中建立curl连接的步骤一般为：初始化，设置选项，执行操作，释放连接。

$ch = curl_init();
curl_setopt($ch, curlopt, $opt);
$out = curl_exec($ch);
curl_close();

常用的curlopt设置，更多参考文档

curlopt_url, string //设置url必须
curlopt_header, array //设置请求header
curlopt_returntransfer, bool //为true时,以字符串返回响应,不包含header
curlopt_ssl_verifypeer, bool //为false时,不验证https证书,用于请求https的url
curlopt_post, int //为1时配合curlopt_postfields使用post请求,默认使用get
curlopt_postfields, array //post数据数组

直接输出$output发现乱码，通过查看源码发现网页使用的是gb2312编码，用mb_convert_encoding或者iconv转换成utf-8编码输出。

preg_match 正则匹配

通过查看源码发现我们需要的图片标签为<img id="bigimg" src="https://desk-fd.zol-img.com.cn/t_s960x600c5/g5/m00/0a/03/chmkj1wy5y-ifhr_aalcdzhe3wwaat3agoma_iaasin642.jpg" width="960" height="600">

正则表达式

<img id="bigimg" src="(?<src>http.*\.(?<ext>jpg|png))".*>

.*匹配所有，(?<name>)使用分组可以方便的使用$match['name']取到想要的部分

最后$match['src']拿到了图片的真实url，通过file_put_contents保存，就算完成了

上一篇： Jsp技术总结

下一篇： jsp页面转后台，出现中文乱码

使用原生php爬取图片并保存到本地

用到的函数或知识点

代码

效果

解释

curl 发送请求

preg_match 正则匹配

Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码

使用node爬取页面图片并保存到本地（以获取码农网站首页图片为例）