nodejs爬虫遇到的乱码问题汇总
上一篇文章中使用nodejs程序解析了网页编码为gbk,gb2312,以及utf-8的情况,这里面有三种特殊的乱码情况需要单独的说明一下.
1,网页编码为utf-8,但是解析为乱码,代表网站为www.guoguo-app.com。
这个问题真是个逗逼问题,查看网页源码中给出的编码方式为utf8,如下:
<meta charset="utf-8"> <title>查快递</title>
由于解析出来的始终是乱码,我就抓包看了下,报文中的编码方式为gbk,果然我使用gbk的方式之后,得到的不再是乱码了。淘宝为了反爬虫也是操碎了新,但是我也很好奇这种方式是怎么实现的,知道的告诉我。
get / http/1.1 host: www.guoguo-app.com connection: close http/1.1 200 ok date: thu, 06 apr 2017 01:56:23 gmt content-type: text/html;charset=gbk transfer-encoding: chunked connection: close vary: accept-encoding vary: accept-encoding content-language: zh-cn server: tengine/aserver
1,网页编码为utf-8,解析为乱码情况二,代表网站http//andersonjiang.blog.sohu.com/
单纯的查看网页源码看不出任何毛病,于是我又抓了一次包,得到如下情形:
get / http/1.1 host: andersonjiang.blog.sohu.com connection: close http/1.1 200 ok content-type: text/html; charset=gbk transfer-encoding: chunked connection: close server: nginx date: thu, 06 apr 2017 02:10:33 gmt vary: accept-encoding expires: thu, 01 jan 1970 00:00:00 gmt rhost: 192.168.110.68@11177 pragma: no-cache cache-control: no-cache content-language: en-us content-encoding: gzip fss-cache: miss from 13539701.18454911.21477824 fss-proxy: powered by 9935166.11245896.17873234
andersonjiang.blog.sohu.com这个网站同时采用了transfer-encoding: chunked传输编码和content-encoding: gzip内容编码功能,由于nodejs爬虫没有gzip解包功能,因此该网站提取不到任何字段,即title和charset等。要想提取此类网站则要添加gzip解包功能。
下面两个网站www.cr173.com以及www.csdn.net是正常的抓包情况。
get / http/1.1 host: www.cr173.com connection: close http/1.1 200 ok expires: thu, 06 apr 2017 02:42:20 gmt date: thu, 06 apr 2017 02:12:20 gmt content-type: text/html last-modified: thu, 06 apr 2017 00:52:42 gmt etag: "96a4141970aed21:0" cache-control: max-age=1800 accept-ranges: bytes content-length: 158902 accept-ranges: bytes x-varnish: 1075189606 via: 1.1 varnish x-via: 1.1 dxxz46:4 (cdn cache server v2.0), 1.1 oudxin15:1 (cdn cache server v2.0) connection: close get / http/1.1 host: www.csdn.net connection: close http/1.1 200 ok server: openresty date: thu, 06 apr 2017 02:18:59 gmt content-type: text/html; charset=utf-8 content-length: 99363 connection: close vary: accept-encoding last-modified: thu, 06 apr 2017 02:10:02 gmt vary: accept-encoding etag: "58e5a37a-18423" accept-ranges: bytes
3,网页编码为其他形式的编码,解析为乱码,例如:
(1)编码为big5,代表网站为 www.ruten.com.tw, www.ctgoodjobs.hk
(2)编码为shift_jis,代表网站为www.vector.co.jp,www.smbc.co.jp
(3)编码为windows-12,代表网站为www.tff.org,www.pravda.com.ua
(4)编码为euc-jp,代表网站为www.showtime.jp
(5)编码为euc-kr ,代表网站为www.incruit.com,www.samsunghospital.com,
由于iconv-lite的说明中支持如下的编码方式:
currently only a small part of encodings supported:
all node.js native encodings: 'utf8', 'ucs2', 'ascii', 'binary', 'base64'. base encodings: 'latin1' cyrillic encodings: 'windows-1251', 'koi8-r', 'iso 8859-5'. simplified chinese: 'gbk', 'gb2313'.
other encodings are easy to add, see the source. please, participate
因此对于上述出现的网页编码,只有自己添加解码方式加以解决了。
总之要写一个通用的爬虫程序还有很长的路要走。