vbs或asp采集文章时网页编码问题

程序员文章站 2022-04-14 19:11:00

'/*========================================================================= &nbs...

'/*=========================================================================
' * intro       研究网页编码很长时间了，因为最近要设计一个友情链接检测的vbs脚本，而与你链接的人的页面很可能是各种编码，以前采取的方法是：如果用gb2312查不到再用utf-8查，再找不到证明对方没有给你做链接虽然不是100%正确，但也差不多了，这两种编码用的人比较多，偶然间在收藏夹里的一个地址看到的一个思路，终于可以在采集文章时自动判断网页的编码了。因为研究过程中这个问题困扰很久，虽然现在觉得简单了，想必很多人还在找，所以把这三个函数贴出来。
' * filename    getwebcodepage.vbs
' * author      yongfa365
' * version     v2.0
' * web         http://www.yongfa365.com
' * email       yongfa365[at]qq.com
' * firstwrite  http://www.yongfa365.com/item/getwebcodepage.vbs.html
' * madetime    2008-01-29 20:55:46
' * lastmodify  2008-01-30 20:55:46
' *==========================================================================*/


call gethttppage("http://www.baidu.com/")
call gethttppage("http://www.google.com/")
call gethttppage("http://www.yongfa365.com/")
call gethttppage("http://www.cbdcn.com/")
call gethttppage("http://www.csdn.net/")


'得到匹配的内容，返回数组
'getcontents(表达式，字符串，是否返回引用值)
'msgbox getcontents("a(.+?)b", "a23234b ab a67896896b sadfasdfb" ,true)(0)

function getcontents(patrn, strng , yinyong)
'by www.yongfa365.com 转载请保留链接，以便最终用户及时得到最新更新信息
    on error resume next
    set re = new regexp
    re.pattern = patrn
    re.ignorecase = true
    re.global = true
    set matches = re.execute(strng)
    if yinyong then
        for i = 0 to matches.count -1
            if matches(i).value<>"" then retstr = retstr & matches(i).submatches(0) & "柳永法"
        next
    else
        for each omatch in matches
            if omatch.value<>"" then retstr = retstr & omatch.value & "柳永法"
        next
    end if
    getcontents = split(retstr, "柳永法")
end function

function gethttppage(url)
    on error resume next
    set xmlhttp = createobject("msxml2.xmlhttp")
    xmlhttp.open "get", url, false
    xmlhttp.send
    if xmlhttp.status<>200 then exit function
    getbody = xmlhttp.responsebody
    '柳永法(www.yongfa365.com)在此的思路是,先根据返回的字符串找，找文件头，如果还没有的话就用gb2312,一般都能直接匹配出编码。
    '在返回的字符串里看，虽然中文是乱码，但不影响我们取其编码，
    getcodepage = getcontents("charset=[""']*([^"",']+)", xmlhttp.responsetext , true)(0)
    '在头文件里看编码
     if len(getcodepage)<3 then getcodepage = getcontents("charset=[""']*([^"",']+)", xmlhttp.getresponseheader("content-type") , true)(0)
    if len(getcodepage)<3 then getcodepage = "gb2312"
    set xmlhttp = nothing
    '下边这句在正式使用时要屏蔽掉
    wscript.echo url & "-->" & getcodepage
    gethttppage = bytestobstr(getbody, getcodepage)
end function


function bytestobstr(body, cset)
    on error resume next
    dim objstream
    set objstream = createobject("adodb.stream")
    objstream.type = 1
    objstream.mode = 3
    objstream.open
    objstream.write body
    objstream.position = 0
    objstream.type = 2
    objstream.charset = cset
    bytestobstr = objstream.readtext
    objstream.close
    set objstream = nothing
end function

上一篇：简单的asp采集代码教程

下一篇： asp文本框换行显示代码