自己做采集程序

程序员文章站 2023-01-29 11:34:31

现在网上的采集程序很多，但是有时候你发现一个好的网站，想自己做个采集工具采集一些信息，就需要自己去写程序了，其实这样的采集程序并不难写，主要是去分析源网站的网页结构。首先去...

现在网上的采集程序很多，但是有时候你发现一个好的网站，想自己做个采集工具采集一些信息，就需要自己去写程序了，其实这样的采集程序并不难写，主要是去分析源网站的网页结构。
首先去下载个xmlhttp的类文件：
<%
class xhttp
private cset,surl,serror
private sub class_initialize()
'cset="utf-8"
cset="gb2312"
serror=""
end sub

private sub class_terminate()
end sub

public property let url(theurl)
surl=theurl
end property
public property get basepath()
basepath=mid(surl,1,instrrev(surl,"/")-1)
end property
public property get filename()
filename=mid(surl,instrrev(surl,"/")+1)
end property
public property get html()
html=bytestobstr(getbody(surl))
end property

public property get xhttperror()
xhttperror=serror
end property

private function bytestobstr(body)
on error resume next
'cset:gb2312 utf-8
dim objstream
set objstream = server.createobject("adodb.stream")
with objstream
.type = 1 '
.mode = 3 '
.open
.write body '
.position = 0 '
.type = 2 '
.charset = cset '
bytestobstr = .readtext '
.close
end with
set objstream = nothing
end function

private function getbody(surl)
on error resume next
dim xmlhttp
'set xmlhttp=server.createobject("msxml2.xmlhttp.4.0")
'set xmlhttp=server.createobject("microsoft.xmlhttp")
set xmlhttp=server.createobject("msxml2.serverxmlhttp")
xmlhttp.settimeouts 10000,10000,10000,30000
xmlhttp.open "get",surl,false
xmlhttp.send
if xmlhttp.readystate=4 then
'if xmlhttp.status=200 then
getbody=xmlhttp.responsebody
'end if
else
getbody=""
end if

if err.number<>0 then
serror=err.number
err.clear
else
serror=""
end if
set xmlhttp=nothing
end function

public function saveimage(tofile,isoverwrite)
on error resume next
dim objstream,objfso,imgs

if not isoverwrite then
set objfso = server.createobject("scripting.filesystemobject")
if objfso.fileexists(server.mappath(tofile)) then
exit function
end if
set objfso = nothing
end if

imgs=getbody(surl)
set objstream = server.createobject("adodb.stream")
with objstream
.type =1
.open
.write imgs
.savetofile server.mappath(tofile),2
.close()
end with
set objstream=nothing
end function

end class

%>
用了这个类文件，做起事情来就方便多了。
然后就可以分析采集网站的网页结构，写采集程序了。
下面给个例子：



<%
server.scripttimeout = 1000
%>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=gb2312" />
<title>bt采集器</title>
</head>
<body>
<form name="form1" method="post" action="get81bt.asp">
分类id：
  <input type="text" name="cid" value="<%=request("cid")%>"><br>
开始id：
  <input type="text" name="startid" value="<%=request("startid")%>">
  <br>
  结束id：
  <input type="text" name="overid" value="<%=request("overid")%>">
  <br>
  分类名称：<input type="text" name="classname" value="<%=request("classname")%>">为空自动获取
  <br>
  <input name="action" type="hidden" id="action" value="getdata">
  <input type="submit" name="submit" value="采集">
</form>
当前id：<%=request("id")%> <br>
<%
dim action

action = request("action")
if action = "getdata" then
cid = request("cid")
startid = request("startid")
overid = request("overid")
id = request("id")
if id = "" then id = startid

set objxhttp = new xhttp

objxhttp.url = "http://www.81dd.com/class/"&cid&"_"&id&".htm"
content = objxhttp.html

if instr(content,"网站维护中") then
call nextid
response.end()
end if

list = getcontent(content,"","",0)

dim regex, match, matches,patrn
set regex = new regexp
patrn = "<a href=""../bthtml/(.+?)"">"
regex.pattern = patrn
regex.ignorecase = true
regex.global = true
set matches = regex.execute(list)
on error resume next
for each match in matches

'response.write match.value & "<br>"
weburl = "http://www.81dd.com/bthtml/" & regex.replace(match.value,"$1")
response.write weburl & "<br>"
response.flush()

objxhttp.url = weburl
cpage = objxhttp.html
cpage = getcontent(cpage,"","",0)

title = getcontent(cpage,"bt资源名称：<strong>","</strong>",0)
title = striphtml(title)

if request("classname") <> "" then
classname = request("classname")
else
if instr(title,"喜剧") then
classname = "喜剧"
elseif instr(title,"动作") then
classname = "动作"
elseif instr(title,"惊悚") then
classname = "惊悚"
elseif instr(title,"犯罪") then
classname = "犯罪"
elseif instr(title,"恐怖") then
classname = "恐怖"
elseif instr(title,"爱情") then
classname = "爱情"
elseif instr(title,"冒险") then
classname = "冒险"
elseif instr(title,"科幻") then
classname = "科幻"
elseif instr(title,"悬念") then
classname = "悬念"
elseif instr(title,"奇幻") then
classname = "奇幻"
elseif instr(title,"战争") then
classname = "战争"
elseif instr(title,"连续剧") then
classname = "连续剧"
elseif instr(title,"综艺") then
classname = "综艺"
elseif instr(title,"灾难") then
classname = "灾难"
elseif instr(title,"伦理") then
classname = "伦理"
elseif instr(title,"动漫") or instr(title,"动画") then
classname = "动漫"
elseif instr(title,"国语") or instr(title,"集") then
classname = "其他影视"
else
classname = "其他"
end if
end if

intro = getcontent(cpage,"<tr><td width=770 bgcolor=#ffffff><div style=""margin:10px;line-height:150%"">","</div>",0)
intro = replace(intro,"<br />","[br]")
intro = replace(intro,"<br />","[br]")
intro = replace(intro,"<br>","[br]")
intro = replace(intro,"<br>","[br]")
intro = replace(intro,"<p>","[p]")
intro = replace(intro,"<p>","[p]")
intro = replace(intro,"</p>","[/p]")
intro = replace(intro,"</p>","[p]")
intro = replace(intro,"<img","[img")
intro = replace(intro,"<img","[img")
intro = striphtml(intro)
intro = replace(intro,"[br]","<br>")
intro = replace(intro,"[p]","<p>")
intro = replace(intro,"[/p]","</p>")
intro = replace(intro,"[img","<img")
intro = replace(intro,"[img]","<img src=")
intro = replace(intro,"[/img]",">")
intro = replace(intro,"[img]","<img src=")
intro = replace(intro,"[/img]",">")
'response.write t
'response.end()

addtime = trim(getcontent(cpage,"发布时间："," ",0))
if not isdate(addtime) then addtime = now()

username = "bt"

filesize = getcontent(content,"bt文件大小："," ",0)

title2 = title

downurl = getcontent(cpage,"<a style=""color:red"" href=""","""",0)

p = cdate(addtime)
dim srnd
randomize
srnd = int(900 * rnd) + 100
sfilename = year(p) & month(p) & day(p) & hour(now) & minute(now) & second(now) & srnd & ".torrent"

url = "torrent/" & year(p) & "-" & month(p) & "-" & day(p) & "/" & sfilename
call createf(url)

'text
response.write classname & "<br>"
response.write title & "<br>"
'response.write intro & "<br>"
'response.write addtime & "<br>"
'response.write username & "<br>"
'response.write filesize & "<br>"
response.write downurl & "<br>"
response.write url & "<br>"
response.flush()

'response.end()
'database

if err.number = 0 then
if (not isnull(title)) and title <> "" and downurl <> "" then
set rs = server.createobject("adodb.recordset")
sql = "select * from bt_class where classname = '" & classname & "'"
rs.open sql,conn,1,3
if rs.eof then
rs.addnew
rs("classname") = classname
rs.update
end if
classid = rs("classid")
rs.close
set rs = nothing

set rs = server.createobject("adodb.recordset")
sql = "select * from bt_movie where title in ('" & title & "')"
rs.open sql,conn,1,3
if rs.eof then
response.write "<div><font color=blue>写入数据库...</font></div>"
response.flush()
rs.addnew
rs("classid") = classid
rs("title") = title
rs("title2") = title2
rs("intro") = intro
rs("username") = username
rs("filesize") = filesize
rs("url") = url
rs("serverid") = 1
rs("addtime") = addtime
rs("ismake") = 0
rs.update

objxhttp.url = downurl
objxhttp.saveimage url,false
else
response.write "<div><font color=red>已经存在！</font></div>"
end if
rs.close
set rs = nothing

'objxhttp.url = downurl
'objxhttp.saveimage url,false
end if

else
err.clear
end if
response.write "-------------------------------------------<br>"
next
set regex = nothing

response.write "下一页<br>"
response.flush()

call nextid()

end if

sub nextid
conn.close
set conn = nothing

if cint(startid) < cint(overid) and cint(id) < cint(overid) then
response.write "<script>location.href='get81bt.asp?action=getdata&classname=" & request("classname") & "&cid=" & cid & "&startid=" & startid & "&overid=" & overid & "&id="& id + 1 &"'</script>"
elseif cint(startid) > cint(overid) and cint(id) > cint(overid) then
response.write "<script>location.href='get81bt.asp?action=getdata&classname=" & request("classname") & "&cid=" & cid & "&startid=" & startid & "&overid=" & overid & "&id="& id - 1 &"'</script>"
else
response.write "采集完成！<br>"
response.end()
end if
end sub

%>

</body>
</html>

上一篇：浅析PHP关键词替换的类(避免重复替换，保留与还原原始链接)

下一篇：基于Node的React图片上传组件实现实例代码

自己做采集程序

给苹果装Windows 教你做苹果电脑双系统：看完自己随便装

*切糕好吃吗-自己做的好吃的流口水

刘基自己不当宰相为何不让别人当他怎么做究竟有什么用意

eclipse android logcat只显示自己应用程序信息的设置方法

女人该如何保养自己？你的保养程序是否正确

用ASP做的DNS LOOKUP程序

在家也可以自己做麻婆豆腐，家常素菜简单的做法

基于PHP的简单采集数据入库程序【续篇】

基于PHP的简单采集数据入库程序

企业做竞价推广如何找到自己做竞价推广的优势

自己做采集程序

给苹果装Windows 教你做苹果电脑双系统：看完自己随便装

*切糕好吃吗-自己做的好吃的流口水

刘基自己不当宰相为何不让别人当 他怎么做究竟有什么用意

eclipse android logcat只显示自己应用程序信息的设置方法

女人该如何保养自己？你的保养程序是否正确

用ASP做的DNS LOOKUP程序

在家也可以自己做麻婆豆腐，家常素菜简单的做法

基于PHP的简单采集数据入库程序【续篇】

基于PHP的简单采集数据入库程序

企业做竞价推广 如何找到自己做竞价推广的优势

刘基自己不当宰相为何不让别人当他怎么做究竟有什么用意

企业做竞价推广如何找到自己做竞价推广的优势