欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Python解析、提取url关键字的实例详解

程序员文章站 2022-10-02 18:18:56
解析url用的类库: python2版本: from urlparse import urlparse import urllib python3版本...

解析url用的类库:

python2版本:

from urlparse import urlparse
import urllib

python3版本:

from urllib.parse import urlparse
import urllib.request

研究了不同的url规则发现:只要在搜索关键字是用=嫁接的,查询的关键在解析后的query里

如果不是用=嫁接,查询的关键在解析后的path里。

解析的规则都是一样的,正则如下:(6中不同情况的组合)

另外host为‘s.weibo.com'的url编码与其他不同要另做处理。

代码如下:有些网站的规则还不是很清楚,需要花大量时间找规则,规则越清晰,关键字就越清楚,如下规则已适合绝大部分网站,酌情参考。

# -*- coding:utf-8 -*-
 
from urlparse import urlparse
import urllib
import re
 
# url
source_txt = "e:\\python_anaconda_code\\url.txt"
# 规则
regular = r'(\w+(%\w\w)+\w+|(%\w\w)+\w+(%\w\w)+|\w+(%\w\w)+|(%\w\w)+\w+|(%\w\w)+|\w+)'
 
# 存放关键字
kw_list = list()
 
# key为要研究网站的host,value为关键字的嫁接标识符
dict = {
 "www.baidu.com": "wd=",
 "news.baidu.com": "word=",
 "www.sogou.com": "query=",
 "tieba.baidu.com": "kw=",
 "wenku.baidu.com": "word=",
 "music.sina.com.cn": "k=",
 "www.haosou.com": "q=",
 "www.lagou.com": "list_",
 "www.chunyuyisheng.com": "query=",
 "s.weibo.com": "weibo/"
}
 
def main():
 with open(source_txt, 'r') as f_source_txt:
  for url in f_source_txt:
   host = url.split("//")[1].split("/")[0]
   if host in dict:
    flag = dict[host]
    if flag.find("=") != -1:
     query = urlparse(url).query.replace('+', '')
     kw = re.search(flag + regular, query, re.i) # .group(0)
     if kw:
      kw = urllib.unquote(kw.group(0).split(flag)[1])
      print(kw)
    else:
     path = urlparse(url).path.replace('+', '')
     kw = re.search(flag + regular, path.replace("%25", "%"), re.i)
     if kw:
      kw = urllib.unquote(kw.group(0).split(flag)[1])
      print(kw)
if __name__ == '__main__':
 main()

url.txt的内容如下:

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&ch=&tn=baidu&bar=&wd=python&rn=&oq=&rsv_pq=ece0867c0002c793&rsv_t=edeaqq7ddvznxq%2fzvra5k%2beuanltiuxhgihvutaqdfoecluxr25xkdp%2bi0i&rqlang=cn&rsv_enter=1&inputt=218
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python%e9%87%8c%e7%9a%84%e5%ad%97%e5%85%b8dict&oq=python&rsv_pq=96c160e70003f332&rsv_t=0880nkovmir3tvoddp1t8eblod8qwr4yep6cfpjqihqnnhdexfuwyofmrx0&rqlang=cn&rsv_enter=0&inputt=10411 
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python%e9%87%8c%e7%9a%84urlprese&oq=python%25e9%2587%258c%25e7%259a%2584re%25e9%2587%258c%25e7%259a%2584%257c%25e6%2580%258e%25e4%25b9%2588%25e7%2594%25a8&rsv_pq=d1d4e7b90003d391&rsv_t=5ff4vok4eelk1pgj4osk8l0vvkan51%2bl8ns%2fjsubexg7lb7znkctvnvtn8m&rqlang=cn&rsv_enter=1&inputt=2797 
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python++wo+%e7%88%b1urlprese&oq=python%25e9%2587%258c%25e7%259a%2584urlprese&rsv_pq=eecf45e900033e87&rsv_t=1c70xayhrvw5joza7lpvgt4pw%2bw1to8hqtejth67jgeqfqagydydd25hamu&rqlang=cn&rsv_enter=0&inputt=10884 
http://news.baidu.com/ns?word=%e8%b6%b3%e7%90%83&tn=news&from=news&cl=2&rn=20&ct=1 
http://news.baidu.com/ns?ct=1&rn=20&ie=utf-8&bs=%e8%b6%b3%e7%90%83&rsv_bp=1&sr=0&cl=2&f=8&prevct=no&tn=news&word=++++++%e8%b6%b3++%e7%90%83+++++%e4%bd%a0%e5%a5%bd+%e5%98%9b%ef%bc%9f&rsv_sug3=14&rsv_sug4=912&rsv_sug1=4&inputt=8526 
http://tieba.baidu.com/f?ie=utf-8&kw=%e7%ba%a2%e6%b5%b7%e8%a1%8c%e5%8a%a8&fr=search&red_tag=q0224393377 
https://www.sogou.com/web?query=ni+zai+%e6%88%91+%e5%bf%83li&_asf=www.sogou.com&_ast=1520388441&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=9493&sst0=1520388440692&lkt=8%2c1520388431200%2c1520388436842&sugsuv=1498714959961744&sugtime=1520388440692 
https://www.lagou.com/jobs/list_python%e5%a4%a7%e6%95%b0%e6%8d%aemr?labelwords=&fromsearch=true&suginput= 
https://www.chunyuyisheng.com/pc/search/?query=%e6%85%a2%e6%80%a7%e4%b9%99%e8%82%9d% 
http://s.weibo.com/weibo/%25e5%2594%2590%25e4%25ba%25ba%25e8%25a1%2597%25e6%258e%25a2%25e6%25a1%25882&refer=index
http://s.weibo.com/weibo/%25e4%25bd%25a0%25e5%25a5%25bd123mm%2520%25e5%2597%25af%2520mm11&refer=stopic_box

结果如下:

Python解析、提取url关键字的实例详解

如果要研究其他host,可以加到字典dict里。

备注:以上代码和思路仅供参考,如有更好的方法敬请留言!

以上这篇python解析、提取url关键字的实例详解就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持。