Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇

程序员文章站 2022-04-28 15:59:40

学术搜索学习理论的知识少不了去检索文献，好多文献为你的实操提供了合理的支撑，我所在的大学内网默认是有知网账户的，非常NICE 今天要完成的网站是 http://ac.scmor.com/ Google学术搜索是一个文献检索服务，目前主要是提供维普资讯、万方数据等几个学术文献资源库的检索服务。通过G ......

学术搜索

学习理论的知识少不了去检索文献，好多文献为你的实操提供了合理的支撑，我所在的大学内网默认是有知网账户的，非常nice

今天要完成的网站是 http://ac.scmor.com/

google学术搜索是一个文献检索服务，目前主要是提供维普资讯、万方数据等几个学术文献资源库的检索服务。通过google学术搜索只能够查找到这些学术资料的“报告、摘要及引用内容... 来源百度百科

Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇

我们的目标

获取现在访问的链接地址，当你使用谷歌浏览器的开发者工具抓取的时候，得到的是一个js加密函数

Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇

注意看上图2的位置，接下来，我们采用上篇博客的方式，去尝试获取visit函数的具体内容

我们要在所有的请求链接中去检索一个visit方法，注意步骤
Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇
双击方法名，进入

找到核心方法

function visit(url) {
    var newtab = window.open('about:blank');   
    if(gword!='') url = strdecode(url);
   // var newtab = window.open(url);   
    newtab.location.href = url;
    //newtab.location.reload(true);
}

发现url在跳转前调用了一个strdecode函数，你只需要关注这个函数的实现就可以了

再次查看visit的调用函数，找到参数的生成方式为

 onclick="visit(\'' + autourl[b] + '\')"

对autourl[b] 我们是可以直接用爬虫在html页面获取到的

Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇

function auto(b) {
    t = (tim - ts[b]) / 100;
    tt = t.tostring().split('.');
    if(tt.length==1) t = t.tostring() + '.00';
    else if(tt[1].length < 2)  t = t.tostring() + '0';
    if (t > 4) document.getelementbyid("txt" + b).innerhtml = '<font color=red>连接超时！<\/font>';
    else document.getelementbyid("txt" + b).innerhtml = 'takes ' + t + 's.   <a href="javascript:;" class="ok" onclick="visit(\'' + autourl[b] + '\')"> 现在访问 <\/a>'
}

function visit(url) {
    var newtab = window.open('about:blank');   
    if(gword!='') url = strdecode(url);
   // var newtab = window.open(url);   
    newtab.location.href = url;
    //newtab.location.reload(true);
}

参数分析

if(gword!='') url = strdecode(url); 如果gword为空，调用的是strdecode方法，查阅之后，发现相关代码也在下面

gword 在上面的一张图片中我们也已经获取到了，可以向上看

strdecode函数分析

进行base64编码
通过gword生成一个key
计算key的len
循环string然后将code生成，这个地方注意js里面的fromcharcode函数（python里面的chr）和charcodeat函数（python里面的ord）

//code
function strdecode(string) {
    string = base64decode(string);
    key = gword+'ok ';
    len = key.length;
    code = '';
    for (i = 0; i < string.length; i++) {
        var k = i % len;
        code += string.fromcharcode(string.charcodeat(i) ^ key.charcodeat(k))
    }
    return base64decode(code)
}
var base64decodechars = new array(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 62, -1, -1, -1, 63, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, -1, -1, -1, -1, -1, -1, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, -1, -1, -1, -1, -1);

function base64decode(str) {
    var c1, c2, c3, c4;
    var i, len, out;
    len = str.length;
    i = 0;
    out = "";
    while (i < len) {
        do {
            c1 = base64decodechars[str.charcodeat(i++) & 0xff]
        } while (i < len && c1 == -1);
        if (c1 == -1) break;
        do {
            c2 = base64decodechars[str.charcodeat(i++) & 0xff]
        } while (i < len && c2 == -1);
        if (c2 == -1) break;
        out += string.fromcharcode((c1 << 2) | ((c2 & 0x30) >> 4));
        do {
            c3 = str.charcodeat(i++) & 0xff;
            if (c3 == 61) return out;
            c3 = base64decodechars[c3]
        } while (i < len && c3 == -1);
        if (c3 == -1) break;
        out += string.fromcharcode(((c2 & 0xf) << 4) | ((c3 & 0x3c) >> 2));
        do {
            c4 = str.charcodeat(i++) & 0xff;
            if (c4 == 61) return out;
            c4 = base64decodechars[c4]
        } while (i < len && c4 == -1);
        if (c4 == -1) break;
        out += string.fromcharcode(((c3 & 0x03) << 6) | c4)
    }
    return out
}

这个地方有2个解决方案了

1是用python重写编写相关逻辑
2通过python调用js直接实现

我们采用方案2 将 base64decode 复制到一个文件中，然后通过execjs进行调用

python 执行js库 execjs

execjs可以在python中运行javascript代码

官网：https://pypi.org/project/pyexecjs/

安装：pip install pyexecjs

可以切换清华源

安装成功之后在pycharm中引入一下，不出错误，表示运行成功
Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇
我们对js进行编译

import execjs
with open('scmor.js', 'r', encoding='utf-8') as f:
    js = f.read()
    ctx = execjs.compile(js)  # 对js进行编译

核心的方法

def decode(string):
    string = ctx.call('base64decode', string)  # base64解码string参数,string参数上面获取到的autourls里面的值
    key = " link@scmor.comok "  # gword的值+ 'ok '   key 在html页面中可以获取到
    len = len(key)  # gword长度
    code = ''
    for i in range(0, len(string)):
        k = i % len
        n = ord(str(string[i])) ^ ord(str(key[k]))
        code += chr(n)
    return ctx.call('base64decode', code)

运行结果展示

Python爬虫入门教程 62-100 30岁了，想找点文献提高自己，还被反爬了，Python搞起，反爬第2篇

完整代码下载

关注公众账号：非本科程序员，回复0402获取下载地址

上一篇：福建七夕的传统习俗有哪些？七夕节吃什么比较好？

下一篇：一些PHP Coding Tips(php小技巧)[2011/04/02最后更新]