C# 关于爬取网站数据遇到csrf-token的分析与解决
需求
某航空公司物流单信息查询,是一个post请求。通过后台模拟post http请求发现无法获取页面数据,通过查看航空公司网站后,发现网站使用避免csrf攻击机制,直接发挥40x错误。
关于csrf
读者自行百度
网站http请求分析
headers
form data
在head里包含了cookie 与 x-csrf-token formdata 里包含了_csrf (与head里的值是一样的).
这里通过查看该网站的js源代码发现_csrf 来自于网页的head标签里
猜测cookie与 x-csrf-token是有一定的有效期,并且他们共同作用来防御csrf攻击。
解决方案
1,首先请求一下该航空公司的网站,获取cookie与_csrf
2,然后c# 模拟http分别在head和formdata里加入如上参数,发起请求
代码
public class csrftoken { string cookie;//用于请求的站点的cookie list<string> csrfs;//用于请求站点的token的key 以及 value public csrftoken(string url) { //校验传输安全 if (!string.isnullorwhitespace(url)) { try { //设置请求的头信息.获取url的host var _http = new httphelper(url); string cookie; string html = _http.creategethttpresponseforpc(out cookie); this.cookie = cookie; string headregex = @"<meta name=""_csrf.*"" content="".*""/>"; matchcollection matches = regex.matches(html, headregex); regex re = new regex("(?<=content=\").*?(?=\")", regexoptions.none); csrfs = new list<string>(); foreach (match math in matches) { matchcollection mc = re.matches(math.value); foreach (match ma in mc) { csrfs.add(ma.value); } } } catch (exception e) { } } } public string getcookie() { return cookie; } public void setcookie(string cookie) { this.cookie = cookie; } public list<string> getcsrf_token() { return csrfs; } }
httphelper
public string createposthttpresponse(idictionary<string, string> headers, idictionary<string, string> parameters) { httpwebrequest request = null; //httpsq请求 utf8encoding encoding = new system.text.utf8encoding(); servicepointmanager.servercertificatevalidationcallback = new remotecertificatevalidationcallback(checkvalidationresult); request = webrequest.create(_baseipaddress) as httpwebrequest; request.protocolversion = httpversion.version10; servicepointmanager.securityprotocol = securityprotocoltype.tls12 | securityprotocoltype.tls11; request.method = "post"; request.contenttype = "application/x-www-form-urlencoded"; // request.contenttype = "application/json"; request.useragent = defaultuseragent; //request.headers.add("x-csrf-token", "bc0cc533-60cc-484a-952d-0b4c1a95672c"); //request.referer = "https://www.asianacargo.com/tracking/viewtraceairwaybill.do"; //request.headers.add("origin", "https://www.asianacargo.com"); //request.headers.add("cookie", "jsessionid=hp21d2dq5foslg4fyw4slwwhb0-sl1cg6jgtj7he41e5f4an_r1p!-435435446!117330181"); //request.host = "www.asianacargo.com"; if (!(headers == null || headers.count == 0)) { foreach (string key in headers.keys) { request.headers.add(key, headers[key]); } } //如果需要post数据 if (!(parameters == null || parameters.count == 0)) { stringbuilder buffer = new stringbuilder(); int i = 0; foreach (string key in parameters.keys) { if (i > 0) { buffer.appendformat("&{0}={1}", key, parameters[key]); } else { buffer.appendformat("{0}={1}", key, parameters[key]); } i++; } byte[] data = encoding.getbytes(buffer.tostring()); using (stream stream = request.getrequeststream()) { stream.write(data, 0, data.length); } } httpwebresponse response; try { //获得响应流 response = (httpwebresponse)request.getresponse(); stream s = response.getresponsestream(); streamreader readstream = new streamreader(s, encoding.utf8); string sourcecode = readstream.readtoend(); response.close(); readstream.close(); return sourcecode; } catch (webexception ex) { response = ex.response as httpwebresponse; return null; } } public string creategethttpresponse(out string cookie) { httpwebrequest request = null; //httpsq请求 utf8encoding encoding = new system.text.utf8encoding(); servicepointmanager.servercertificatevalidationcallback = new remotecertificatevalidationcallback(checkvalidationresult); request = webrequest.create(_baseipaddress) as httpwebrequest; request.protocolversion = httpversion.version10; servicepointmanager.securityprotocol = securityprotocoltype.tls12 | securityprotocoltype.tls11; request.method = "get"; request.contenttype = "application/x-www-form-urlencoded"; request.useragent = defaultuseragent; httpwebresponse response; try { //获得响应流 response = (httpwebresponse)request.getresponse(); cookie = response.headers["set-cookie"]; stream s = response.getresponsestream(); streamreader readstream = new streamreader(s, encoding.utf8); string sourcecode = readstream.readtoend(); response.close(); readstream.close(); return sourcecode; } catch (webexception ex) { response = ex.response as httpwebresponse; cookie = ""; return null; } }
爬取程序
爬取结果
浏览器结果
注意事项与结论
1,不同的网站,获取cstf的方式不一样,无论怎么做,只要信息传到前台我们都可以有相应的方法来获取。
2,请求时候的http验证可能不一样,测试的某航空公司物流信息的时候,http请求的安全协议是tis12。
servicepointmanager.securityprotocol = securityprotocoltype.tls12 | securityprotocoltype.tls11; 还有其他参数比如useragent后台可能也会验证
3,基于如上航空公司,发现它的cookie和cstf_token一定时间内不会改变,那么当实际爬取的时候可以考虑缓存cookie以及cstf_token,只有当请求失败的时候,才重新获取
上一篇: Java并发编程-八锁问题带你彻底理解对象锁和类锁
下一篇: springBoot学习 错误记录