欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

python爬虫备忘录

程序员文章站 2022-03-16 14:16:15
...

我都不知道多久没有发过博文了,伴随着毕业的到来,论文和实习,都一起到来了,可能我以后也很少发布编程类的文章了,更多的将会注重于网络安全文章的发布了,windowsAPI的文章将会逐渐从我的博文中删除,以后将会不定期更新webdirscan,weblogon_brust等的python技术文章,二进制或者手机APP类型的,很感叹自己一路走过来,学习了不少的知识,也遇到过很多大佬,向他们学习了一些知识,到如今,我还是觉得脚踏实地的走比较靠谱。之后我会陆续更新我的开源webdirscan软件,开源一些信息收集的小工具。

爬虫环境配置

selenium

描述:模拟浏览器访问,有界面
安装: pip3 install selenium
基本使用:
import selenium
from selenium import webdriver
driver=webdriver.Chrome()

chromedriver

描述:谷歌驱动
安装:pip install chromedriver
基本使用:
import selenium
from selenium import webdriver
driver=webdriver.Chrome()

phantomjs

描述:模拟浏览器访问,*面
安装: 下载https://phantomjs.org/download
配置:
export PATH=${PATH}:/root/phantomjs/bin
export OPENSSL_CONF=/etc/ssl/

beautifulsoup4

描述:html
安装: pip3 install beautifulsoup4
使用:
from bs4 import BeautifulSoup
soup=BeautifulSoup('<html><html>','lxml')

pyquery

描述:类似Jquery
安装: pip3 install pyquery
使用:
from pyquery import PyQuery as pq
doc=pq('<html>HELLO</html>')
result=doc('html').text()
result

pymysql

描述:mysql
安装: pip3 install pymysql
使用:
import pymysql
conn=pymysql.connect(host='127.0.0.1',user='root',password='root',port=3306,db='mysql')
cursor=conn.cursor()
cursor.execute('select * from db')
cursor.fetchone()
出错解决:
update mysql.user set authentication_string=PASSWORD('root'), plugin='mysql_native_password' where user='root';

pymongo

描述:MongoDB
安装:pip3 install pymongo
使用:
import pymongo
client=pymongo.MongoClient('localhost')
db=client['newtestdb']
db['table'].insert({'name':'Bob'})
db['table'].find_one({name':'Bob'})

pyredis

描述:Redis
安装:pip3 install redis
使用:
import redis
r=redis.Redis('localhost',6379)
r.set('name','Bob')
r.get('name')

flask代理

描述:proxy
安装:pip3 install flask
使用:
import flask

django

描述:django
安装:pip3 install django
使用:
import django

jupyter

描述:makedown在线,编译在线
安装:pip3 install jupyter
使用:
jupyter notebook

ALL

pip3 install requests selenium beautifulsoup4 pyquery pymysql pymongo redis flask django jupyter

什么是爬虫?

一个简单的请求

import requests
response=request.get('https://www.baidu.com')
#response.decoding="utf8"
print(response.text)
print(response.header)
print(response.status_code)

headers={'User-Agent':"**********"}
response=requests.get('https://www.baidu.com',headers=headers)
print(response.status_code)

# 下载图片
response=request.get('https://www.baidu.com/gif.ico')
print(response.content)
with open('/var/tmp/1.git','wb') as f:
	f.write(response.content)
	f.close

JS(javascript)渲染问题

selenium/webDriver or Splash or pyv8、Ghost.py

from selenium import webdriver
driver=webdriver.Chrome()
driver.get('http://m.weibo.com')
print(driver.page_sources)

Urllib库

什么是Urllib?

内置请求:
urllib.request 请求模块
usllib.error 异常
urllib.parse url解析模块
urllib.robotparser robot.txt解析模块

相比python2变化:
python2:
impoort urllib2
response=usllib.urlopen('https://www.baidu.com')

python3:
import urllib.request
response=urllib.request.urlopen('https://www.baidu.com')

Requests库

import requests

response=requests.get('https://www.baidu.com')
print(type(response))
print(response.status_code)
print(type(response.text))
print(response.text)
print(response.cookies)
import requests
response=requests.get('http://www.baidu.com?id=1')
print(response.text)

参数:
import requests
data={
'name':'germay',
'age':22
}
response=requests.get('url',params=data)
print(response.text)
JSON:
import request
response=requests.get('url')
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))
二进制:
import requests
response=request.get('url/img.ico')
print(response.text)
print(response.content)

with open('a.ico','web') as f:
	f.write(response.content)
	f.close()
headers:
import requests 
headers={
'User-Agent':'Moziila/5.0'
}
response =request.get('URL/explore',headers=headers)
print(response.text)
POST请求:
import request

data={'name':'asd','age':22}
response=request.post('http://www',data=data)
print(response.text)

import request

data={'name':'asd','age':22}
headers={
'User-Agent':'asdasdasd'
}
response=requests.post('URL',data=data,headers=headers)
print(response.json())
响应:
import requests

response=requests.get('URL')
print(response.status_code)
print(response.headers)
print(response.cookies)
print(response.url)
print(response.history)
状态码判断
import requests
response=request.get('URL')
exit() if not response.status_code==request.codes.ok else print('ok')
#exit() if not response.status_code==request.codes.200 else print('ok')

高级操作

文件上传:
import requests

files={'file',open('img.jpg','rb')}
response=requests.post('URL',file=files)
print(response.text)
获取cookie
import requests

response=request.get('URL')
print(response.cookies)
for key,value in reqonse.cookies.items():
	print(key+'='+value)
会话维持:
import requests
s=request.Session()
s.get('set cookie URL')
response=s.get('get cookie url')
print(response.text)
证书验证
import requests 
response=requests.get('URL')
print(response.status_code)

import requests 
from request.packages import urllib3
urllib3.disable_warinings()
response=requests.get('URL',verify=False)# 不认证证书
print(response.status_code)

import requests 
response=requests.get('URL',cert=('/path/server.crt','path/key'))
print(response.status_code)
代理设置:
import request

proxies={
'http':'http://127.0.0.1:1080',
'https':'https://127.0.0.1:1080'
}

response=request.get('URL',proxies=proxies)
print(response.status_code)


安装:pip install 'requests[socks]'
import requests

proxies={
'http':'socks5://127.0.0.1:1080',
'https':'socks5://127.0.0.1:1080'
}
response=request.get('URL',proxies=proxies)
print(response.status_code)
超时设置:
import requests
response=request.get('https://www.baidu.com',timeout=1)
认证设置(网站访问认证):
import requests
from requests.auth import HTTPBasicAuth
r=requests.get('http://127.0.0.1:9090',auth=HTTPBasicAuth('user',123))
print(r.status_code)

import requests
r=requests.get('http://127.0.0.1:9090',auth=('user','123'))
print(r.status_code)
异常处理:
import requests
from request.excception import ReadTimeout,HTTPError,RequestException

try:
	response=request.get('URL',timeout=1)
	print(response.status_code)
except ReadTimeout:
	print('...')
except HTTPError:
	print('...')
except RequestException:
	print('...')

re正则表达式

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wq1rQiNO-1578217063553)(C:\Users\lvy\AppData\Roaming\Typora\typora-user-images\image-20200105170342954.png)]

match

re.match
从字符串起始位置匹配一个模式,没有就是None
re.match(pattern,staring,flags=0)

普通匹配:
impoer re
content="Hello 123 4567 World_This is a Regex Demo"
result=re.match('^Hello\s\d\d\d\s\d[4]\s\w[10]\s\s[2].*Demo$',content)
print(result.group())
print(result.span())

泛匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match("^Hello.*Demo$",content)
print(result.group())
print(result.span())

目标匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match('^Hello\s(\d+)\sWorld.*Demo$',content)# (\d+)
print(result.group(1))
print(result.span())

贪婪匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match("^He.*(\d+).*Demo$",content)
print(result)
print(result.group(1))
print(result.span())

非贪婪匹配:
impoer re
content="Hello 123 4567 Wordld_This is a Regex Demo"
result=re.match("^He.*?(\d+).*Demo$",content)
print(result)
print(result.group(1))

模式匹配:
impoer re
content="Hello 123 4567 Wordld_This
is a Regex Demo
"
result=re.match("^He.*?(\d+).*?Demo$",content,re.S)#re.S 这样.*可以匹配任意字符
print(result)
print(result.group(1))

转义:
import re
content='price is $5.00'
result=re.match('price is \$5\.00',content)
print(result)
print(result.group(1))

search(一个)

re.search:
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
result=re.search('Hello.*?(\d+).*?Demo',content)
print(result)
print(result.group(1))

findall(所有)

result=re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html.re.S)
print(result)

Sub

替换字符串中每一个匹配的字串后返回替换后的字符串
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content=re.sub('\d+','',content)
print(content)

import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content=re.sub('\d+','replace',content)
print(content)

替换整体
import re
content='Extra strings Hello 1234567 World_This is a Regex Demo Extra strings'
content=re.sub('(\d+)','r\1 8910',content)
print(content)

compile

将一个正则表达式编译成表达式对象
import re
content='Extra strings Hello 1234567 World_This 
is a Regex Demo Extra strings'
pattern=re.compile("Hello.*Demo",re.S)
result=re.match(pattern,content)
print(result)