网络爬虫爬取新浪某篇文章的标题、日期时间、来源、作者及文章内容（Python）

程序员文章站 2022-03-02 19:39:01

...

1.准备工作
Python安装有BeautifulSoup4
Python安装有requests(可有可无，我会贴出两种方式)
2.当然进入主题了
先获得新浪的一篇文章的Url,我所用的Url为：http://news.sina.com.cn/c/2018-04-22/doc-ifznefkh5284628.shtml

下面就是代码了：
（1）第一种方式：采用Python自带库urllib.request的方式获得链接

# 爬取文章标题，发表时间，文章来源,作者，文章内容

from urllib.request import urlopen 
from bs4 import BeautifulSoup

url = urlopen("http://news.sina.com.cn/c/2018-04-22/doc-ifznefkh5284628.shtml")   #打开字符串的url
soup = BeautifulSoup(url,"html.parser")  #使用指定解析器解析获得链接内容

head = soup.select(".main-title")[0].text  #获取文章标题
date = soup.select(".date")[0].text        #获取日期
source = soup.select(".source")[0].text    #获取来源

article = []        #定义列表
for p in soup.select("#article p")[:-1]:    #获得每段内容
    article.append(p.text.strip())          #追加至列表里
article = '\n\n'.join(article)              #每段两个换行，为看起来方便
# article = '\n\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])     #Python的一行烩
             #获取文章的内容
author = soup.select(".show_author")[0].text.strip("")  #获取作者
print(head.rjust(60),"\n",date.rjust(60)+' '+source,"\t\n",author.rjust(70),"\n",article)
                                    #打印输出（加rjust为模拟文章格式）

（2）第二种方式：采用requests请求获得链接

# 爬取文章标题，发表时间，文章来源,作者，文章内容

import requests
from bs4 import BeautifulSoup

res = requests.get("http://news.sina.com.cn/c/2018-04-22/doc-ifznefkh5284628.shtml")   #res获得请求到的结果
soup = BeautifulSoup(res.text,"html.parser")  #使用指定解析器解析获得res文本

head = soup.select(".main-title")[0].text  #获取文章标题
date = soup.select(".date")[0].text        #获取日期
source = soup.select(".source")[0].text    #获取来源
article = '\n\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])     #Python的一行烩,获取文章的内容
author = soup.select(".show_author")[0].text.strip("")  #获取作者

print(head.rjust(60),"\n",date.rjust(60)+' '+source,"\t\n",author.rjust(70),"\n",article)
                                    #打印输出（加rjust为模拟文章格式）

就这些了，小白学爬虫，看视频整理而来，大神勿喷

有兴趣学爬虫的下面为链接
视频链接：http://study.163.com/course/courseMain.htm?courseId=1003285002

老师讲的挺好