JAVA爬虫入门豆瓣电影数据并分词写入excel表
程序员文章站
2022-05-07 19:02:42
以前写的爬虫 简单分享一下一、实验目的爬虫获取网页数据信息并写入文本数据,进行分词,最后写入Excel表,整体比较简单二、实验方法此前没有爬虫实践经验,进行了相关学习了解后,由于没有学过python,选择使用Java中基本的Jsoup来进行爬虫操作,分词工具使用jieba分词,excel写入使用jxl。主要分三个类:content、list、jieba;最后完整爬取了豆瓣电影TOP250首页的25部电影,在list类中每部电影分别爬取了其名字、详情网址、评分、引用,通过调用sipder方法分别爬...
以前写的入门爬虫 简单分享一下 具体实现代码在后面
一、目的
爬虫获取网页数据信息并写入文本数据,进行分词,最后写入Excel表,整体比较简单
二、方法
此前没有爬虫实践经验,进行了相关学习了解后,由于没有学过python,选择使用Java中基本的Jsoup来进行爬虫操作,分词工具使用jieba分词,excel写入使用jxl。
主要分三个类:content、list、jieba;
最后完整爬取了豆瓣电影TOP250首页的25部电影,在list类中每部电影分别爬取了其名字、详情网址、评分、引用,通过调用sipder方法分别爬取了每部电影的海报、简介、热门短评5条、热门影评10条;
三、实现过程及结果
1、添加依赖包
2、结构:
content:爬取电影各类热门影评;
list:爬取电影列表,调用content中的spider方法爬取影评并将结果保存为文本数据写入txt文件;
jieba:读取txt文件并进行分词,最后写入excel文件;
3、写好爬虫电影列表测试类——
查看并分析要爬取的网页:
编写测试代码:
测试:
4、写好爬虫电影详情测试类——
查看并分析要爬取的网页:
编写测试代码:
测试:
5、修改代码使list调用content方法——
content(with spider方法):
package reallll.QYL01;
import java.io.IOException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/*
* @student:QYL
*/
public class content {
public static void main(String[] args) throws IOException{
}
public static void spider(String M_url) throws IOException{
String url=M_url;
Document document = Jsoup.parse(new URL(url),30000);
//Elements e1 = document.getElementsByTag("h1");
//System.out.println("Name: "+e1.text());
Elements e2 = document.getElementsByTag("img");
System.out.println("Picture_url: "+e2.attr("src"));
Elements e3 = document.getElementsByClass("short");
System.out.println("Intro: "+e3.text());
System.out.println("==========================================================");
//Elements element = document.getElementsByClass("short-content");
//System.out.println(element.html());
Elements element1= document.getElementsByClass("comment");
for(Element el : element1) {
String shortcontent = el.getElementsByClass("short").eq(0).text();
System.out.println("----------------------");
System.out.println("Hot Short Comments: "+shortcontent);
}
System.out.println("===========================================================");
Elements element2= document.getElementsByClass("main review-item");
for(Element el : element2) {
String shortcontent = el.getElementsByClass("short-content").eq(0).text();
System.out.println("----------------------");
System.out.println("Hot Comments: "+shortcontent);
}
}
}
list的调用并写入txt文件:
package reallll.QYL01;
import java.io.IOException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import reallll.QYL01.content;
import java.io.PrintStream;
/*
* @student:QYL
*/
public class list {
public static void main(String[] args) throws IOException {
list();
}
public static void list() throws IOException{
String url="https://movie.douban.com/top250?start=0";
Document document = Jsoup.parse(new URL(url),30000);
Element elementById = document.getElementById("content");
Elements elementsByClass = elementById.getElementsByClass("info");
//System.out.println(element.html());
PrintStream ps = new PrintStream("D:\\spider.txt");
System.setOut(ps);
for(Element el : elementsByClass) {
String name = el.getElementsByClass("title").eq(0).text();
String Murl= el.getElementsByTag("a").eq(0).attr("href");
//String Purl= el.getElementsByTag("img").eq(0).attr("src");
String score= el.getElementsByClass("rating_num").eq(0).text();
String quote= el.getElementsByClass("inq").eq(0).text();
System.out.println("============================================================");
System.out.println("Name: "+name);
// System.out.println("Picture-url: "+Purl);
System.out.println("Movie-url: "+Murl);
System.out.println("Score: "+score);
System.out.println("Quote: "+quote);
content.spider(Murl);
}
}
}
运行结果:
txt文本:
6、分词并写入excel——
添加jieba-analysis依赖包:
读取并分词:
package reallll.QYL01;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.nio.file.Paths;
import java.util.Collection;
import java.util.List;
import javax.servlet.jsp.tagext.PageData;
import org.junit.Test;
import com.huaban.analysis.jieba.JiebaSegmenter;
import com.huaban.analysis.jieba.JiebaSegmenter.SegMode;
import com.huaban.analysis.jieba.WordDictionary;
import junit.framework.TestCase;
import jxl.Workbook;
import jxl.write.Label;
import jxl.write.WritableSheet;
import jxl.write.WritableWorkbook;
import jxl.write.WriteException;
import jxl.write.biff.RowsExceededException;
public class jieba extends TestCase {
private static List<String> s;
//读取本地txt文本数据
public static String txt2String(File file){
StringBuilder result = new StringBuilder();
try{
BufferedReader br = new BufferedReader(new FileReader(file));
String s = null;
while((s = br.readLine())!=null){
result.append(System.lineSeparator()+s);
}
br.close();
}catch(Exception e){
e.printStackTrace();
}
return result.toString();
}
public static void main(String[] args) throws IOException, RowsExceededException, WriteException{
File file = new File("D:\\spider.txt");
//System.out.println(txt2String(file));
JiebaSegmenter segmenter = new JiebaSegmenter();
// System.out.println(segmenter.sentenceProcess(txt2String(file)));
s = segmenter.sentenceProcess(txt2String(file));
System.out.println(s);
测试结果:
写入excel:
//用jxl写入excel
try {
File xlsFile = new File("D:\\spider.xls");
xlsFile.createNewFile();
WritableWorkbook workbook = Workbook.createWorkbook(xlsFile);
WritableSheet sheet1 = workbook.createSheet("sheet1", 0);
//sheet1.addCell(new Label(0, 0,s.get(1)));
for (int row = 0; row < 50; row++)
{
for(int col=0;col < 50 ;col++) {
sheet1.addCell(new Label(row, col, s.get(row+50*col)));
}
}
workbook.write();
workbook.close();
System.out.println("成功写入文件!");
} catch (Exception e) {
System.out.println("文件写入失败");
}
}
}
结果:
显示成功写入
Excel表结果:
本文地址:https://blog.csdn.net/qq_43728087/article/details/107282391