Java爬取51job保存到MySQL并进行分析
大二下实训课结业作业,想着就爬个工作信息,原本是要用python的,后面想想就用java试试看,
java就自学了一个月左右,想要锻炼一下自己面向对象的思想等等的,
然后网上转了一圈,拉钩什么的是动态生成的网页,51job是静态网页,比较方便,就决定爬51job了。
前提:
创建maven project方便包管理
使用httpclient 3.1以及jsoup1.8.3最为爬取网页和筛选信息的包,这两个版本用的人多。
mysql-connect-java 8.0.13用来将数据导入数据库,支持mysql8.0+
分析使用,tablesaw(可选,会用的就行)
“大数据+上海”以此url为例子,只要是类似的url都可行
https://search.51job.com/list/020000,000000,0000,00,9,99,%25e5%25a4%25a7%25e6%2595%25b0%25e6%258d%25ae,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
先设计了个大概的功能,修改了好几版,最后觉得这样思路比较清晰,以jobbean容器作为所有功能的媒介
先完成爬取网页,以及保存到本地
创建jobbean对象
public class jobbean { private string jobname; private string company; private string address; private string salary; private string date; private string joburl; public jobbean(string jobname, string company, string address, string salary, string date, string joburl) { this.jobname = jobname; this.company = company; this.address = address; this.salary = salary; this.date = date; this.joburl = joburl; } @override public string tostring() { return "jobname=" + jobname + ", company=" + company + ", address=" + address + ", salary=" + salary + ", date=" + date + ", joburl=" + joburl; } public string getjobname() { return jobname; } public void setjobname(string jobname) { this.jobname = jobname; } public string getcompany() { return company; } public void setcompany(string company) { this.company = company; } public string getaddress() { return address; } public void setaddress(string address) { this.address = address; } public string getsalary() { return salary; } public void setsalary(string salary) { this.salary = salary; } public string getdate() { return date; } public void setdate(string date) { this.date = date; } public string getjoburl() { return joburl; } public void setjoburl(string joburl) { this.joburl = joburl; } }
然后写一个用于保存容器的工具类,这样在任何阶段都可以保存容器
import java.io.*; import java.util.*; /**实现 * 1。将jobbean容器存入本地 * 2.从本地文件读入文件为jobbean容器(有筛选) * @author powerzzj * */ public class jobbeanutils { /**保存jobbean到本地功能实现 * @param job */ public static void savejobbean(jobbean job) { try(bufferedwriter bw = new bufferedwriter( new filewriter("jobinfo.txt",true))){ string jobinfo = job.tostring(); bw.write(jobinfo); bw.newline(); bw.flush(); }catch(exception e) { system.out.println("保存jobbean失败"); e.printstacktrace(); } } /**保存jobbean容器到本地功能实现 * @param jobbeanlist jobbean容器 */ public static void savejobbeanlist(list<jobbean> jobbeanlist) { system.out.println("正在备份容器到本地"); for(jobbean jobbean : jobbeanlist) { savejobbean(jobbean); } system.out.println("备份完成,一共"+jobbeanlist.size()+"条信息"); } /**从本地文件读入文件为jobbean容器(有筛选) * @return jobbean容器 */ public static list<jobbean> loadjobbeanlist(){ list<jobbean> jobbeanlist = new arraylist<>(); try(bufferedreader br = new bufferedreader( new filereader("jobinfo.txt"))){ string str = null; while((str=br.readline())!=null) { //筛选,有些公司名字带有","不规范,直接跳过 try { string[] datas = str.split(","); string jobname = datas[0].substring(8); string company = datas[1].substring(9); string address = datas[2].substring(9); string salary = datas[3].substring(8); string date = datas[4].substring(6); string joburl = datas[5].substring(8); //筛选,全部都不为空,工资是个区间,url以https开头,才建立jobbean if (jobname.equals("") || company.equals("") || address.equals("") || salary.equals("") || !(salary.contains("-"))|| date.equals("") || !(joburl.startswith("http"))) continue; jobbean jobbean = new jobbean(jobname, company, address, salary, date, joburl); //放入容器 jobbeanlist.add(jobbean); }catch(exception e) { system.out.println("本地读取筛选:有问题需要跳过的数据行:"+str); continue; } } system.out.println("读取完成,一共读取"+jobbeanlist.size()+"条信息"); return jobbeanlist; }catch(exception e) { system.out.println("读取jobbean失败"); e.printstacktrace(); } return jobbeanlist; } }
接着就是关键的爬取了
标签是el 里面是需要的信息,以及第一个el出来的是总体信息,一会需要去除。
各自里面都有t1,t2,t3,t4,t5标签,按照顺序一个个取出来就好。
再查看"下一页"元素,在bk标签下,这里要注意,有两个bk,第一个bk是上一页,第二个bk才是下一页,
之前我爬取进入死循环了。。。。
最后一个spider功能把爬取信息以及迭代下一页全部都放在一起
import java.net.url; import java.util.arraylist; import java.util.list; import org.jsoup.jsoup; import org.jsoup.nodes.document; import org.jsoup.nodes.element; import org.jsoup.select.elements; /**爬取网页信息 * @author powerzzj * */ public class spider { //记录爬到第几页 private static int pagecount = 1; private string strurl; private string nextpageurl; private document document;//网页全部信息 private list<jobbean> jobbeanlist; public spider(string strurl) { this.strurl = strurl; nextpageurl = strurl;//下一页url初始化为当前,方便遍历 jobbeanlist = new arraylist<jobbean>(); } /**获取网页全部信息 * @param 网址 * @return 网页全部信息 */ public document getdom(string strurl) { try { url url = new url(strurl); //解析,并设置超时 document = jsoup.parse(url, 4000); return document; }catch(exception e) { system.out.println("getdom失败"); e.printstacktrace(); } return null; } /**筛选当前网页信息,转成jobbean对象,存入容器 * @param document 网页全部信息 */ public void getpageinfo(document document) { //通过css选择器用#resultlist .el获取el标签信息 elements elements = document.select("#resultlist .el"); //总体信息删去 elements.remove(0); //筛选信息 for(element element: elements) { elements elementsspan = element.select("span"); string joburl = elementsspan.select("a").attr("href"); string jobname = elementsspan.get(0).select("a").attr("title"); string company = elementsspan.get(1).select("a").attr("title"); string address = elementsspan.get(2).text(); string salary = elementsspan.get(3).text(); string date = elementsspan.get(4).text(); //建立jobbean对象 jobbean jobbean = new jobbean(jobname, company, address, salary, date, joburl); //放入容器 jobbeanlist.add(jobbean); } } /**获取下一页的url * @param document 网页全部信息 * @return 有,则返回url */ public string getnextpageurl(document document) { try { elements elements = document.select(".bk"); //第二个bk才是下一页 element element = elements.get(1); nextpageurl = element.select("a").attr("href"); if(nextpageurl != null) { system.out.println("---------"+(pagecount++)+"--------"); return nextpageurl; } }catch(exception e) { system.out.println("获取下一页url失败"); e.printstacktrace(); } return null; } /**开始爬取 * */ public void spider() { while(!nextpageurl.equals("")) { //获取全部信息 document = getdom(nextpageurl); //把相关信息加入容器 getpageinfo(document); //查找下一页的url nextpageurl = getnextpageurl(document); } } //获取jobbean容器 public list<jobbean> getjobbeanlist() { return jobbeanlist; } }
然后测试一下爬取与保存功能
import java.util.arraylist; import java.util.list; public class test1 { public static void main(string[] args) { list<jobbean> jobbeanlist = new arraylist<>(); //大数据+上海 string strurl = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25e5%25a4%25a7%25e6%2595%25b0%25e6%258d%25ae,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; //测试spider以及保存 spider spider = new spider(strurl); spider.spider(); //获取爬取后的jobbean容器 jobbeanlist = spider.getjobbeanlist(); //调用jobbean工具类保存jobbeanlist到本地 jobbeanutils.savejobbeanlist(jobbeanlist); //调用jobbean工具类从本地筛选并读取,得到jobbeanlist jobbeanlist = jobbeanutils.loadjobbeanlist(); } }
然后本地就有了jobinfo.txt
然后就是把jobbean容器放到mysql中了,我的数据库名字是51job,表名字是jobinfo,所有属性都是字符串,emmm就字符串吧
import java.sql.connection; import java.sql.drivermanager; import java.sql.sqlexception; public class connectmysql { //数据库信息 private static final string dbaddress = "jdbc:mysql://localhost/51job?servertimezone=utc"; private static final string username = "root"; private static final string password = "woshishabi2813"; private connection conn; //加载驱动,连接数据库 public connectmysql() { loaddriver(); //连接数据库 try { conn = drivermanager.getconnection(dbaddress, username, password); } catch (sqlexception e) { system.out.println("数据库连接失败"); } } //加载驱动 private void loaddriver() { try { class.forname("com.mysql.cj.jdbc.driver"); system.out.println("加载驱动成功"); } catch (exception e) { system.out.println("驱动加载失败"); } } //获取连接 public connection getconn() { return conn; } }
接着就是数据相关操作的工具类的编写了。
import java.sql.connection; import java.sql.preparedstatement; import java.sql.resultset; import java.util.arraylist; import java.util.list; public class dbutils { /**将jobbean容器存入数据库(有筛选) * @param conn 数据库的连接 * @param jobbeanlist jobbean容器 */ public static void insert(connection conn, list<jobbean> jobbeanlist) { system.out.println("正在插入数据"); preparedstatement ps; for(jobbean j: jobbeanlist) { //命令生成 string command = string.format("insert into jobinfo values('%s','%s','%s','%s','%s','%s')", j.getjobname(), j.getcompany(), j.getaddress(), j.getsalary(), j.getdate(), j.getjoburl()); try { ps = conn.preparestatement(command); ps.executeupdate(); } catch (exception e) { system.out.println("存入数据库筛选有误信息:"+j.getjobname()); } } system.out.println("插入数据完成"); } /**将jobbean容器,取出 * @param conn 数据库的连接 * @return jobbean容器 */ public static list<jobbean> select(connection conn){ preparedstatement ps; resultset rs; list<jobbean> jobbeanlist = new arraylist<jobbean>(); string command = "select * from jobinfo"; try { ps = conn.preparestatement(command); rs = ps.executequery(); int col = rs.getmetadata().getcolumncount(); while(rs.next()) { jobbean jobbean = new jobbean(rs.getstring(1), rs.getstring(2), rs.getstring(3), rs.getstring(4), rs.getstring(5), rs.getstring(6)); jobbeanlist.add(jobbean); } return jobbeanlist; } catch (exception e) { system.out.println("数据库查询失败"); } return null; } }
然后测试一下
import java.sql.connection; import java.util.arraylist; import java.util.list; public class test2 { public static void main(string[] args) { list<jobbean> jobbeanlist = new arraylist<>(); jobbeanlist = jobbeanutils.loadjobbeanlist(); //数据库测试 connectmysql cm = new connectmysql(); connection conn = cm.getconn(); //插入测试 dbutils.insert(conn, jobbeanlist); //select测试 jobbeanlist = dbutils.select(conn); for(jobbean j: jobbeanlist) { system.out.println(j); } } }
上面的图可以看到虽然是“大数据+上海”,但是依旧有运维工程师上面不相关的,后面会进行过滤处理。这里就先存入数据库中
先来个功能的整体测试,删除jobinfo.txt,重建数据库
import java.sql.connection; import java.util.arraylist; import java.util.list; public class testmain { public static void main(string[] args) { list<jobbean> jobbeanlist = new arraylist<>(); //大数据+上海 string strurl = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25e5%25a4%25a7%25e6%2595%25b0%25e6%258d%25ae,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; // //java+上海 // string strurl = "https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare="; //所有功能测试 //爬取的对象 spider jobspider = new spider(strurl); jobspider.spider(); //爬取完的jobbeanlist jobbeanlist = jobspider.getjobbeanlist(); //调用jobbean工具类保存jobbeanlist到本地 jobbeanutils.savejobbeanlist(jobbeanlist); //调用jobbean工具类从本地筛选并读取,得到jobbeanlist jobbeanlist = jobbeanutils.loadjobbeanlist(); //连接数据库,并获取连接 connectmysql cm = new connectmysql(); connection conn = cm.getconn(); //调用数据库工具类将jobbean容器存入数据库 dbutils.insert(conn, jobbeanlist); // //调用数据库工具类查询数据库信息,并返回一个jobbeanlist // jobbeanlist = dbutils.select(conn); // // for(jobbean j: jobbeanlist) { // system.out.println(j); // } } }
这些功能都是能独立使用的,不是一定要这样一路写下来。
接下来就是进行数据库的读取,进行简单的过滤,然后进行分析了
先上思维导图
首先是过滤关键字和日期
import java.util.arraylist; import java.util.calendar; import java.util.list;public class basefilter { private list<jobbean> jobbeanlist; //foreach遍历不可以remove,iterator有锁 //用新的保存要删除的,然后removeall private list<jobbean> removelist; public basefilter(list<jobbean> jobbeanlist) { this.jobbeanlist = new arraylist<jobbean>(); removelist = new arraylist<jobbean>(); //引用同一个对象,getjobbeanlist有没有都一样 this.jobbeanlist = jobbeanlist; printnum(); } //打印jobbean容器中的数量 public void printnum() { system.out.println("现在一共"+jobbeanlist.size()+"条数据"); } /**筛选职位名字 * @param containjobname 关键字保留 */ public void filterjobname(string containjobname) { for(jobbean j: jobbeanlist) { if(!j.getjobname().contains(containjobname)) { removelist.add(j); } } jobbeanlist.removeall(removelist); removelist.clear(); printnum(); } /**筛选日期,要当天发布的 * @param */ public void filterdate() { calendar now=calendar.getinstance(); int nowmonth = now.get(calendar.month)+1; int nowday = now.get(calendar.date); for(jobbean j: jobbeanlist) { string[] date = j.getdate().split("-"); int jobmonth = integer.valueof(date[0]); int jobday = integer.valueof(date[1]); if(!(jobmonth==nowmonth && jobday==nowday)) { removelist.add(j); } } jobbeanlist.removeall(removelist); removelist.clear(); printnum(); } public list<jobbean> getjobbeanlist(){ return jobbeanlist; } }
测试一下过滤的效果
import java.sql.connection; import java.util.arraylist; import java.util.list; public class test3 { public static void main(string[] args) { list<jobbean> jobbeanlist = new arraylist<>(); //数据库读取jobbean容器 connectmysql cm = new connectmysql(); connection conn = cm.getconn(); jobbeanlist = dbutils.select(conn); basefilter bf = new basefilter(jobbeanlist); //过滤时间 bf.filterdate(); //过滤关键字 bf.filterjobname("数据"); bf.filterjobname("分析"); for(jobbean j: jobbeanlist) { system.out.println(j); } } }
到这里基本是统一的功能,后面的分析就要按照不同职业,或者不同需求而定了,不过基本差不多,
这里分析的就是“大数据+上海”下的相关信息了,为了数据量大一点,关键字带有"数据"就行,有247条信息
用到了tablesaw的包,这个我看有人推荐,结果中间遇到问题都基本百度不到,只有官方文档,反复看了,而且这个还不能单独画出图,
还要别的依赖包,所以我就做个表格吧。。。可视化什么的已经不想研究了(我为什么不用python啊。。。)
分析也就没有什么面向对象需要写的了,基本就是一个main里面一路写下去了。具体用法可以看官方文档,就当看个结果了解一下
工资统一为万/月
import static tech.tablesaw.aggregate.aggregatefunctions.*; import java.sql.connection; import java.util.arraylist; import java.util.list; import tech.tablesaw.api.*; public class analayze { public static void main(string[] args) { list<jobbean> jobbeanlist = new arraylist<>(); connectmysql cm = new connectmysql(); connection conn = cm.getconn(); jobbeanlist = dbutils.select(conn); basefilter bf = new basefilter(jobbeanlist); bf.filterdate(); bf.filterjobname("数据"); int nums = jobbeanlist.size(); //分析 //按照工资排序 string[] jobnames = new string[nums]; string[] companys = new string[nums]; string[] addresss = new string[nums]; double[] salarys = new double[nums]; string[] joburls = new string[nums]; for(int i=0; i<nums; i++) { jobbean j = jobbeanlist.get(i); string jobname = j.getjobname(); string company = j.getcompany(); //地址提出区名字 string address; if(j.getaddress().contains("-")) { address = j.getaddress().split("-")[1]; }else{ address = j.getaddress(); } //工资统一单位 string ssalary = j.getsalary(); double dsalary; if(ssalary.contains("万/月")) { dsalary = double.valueof(ssalary.split("-")[0]); }else if(ssalary.contains("千/月")) { dsalary = double.valueof(ssalary.split("-")[0])/10; dsalary = (double) math.round(dsalary * 100) / 100; }else if(ssalary.contains("万/年")) { dsalary = double.valueof(ssalary.split("-")[0])/12; dsalary = (double) math.round(dsalary * 100) / 100; }else { dsalary = 0; system.out.println("工资转换失败"); continue; } string joburl = j.getjoburl(); jobnames[i] = jobname; companys[i] = company; addresss[i] = address; salarys[i] = dsalary; joburls[i] = joburl; } table jobinfo = table.create("job info") .addcolumns( stringcolumn.create("jobname", jobnames), stringcolumn.create("company", companys), stringcolumn.create("address", addresss), doublecolumn.create("salary", salarys), stringcolumn.create("joburl", joburls) ); // system.out.println("全上海信息"); // system.out.println(salaryinfo(jobinfo)); list<table> addressjobinfo = new arraylist<>(); //按照地区划分 table shanghaijobinfo = choosebyaddress(jobinfo, "上海"); table jinganjobinfo = choosebyaddress(jobinfo, "静安区"); table pudongjobinfo = choosebyaddress(jobinfo, "浦东新区"); table changningjobinfo = choosebyaddress(jobinfo, "长宁区"); table minhangjobinfo = choosebyaddress(jobinfo, "闵行区"); table xuhuijobinfo = choosebyaddress(jobinfo, "徐汇区"); //人数太少 // table songjiangjobinfo = choosebyaddress(jobinfo, "松江区"); // table yangpujobinfo = choosebyaddress(jobinfo, "杨浦区"); // table hongkoujobinfo = choosebyaddress(jobinfo, "虹口区"); // table otherinfo = choosebyaddress(jobinfo, "异地招聘"); // table putuojobinfo = choosebyaddress(jobinfo, "普陀区"); addressjobinfo.add(jobinfo); //上海地区招聘 addressjobinfo.add(shanghaijobinfo); addressjobinfo.add(jinganjobinfo); addressjobinfo.add(pudongjobinfo); addressjobinfo.add(changningjobinfo); addressjobinfo.add(minhangjobinfo); addressjobinfo.add(xuhuijobinfo); // addressjobinfo.add(songjiangjobinfo); // addressjobinfo.add(yangpujobinfo); // addressjobinfo.add(hongkoujobinfo); // addressjobinfo.add(putuojobinfo); // addressjobinfo.add(otherinfo); for(table t: addressjobinfo) { system.out.println(salaryinfo(t)); } for(table t: addressjobinfo) { system.out.println(sortbysalary(t).first(10)); } } //工资平均值,最小,最大 public static table salaryinfo(table t) { return t.summarize("salary",mean,stddev,median,max,min).apply(); } //salary进行降序 public static table sortbysalary(table t) { return t.sortdescendingon("salary"); } //选择地区 public static table choosebyaddress(table t, string address) { table t2 = table.create(address) .addcolumns( stringcolumn.create("jobname"), stringcolumn.create("company"), stringcolumn.create("address"), doublecolumn.create("salary"), stringcolumn.create("joburl")); for(row r: t) { if(r.getstring(2).equals(address)) { t2.addrow(r); } } return t2; } }
前半段是各个地区的信息
后半段是各个区工资最高的前10名的信息,可以看到这个tablesaw的表要多难看有多难看。。。
joburl可以直接在浏览器里面看,
换个url进行测试
我要找java开发工作
将之前testmain中的strurl换成java+上海
https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
删除jobinfo.txt,重建数据库
运行,爬了270多页,本地jobinfo.txt
数据库
然后到analyze中把bf.filterjobname("数据");
改为“java”,再加一个“开发”,然后运行
信息全部都出来了,分析什么的,先照着表格说一点把。。。
后面想要拓展的内容就是继续爬取joburl然后把职位要求做统计。这还没做,暑假有兴趣应该会搞一下,
先做到这里把作业交了。。。
最后附上源代码 :链接:https://pan.baidu.com/s/1xwtblctxerzqueimrfuliw
提取码:2fea