欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Java爬取51job保存到MySQL并进行分析

程序员文章站 2022-04-08 23:36:52
大二下实训课结业作业,想着就爬个工作信息,原本是要用python的,后面想想就用java试试看, java就自学了一个月左右,想要锻炼一下自己面向对象的思想等等的, 然后网上转了一圈,拉钩什么的是动态生成的网页,51job是静态网页,比较方便,就决定爬51job了。 前提: 创建Maven Proj ......

大二下实训课结业作业,想着就爬个工作信息,原本是要用python的,后面想想就用java试试看,

java就自学了一个月左右,想要锻炼一下自己面向对象的思想等等的,

然后网上转了一圈,拉钩什么的是动态生成的网页,51job是静态网页,比较方便,就决定爬51job了。

 

前提:

创建maven project方便包管理

使用httpclient 3.1以及jsoup1.8.3最为爬取网页和筛选信息的包,这两个版本用的人多。

mysql-connect-java 8.0.13用来将数据导入数据库,支持mysql8.0+

分析使用,tablesaw(可选,会用的就行)

 

“大数据+上海”以此url为例子,只要是类似的url都可行

https://search.51job.com/list/020000,000000,0000,00,9,99,%25e5%25a4%25a7%25e6%2595%25b0%25e6%258d%25ae,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=

 

先设计了个大概的功能,修改了好几版,最后觉得这样思路比较清晰,以jobbean容器作为所有功能的媒介

Java爬取51job保存到MySQL并进行分析

 

先完成爬取网页,以及保存到本地

创建jobbean对象

public class jobbean {
    private string jobname;
    private string company;
    private string address;
    private string salary;
    private string date;
    private string joburl;
    
    public jobbean(string jobname, string company, string address, string salary, string date, string joburl) {
        this.jobname = jobname;
        this.company = company;
        this.address = address;
        this.salary = salary;
        this.date = date;
        this.joburl = joburl;
    }
    
    
    
    @override
    public string tostring() {
        return "jobname=" + jobname + ", company=" + company + ", address=" + address + ", salary=" + salary
                + ", date=" + date + ", joburl=" + joburl;
    }



    public string getjobname() {
        return jobname;
    }
    public void setjobname(string jobname) {
        this.jobname = jobname;
    }
    public string getcompany() {
        return company;
    }
    public void setcompany(string company) {
        this.company = company;
    }
    public string getaddress() {
        return address;
    }
    public void setaddress(string address) {
        this.address = address;
    }
    public string getsalary() {
        return salary;
    }
    public void setsalary(string salary) {
        this.salary = salary;
    }
    public string getdate() {
        return date;
    }
    public void setdate(string date) {
        this.date = date;
    }
    public string getjoburl() {
        return joburl;
    }
    public void setjoburl(string joburl) {
        this.joburl = joburl;
    }
}

然后写一个用于保存容器的工具类,这样在任何阶段都可以保存容器

import java.io.*;
import java.util.*;

/**实现
 * 1。将jobbean容器存入本地
 * 2.从本地文件读入文件为jobbean容器(有筛选)
 * @author powerzzj
 *
 */
public class jobbeanutils {
    
    /**保存jobbean到本地功能实现
     * @param job
     */
    public static void savejobbean(jobbean job) {
        try(bufferedwriter bw =
                new bufferedwriter(
                        new filewriter("jobinfo.txt",true))){
            string jobinfo = job.tostring();
            bw.write(jobinfo);
            bw.newline();
            bw.flush();
        }catch(exception e) {
            system.out.println("保存jobbean失败");
            e.printstacktrace();
        }
    }
    
    /**保存jobbean容器到本地功能实现
     * @param jobbeanlist jobbean容器
     */
    public static void savejobbeanlist(list<jobbean> jobbeanlist) {
        system.out.println("正在备份容器到本地");
        for(jobbean jobbean : jobbeanlist) {
            savejobbean(jobbean);
        }
        system.out.println("备份完成,一共"+jobbeanlist.size()+"条信息");
    }
    
    /**从本地文件读入文件为jobbean容器(有筛选)
     * @return jobbean容器
     */
    public static list<jobbean> loadjobbeanlist(){
        list<jobbean> jobbeanlist = new arraylist<>();
        try(bufferedreader br = 
                new bufferedreader(
                        new filereader("jobinfo.txt"))){
            string str = null;
            while((str=br.readline())!=null) {
                //筛选,有些公司名字带有","不规范,直接跳过
                try {
                    string[] datas = str.split(","); 
                    string jobname = datas[0].substring(8);
                    string company = datas[1].substring(9);
                    string address = datas[2].substring(9);
                    string salary = datas[3].substring(8);
                    string date = datas[4].substring(6);
                    string joburl = datas[5].substring(8);
                    //筛选,全部都不为空,工资是个区间,url以https开头,才建立jobbean
                    if (jobname.equals("") || company.equals("") || address.equals("") || salary.equals("")
                            || !(salary.contains("-"))|| date.equals("") || !(joburl.startswith("http")))
                        continue;
                    jobbean jobbean = new jobbean(jobname, company, address, salary, date, joburl);
                    //放入容器
                    jobbeanlist.add(jobbean);
                }catch(exception e) {
                    system.out.println("本地读取筛选:有问题需要跳过的数据行:"+str);
                    continue;
                }
            }
            system.out.println("读取完成,一共读取"+jobbeanlist.size()+"条信息");
            return jobbeanlist;
        }catch(exception e) {
            system.out.println("读取jobbean失败");
            e.printstacktrace();
        }
        return jobbeanlist;
    }
}

接着就是关键的爬取了

Java爬取51job保存到MySQL并进行分析

标签是el 里面是需要的信息,以及第一个el出来的是总体信息,一会需要去除。

各自里面都有t1,t2,t3,t4,t5标签,按照顺序一个个取出来就好。

再查看"下一页"元素,在bk标签下,这里要注意,有两个bk,第一个bk是上一页,第二个bk才是下一页,

之前我爬取进入死循环了。。。。

Java爬取51job保存到MySQL并进行分析

最后一个spider功能把爬取信息以及迭代下一页全部都放在一起

import java.net.url;
import java.util.arraylist;
import java.util.list;

import org.jsoup.jsoup;
import org.jsoup.nodes.document;
import org.jsoup.nodes.element;
import org.jsoup.select.elements;

/**爬取网页信息
 * @author powerzzj
 *
 */
public class spider {
    //记录爬到第几页
    private static int pagecount = 1;
    
    private string strurl;
    private string nextpageurl;
    private document document;//网页全部信息
    private list<jobbean> jobbeanlist;
    
    public spider(string strurl) {
        this.strurl = strurl;
        nextpageurl = strurl;//下一页url初始化为当前,方便遍历
        jobbeanlist = new arraylist<jobbean>();
        
    }
    
    /**获取网页全部信息
     * @param 网址
     * @return 网页全部信息
     */
    public document getdom(string strurl) {
        try {
            url url = new url(strurl);
            //解析,并设置超时
            document = jsoup.parse(url, 4000);
            return document;
        }catch(exception e) {
            system.out.println("getdom失败");
            e.printstacktrace();
        }
        return null;
    }
    

    /**筛选当前网页信息,转成jobbean对象,存入容器
     * @param document 网页全部信息
     */
    public void getpageinfo(document document) {
        //通过css选择器用#resultlist .el获取el标签信息
        elements elements = document.select("#resultlist .el");
        //总体信息删去
        elements.remove(0);
        //筛选信息
        for(element element: elements) {
            elements elementsspan = element.select("span");
            string joburl = elementsspan.select("a").attr("href");
            string jobname = elementsspan.get(0).select("a").attr("title");
            string company = elementsspan.get(1).select("a").attr("title");
            string address = elementsspan.get(2).text();
            string salary = elementsspan.get(3).text();
            string date = elementsspan.get(4).text();
            //建立jobbean对象
            jobbean jobbean = new jobbean(jobname, company, address, salary, date, joburl);
            //放入容器
            jobbeanlist.add(jobbean);
        }
    }
    
    /**获取下一页的url
     * @param document 网页全部信息
     * @return 有,则返回url
     */
    public string getnextpageurl(document document) {
        try {
            elements elements = document.select(".bk");
            //第二个bk才是下一页
            element element = elements.get(1);
            nextpageurl = element.select("a").attr("href");
            if(nextpageurl != null) {
                system.out.println("---------"+(pagecount++)+"--------");
                return nextpageurl;
            }
        }catch(exception e) {
            system.out.println("获取下一页url失败");
            e.printstacktrace();
        }
        return null;
    }
    
    
    /**开始爬取
     * 
     */
    public void spider() {
        while(!nextpageurl.equals("")) {
            //获取全部信息
            document = getdom(nextpageurl);
            //把相关信息加入容器
            getpageinfo(document);
            //查找下一页的url
            nextpageurl = getnextpageurl(document);
        }
    }
    
    //获取jobbean容器
    public list<jobbean> getjobbeanlist() {
        return jobbeanlist;
    }
}

 然后测试一下爬取与保存功能

import java.util.arraylist;
import java.util.list;

public class test1 {
    public static void main(string[] args) {
        list<jobbean> jobbeanlist = new arraylist<>();
        //大数据+上海
        string strurl = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25e5%25a4%25a7%25e6%2595%25b0%25e6%258d%25ae,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";        

        //测试spider以及保存
        spider spider = new spider(strurl);
        spider.spider();
        //获取爬取后的jobbean容器
        jobbeanlist = spider.getjobbeanlist();
        
        //调用jobbean工具类保存jobbeanlist到本地
        jobbeanutils.savejobbeanlist(jobbeanlist);
    
        //调用jobbean工具类从本地筛选并读取,得到jobbeanlist
        jobbeanlist = jobbeanutils.loadjobbeanlist();
        
    }
}

然后本地就有了jobinfo.txt

Java爬取51job保存到MySQL并进行分析

然后就是把jobbean容器放到mysql中了,我的数据库名字是51job,表名字是jobinfo,所有属性都是字符串,emmm就字符串吧

Java爬取51job保存到MySQL并进行分析

import java.sql.connection;
import java.sql.drivermanager;
import java.sql.sqlexception;

public class connectmysql {
    //数据库信息
    private static final string dbaddress = "jdbc:mysql://localhost/51job?servertimezone=utc";
    private static final string username = "root";
    private static final string password = "woshishabi2813";
    
    private connection conn;
    
    //加载驱动,连接数据库
    public connectmysql() {
        loaddriver();
        //连接数据库
        try {
            conn = drivermanager.getconnection(dbaddress, username, password);
        } catch (sqlexception e) {
            system.out.println("数据库连接失败");
        }
    }
    
    //加载驱动
    private void loaddriver() {
        try {
            class.forname("com.mysql.cj.jdbc.driver");
            system.out.println("加载驱动成功");
        } catch (exception e) {
            system.out.println("驱动加载失败");
        }
    }
    
    //获取连接
    public connection getconn() {
        return conn;
    }
}

接着就是数据相关操作的工具类的编写了。

import java.sql.connection;
import java.sql.preparedstatement;
import java.sql.resultset;
import java.util.arraylist;
import java.util.list;


public class dbutils {
    
    /**将jobbean容器存入数据库(有筛选)
     * @param conn 数据库的连接
     * @param jobbeanlist jobbean容器
     */
    public static void insert(connection conn, list<jobbean> jobbeanlist) {
        system.out.println("正在插入数据");
        preparedstatement ps;
        for(jobbean j: jobbeanlist) {
            //命令生成
            string command = string.format("insert into jobinfo values('%s','%s','%s','%s','%s','%s')",
                    j.getjobname(),
                    j.getcompany(),
                    j.getaddress(),
                    j.getsalary(),
                    j.getdate(),
                    j.getjoburl());
            
            try {
                ps = conn.preparestatement(command);
                ps.executeupdate();
            } catch (exception e) {
                system.out.println("存入数据库筛选有误信息:"+j.getjobname());
            }
        }
        system.out.println("插入数据完成");

    }
    
    /**将jobbean容器,取出
     * @param conn 数据库的连接
     * @return jobbean容器
     */
    public static list<jobbean> select(connection conn){
        preparedstatement ps;
        resultset rs;
        list<jobbean> jobbeanlist  = new arraylist<jobbean>();

        string command = "select * from jobinfo";
        try {
            ps = conn.preparestatement(command);
            rs = ps.executequery();
            int col = rs.getmetadata().getcolumncount();
            while(rs.next()) {
                jobbean jobbean = new jobbean(rs.getstring(1), 
                            rs.getstring(2), 
                            rs.getstring(3), 
                            rs.getstring(4),
                            rs.getstring(5),
                            rs.getstring(6));

                jobbeanlist.add(jobbean);
            }
            return jobbeanlist;
        } catch (exception e) {
            system.out.println("数据库查询失败");
        }
        return null;
    }
}

 

然后测试一下

import java.sql.connection;
import java.util.arraylist;
import java.util.list;

public class test2 {
    public static void main(string[] args) {
        list<jobbean> jobbeanlist = new arraylist<>();
        jobbeanlist = jobbeanutils.loadjobbeanlist();

        //数据库测试
        connectmysql cm = new connectmysql();
        connection conn = cm.getconn();
        
        //插入测试
        dbutils.insert(conn, jobbeanlist);
        //select测试
        jobbeanlist = dbutils.select(conn);
        for(jobbean j: jobbeanlist) {
            system.out.println(j);
        }
    }
}

 

Java爬取51job保存到MySQL并进行分析

Java爬取51job保存到MySQL并进行分析

上面的图可以看到虽然是“大数据+上海”,但是依旧有运维工程师上面不相关的,后面会进行过滤处理。这里就先存入数据库中

先来个功能的整体测试,删除jobinfo.txt,重建数据库

import java.sql.connection;
import java.util.arraylist;
import java.util.list;


public class testmain {
    public static void main(string[] args) {
        list<jobbean> jobbeanlist = new arraylist<>();
        //大数据+上海
        string strurl = "https://search.51job.com/list/020000,000000,0000,00,9,99,%25e5%25a4%25a7%25e6%2595%25b0%25e6%258d%25ae,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";        
//        //java+上海
//        string strurl = "https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=";
        
        //所有功能测试
        //爬取的对象
        spider jobspider = new spider(strurl);
        jobspider.spider();
        //爬取完的jobbeanlist
        jobbeanlist = jobspider.getjobbeanlist();
        
        //调用jobbean工具类保存jobbeanlist到本地
        jobbeanutils.savejobbeanlist(jobbeanlist);
    
        //调用jobbean工具类从本地筛选并读取,得到jobbeanlist
        jobbeanlist = jobbeanutils.loadjobbeanlist();
    
        //连接数据库,并获取连接
        connectmysql cm = new connectmysql();
        connection conn = cm.getconn();
        
        //调用数据库工具类将jobbean容器存入数据库
        dbutils.insert(conn, jobbeanlist);
        
//        //调用数据库工具类查询数据库信息,并返回一个jobbeanlist
//        jobbeanlist = dbutils.select(conn);
//        
//        for(jobbean j: jobbeanlist) {
//            system.out.println(j);
//        }
    }
}

这些功能都是能独立使用的,不是一定要这样一路写下来。

接下来就是进行数据库的读取,进行简单的过滤,然后进行分析了

先上思维导图

Java爬取51job保存到MySQL并进行分析

首先是过滤关键字和日期

 

import java.util.arraylist;
import java.util.calendar;
import java.util.list;public class basefilter {
    private list<jobbean> jobbeanlist;
    //foreach遍历不可以remove,iterator有锁
    //用新的保存要删除的,然后removeall
    private list<jobbean> removelist;
    
    public basefilter(list<jobbean> jobbeanlist) {
        this.jobbeanlist = new arraylist<jobbean>();
        removelist =  new arraylist<jobbean>();
        //引用同一个对象,getjobbeanlist有没有都一样
        this.jobbeanlist = jobbeanlist;
        printnum();
    }
    
    //打印jobbean容器中的数量
    public void printnum() {
        system.out.println("现在一共"+jobbeanlist.size()+"条数据");
    }
    

    /**筛选职位名字
     * @param containjobname 关键字保留
     */
    public void filterjobname(string containjobname) {
        for(jobbean j: jobbeanlist) {
            if(!j.getjobname().contains(containjobname)) {
                removelist.add(j);
            }
        }
        jobbeanlist.removeall(removelist);
        removelist.clear();
        printnum();
    }
    
    /**筛选日期,要当天发布的
     * @param
     */
    public void filterdate() {
        calendar now=calendar.getinstance();
        int nowmonth = now.get(calendar.month)+1;
        int nowday = now.get(calendar.date);
        
        for(jobbean j: jobbeanlist) {
            string[] date = j.getdate().split("-");
            int jobmonth = integer.valueof(date[0]);
            int jobday = integer.valueof(date[1]);
            if(!(jobmonth==nowmonth && jobday==nowday)) {
                removelist.add(j);
            }
        }
        jobbeanlist.removeall(removelist);
        removelist.clear();
        printnum();
    }
    
    public list<jobbean> getjobbeanlist(){
        return jobbeanlist;
    }
    
}

测试一下过滤的效果

import java.sql.connection;
import java.util.arraylist;
import java.util.list;


public class test3 {
    public static void main(string[] args) {
        list<jobbean> jobbeanlist = new arraylist<>();
        //数据库读取jobbean容器
        connectmysql cm = new connectmysql();
        connection conn = cm.getconn();
        jobbeanlist = dbutils.select(conn);
        
        basefilter bf = new basefilter(jobbeanlist);
        //过滤时间
        bf.filterdate();
        //过滤关键字
        bf.filterjobname("数据");
        bf.filterjobname("分析");
        
        for(jobbean j: jobbeanlist) {
            system.out.println(j);
        }
    }
}

Java爬取51job保存到MySQL并进行分析

到这里基本是统一的功能,后面的分析就要按照不同职业,或者不同需求而定了,不过基本差不多,

这里分析的就是“大数据+上海”下的相关信息了,为了数据量大一点,关键字带有"数据"就行,有247条信息

用到了tablesaw的包,这个我看有人推荐,结果中间遇到问题都基本百度不到,只有官方文档,反复看了,而且这个还不能单独画出图,

还要别的依赖包,所以我就做个表格吧。。。可视化什么的已经不想研究了(我为什么不用python啊。。。)

分析也就没有什么面向对象需要写的了,基本就是一个main里面一路写下去了。具体用法可以看官方文档,就当看个结果了解一下

工资统一为万/月

import static tech.tablesaw.aggregate.aggregatefunctions.*;

import java.sql.connection;
import java.util.arraylist;
import java.util.list;

import tech.tablesaw.api.*;

public class analayze {
    public static void main(string[] args) {
        list<jobbean> jobbeanlist = new arraylist<>();

        connectmysql cm = new connectmysql();
        connection conn = cm.getconn();
        jobbeanlist = dbutils.select(conn);
        
        basefilter bf = new basefilter(jobbeanlist);
        bf.filterdate();
        bf.filterjobname("数据");
        int nums = jobbeanlist.size();
        
        //分析
        //按照工资排序
        string[] jobnames = new string[nums];
        string[] companys = new string[nums];
        string[] addresss = new string[nums];
        double[] salarys = new double[nums];
        string[] joburls = new string[nums];
        for(int i=0; i<nums; i++) {
            jobbean j = jobbeanlist.get(i);
            string jobname = j.getjobname();
            string company = j.getcompany();
            //地址提出区名字
            string address;
            if(j.getaddress().contains("-")) {
                address = j.getaddress().split("-")[1];
            }else{
                address = j.getaddress();
            }
            
            //工资统一单位
            string ssalary = j.getsalary();
            double dsalary;
            if(ssalary.contains("万/月")) {
                dsalary = double.valueof(ssalary.split("-")[0]);
            }else if(ssalary.contains("千/月")) {
                dsalary = double.valueof(ssalary.split("-")[0])/10;
                dsalary = (double) math.round(dsalary * 100) / 100;
            }else if(ssalary.contains("万/年")) {
                dsalary = double.valueof(ssalary.split("-")[0])/12;
                dsalary = (double) math.round(dsalary * 100) / 100;
            }else {
                dsalary = 0;
                system.out.println("工资转换失败");
                continue;
            }
            string joburl = j.getjoburl();
            
            jobnames[i] = jobname;
            companys[i] = company;
            addresss[i] = address;
            salarys[i] = dsalary;
            joburls[i] = joburl;
        }
        
        table jobinfo = table.create("job info")
                .addcolumns(
                    stringcolumn.create("jobname", jobnames),
                    stringcolumn.create("company", companys),
                    stringcolumn.create("address", addresss),
                    doublecolumn.create("salary", salarys),
                    stringcolumn.create("joburl", joburls)
                        );
        
//        system.out.println("全上海信息");
//        system.out.println(salaryinfo(jobinfo));
        
        
        list<table> addressjobinfo = new arraylist<>();
        //按照地区划分
        table shanghaijobinfo = choosebyaddress(jobinfo, "上海");
        table jinganjobinfo = choosebyaddress(jobinfo, "静安区");
        table pudongjobinfo = choosebyaddress(jobinfo, "浦东新区");
        table changningjobinfo = choosebyaddress(jobinfo, "长宁区");
        table minhangjobinfo = choosebyaddress(jobinfo, "闵行区");
        table xuhuijobinfo = choosebyaddress(jobinfo, "徐汇区");
        //人数太少
//        table songjiangjobinfo = choosebyaddress(jobinfo, "松江区");
//        table yangpujobinfo = choosebyaddress(jobinfo, "杨浦区");
//        table hongkoujobinfo = choosebyaddress(jobinfo, "虹口区");
//        table otherinfo = choosebyaddress(jobinfo, "异地招聘");
//        table putuojobinfo = choosebyaddress(jobinfo, "普陀区");
        
        addressjobinfo.add(jobinfo);
        //上海地区招聘
        addressjobinfo.add(shanghaijobinfo);
        addressjobinfo.add(jinganjobinfo);
        addressjobinfo.add(pudongjobinfo);
        addressjobinfo.add(changningjobinfo);
        addressjobinfo.add(minhangjobinfo);
        addressjobinfo.add(xuhuijobinfo);
//        addressjobinfo.add(songjiangjobinfo);
//        addressjobinfo.add(yangpujobinfo);
//        addressjobinfo.add(hongkoujobinfo);
//        addressjobinfo.add(putuojobinfo);
//        addressjobinfo.add(otherinfo);

        for(table t: addressjobinfo) {
            system.out.println(salaryinfo(t));
        }
        
        for(table t: addressjobinfo) {
            system.out.println(sortbysalary(t).first(10));
        }
        
    }
    
    //工资平均值,最小,最大
    public static table salaryinfo(table t) {        
        return t.summarize("salary",mean,stddev,median,max,min).apply();
    }
    
    //salary进行降序
    public static table sortbysalary(table t) {
        return t.sortdescendingon("salary");
    }
    
    //选择地区
    public static table choosebyaddress(table t, string address) {
        table t2 = table.create(address)
                .addcolumns(
                    stringcolumn.create("jobname"),
                    stringcolumn.create("company"),
                    stringcolumn.create("address"),
                    doublecolumn.create("salary"),
                    stringcolumn.create("joburl"));
        for(row r: t) {
            if(r.getstring(2).equals(address)) {
                t2.addrow(r);
            }
        }
        return t2;
    }
}

前半段是各个地区的信息

Java爬取51job保存到MySQL并进行分析

 

后半段是各个区工资最高的前10名的信息,可以看到这个tablesaw的表要多难看有多难看。。。

joburl可以直接在浏览器里面看,

Java爬取51job保存到MySQL并进行分析

 

换个url进行测试

我要找java开发工作

将之前testmain中的strurl换成java+上海

https://search.51job.com/list/020000,000000,0000,00,9,99,java,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2c0&radius=-1&ord_field=0&confirmdate=9&fromtype=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=

删除jobinfo.txt,重建数据库

运行,爬了270多页,本地jobinfo.txt

Java爬取51job保存到MySQL并进行分析

数据库

Java爬取51job保存到MySQL并进行分析

 

 然后到analyze中把bf.filterjobname("数据");

改为“java”,再加一个“开发”,然后运行

Java爬取51job保存到MySQL并进行分析

Java爬取51job保存到MySQL并进行分析

Java爬取51job保存到MySQL并进行分析

信息全部都出来了,分析什么的,先照着表格说一点把。。。

后面想要拓展的内容就是继续爬取joburl然后把职位要求做统计。这还没做,暑假有兴趣应该会搞一下,

先做到这里把作业交了。。。

最后附上源代码 :链接:https://pan.baidu.com/s/1xwtblctxerzqueimrfuliw
提取码:2fea