C# 爬虫简单教程

程序员文章站 2022-07-03 08:01:10

1、使用第三方类库 htmlagilitypack官方网址：、// from file 从文件获取html信息var doc = new htmldocument();doc.load(filepat...

1、使用第三方类库 htmlagilitypack

官方网址：、

// from file 从文件获取html信息
var doc = new htmldocument();
doc.load(filepath);

// from string 从字符串获取html信息
var doc = new htmldocument();
doc.loadhtml(html);

// from web  从网址获取html信息
var url = "http://html-agility-pack.net/";
var web = new htmlweb();
var doc = web.load(url);

1.1、这里介绍一下最后一种用法

var web = new htmlweb();
var doc = web.load(url);

在 web 中我们还可以设置cookie、headers等信息，来处理一些特定的网站需求，比如需要登陆等。

1.2 用法解释

网页在你查看网页源代码之后只是一段字符串，而爬虫所做的就是在这堆字符串中，查询到我们想要的信息，挑选出来。
以往的筛选方法：正则（太麻烦了，写起来有些头疼）
htmlagilitypack 支持通过xpath来解析我们需要的信息。

1.2.1 在哪里找xpath？

网页右键检查

C# 爬虫简单教程

通过xpath就可以准确获取你想要元素的全部信息。

1.2.2 获取选中html元素的信息？

获取选中元素

var web = new htmlweb();
var doc = web.load(url);
var htmlnode = doc?.documentnode?.selectsinglenode("/html/body/header")

获取元素信息

htmlnode.innertext;
htmlnode.innerhtml;
//根据属性取值
htmlnode?.getattributevalue("src", "未找到")

2、自己封装的类库

 /// <summary>
  /// 下载html帮助类
  /// </summary>
  public static class loadhtmlhelper
  {
    /// <summary>
    /// 从url地址下载页面
    /// </summary>
    /// <param name="url"></param>
    /// <returns></returns>
    public async static valuetask<htmldocument> loadhtmlfromurlasync(string url)
    {
      htmlweb web = new htmlweb();
       return await
         web?.loadfromwebasync(url);
    }

    /// <summary>
    /// 获取单个节点扩展方法
    /// </summary>
    /// <param name="htmldocument">文档对象</param>
    /// <param name="xpath">xpath路径</param>
    /// <returns></returns>
    public static htmlnode getsinglenode(this htmldocument htmldocument, string xpath)
    {
     return htmldocument?.documentnode?.selectsinglenode(xpath);
    }

    /// <summary>
    /// 获取多个节点扩展方法
    /// </summary>
    /// <param name="htmldocument">文档对象</param>
    /// <param name="xpath">xpath路径</param>
    /// <returns></returns>
    public static htmlnodecollection getnodes(this htmldocument htmldocument, string xpath)
    {
      return htmldocument?.documentnode?.selectnodes(xpath);
    }

   

    /// <summary>
    /// 获取多个节点扩展方法
    /// </summary>
    /// <param name="htmldocument">文档对象</param>
    /// <param name="xpath">xpath路径</param>
    /// <returns></returns>
    public static htmlnodecollection getnodes(this htmlnode htmlnode, string xpath)
    {
      return htmlnode?.selectnodes(xpath);
    }


    /// <summary>
    /// 获取单个节点扩展方法
    /// </summary>
    /// <param name="htmldocument">文档对象</param>
    /// <param name="xpath">xpath路径</param>
    /// <returns></returns>
    public static htmlnode getsinglenode(this htmlnode htmlnode, string xpath)
    {
      return htmlnode?.selectsinglenode(xpath);
    }

    /// <summary>
    /// 下载图片
    /// </summary>
    /// <param name="url">地址</param>
    /// <param name="filpath">文件路径</param>
    /// <returns></returns>
    public async static valuetask<bool> downloadimg(string url ,string filpath)
    {
      httpclient httpclient = new httpclient();
      try
      {
        var bytes = await httpclient.getbytearrayasync(url);
        using (filestream fs = file.create(filpath))
        {
          fs.write(bytes, 0, bytes.length);
        }
        return file.exists(filpath);
      }
      catch (exception ex)
      {
       
        throw new exception("下载图片异常", ex);
      }
      
    }
  }

3、自己写的爬虫案例，爬取的网站

数据存储层没有实现，懒得写了，靠你们喽，我是数据暂时存在了文件中
github地址：https://github.com/zhangqueque/quewaner.crawler.git

C# 爬虫简单教程

以上就是c# 爬虫简单教程的详细内容，更多关于c# 爬虫的资料请关注其它相关文章！

C# 爬虫简单教程

1、使用第三方类库 htmlagilitypack

1.1、这里介绍一下最后一种用法

1.2 用法解释

2、自己封装的类库

3、自己写的爬虫案例，爬取的网站

C#实现基于IE内核的简单浏览器完整实例

C#简单实现SNMP的方法

C#简单写入xml文件的方法

SharePoint 2007图文开发教程(3) 实现简单的WebPart

简单的Python抓taobao图片爬虫

CorelDraw入门教程：教你制作史上最简单的三角立方体

CorelDraw(CDR)简单步骤绘制王冠实例教程

简单好用的nodejs 爬虫框架分享

CoreIDraw(CDR)绘制简单的笑脸教程技巧分享

Photoshop制作漂亮简单的金色龙头的教程