欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

C# , htmlAgilityPack,乱码的问题,GB2312,爬虫乱码,byte编码GB2312

程序员文章站 2022-07-14 19:05:24
...

学习htmlAgilityPack文档时,尝试直接用官网给的代码,将网址修改为百度搜索风云榜的网址,出现各种问号乱码.

var html = @"http://html-agility-pack.net/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var node = htmlDoc.DocumentNode.SelectSingleNode("//head/title");
Console.WriteLine("Node Name: " + node.Name + "\n" + node.OuterHtml);

查阅资料之后发现,这个包是法国人写的,默认解码是按法语.解决思路如下:

参考文档

以byte数组格式获取html
从byte转码至其他编码形式
解析html
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Net.Http;


namespace htmlAgilitypack
{
    class Program
    {
        static readonly HttpClient client = new HttpClient();
        static async Task Main()
        {
            // Call asynchronous network methods in a try/catch block to handle exceptions.
            try
            {
              //以byte[]获取html
                byte[] response1= await client.GetByteArrayAsync("http://top.baidu.com/buzz?b=1&fr=topindex");
                //foreach(byte i in response1)
                //{
                //    Console.Write(i);
                //}
                //Console.WriteLine("\n");

                //将byte[]重新编码成GB2312;
                string temp = Encoding.GetEncoding("GB2312").GetString(response1);

                //解析html,并输入
                HtmlDocument html = new HtmlDocument();
                html.LoadHtml(temp);
                var node = html.DocumentNode.SelectNodes("//a[@class=\"list-title\"]");
                foreach(var t in node)
                {
                    Console.WriteLine(t.InnerText);
                }              
            }
            catch (HttpRequestException e)
            {
                Console.WriteLine("\nException Caught!");
                Console.WriteLine("Message :{0} ", e.Message);
            }
            finally
            {
                Console.Read();
            }
        }
    }
}

结果如下,成功获取百度搜索风云榜信息
C# , htmlAgilityPack,乱码的问题,GB2312,爬虫乱码,byte编码GB2312

相关标签: C#笔记