C# , htmlAgilityPack,乱码的问题,GB2312,爬虫乱码,byte编码GB2312
程序员文章站
2022-07-14 19:05:24
...
学习htmlAgilityPack文档时,尝试直接用官网给的代码,将网址修改为百度搜索风云榜的网址,出现各种问号乱码.
var html = @"http://html-agility-pack.net/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var node = htmlDoc.DocumentNode.SelectSingleNode("//head/title");
Console.WriteLine("Node Name: " + node.Name + "\n" + node.OuterHtml);
查阅资料之后发现,这个包是法国人写的,默认解码是按法语.解决思路如下:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Net.Http;
namespace htmlAgilitypack
{
class Program
{
static readonly HttpClient client = new HttpClient();
static async Task Main()
{
// Call asynchronous network methods in a try/catch block to handle exceptions.
try
{
//以byte[]获取html
byte[] response1= await client.GetByteArrayAsync("http://top.baidu.com/buzz?b=1&fr=topindex");
//foreach(byte i in response1)
//{
// Console.Write(i);
//}
//Console.WriteLine("\n");
//将byte[]重新编码成GB2312;
string temp = Encoding.GetEncoding("GB2312").GetString(response1);
//解析html,并输入
HtmlDocument html = new HtmlDocument();
html.LoadHtml(temp);
var node = html.DocumentNode.SelectNodes("//a[@class=\"list-title\"]");
foreach(var t in node)
{
Console.WriteLine(t.InnerText);
}
}
catch (HttpRequestException e)
{
Console.WriteLine("\nException Caught!");
Console.WriteLine("Message :{0} ", e.Message);
}
finally
{
Console.Read();
}
}
}
}
结果如下,成功获取百度搜索风云榜信息
上一篇: Redis过期时间与缓存
下一篇: Redis简介