C#.Net基于正则表达式抓取百度百家文章列表的方法示例

程序员文章站 2023-12-15 21:21:46

本文实例讲述了c#.net基于正则表达式抓取百度百家文章列表的方法。分享给大家供大家参考，具体如下：工作之余，学习了一下正则表达式，鉴于实践是检验真理的唯一标准，于是便...

本文实例讲述了c#.net基于正则表达式抓取百度百家文章列表的方法。分享给大家供大家参考，具体如下：

工作之余，学习了一下正则表达式，鉴于实践是检验真理的唯一标准，于是便写了一个利用正则表达式抓取百度百家文章的例子，具体过程请看下面源码：

一、获取百度百家网页内容

public list<string[]> geturl()
{
  try
  {
    string url = "http://baijia.baidu.com/";
    webrequest webrequest = webrequest.create(url);
    webresponse webresponse = webrequest.getresponse();
    streamreader reader = new streamreader(webresponse.getresponsestream());
    string result = reader.readtoend();
    reader.close();
    webresponse.close();
    return analysishtml(result);
  }
  catch (exception ex)
  {
    throw ex;
  }
}

二、通过正则表达式筛选

public list<string[]> analysishtml(string htmlcontent)
{
  list<string[]> list = new list<string[]>();
  string strpattern = "<h3><a\\s*.*>(?<title>[^<]+)</a></h3>.*\\s*<p\\s*class=\"feeds-item-text\">(?<abstract>[^<]+)<a\\s*href=\"(?<url>.*)\"\\s*target=\"_blank\"\\s*class=\"feeds-item-more\"\\s*mon=\".*\\s*\">.*\\s*</a></p>";
  regex regex = new regex(strpattern, regexoptions.ignorecase | regexoptions.multiline | regexoptions.cultureinvariant);
  if (regex.ismatch(htmlcontent))
  {
    matchcollection matchcollection = regex.matches(htmlcontent);
    foreach (match match in matchcollection)
    {
      string[] str = new string[3];
      str[0] = match.groups[1].value;//获取到的是列表数据的标题
      str[1] = match.groups[2].value;//获取到的是内容
      str[2] = match.groups[3].value;//获取到的是链接到的地址
      list.add(str);
    }
  }
  return list;
}

附：完整实例代码点击此处本站下载。

ps：这里再为大家提供2款非常方便的正则表达式工具供大家参考使用：

javascript正则表达式在线测试工具：

正则表达式在线生成工具：

更多关于c#相关内容感兴趣的读者可查看本站专题：《c#正则表达式用法总结》、《c#编码操作技巧总结》、《c#常见控件用法教程》、《winform控件用法总结》、《c#数据结构与算法教程》、《c#面向对象程序设计入门教程》及《c#程序设计之线程使用技巧总结》

希望本文所述对大家c#程序设计有所帮助。