【推荐】oc解析HTML数据的类库(爬取网页数据)_html/css_WEB-ITnose
配置
1.导入libxml2.tbd
2.设置编译路径
使用
这里使用一个例子来说明
http://so.gushiwen.org/guwen/book_2.aspx
1.创建TFHpple对象,data为网站返回的数据
TFHpple *htmlParser = [[TFHpple alloc] initWithHTMLData:data];
2.使用searchWithXPathQuery方法得到有用数据,XPATH知识具体百度
NSArray *temp1 = [htmlParser searchWithXPathQuery:@"//div[@class='shileft']/div[@class='bookcont']"]
这样我们获取了论语的数据
3。获取并分析元素
TFHppleElement *element = [elements objectAtIndex:i];
TFHppleElement对象包含许多属性,下面简单介绍一下各属性
1。
@property (nonatomic, copy, readonly) NSString *raw
raw是包含html标记的网页数据
2.content是网页的具体数据,不包含html标记
学而篇 为政篇 八佾篇 里仁篇 公冶长篇 雍也篇 述而篇 泰伯篇 子罕篇 乡党篇 先进篇 颜渊篇 子路篇 宪问篇 卫灵公篇 季氏篇 阳货篇 微子篇 子张篇 尧曰篇
3.tagName是html标签
输出只有div
4.attributes,属性。。。。。。。
class = bookcont;
5.children子节点
( "{\n nodeContent = \"\\n \";\n nodeName = text;\n}", "{\n nodeChildArray = (\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_19.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5b66\\U800c\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5b66\\U800c\\U7bc7\";\n nodeName = a;\n raw = \"\\U5b66\\U800c\\U7bc7\";\n }\n );\n nodeContent = \"\\U5b66\\U800c\\U7bc7\";\n nodeName = span;\n raw = \"\\U5b66\\U800c\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_20.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U4e3a\\U653f\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U4e3a\\U653f\\U7bc7\";\n nodeName = a;\n raw = \"\\U4e3a\\U653f\\U7bc7\";\n }\n );\n nodeContent = \"\\U4e3a\\U653f\\U7bc7\";\n nodeName = span;\n raw = \"\\U4e3a\\U653f\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_21.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U516b\\U4f7e\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U516b\\U4f7e\\U7bc7\";\n nodeName = a;\n raw = \"\\U516b\\U4f7e\\U7bc7\";\n }\n );\n nodeContent = \"\\U516b\\U4f7e\\U7bc7\";\n nodeName = span;\n raw = \"\\U516b\\U4f7e\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_22.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U91cc\\U4ec1\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U91cc\\U4ec1\\U7bc7\";\n nodeName = a;\n raw = \"\\U91cc\\U4ec1\\U7bc7\";\n }\n );\n nodeContent = \"\\U91cc\\U4ec1\\U7bc7\";\n nodeName = span;\n raw = \"\\U91cc\\U4ec1\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_23.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n nodeName = a;\n raw = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n }\n );\n nodeContent = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n nodeName = span;\n raw = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_24.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U96cd\\U4e5f\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U96cd\\U4e5f\\U7bc7\";\n nodeName = a;\n raw = \"\\U96cd\\U4e5f\\U7bc7\";\n }\n );\n nodeContent = \"\\U96cd\\U4e5f\\U7bc7\";\n nodeName = span;\n raw = \"\\U96cd\\U4e5f\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_25.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U8ff0\\U800c\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U8ff0\\U800c\\U7bc7\";\n nodeName = a;\n raw = \"\\U8ff0\\U800c\\U7bc7\";\n }\n );\n nodeContent = \"\\U8ff0\\U800c\\U7bc7\";\n nodeName = span;\n raw = \"\\U8ff0\\U800c\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_26.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U6cf0\\U4f2f\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U6cf0\\U4f2f\\U7bc7\";\n nodeName = a;\n raw = \"\\U6cf0\\U4f2f\\U7bc7\";\n }\n );\n nodeContent = \"\\U6cf0\\U4f2f\\U7bc7\";\n nodeName = span;\n raw = \"\\U6cf0\\U4f2f\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_27.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5b50\\U7f55\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5b50\\U7f55\\U7bc7\";\n nodeName = a;\n raw = \"\\U5b50\\U7f55\\U7bc7\";\n }\n );\n nodeContent = \"\\U5b50\\U7f55\\U7bc7\";\n nodeName = span;\n raw = \"\\U5b50\\U7f55\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_28.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U4e61\\U515a\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U4e61\\U515a\\U7bc7\";\n nodeName = a;\n raw = \"\\U4e61\\U515a\\U7bc7\";\n }\n );\n nodeContent = \"\\U4e61\\U515a\\U7bc7\";\n nodeName = span;\n raw = \"\\U4e61\\U515a\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_29.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5148\\U8fdb\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5148\\U8fdb\\U7bc7\";\n nodeName = a;\n raw = \"\\U5148\\U8fdb\\U7bc7\";\n }\n );\n nodeContent = \"\\U5148\\U8fdb\\U7bc7\";\n nodeName = span;\n raw = \"\\U5148\\U8fdb\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_30.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U989c\\U6e0a\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U989c\\U6e0a\\U7bc7\";\n nodeName = a;\n raw = \"\\U989c\\U6e0a\\U7bc7\";\n }\n );\n nodeContent = \"\\U989c\\U6e0a\\U7bc7\";\n nodeName = span;\n raw = \"\\U989c\\U6e0a\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_31.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5b50\\U8def\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5b50\\U8def\\U7bc7\";\n nodeName = a;\n raw = \"\\U5b50\\U8def\\U7bc7\";\n }\n );\n nodeContent = \"\\U5b50\\U8def\\U7bc7\";\n nodeName = span;\n raw = \"\\U5b50\\U8def\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_32.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5baa\\U95ee\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5baa\\U95ee\\U7bc7\";\n nodeName = a;\n raw = \"\\U5baa\\U95ee\\U7bc7\";\n }\n );\n nodeContent = \"\\U5baa\\U95ee\\U7bc7\";\n nodeName = span;\n raw = \"\\U5baa\\U95ee\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_33.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U536b\\U7075\\U516c\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U536b\\U7075\\U516c\\U7bc7\";\n nodeName = a;\n raw = \"\\U536b\\U7075\\U516c\\U7bc7\";\n }\n );\n nodeContent = \"\\U536b\\U7075\\U516c\\U7bc7\";\n nodeName = span;\n raw = \"\\U536b\\U7075\\U516c\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_34.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5b63\\U6c0f\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5b63\\U6c0f\\U7bc7\";\n nodeName = a;\n raw = \"\\U5b63\\U6c0f\\U7bc7\";\n }\n );\n nodeContent = \"\\U5b63\\U6c0f\\U7bc7\";\n nodeName = span;\n raw = \"\\U5b63\\U6c0f\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_35.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U9633\\U8d27\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U9633\\U8d27\\U7bc7\";\n nodeName = a;\n raw = \"\\U9633\\U8d27\\U7bc7\";\n }\n );\n nodeContent = \"\\U9633\\U8d27\\U7bc7\";\n nodeName = span;\n raw = \"\\U9633\\U8d27\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_36.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5fae\\U5b50\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5fae\\U5b50\\U7bc7\";\n nodeName = a;\n raw = \"\\U5fae\\U5b50\\U7bc7\";\n }\n );\n nodeContent = \"\\U5fae\\U5b50\\U7bc7\";\n nodeName = span;\n raw = \"\\U5fae\\U5b50\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_37.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5b50\\U5f20\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5b50\\U5f20\\U7bc7\";\n nodeName = a;\n raw = \"\\U5b50\\U5f20\\U7bc7\";\n }\n );\n nodeContent = \"\\U5b50\\U5f20\\U7bc7\";\n nodeName = span;\n raw = \"\\U5b50\\U5f20\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n },\n {\n nodeChildArray = (\n {\n nodeAttributeArray = (\n {\n attributeName = href;\n nodeContent = \"/guwen/bookv_38.aspx\";\n }\n );\n nodeChildArray = (\n {\n nodeContent = \"\\U5c27\\U66f0\\U7bc7\";\n nodeName = text;\n }\n );\n nodeContent = \"\\U5c27\\U66f0\\U7bc7\";\n nodeName = a;\n raw = \"\\U5c27\\U66f0\\U7bc7\";\n }\n );\n nodeContent = \"\\U5c27\\U66f0\\U7bc7\";\n nodeName = span;\n raw = \"\\U5c27\\U66f0\\U7bc7\";\n },\n {\n nodeContent = \"\\n \\n \";\n nodeName = text;\n }\n );\n nodeContent = \"\\n \\n \\U5b66\\U800c\\U7bc7\\n \\n \\U4e3a\\U653f\\U7bc7\\n \\n \\U516b\\U4f7e\\U7bc7\\n \\n \\U91cc\\U4ec1\\U7bc7\\n \\n \\U516c\\U51b6\\U957f\\U7bc7\\n \\n \\U96cd\\U4e5f\\U7bc7\\n \\n \\U8ff0\\U800c\\U7bc7\\n \\n \\U6cf0\\U4f2f\\U7bc7\\n \\n \\U5b50\\U7f55\\U7bc7\\n \\n \\U4e61\\U515a\\U7bc7\\n \\n \\U5148\\U8fdb\\U7bc7\\n \\n \\U989c\\U6e0a\\U7bc7\\n \\n \\U5b50\\U8def\\U7bc7\\n \\n \\U5baa\\U95ee\\U7bc7\\n \\n \\U536b\\U7075\\U516c\\U7bc7\\n \\n \\U5b63\\U6c0f\\U7bc7\\n \\n \\U9633\\U8d27\\U7bc7\\n \\n \\U5fae\\U5b50\\U7bc7\\n \\n \\U5b50\\U5f20\\U7bc7\\n \\n \\U5c27\\U66f0\\U7bc7\\n \\n \";\n nodeName = ul;\n raw = \"
-
\\n
\\n \\U5b66\\U800c\\U7bc7
\\n
\\n \\U4e3a\\U653f\\U7bc7
\\n
\\n \\U516b\\U4f7e\\U7bc7
\\n
\\n \\U91cc\\U4ec1\\U7bc7
\\n
\\n \\U516c\\U51b6\\U957f\\U7bc7
\\n
\\n \\U96cd\\U4e5f\\U7bc7
\\n
\\n \\U8ff0\\U800c\\U7bc7
\\n
\\n \\U6cf0\\U4f2f\\U7bc7
\\n
\\n \\U5b50\\U7f55\\U7bc7
\\n
\\n \\U4e61\\U515a\\U7bc7
\\n
\\n \\U5148\\U8fdb\\U7bc7
\\n
\\n \\U989c\\U6e0a\\U7bc7
\\n
\\n \\U5b50\\U8def\\U7bc7
\\n
\\n \\U5baa\\U95ee\\U7bc7
\\n
\\n \\U536b\\U7075\\U516c\\U7bc7
\\n
\\n \\U5b63\\U6c0f\\U7bc7
\\n
\\n \\U9633\\U8d27\\U7bc7
\\n
\\n \\U5fae\\U5b50\\U7bc7
\\n
\\n \\U5b50\\U5f20\\U7bc7
\\n
\\n \\U5c27\\U66f0\\U7bc7
\\n
\\n
6.firstChild
{ nodeContent = "\n "; nodeName = text;}
上面属性都是涉及HTML语言的标记,我们一般使用的时content属性,然后处理得到的NSString对象
这样我们就得到并处理为我们想要的数据。TFHppleElement是一个很重要的类,具体使用在这里就不介绍了。