欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  web前端

【推荐】oc解析HTML数据的类库(爬取网页数据)_html/css_WEB-ITnose

程序员文章站 2022-05-23 15:52:42
...
  TFhpple是一个用于解析html数据的第三方库,本人感觉功能还算可以,只不过在使用前必须配置项目。

  

  配置

1.导入libxml2.tbd

2.设置编译路径

  使用

这里使用一个例子来说明

http://so.gushiwen.org/guwen/book_2.aspx

1.创建TFHpple对象,data为网站返回的数据

TFHpple *htmlParser = [[TFHpple alloc] initWithHTMLData:data];

2.使用searchWithXPathQuery方法得到有用数据,XPATH知识具体百度

NSArray *temp1 = [htmlParser searchWithXPathQuery:@"//div[@class='shileft']/div[@class='bookcont']"]

这样我们获取了论语的数据

3。获取并分析元素

TFHppleElement *element = [elements objectAtIndex:i];

TFHppleElement对象包含许多属性,下面简单介绍一下各属性

1。

@property (nonatomic, copy, readonly) NSString *raw

raw是包含html标记的网页数据

2.content是网页的具体数据,不包含html标记

学而篇                             为政篇                             八佾篇                             里仁篇                             公冶长篇                             雍也篇                             述而篇                             泰伯篇                             子罕篇                             乡党篇                             先进篇                             颜渊篇                             子路篇                             宪问篇                             卫灵公篇                             季氏篇                             阳货篇                             微子篇                             子张篇                             尧曰篇

3.tagName是html标签

输出只有div

4.attributes,属性。。。。。。。

class = bookcont;

5.children子节点

(    "{\n    nodeContent = \"\\n        \";\n    nodeName = text;\n}",    "{\n    nodeChildArray =     (\n                {\n            nodeContent = \"\\n         \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_19.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5b66\\U800c\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5b66\\U800c\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5b66\\U800c\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5b66\\U800c\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5b66\\U800c\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_20.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U4e3a\\U653f\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U4e3a\\U653f\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U4e3a\\U653f\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U4e3a\\U653f\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U4e3a\\U653f\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_21.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U516b\\U4f7e\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U516b\\U4f7e\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U516b\\U4f7e\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U516b\\U4f7e\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U516b\\U4f7e\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_22.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U91cc\\U4ec1\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U91cc\\U4ec1\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U91cc\\U4ec1\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U91cc\\U4ec1\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U91cc\\U4ec1\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_23.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U516c\\U51b6\\U957f\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_24.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U96cd\\U4e5f\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U96cd\\U4e5f\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U96cd\\U4e5f\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U96cd\\U4e5f\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U96cd\\U4e5f\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_25.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U8ff0\\U800c\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U8ff0\\U800c\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U8ff0\\U800c\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U8ff0\\U800c\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U8ff0\\U800c\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_26.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U6cf0\\U4f2f\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U6cf0\\U4f2f\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U6cf0\\U4f2f\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U6cf0\\U4f2f\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U6cf0\\U4f2f\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_27.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5b50\\U7f55\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5b50\\U7f55\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5b50\\U7f55\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5b50\\U7f55\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5b50\\U7f55\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_28.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U4e61\\U515a\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U4e61\\U515a\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U4e61\\U515a\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U4e61\\U515a\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U4e61\\U515a\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_29.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5148\\U8fdb\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5148\\U8fdb\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5148\\U8fdb\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5148\\U8fdb\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5148\\U8fdb\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_30.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U989c\\U6e0a\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U989c\\U6e0a\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U989c\\U6e0a\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U989c\\U6e0a\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U989c\\U6e0a\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_31.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5b50\\U8def\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5b50\\U8def\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5b50\\U8def\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5b50\\U8def\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5b50\\U8def\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_32.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5baa\\U95ee\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5baa\\U95ee\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5baa\\U95ee\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5baa\\U95ee\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5baa\\U95ee\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_33.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U536b\\U7075\\U516c\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U536b\\U7075\\U516c\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U536b\\U7075\\U516c\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U536b\\U7075\\U516c\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U536b\\U7075\\U516c\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_34.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5b63\\U6c0f\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5b63\\U6c0f\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5b63\\U6c0f\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5b63\\U6c0f\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5b63\\U6c0f\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_35.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U9633\\U8d27\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U9633\\U8d27\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U9633\\U8d27\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U9633\\U8d27\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U9633\\U8d27\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_36.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5fae\\U5b50\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5fae\\U5b50\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5fae\\U5b50\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5fae\\U5b50\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5fae\\U5b50\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_37.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5b50\\U5f20\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5b50\\U5f20\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5b50\\U5f20\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5b50\\U5f20\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5b50\\U5f20\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n               \\n              \";\n            nodeName = text;\n        },\n                {\n            nodeChildArray =             (\n                                {\n                    nodeAttributeArray =                     (\n                                                {\n                            attributeName = href;\n                            nodeContent = \"/guwen/bookv_38.aspx\";\n                        }\n                    );\n                    nodeChildArray =                     (\n                                                {\n                            nodeContent = \"\\U5c27\\U66f0\\U7bc7\";\n                            nodeName = text;\n                        }\n                    );\n                    nodeContent = \"\\U5c27\\U66f0\\U7bc7\";\n                    nodeName = a;\n                    raw = \"\\U5c27\\U66f0\\U7bc7\";\n                }\n            );\n            nodeContent = \"\\U5c27\\U66f0\\U7bc7\";\n            nodeName = span;\n            raw = \"\\U5c27\\U66f0\\U7bc7\";\n        },\n                {\n            nodeContent = \"\\n              \\n        \";\n            nodeName = text;\n        }\n    );\n    nodeContent = \"\\n         \\n              \\U5b66\\U800c\\U7bc7\\n               \\n              \\U4e3a\\U653f\\U7bc7\\n               \\n              \\U516b\\U4f7e\\U7bc7\\n               \\n              \\U91cc\\U4ec1\\U7bc7\\n               \\n              \\U516c\\U51b6\\U957f\\U7bc7\\n               \\n              \\U96cd\\U4e5f\\U7bc7\\n               \\n              \\U8ff0\\U800c\\U7bc7\\n               \\n              \\U6cf0\\U4f2f\\U7bc7\\n               \\n              \\U5b50\\U7f55\\U7bc7\\n               \\n              \\U4e61\\U515a\\U7bc7\\n               \\n              \\U5148\\U8fdb\\U7bc7\\n               \\n              \\U989c\\U6e0a\\U7bc7\\n               \\n              \\U5b50\\U8def\\U7bc7\\n               \\n              \\U5baa\\U95ee\\U7bc7\\n               \\n              \\U536b\\U7075\\U516c\\U7bc7\\n               \\n              \\U5b63\\U6c0f\\U7bc7\\n               \\n              \\U9633\\U8d27\\U7bc7\\n               \\n              \\U5fae\\U5b50\\U7bc7\\n               \\n              \\U5b50\\U5f20\\U7bc7\\n               \\n              \\U5c27\\U66f0\\U7bc7\\n              \\n        \";\n    nodeName = ul;\n    raw = \"
\";\n}", "{\n nodeContent = \"\\n \";\n nodeName = text;\n}")

6.firstChild

{    nodeContent = "\n        ";    nodeName = text;}

上面属性都是涉及HTML语言的标记,我们一般使用的时content属性,然后处理得到的NSString对象

这样我们就得到并处理为我们想要的数据。TFHppleElement是一个很重要的类,具体使用在这里就不介绍了。