jsoup 1.8.3 发布,此版本主要改进有:解析大型 HTML 文件的一些性能提升;抓取 XML 文档时,自动切换到 XML 解析器;重要 bug 修复。
改进
Performance improvement on parsing larger HTML pages. On Android KitKat, around 1.7x times faster.
On Android Lollipop, ~ 1.3x faster. Improvements largely from re-ordering the HtmlTreeBuilder methods based on analysis of various websites; also from further memory reduction for nodes with no children, and other tweaks.
When fetching XML URLs, automatically switch to the XML parser instead of the HTML parser.
Improved support for boolean attributes in HTML5.
When serialising XML, ensure that '<' characters in attributes are escaped, per spec. Not required in HTML.
Bug 修复
Fixed an issue in Element.elementSiblingIndex() (and related methods) where sibling elements with the same content would incorrectly have the same sibling index.
Fixed an issue where unexpected elements in a badly nested table could be moved to the wrong location in the document.
Fixed an issue where a table nested within a TH cell would parse to an incorrect tree.
When serializing a document using the XHTML encoding entities, if the character set did not support chars (such as Shift_JIS), the character would be skipped. For visibility, will now always output &xa0; (the hex code for non-breaking-space); when using XHTML encoding entities (as is not defined), regardless of the output character set.
Fixed an issue when resolving URLs, where if the absolute URL had no path, the relative URL was not normalized correctly.
Fixed an issue where connections that were redirected to a relative URL did not have the same normalization rules as a URL read from Nodes.absUrl(String).
jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。
jsoup的主要功能如下:
从一个URL,文件或字符串中解析HTML;
使用DOM或CSS选择器来查找、取出数据;
可操作HTML元素、属性、文本;
jsoup是基于MIT协议发布的,可放心使用于商业项目。