转载

crawler4j：轻量级多线程网络爬虫

crawler4j是Java实现的开源网络爬虫。提供了简单易用的接口，可以在几分钟内创建一个多线程网络爬虫。

安装

使用Maven

使用最新版本的crawler4j，在pom.xml中添加如下片段：

XHTML

<dependency>     <groupId>edu.uci.ics</groupId>     <artifactId>crawler4j</artifactId>     <version>4.1</version> </dependency>

<dependency>     <groupId>edu.uci.ics</groupId>     <artifactId>crawler4j</artifactId>     <version>4.1</version> </dependency>

不使用Maven

crawler4j的JAR包可以从 releases page 和 Maven Central 下载。需要注意crawler4j包有几个要依赖的包。在 releases page 下的crawler4j-X.Y-with-dependencies.jar包含了crawler4j的所有的依赖包。可以下载并添加到你的classpath中。

快速开始

使用crawler4j需要创建一个继承WebCrawler的爬虫类。下面是个简单的例子：

Java

public class MyCrawler extends WebCrawler {      private final static Pattern FILTERS = Pattern.compile(".*(//.(css|js|gif|jpg"                                                            + "|png|mp3|mp3|zip|gz))$");      /**      * This method receives two parameters. The first parameter is the page      * in which we have discovered this new url and the second parameter is      * the new url. You should implement this function to specify whether      * the given url should be crawled or not (based on your crawling logic).      * In this example, we are instructing the crawler to ignore urls that      * have css, js, git, ... extensions and to only accept urls that start      * with "http://www.ics.uci.edu/". In this case, we didn't need the      * referringPage parameter to make the decision.      */      @Override      public boolean shouldVisit(Page referringPage, WebURL url) {          String href = url.getURL().toLowerCase();          return !FILTERS.matcher(href).matches()                 && href.startsWith("http://www.ics.uci.edu/");      }       /**       * This function is called when a page is fetched and ready       * to be processed by your program.       */      @Override      public void visit(Page page) {          String url = page.getWebURL().getURL();          System.out.println("URL: " + url);           if (page.getParseData() instanceof HtmlParseData) {              HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();              String text = htmlParseData.getText();              String html = htmlParseData.getHtml();              Set<WebURL> links = htmlParseData.getOutgoingUrls();               System.out.println("Text length: " + text.length());              System.out.println("Html length: " + html.length());              System.out.println("Number of outgoing links: " + links.size());          }     } }

publicclassMyCrawlerextendsWebCrawler{       privatefinalstaticPatternFILTERS=Pattern.compile(".*(//.(css|js|gif|jpg"                                                            +"|png|mp3|mp3|zip|gz))$");       /**      * This method receives two parameters. The first parameter is the page      * in which we have discovered this new url and the second parameter is      * the new url. You should implement this function to specify whether      * the given url should be crawled or not (based on your crawling logic).      * In this example, we are instructing the crawler to ignore urls that      * have css, js, git, ... extensions and to only accept urls that start      * with "http://www.ics.uci.edu/". In this case, we didn't need the      * referringPage parameter to make the decision.      */      @Override      publicbooleanshouldVisit(PagereferringPage,WebURLurl){          Stringhref=url.getURL().toLowerCase();          return!FILTERS.matcher(href).matches()                 &&href.startsWith("http://www.ics.uci.edu/");      }        /**       * This function is called when a page is fetched and ready       * to be processed by your program.       */      @Override      publicvoidvisit(Pagepage){          Stringurl=page.getWebURL().getURL();          System.out.println("URL: "+url);            if(page.getParseData()instanceofHtmlParseData){              HtmlParseDatahtmlParseData=(HtmlParseData)page.getParseData();              Stringtext=htmlParseData.getText();              Stringhtml=htmlParseData.getHtml();              Set<WebURL>links=htmlParseData.getOutgoingUrls();                System.out.println("Text length: "+text.length());              System.out.println("Html length: "+html.length());              System.out.println("Number of outgoing links: "+links.size());          }     } }

上面的例子覆盖了两个主要方法：

shouldVisit：这个方法决定了要抓取的URL及其内容，例子中只允许抓取“www.ics.uci.edu”这个域的页面，不允许.css、.js和多媒体等文件。
visit：当URL下载完成会调用这个方法。你可以轻松获取下载页面的url, 文本, 链接, html,和唯一id等内容。

实现控制器类以制定抓取的种子（seed）、中间数据存储的文件夹、并发线程的数目：

Java

public class Controller {     public static void main(String[] args) throws Exception {         String crawlStorageFolder = "/data/crawl/root";         int numberOfCrawlers = 7;          CrawlConfig config = new CrawlConfig();         config.setCrawlStorageFolder(crawlStorageFolder);          /*          * Instantiate the controller for this crawl.          */         PageFetcher pageFetcher = new PageFetcher(config);         RobotstxtConfig robotstxtConfig = new RobotstxtConfig();         RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);         CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);          /*          * For each crawl, you need to add some seed urls. These are the first          * URLs that are fetched and then the crawler starts following links          * which are found in these pages          */         controller.addSeed("http://www.ics.uci.edu/~lopes/");         controller.addSeed("http://www.ics.uci.edu/~welling/");         controller.addSeed("http://www.ics.uci.edu/");          /*          * Start the crawl. This is a blocking operation, meaning that your code          * will reach the line after this only when crawling is finished.          */         controller.start(MyCrawler.class, numberOfCrawlers);     } }

publicclassController{     publicstaticvoidmain(String[]args)throwsException{         StringcrawlStorageFolder="/data/crawl/root";         intnumberOfCrawlers=7;           CrawlConfigconfig=newCrawlConfig();         config.setCrawlStorageFolder(crawlStorageFolder);           /*          * Instantiate the controller for this crawl.          */         PageFetcherpageFetcher=newPageFetcher(config);         RobotstxtConfigrobotstxtConfig=newRobotstxtConfig();         RobotstxtServerrobotstxtServer=newRobotstxtServer(robotstxtConfig,pageFetcher);         CrawlControllercontroller=newCrawlController(config,pageFetcher,robotstxtServer);           /*          * For each crawl, you need to add some seed urls. These are the first          * URLs that are fetched and then the crawler starts following links          * which are found in these pages          */         controller.addSeed("http://www.ics.uci.edu/~lopes/");         controller.addSeed("http://www.ics.uci.edu/~welling/");         controller.addSeed("http://www.ics.uci.edu/");           /*          * Start the crawl. This is a blocking operation, meaning that your code          * will reach the line after this only when crawling is finished.          */         controller.start(MyCrawler.class,numberOfCrawlers);     } }

例子介绍

Basic crawler ：上述例子的全部源码及细节。
Image crawler ：一个简单的图片爬虫：从指定域下载图片并存在指定文件夹。这个例子演示了怎样用crawler4j抓取二进制内容。
Collecting data from threads ：这个例子演示了控制器怎样从抓取线程中收集数据/统计
Multiple crawlers ：这个例子演示了如何同时运行两个不同的爬虫。
Shutdown crawling ：这个例子演示了可以通过向控制器发送“shutdown”命令优雅的关闭抓取过程。

配置介绍

控制器类必须传一个类型为CrawlConfig的参数，用于配置crawler4j。下面描述了一些关于配置的细节。

抓取深度

默认情况下没有抓取深度的限制。可以通过配置来限制深度，比如，你有个种子页面A连接到B，B又连接到C，C又连接到D。结构如下：

Java

A --> B --> C --> D

A-->B-->C-->D

A是种子页面深度为0，B为1，C、D以此类推。如：当设置抓取深度是2是，就不会抓取页面D。抓取最大深度通过以下代码配置：

Java

crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);

crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);

页面抓取的最大数量

默认情况下没有抓取数量限制，可以通过以下代码配置：

Java

crawlConfig.setMaxPagesToFetch(maxPagesToFetch);

crawlConfig.setMaxPagesToFetch(maxPagesToFetch);

其他限制

crawler4j是高效的，有着极快的抓取能力（比如：每秒可以抓取200个Wikipedia页面）。然而，这会给服务器带来很大的负荷（而服务器可能会阻断你的请求！）。所以，从1.3版开始，默认情况下，crawler4j每次请求前等待200毫秒。但是这个参数可以修改：

Java

crawlConfig.setPolitenessDelay(politenessDelay);

crawlConfig.setPolitenessDelay(politenessDelay);

代理

使用下代码配置爬虫通过代理：

Java

crawlConfig.setProxyHost("proxyserver.example.com"); crawlConfig.setProxyPort(8080);

crawlConfig.setProxyHost("proxyserver.example.com"); crawlConfig.setProxyPort(8080);

如果你的代理需要认证：

Java

crawlConfig.setProxyUsername(username); crawlConfig.getProxyPassword(password);

crawlConfig.setProxyUsername(username); crawlConfig.getProxyPassword(password);

抓取恢复

有时爬虫需要运行很长时间，但中途可能意外终止了。这种情况下，可以通过以下配置恢复停止/崩溃的爬虫：

Java

crawlConfig.setResumableCrawling(true);

crawlConfig.setResumableCrawling(true);

然而，这可能对抓取速度稍有影响。

User-agent字符串

User-agent字符串用于向web服务器表明你的爬虫。User-agent 详解。默认情况下crawler4j使用如下字符串： “crawler4j (https://github.com/yasserg/crawler4j/)” 你可以通过配置修改：

Java

crawlConfig.setUserAgentString(userAgentString);

crawlConfig.setUserAgentString(userAgentString);

许可

根据 Apache License 2.0 发布

开源地址： https://github.com/yasserg/crawler4j

正文到此结束

所属分类：编程技术

本文标签： pom http 安装时间服务器 final https IDE java URLs maven 统计线程代码 root dependencies git 数据 parse 认证 src tar GitHub web 参数 struct 下载 UI HTML 源码 example 多线程 apache 快的 ip Word 配置 CEO classpath 开源 js CSS zip XML
版权声明： 本文为互联网转载文章，出处已在文章中说明(部分除外)。如果侵权，请联系本站长删除，谢谢。
本文海报： 生成海报一生成海报二

其他链接

关于本站

本站定位：个人技术类博客

本站作用：写博客、记日志、闲聊扯淡鼓捣技术。

问题交流

crawler4j：轻量级多线程网络爬虫

安装

使用Maven

不使用Maven

快速开始

例子介绍

配置介绍

抓取深度

页面抓取的最大数量

其他限制

代理

抓取恢复

User-agent字符串

许可

热门推荐

相关文章

说给你听

本文目录

随机标签

书籍教程

近期评论

网站信息

其他链接

关于本站

问题交流