转载

crawler4j:轻量级多线程网络爬虫

crawler4j是Java实现的开源网络爬虫。提供了简单易用的接口,可以在几分钟内创建一个多线程网络爬虫。

crawler4j:轻量级多线程网络爬虫

安装

使用Maven

使用最新版本的crawler4j,在pom.xml中添加如下片段:

XHTML

<dependency>     <groupId>edu.uci.ics</groupId>     <artifactId>crawler4j</artifactId>     <version>4.1</version> </dependency>
<dependency>     <groupId>edu.uci.ics</groupId>     <artifactId>crawler4j</artifactId>     <version>4.1</version> </dependency>

不使用Maven

crawler4j的JAR包可以从 releases page 和 Maven Central 下载。 需要注意crawler4j包有几个要依赖的包。在 releases page 下的crawler4j-X.Y-with-dependencies.jar包含了crawler4j的所有的依赖包。可以下载并添加到你的classpath中。

快速开始

使用crawler4j需要创建一个继承WebCrawler的爬虫类。下面是个简单的例子:

Java

public class MyCrawler extends WebCrawler {      private final static Pattern FILTERS = Pattern.compile(".*(//.(css|js|gif|jpg"                                                            + "|png|mp3|mp3|zip|gz))$");      /**      * This method receives two parameters. The first parameter is the page      * in which we have discovered this new url and the second parameter is      * the new url. You should implement this function to specify whether      * the given url should be crawled or not (based on your crawling logic).      * In this example, we are instructing the crawler to ignore urls that      * have css, js, git, ... extensions and to only accept urls that start      * with "http://www.ics.uci.edu/". In this case, we didn't need the      * referringPage parameter to make the decision.      */      @Override      public boolean shouldVisit(Page referringPage, WebURL url) {          String href = url.getURL().toLowerCase();          return !FILTERS.matcher(href).matches()                 && href.startsWith("http://www.ics.uci.edu/");      }       /**       * This function is called when a page is fetched and ready       * to be processed by your program.       */      @Override      public void visit(Page page) {          String url = page.getWebURL().getURL();          System.out.println("URL: " + url);           if (page.getParseData() instanceof HtmlParseData) {              HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();              String text = htmlParseData.getText();              String html = htmlParseData.getHtml();              Set<WebURL> links = htmlParseData.getOutgoingUrls();               System.out.println("Text length: " + text.length());              System.out.println("Html length: " + html.length());              System.out.println("Number of outgoing links: " + links.size());          }     } }
publicclassMyCrawlerextendsWebCrawler{       privatefinalstaticPatternFILTERS=Pattern.compile(".*(//.(css|js|gif|jpg"                                                            +"|png|mp3|mp3|zip|gz))$");       /**      * This method receives two parameters. The first parameter is the page      * in which we have discovered this new url and the second parameter is      * the new url. You should implement this function to specify whether      * the given url should be crawled or not (based on your crawling logic).      * In this example, we are instructing the crawler to ignore urls that      * have css, js, git, ... extensions and to only accept urls that start      * with "http://www.ics.uci.edu/". In this case, we didn't need the      * referringPage parameter to make the decision.      */      @Override      publicbooleanshouldVisit(PagereferringPage,WebURLurl){          Stringhref=url.getURL().toLowerCase();          return!FILTERS.matcher(href).matches()                 &&href.startsWith("http://www.ics.uci.edu/");      }        /**       * This function is called when a page is fetched and ready       * to be processed by your program.       */      @Override      publicvoidvisit(Pagepage){          Stringurl=page.getWebURL().getURL();          System.out.println("URL: "+url);            if(page.getParseData()instanceofHtmlParseData){              HtmlParseDatahtmlParseData=(HtmlParseData)page.getParseData();              Stringtext=htmlParseData.getText();              Stringhtml=htmlParseData.getHtml();              Set<WebURL>links=htmlParseData.getOutgoingUrls();                System.out.println("Text length: "+text.length());              System.out.println("Html length: "+html.length());              System.out.println("Number of outgoing links: "+links.size());          }     } }

上面的例子覆盖了两个主要方法:

  • shouldVisit:这个方法决定了要抓取的URL及其内容,例子中只允许抓取“www.ics.uci.edu”这个域的页面,不允许.css、.js和多媒体等文件。
  • visit:当URL下载完成会调用这个方法。你可以轻松获取下载页面的url, 文本, 链接, html,和唯一id等内容。

实现控制器类以制定抓取的种子(seed)、中间数据存储的文件夹、并发线程的数目:

Java

public class Controller {     public static void main(String[] args) throws Exception {         String crawlStorageFolder = "/data/crawl/root";         int numberOfCrawlers = 7;          CrawlConfig config = new CrawlConfig();         config.setCrawlStorageFolder(crawlStorageFolder);          /*          * Instantiate the controller for this crawl.          */         PageFetcher pageFetcher = new PageFetcher(config);         RobotstxtConfig robotstxtConfig = new RobotstxtConfig();         RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);         CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);          /*          * For each crawl, you need to add some seed urls. These are the first          * URLs that are fetched and then the crawler starts following links          * which are found in these pages          */         controller.addSeed("http://www.ics.uci.edu/~lopes/");         controller.addSeed("http://www.ics.uci.edu/~welling/");         controller.addSeed("http://www.ics.uci.edu/");          /*          * Start the crawl. This is a blocking operation, meaning that your code          * will reach the line after this only when crawling is finished.          */         controller.start(MyCrawler.class, numberOfCrawlers);     } }
publicclassController{     publicstaticvoidmain(String[]args)throwsException{         StringcrawlStorageFolder="/data/crawl/root";         intnumberOfCrawlers=7;           CrawlConfigconfig=newCrawlConfig();         config.setCrawlStorageFolder(crawlStorageFolder);           /*          * Instantiate the controller for this crawl.          */         PageFetcherpageFetcher=newPageFetcher(config);         RobotstxtConfigrobotstxtConfig=newRobotstxtConfig();         RobotstxtServerrobotstxtServer=newRobotstxtServer(robotstxtConfig,pageFetcher);         CrawlControllercontroller=newCrawlController(config,pageFetcher,robotstxtServer);           /*          * For each crawl, you need to add some seed urls. These are the first          * URLs that are fetched and then the crawler starts following links          * which are found in these pages          */         controller.addSeed("http://www.ics.uci.edu/~lopes/");         controller.addSeed("http://www.ics.uci.edu/~welling/");         controller.addSeed("http://www.ics.uci.edu/");           /*          * Start the crawl. This is a blocking operation, meaning that your code          * will reach the line after this only when crawling is finished.          */         controller.start(MyCrawler.class,numberOfCrawlers);     } }

例子介绍

  • Basic crawler :上述例子的全部源码及细节。
  • Image crawler :一个简单的图片爬虫:从指定域下载图片并存在指定文件夹。这个例子演示了怎样用crawler4j抓取二进制内容。
  • Collecting data from threads :这个例子演示了控制器怎样从抓取线程中收集数据/统计
  • Multiple crawlers :这个例子演示了如何同时运行两个不同的爬虫。
  • Shutdown crawling :这个例子演示了可以通过向控制器发送“shutdown”命令优雅的关闭抓取过程。

配置介绍

控制器类必须传一个类型为CrawlConfig的参数,用于配置crawler4j。下面描述了一些关于配置的细节。

抓取深度

默认情况下没有抓取深度的限制。可以通过配置来限制深度,比如,你有个种子页面A连接到B,B又连接到C,C又连接到D。结构如下:

Java

A --> B --> C --> D
A-->B-->C-->D

A是种子页面深度为0,B为1,C、D以此类推。如:当设置抓取深度是2是,就不会抓取页面D。抓取最大深度通过以下代码配置:

Java

crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);
crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);

页面抓取的最大数量

默认情况下没有抓取数量限制,可以通过以下代码配置:

Java

crawlConfig.setMaxPagesToFetch(maxPagesToFetch);
crawlConfig.setMaxPagesToFetch(maxPagesToFetch);

其他限制

crawler4j是高效的,有着极快的抓取能力(比如:每秒可以抓取200个Wikipedia页面)。然而,这会给服务器带来很大的负荷(而服务器可能会阻断你的请求!)。所以,从1.3版开始,默认情况下,crawler4j每次请求前等待200毫秒。但是这个参数可以修改:

Java

crawlConfig.setPolitenessDelay(politenessDelay);
crawlConfig.setPolitenessDelay(politenessDelay);

代理

使用下代码配置爬虫通过代理:

Java

crawlConfig.setProxyHost("proxyserver.example.com"); crawlConfig.setProxyPort(8080);
crawlConfig.setProxyHost("proxyserver.example.com"); crawlConfig.setProxyPort(8080);

如果你的代理需要认证:

Java

crawlConfig.setProxyUsername(username); crawlConfig.getProxyPassword(password);
crawlConfig.setProxyUsername(username); crawlConfig.getProxyPassword(password);

抓取恢复

有时爬虫需要运行很长时间,但中途可能意外终止了。这种情况下,可以通过以下配置恢复停止/崩溃的爬虫:

Java

crawlConfig.setResumableCrawling(true);
crawlConfig.setResumableCrawling(true);

然而,这可能对抓取速度稍有影响。

User-agent字符串

User-agent字符串用于向web服务器表明你的爬虫。User-agent 详解 。 默认情况下crawler4j使用如下字符串: “crawler4j (https://github.com/yasserg/crawler4j/)” 你可以通过配置修改:

Java

crawlConfig.setUserAgentString(userAgentString);
crawlConfig.setUserAgentString(userAgentString);

许可

Copyright (c) 2010-2015 Yasser Ganjisaffar

根据 Apache License 2.0 发布

开源地址: https://github.com/yasserg/crawler4j

正文到此结束
Loading...