crawler4j是Java实现的开源网络爬虫。提供了简单易用的接口,可以在几分钟内创建一个多线程网络爬虫。
使用最新版本的crawler4j,在pom.xml中添加如下片段:
XHTML
<dependency> <groupId>edu.uci.ics</groupId> <artifactId>crawler4j</artifactId> <version>4.1</version> </dependency>
<dependency> <groupId>edu.uci.ics</groupId> <artifactId>crawler4j</artifactId> <version>4.1</version> </dependency>
crawler4j的JAR包可以从 releases page 和 Maven Central 下载。 需要注意crawler4j包有几个要依赖的包。在 releases page 下的crawler4j-X.Y-with-dependencies.jar包含了crawler4j的所有的依赖包。可以下载并添加到你的classpath中。
使用crawler4j需要创建一个继承WebCrawler的爬虫类。下面是个简单的例子:
Java
public class MyCrawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(//.(css|js|gif|jpg" + "|png|mp3|mp3|zip|gz))$"); /** * This method receives two parameters. The first parameter is the page * in which we have discovered this new url and the second parameter is * the new url. You should implement this function to specify whether * the given url should be crawled or not (based on your crawling logic). * In this example, we are instructing the crawler to ignore urls that * have css, js, git, ... extensions and to only accept urls that start * with "http://www.ics.uci.edu/". In this case, we didn't need the * referringPage parameter to make the decision. */ @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/"); } /** * This function is called when a page is fetched and ready * to be processed by your program. */ @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL: " + url); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); Set<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("Text length: " + text.length()); System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } } }
publicclassMyCrawlerextendsWebCrawler{ privatefinalstaticPatternFILTERS=Pattern.compile(".*(//.(css|js|gif|jpg" +"|png|mp3|mp3|zip|gz))$"); /** * This method receives two parameters. The first parameter is the page * in which we have discovered this new url and the second parameter is * the new url. You should implement this function to specify whether * the given url should be crawled or not (based on your crawling logic). * In this example, we are instructing the crawler to ignore urls that * have css, js, git, ... extensions and to only accept urls that start * with "http://www.ics.uci.edu/". In this case, we didn't need the * referringPage parameter to make the decision. */ @Override publicbooleanshouldVisit(PagereferringPage,WebURLurl){ Stringhref=url.getURL().toLowerCase(); return!FILTERS.matcher(href).matches() &&href.startsWith("http://www.ics.uci.edu/"); } /** * This function is called when a page is fetched and ready * to be processed by your program. */ @Override publicvoidvisit(Pagepage){ Stringurl=page.getWebURL().getURL(); System.out.println("URL: "+url); if(page.getParseData()instanceofHtmlParseData){ HtmlParseDatahtmlParseData=(HtmlParseData)page.getParseData(); Stringtext=htmlParseData.getText(); Stringhtml=htmlParseData.getHtml(); Set<WebURL>links=htmlParseData.getOutgoingUrls(); System.out.println("Text length: "+text.length()); System.out.println("Html length: "+html.length()); System.out.println("Number of outgoing links: "+links.size()); } } }
上面的例子覆盖了两个主要方法:
实现控制器类以制定抓取的种子(seed)、中间数据存储的文件夹、并发线程的数目:
Java
public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); /* * For each crawl, you need to add some seed urls. These are the first * URLs that are fetched and then the crawler starts following links * which are found in these pages */ controller.addSeed("http://www.ics.uci.edu/~lopes/"); controller.addSeed("http://www.ics.uci.edu/~welling/"); controller.addSeed("http://www.ics.uci.edu/"); /* * Start the crawl. This is a blocking operation, meaning that your code * will reach the line after this only when crawling is finished. */ controller.start(MyCrawler.class, numberOfCrawlers); } }
publicclassController{ publicstaticvoidmain(String[]args)throwsException{ StringcrawlStorageFolder="/data/crawl/root"; intnumberOfCrawlers=7; CrawlConfigconfig=newCrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); /* * Instantiate the controller for this crawl. */ PageFetcherpageFetcher=newPageFetcher(config); RobotstxtConfigrobotstxtConfig=newRobotstxtConfig(); RobotstxtServerrobotstxtServer=newRobotstxtServer(robotstxtConfig,pageFetcher); CrawlControllercontroller=newCrawlController(config,pageFetcher,robotstxtServer); /* * For each crawl, you need to add some seed urls. These are the first * URLs that are fetched and then the crawler starts following links * which are found in these pages */ controller.addSeed("http://www.ics.uci.edu/~lopes/"); controller.addSeed("http://www.ics.uci.edu/~welling/"); controller.addSeed("http://www.ics.uci.edu/"); /* * Start the crawl. This is a blocking operation, meaning that your code * will reach the line after this only when crawling is finished. */ controller.start(MyCrawler.class,numberOfCrawlers); } }
控制器类必须传一个类型为CrawlConfig的参数,用于配置crawler4j。下面描述了一些关于配置的细节。
默认情况下没有抓取深度的限制。可以通过配置来限制深度,比如,你有个种子页面A连接到B,B又连接到C,C又连接到D。结构如下:
Java
A --> B --> C --> D
A-->B-->C-->D
A是种子页面深度为0,B为1,C、D以此类推。如:当设置抓取深度是2是,就不会抓取页面D。抓取最大深度通过以下代码配置:
Java
crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);
crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);
默认情况下没有抓取数量限制,可以通过以下代码配置:
Java
crawlConfig.setMaxPagesToFetch(maxPagesToFetch);
crawlConfig.setMaxPagesToFetch(maxPagesToFetch);
crawler4j是高效的,有着极快的抓取能力(比如:每秒可以抓取200个Wikipedia页面)。然而,这会给服务器带来很大的负荷(而服务器可能会阻断你的请求!)。所以,从1.3版开始,默认情况下,crawler4j每次请求前等待200毫秒。但是这个参数可以修改:
Java
crawlConfig.setPolitenessDelay(politenessDelay);
crawlConfig.setPolitenessDelay(politenessDelay);
使用下代码配置爬虫通过代理:
Java
crawlConfig.setProxyHost("proxyserver.example.com"); crawlConfig.setProxyPort(8080);
crawlConfig.setProxyHost("proxyserver.example.com"); crawlConfig.setProxyPort(8080);
如果你的代理需要认证:
Java
crawlConfig.setProxyUsername(username); crawlConfig.getProxyPassword(password);
crawlConfig.setProxyUsername(username); crawlConfig.getProxyPassword(password);
有时爬虫需要运行很长时间,但中途可能意外终止了。这种情况下,可以通过以下配置恢复停止/崩溃的爬虫:
Java
crawlConfig.setResumableCrawling(true);
crawlConfig.setResumableCrawling(true);
然而,这可能对抓取速度稍有影响。
User-agent字符串用于向web服务器表明你的爬虫。User-agent 详解 。 默认情况下crawler4j使用如下字符串: “crawler4j (https://github.com/yasserg/crawler4j/)” 你可以通过配置修改:
Java
crawlConfig.setUserAgentString(userAgentString);
crawlConfig.setUserAgentString(userAgentString);
Copyright (c) 2010-2015 Yasser Ganjisaffar
根据 Apache License 2.0 发布
开源地址: https://github.com/yasserg/crawler4j