转载

基于简单脚本的下一代开源爬虫框架 - Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning:At present this project is still under stage-1 development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
    c := creeper.Open("./hacker_news.crs")
    c.Array("news").Each(func(c *creeper.Creeper) {
        println("title: ", c.String("title"))
        println("site: ", c.String("site"))
        println("link: ", c.String("link"))
        println("===")
    })
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

Hey, you might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something" , represents parameter name has a default value "something" .

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Like yaml , nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs .

If a node owned page and fun at the same time, page should on the left of -> , fun should on the right of -> . Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name	Parameters	Description
$	(selector: string)	CSS selector
html		inner HTML
text		inner text
outerHTML		outer HTML
attr	(attr: string)	attribute value
style		style attribute value
href		href attribute value
src		src attribute value
calc	(prec: int)	calculate arithmetic expression
match	(regexp: string)	match first sub-string via regular expression
expand	(regexp: string, target: string)	expand matched strings to target string