A Node.js scraper for humans.
$ npm i --save scrape-it
const scrapeIt = require("scrape-it"); // Promise interface scrapeIt("http://ionicabizau.net", { title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }).then(page => { console.log(page); }); // Callback interface scrapeIt("http://ionicabizau.net", { // Fetch the articles articles: { listItem: ".article" , data: { // Get the article date and convert it into a Date object createdAt: { selector: ".date" , convert: x => new Date(x) } // Get the title , title: "a.article-title" // Nested list , tags: { listItem: ".tags > span" } // Get the content , content: { selector: ".article-content" , how: "html" } } } // Fetch the blog pages , pages: { listItem: "li.page" , name: "pages" , data: { title: "a" , url: { selector: "a" , attr: "href" } } } // Fetch some other data from the page , title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }, (err, page) => { console.log(err || page); }); // { articles: // [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET), // title: 'Pi Day, Raspberry Pi and Command Line', // tags: [Object], // content: '<p>Everyone knows (or should know)...a" alt=""></p>/n' }, // { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET), // title: 'How I ported Memory Blocks to modern web', // tags: [Object], // content: '<p>Playing computer games is a lot of fun. ...' }, // { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET), // title: 'How to convert JSON to Markdown using json2md', // tags: [Object], // content: '<p>I love and ...' } ], // pages: // [ { title: 'Blog', url: '/' }, // { title: 'About', url: '/about' }, // { title: 'FAQ', url: '/faq' }, // { title: 'Training', url: '/training' }, // { title: 'Contact', url: '/contact' } ], // title: 'Ionică Bizău', // desc: 'Web Developer, Linux geek and Musician', // avatar: '/images/logo.png' }
scrapeIt(url, opts, cb)
A scraping module for humans.
url
: The page url or request options. opts
: The options passed to scrapeHTML
method. cb
: The callback function. scrapeIt.scrapeHTML($, opts)
Scrapes the data in the provided element.
$
: The input element. Object opts
: An object containing the scraping information. If you want to scrape a list, you have to use the listItem
selector:
listItem
(String): The list item selector. data
(Object): The fields to include in the list objects: <fieldName>
(Object|String): The selector or an object containing: selector
(String): The selector. convert
(Function): An optional function to change the value. how
(Function|String): A function or function name to access the value. attr
(String): If provided, the value will be taken based on the attribute name. trim
(Boolean): If false
, the value will not be trimmed (default: true
). eq
(Number): If provided, it will select the nth element. listItem
(Object): An object, keeping the recursive schema of the listItem
object. This can be used to create nested lists. { articles: { listItem: ".article" , data: { createdAt: { selector: ".date" , convert: x => new Date(x) } , title: "a.article-title" , tags: { listItem: ".tags > span" } , content: { selector: ".article-content" , how: "html" } } } }
If you want to collect specific data from the page, just use the same schema used for the data
field.
{ title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }
Have an idea? Found a bug? Seehow to contribute.
If you are using this library in one of your projects, add it in this list.
ui-studentsearch
(by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch MIT © Ionică Bizău