A Node.js scraper for humans.
Installation $ npm i --save scrape-it
Example const scrapeIt = require("scrape-it"); // Promise interface scrapeIt("http://ionicabizau.net", { title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }).then(page => { console.log(page); }); // Callback interface scrapeIt("http://ionicabizau.net", { // Fetch the articles articles: { listItem: ".article" , data: { // Get the article date and convert it into a Date object createdAt: { selector: ".date" , convert: x => new Date(x) } // Get the title , title: "a.article-title" // Nested list , tags: { listItem: ".tags > span" } // Get the content , content: { selector: ".article-content" , how: "html" } } } // Fetch the blog pages , pages: { listItem: "li.page" , name: "pages" , data: { title: "a" , url: { selector: "a" , attr: "href" } } } // Fetch some other data from the page , title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }, (err, page) => { console.log(err || page); }); // { articles: // [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET), // title: 'Pi Day, Raspberry Pi and Command Line', // tags: [Object], // content: '<p>Everyone knows (or should know)...a" alt=""></p>/n' }, // { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET), // title: 'How I ported Memory Blocks to modern web', // tags: [Object], // content: '<p>Playing computer games is a lot of fun. ...' }, // { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET), // title: 'How to convert JSON to Markdown using json2md', // tags: [Object], // content: '<p>I love and ...' } ], // pages: // [ { title: 'Blog', url: '/' }, // { title: 'About', url: '/about' }, // { title: 'FAQ', url: '/faq' }, // { title: 'Training', url: '/training' }, // { title: 'Contact', url: '/contact' } ], // title: 'Ionică Bizău', // desc: 'Web Developer, Linux geek and Musician', // avatar: '/images/logo.png' }
Documentation scrapeIt(url, opts, cb) A scraping module for humans.
url : The page url or request options. opts : The options passed to scrapeHTML method. cb : The callback function. scrapeIt.scrapeHTML($, opts) Scrapes the data in the provided element.
$ : The input element. Object opts : An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:
listItem (String): The list item selector. data (Object): The fields to include in the list objects: <fieldName> (Object|String): The selector or an object containing: selector (String): The selector. convert (Function): An optional function to change the value. how (Function|String): A function or function name to access the value. attr (String): If provided, the value will be taken based on the attribute name. trim (Boolean): If false , the value will not be trimmed (default: true ). eq (Number): If provided, it will select the nth element. listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists. { articles: { listItem: ".article" , data: { createdAt: { selector: ".date" , convert: x => new Date(x) } , title: "a.article-title" , tags: { listItem: ".tags > span" } , content: { selector: ".article-content" , how: "html" } } } } If you want to collect specific data from the page, just use the same schema used for the data field.
{ title: ".header h1" , desc: ".header h2" , avatar: { selector: ".header img" , attr: "src" } }
How to contribute Have an idea? Found a bug? Seehow to contribute.
Where is this library used? If you are using this library in one of your projects, add it in this list.
ui-studentsearch (by Rakha Kanz Kautsar)—API for majapahit.cs.ui.ac.id/studentsearch
License MIT © Ionică Bizău