转载

scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据

环境:CentOS 6.0 虚拟机

scrapy(如未安装可参考 安装python爬虫scrapy踩过的那些坑和编程外的思考 )

1、创建工程cnblogs

[root@bogon share]# scrapy startproject cnblogs 2015-06-10 15:45:03 [scrapy] INFO: Scrapy 1.0.0rc2 started (bot: scrapybot) 2015-06-10 15:45:03 [scrapy] INFO: Optional features available: ssl, http11 2015-06-10 15:45:03 [scrapy] INFO: Overridden settings: {} New Scrapy project 'cnblogs' created in:     /mnt/hgfs/share/cnblogs  You can start your first spider with:     cd cnblogs     scrapy genspider example example.com

2、查看下工程的结构

[root@bogon share]# tree cnblogs/ cnblogs/ ├── cnblogs │   ├── __init__.py │   ├── items.py #用于定义抽取网页结构 │   ├── pipelines.py #将抽取的数据进行处理 │   ├── settings.py #爬虫配置文件 │   └── spiders │       └── __init__.py └── scrapy.cfg #项目配置文件

3、定义抽取cnblogs的网页结构,修改items.py

这里我们抽取四个内容:

  • 文章标题
  • 文章链接
  • 文在所在的列表页URL
  • 摘要
[root@bogon cnblogs]# vi cnblogs/items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html  import scrapy class CnblogsItem(scrapy.Item):  # define the fields for your item here like:  # name = scrapy.Field()  title = scrapy.Field()  link = scrapy.Field()  desc = scrapy.Field()  listUrl = scrapy.Field()  pass 

4、创建spider

[root@bogon cnblogs]# vi cnblogs/spiders/cnblogs_spider.py #coding=utf-8 import re import json from scrapy.selector import Selector try:  from scrapy.spider import Spider except:  from scrapy.spider import BaseSpider as Spider from scrapy.utils.response import get_base_url from scrapy.utils.url import urljoin_rfc from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor as sle from cnblogs.items import * class CnblogsSpider(CrawlSpider):  #定义爬虫的名称  name = "CnblogsSpider"  #定义允许抓取的域名,如果不是在此列表的域名则放弃抓取  allowed_domains = ["cnblogs.com"]  #定义抓取的入口url  start_urls = [   "http://www.cnblogs.com/rwxwsblog/default.html?page=1"  ]  # 定义爬取URL的规则,并指定回调函数为parse_item  rules = [   Rule(sle(allow=("/rwxwsblog/default.html/?page=/d{1,}")), #此处要注意?号的转换,复制过来需要对?号进行转义。        follow=True,        callback='parse_item')  ]  #print "**********CnblogsSpider**********"  #定义回调函数  #提取数据到Items里面,主要用到XPath和CSS选择器提取网页数据  def parse_item(self, response):   #print "-----------------"   items = []   sel = Selector(response)   base_url = get_base_url(response)   postTitle = sel.css('div.day div.postTitle')   #print "=============length======="   postCon = sel.css('div.postCon div.c_b_p_desc')   #标题、url和描述的结构是一个松散的结构,后期可以改进   for index in range(len(postTitle)):    item = CnblogsItem()    item['title'] = postTitle[index].css("a").xpath('text()').extract()[0]    #print item['title'] + "***************/r/n"    item['link'] = postTitle[index].css('a').xpath('@href').extract()[0]    item['listUrl'] = base_url    item['desc'] = postCon[index].xpath('text()').extract()[0]    #print base_url + "********/n"    items.append(item)    #print repr(item).decode("unicode-escape") + '/n'   return items 

注意:

首行要设置为: #coding=utf-8 或  # -*- coding: utf-8 -*- 哦!否则会报错。

SyntaxError: Non-ASCII character '/xe5' in file /mnt/hgfs/share/cnblogs/cnblogs/spiders/cnblogs_spider.py on line 15, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

spider的名称为: CnblogsSpider ,后面会用到。

5、修改pipelines.py文件

[root@bogon cnblogs]# vi cnblogs/pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html  from scrapy import signals import json import codecs class JsonWithEncodingCnblogsPipeline(object):  def __init__(self):   self.file = codecs.open('cnblogs.json', 'w', encoding='utf-8')  def process_item(self, item, spider):   line = json.dumps(dict(item), ensure_ascii=False) + "/n"   self.file.write(line)   return item  def spider_closed(self, spider):   self.file.close() 

注意类名为 JsonWithEncodingCnblogsPipeline 哦!settings.py中会用到

6、修改settings.py,添加以下两个配置项

ITEM_PIPELINES = {     'cnblogs.pipelines.JsonWithEncodingCnblogsPipeline': 300, }
LOG_LEVEL = 'INFO'

7、运行spider,scrapy crawl 爬虫名称(cnblogs_spider.py中定义的name)

[root@bogon cnblogs]# scrapy crawl CnblogsSpider

8、查看结果more cnblogs.json(pipelines.py中定义的名称

more cnblogs.json

9、如果有需要可以将结果转成txt文本格式,可参考另外一篇文章 python将json格式的数据转换成文本格式的数据或sql文件

源码可在此下载: https://github.com/jackgitgz/cnblogs

10、相信大家还会有疑问,我们能不能将数据直接保存在数据库呢?答案是可以的,接下来的文章会逐一介绍,敬请期待。

参考资料:

http://doc.scrapy.org/en/master/

http://blog.csdn.net/HanTangSongMing/article/details/24454453

正文到此结束
Loading...