Python Scrapy实践 京东图书计算机24小时畅销榜
0.目标
http://book.jd.com/booktop/0-0-0.html?category=3287-0-0-0-10001-1
1.创建工程
scrapy startproject jdbook24
2.创建爬虫程序
cd jdbook24
scrapy genspider jd24 book.jd.com
3.设置数据存储模板(jdbook24/jdbook24/items.py)
# -*- coding: utf-8 -*-
import scrapy
class Jdbook24Item(scrapy.Item):
#名字
name = scrapy.Field()
4.管道处理(jdbook24/jdbook24/pipelines.py)
# -*- coding: utf-8 -*-
import codecs
import json
class Jdbook24Pipeline(object):
def __init__(self):
self.file = codecs.open('test.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
5.爬虫代码(jdbook24/jdbook24/spiders/jd24.py)
# -*- coding: utf-8 -*-
import scrapy
from jdbook24.items import Jdbook24Item
class Jd24Spider(scrapy.Spider):
name = 'jd24'
allowed_domains = ['book.jd.com']
start_urls = ['http://book.jd.com/booktop/0-0-0.html?category=3287-0-0-0-10001-1']
def parse(self, response):
books = response.xpath('//div[@class="mc"]/ul[@class="clearfix"]/li')
for book in books:
item = Jdbook24Item()
#获取名字
item['name'] = book.xpath('./div[@class="p-detail"]/a/text()').extract()[0]
yield item
#获取下一页的数据
next_url = response.xpath('//a[@class="pn-next"][1]/@href').extract()[0]
if len(next_url) > 0:
next_url = response.urljoin(next_url)
yield scrapy.Request(next_url, callback=self.parse)
6.设置配置管道(jdbook24/jdbook24/settings.py)
关闭 robots.text 的检测
ROBOTSTXT_OBEY = False
设置延迟下载(模拟真人操作,减少服务器压力)
DOWNLOAD_DELAY = 3
没有必要就关闭 cookies
COOKIES_ENABLED = False
开启项目管道(很关键)
ITEM_PIPELINES = {
'jdbook24.pipelines.Jdbook24Pipeline': 300,
}
7.执行爬虫
cd jdbook24
scrapy crawl jd24
假如使用:
scrapy crawl jd24 -- nolog
将不会看到过程输出的信息。