Python Scrapy实践 京东图书计算机24小时畅销榜

0.目标

http://book.jd.com/booktop/0-0-0.html?category=3287-0-0-0-10001-1

1.创建工程

scrapy startproject jdbook24

2.创建爬虫程序

cd jdbook24
scrapy genspider jd24 book.jd.com

3.设置数据存储模板(jdbook24/jdbook24/items.py)

# -*- coding: utf-8 -*-

import scrapy

class Jdbook24Item(scrapy.Item):
    #名字
    name = scrapy.Field()

4.管道处理(jdbook24/jdbook24/pipelines.py)

# -*- coding: utf-8 -*-

import codecs
import json

class Jdbook24Pipeline(object):
    def __init__(self):
        self.file = codecs.open('test.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

5.爬虫代码(jdbook24/jdbook24/spiders/jd24.py)

# -*- coding: utf-8 -*-
import scrapy
from jdbook24.items import Jdbook24Item

class Jd24Spider(scrapy.Spider):
    name = 'jd24'
    allowed_domains = ['book.jd.com']
    start_urls = ['http://book.jd.com/booktop/0-0-0.html?category=3287-0-0-0-10001-1']

    def parse(self, response):
        books = response.xpath('//div[@class="mc"]/ul[@class="clearfix"]/li')
        for book in books:
            item = Jdbook24Item()
            #获取名字
            item['name'] = book.xpath('./div[@class="p-detail"]/a/text()').extract()[0]
            yield item
        
        #获取下一页的数据
        next_url = response.xpath('//a[@class="pn-next"][1]/@href').extract()[0]
        if len(next_url) > 0:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(next_url, callback=self.parse)


6.设置配置管道(jdbook24/jdbook24/settings.py)

关闭 robots.text 的检测

ROBOTSTXT_OBEY = False

设置延迟下载(模拟真人操作,减少服务器压力)

DOWNLOAD_DELAY = 3

没有必要就关闭 cookies

COOKIES_ENABLED = False

开启项目管道(很关键)

ITEM_PIPELINES = {
   'jdbook24.pipelines.Jdbook24Pipeline': 300,
}

7.执行爬虫

cd jdbook24
scrapy crawl jd24

假如使用:

scrapy crawl jd24 -- nolog

将不会看到过程输出的信息。