Scrapy框架+Elasticsearch

  • 2019 年 10 月 6 日
  • 笔记

前提

1. 已安装scrapy框架

2. 已安装elasticsearch

创建一个项目scrapyes

scrapy startproject scrapyes

目录结构

.  |____scrapy.cfg  |____scrapyes  | |______init__.py  | |____items.py  | |____middlewares.py  | |____pipelines.py  | |____settings.py  | |____spiders  | | |______init__.py

安装ScrapyElasticSearch

pip install ScrapyElasticSearch

配置setting.py

...    ITEM_PIPELINES = {    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 300,  }    ELASTICSEARCH_SERVERS = ['192.168.4.215']  ELASTICSEARCH_PORT = 9200 # If port 80 leave blank  ELASTICSEARCH_USERNAME = ''  ELASTICSEARCH_PASSWORD = ''  ELASTICSEARCH_INDEX = 'scrapy.course'  ELASTICSEARCH_TYPE = 'course'  ELASTICSEARCH_UNIQ_KEY = 'url'    ...

配置说明见 https://github.com/knockrentals/scrapy-elasticsearch

写一个网络课程爬虫

import scrapy    class ESCourseSpider(scrapy.Spider):      name = 'es_course'        def start_requests(self):          urls=[]          for i in xrange(1,30):              urls.append('http://demo.edusoho.com/course/'+str(i))            for url in urls:              yield scrapy.Request(url=url, callback=self.parse)          def parse(self, response):          yield {              'title': response.css('span.course-detail-heading::text').extract_first(),              'price': response.css('b.pirce-num::text').extract_first(),              'url' : response.url,          }

跑一下爬虫

scrapy crawl es_course -o es_course.json

爬下来的内容会存放在新生成的一个文件es_course.json里

[  {"url": "http://demo.edusoho.com/course/1", "price": "免费", "title": "n               课程功能体验n                        "},  {"url": "http://demo.edusoho.com/course/20", "price": "0.01", "title": "n               官方主题n                        "},  {"url": "http://demo.edusoho.com/course/24", "price": "999.00", "title": "n               会员专区n                        "},  {"url": "http://demo.edusoho.com/course/22", "price": "免费", "title": "n               第三方主题n                        "},  {"url": "http://demo.edusoho.com/course/27", "price": "0.01", "title": "n               优惠码n                        "}  ]

到elasticsearch中查看数据,查询条件如下

GET scrapy.course*/_search  {    "query" : {      "match_all": {}    }    ,"from" : 0, "size" : 50  }

结果

{    "took": 2,    "timed_out": false,    "_shards": {      "total": 5,      "successful": 5,      "failed": 0    },    "hits": {      "total": 5,      "max_score": 1,      "hits": [        {          "_index": "scrapy.course",          "_type": "course",          "_id": "6306093149d91c35eabc1c59f28d68355cc4de9d",          "_score": 1,          "_source": {            "url": "http://demo.edusoho.com/course/1",            "price": "免费",            "title": "n               课程功能体验n                        "          }        },        {          "_index": "scrapy.course",          "_type": "course",          "_id": "6a090cfe8f9dbf3d21248d64d9907eab4b31bc4d",          "_score": 1,          "_source": {            "url": "http://demo.edusoho.com/course/24",            "price": "999.00",            "title": "n               会员专区n                        "          }        },    ...

说明数据已经存到elasticsearch中。