Scrapy框架+Elasticsearch
- 2019 年 10 月 6 日
- 笔记
前提
1. 已安装scrapy框架
2. 已安装elasticsearch
创建一个项目scrapyes
scrapy startproject scrapyes
目录结构
. |____scrapy.cfg |____scrapyes | |______init__.py | |____items.py | |____middlewares.py | |____pipelines.py | |____settings.py | |____spiders | | |______init__.py
安装ScrapyElasticSearch
pip install ScrapyElasticSearch
配置setting.py
... ITEM_PIPELINES = { 'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 300, } ELASTICSEARCH_SERVERS = ['192.168.4.215'] ELASTICSEARCH_PORT = 9200 # If port 80 leave blank ELASTICSEARCH_USERNAME = '' ELASTICSEARCH_PASSWORD = '' ELASTICSEARCH_INDEX = 'scrapy.course' ELASTICSEARCH_TYPE = 'course' ELASTICSEARCH_UNIQ_KEY = 'url' ...
配置说明见 https://github.com/knockrentals/scrapy-elasticsearch
写一个网络课程爬虫
import scrapy class ESCourseSpider(scrapy.Spider): name = 'es_course' def start_requests(self): urls=[] for i in xrange(1,30): urls.append('http://demo.edusoho.com/course/'+str(i)) for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): yield { 'title': response.css('span.course-detail-heading::text').extract_first(), 'price': response.css('b.pirce-num::text').extract_first(), 'url' : response.url, }
跑一下爬虫
scrapy crawl es_course -o es_course.json
爬下来的内容会存放在新生成的一个文件es_course.json里
[ {"url": "http://demo.edusoho.com/course/1", "price": "免费", "title": "n 课程功能体验n "}, {"url": "http://demo.edusoho.com/course/20", "price": "0.01", "title": "n 官方主题n "}, {"url": "http://demo.edusoho.com/course/24", "price": "999.00", "title": "n 会员专区n "}, {"url": "http://demo.edusoho.com/course/22", "price": "免费", "title": "n 第三方主题n "}, {"url": "http://demo.edusoho.com/course/27", "price": "0.01", "title": "n 优惠码n "} ]
到elasticsearch中查看数据,查询条件如下
GET scrapy.course*/_search { "query" : { "match_all": {} } ,"from" : 0, "size" : 50 }
结果
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 5, "max_score": 1, "hits": [ { "_index": "scrapy.course", "_type": "course", "_id": "6306093149d91c35eabc1c59f28d68355cc4de9d", "_score": 1, "_source": { "url": "http://demo.edusoho.com/course/1", "price": "免费", "title": "n 课程功能体验n " } }, { "_index": "scrapy.course", "_type": "course", "_id": "6a090cfe8f9dbf3d21248d64d9907eab4b31bc4d", "_score": 1, "_source": { "url": "http://demo.edusoho.com/course/24", "price": "999.00", "title": "n 会员专区n " } }, ...
说明数据已经存到elasticsearch中。