Scrapy 入门教程

2019 年 10 月 6 日
筆記

创建一个Scrapy项目

scrapy startproject tutorial

运行结果

(scrapy) localhost:scrapy stanley$ scrapy startproject tutorial  New Scrapy project 'tutorial', using template directory '/Users/stanley/virtualenv/scrapy/lib/python2.7/site-packages/scrapy/templates/project', created in:      /Users/stanley/virtualenv/scrapy/tutorial    You can start your first spider with:      cd tutorial      scrapy genspider example example.com

目录结构

|____scrapy.cfg   #部署配置文件  |____tutorial     #项目的python模块  | |______init__.py  | |____items.py        #item文件  | |____middlewares.py  #middlewares文件  | |____pipelines.py    #pipeline文件  | |____settings.py     #项目配置  | |____spiders         #存放爬虫的目录  | | |______init__.py

第一个爬虫

爬虫就是Scrapy用来从网站抓取数据的类，它们都继承于scrapy.Spider类。

编写第一个爬虫类，文件名为quotes_spider.py，放在tutorial/spiders下面。

import scrapy      class QuotesSpider(scrapy.Spider):      name = "quotes"        def start_requests(self):          urls = [              'http://quotes.toscrape.com/page/1/',              'http://quotes.toscrape.com/page/2/',          ]          for url in urls:              yield scrapy.Request(url=url, callback=self.parse)        def parse(self, response):          page = response.url.split("/")[-2]          filename = 'quotes-%s.html' % page          with open(filename, 'wb') as f:              f.write(response.body)          self.log('Saved file %s' % filename)

name: 这个爬虫的名字
start_requests(): 返回一个可迭代的Rquest，爬虫会从这些Request开始请求数据。可迭代的Request可以是一个list，也可以是一个generator函数。Generator函数可参见 https://my.oschina.net/stanleysun/blog/1501702
parse():用来解析爬虫读取的数据。

让爬虫跑起来

到项目到顶级目录，然后执行

scrapy crawl quotes

这个命令会运行名字为quotes的爬虫，也就是上面写的那个。

注意：这里的参数是crawl，不是runspider。

运行结果：

...    2017-08-08 07:17:01 [scrapy.core.engine] INFO: Spider opened  2017-08-08 07:17:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)  2017-08-08 07:17:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023  2017-08-08 07:17:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)  2017-08-08 07:17:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)  2017-08-08 07:17:03 [quotes] DEBUG: Saved file quotes-1.html  2017-08-08 07:17:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)  2017-08-08 07:17:03 [quotes] DEBUG: Saved file quotes-2.html  2017-08-08 07:17:03 [scrapy.core.engine] INFO: Closing spider (finished)  2017-08-08 07:17:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:  {'downloader/request_bytes': 675,   'downloader/request_count': 3,   'downloader/request_method_count/GET': 3,   'downloader/response_bytes': 5976,   'downloader/response_count': 3,   'downloader/response_status_count/200': 2,   'downloader/response_status_count/404': 1,   'finish_reason': 'finished',   'finish_time': datetime.datetime(2017, 8, 7, 23, 17, 3, 793776),   'log_count/DEBUG': 6,   'log_count/INFO': 7,   'memusage/max': 43855872,   'memusage/startup': 43855872,   'response_received_count': 3,   'scheduler/dequeued': 2,   'scheduler/dequeued/memory': 2,   'scheduler/enqueued': 2,   'scheduler/enqueued/memory': 2,   'start_time': datetime.datetime(2017, 8, 7, 23, 17, 1, 659265)}  2017-08-08 07:17:03 [scrapy.core.engine] INFO: Spider closed (finished)

目录下还会生成2个文件 quotes-1.html 和 quotes-2.html, 它们是爬虫里的parse()函数生成的。

简化start_requests()

start_requests() 生成一系列 scrapy.Request对象，我们也可以start_urls 来代替这个种方式。

import scrapy    class QuotesSpider(scrapy.Spider):      name = "quotes"      start_urls = [          'http://quotes.toscrape.com/page/1/',          'http://quotes.toscrape.com/page/2/',      ]        def parse(self, response):          page = response.url.split("/")[-2]          filename = 'quotes-%s.html' % page          with open(filename, 'wb') as f:              f.write(response.body)

解析数据

可以使用命令行来解析运行过程中的数据，比如运行下面的命令,Scrapy数据存储到默认的变量里。

scrapy shell 'http://quotes.toscrape.com/page/1/'

运行结果

2017-08-08 11:45:41 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tutorial)  2017-08-08 11:45:41 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial', 'LOGSTATS_INTERVAL': 0}    ...    2017-08-08 11:45:41 [scrapy.core.engine] INFO: Spider opened  2017-08-08 11:45:42 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)  2017-08-08 11:45:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)  [s] Available Scrapy objects:  [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)  [s]   crawler    <scrapy.crawler.Crawler object at 0x10fc21190>  [s]   item       {}  [s]   request    <GET http://quotes.toscrape.com/page/1/>  [s]   response   <200 http://quotes.toscrape.com/page/1/>  [s]   settings   <scrapy.settings.Settings object at 0x10fc21a90>  [s]   spider     <DefaultSpider 'default' at 0x10feb3650>  [s] Useful shortcuts:  [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)  [s]   fetch(req)                  Fetch a scrapy.Request and update local objects  [s]   shelp()           Shell help (print this help)  [s]   view(response)    View response in a browser  >>>

request, response,settings等等都是存储数据的变量，可以通过命令行查看和操作它们。如，通过css选择器找到页面中的title标签。

>>> response.css('title')  [<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

>>> response.css('title::text').extract()  [u'Quotes to Scrape']

如果不用::text则会选择整个元素，包含标签。

>>> response.css('title').extract()  [u'<title>Quotes to Scrape</title>']

extract()返回的是一个列表，extract_first()返回第一条。

>>> response.css('title::text').extract_first()  u'Quotes to Scrape'

正则表达式

>>> response.css('title::text').re(r'Quotes.*')  ['Quotes to Scrape']  >>> response.css('title::text').re(r'Qw+')  ['Quotes']  >>> response.css('title::text').re(r'(w+) to (w+)')  ['Quotes', 'Scrape']

直接打开浏览器观看数据

view(response)

XPath简介

除了CSS选择器，还可以使用XPath表达式

>>> response.xpath('//title')  [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]  >>> response.xpath('//title/text()').extract_first()  'Quotes to Scrape'

提取quote和author

http://quotes.toscrape.com 页面里每一个quote都是一个html元素。

<div class="quote">      <span class="text">“The world as we have created it is a process of our      thinking. It cannot be changed without changing our thinking.”</span>      <span>          by <small class="author">Albert Einstein</small>          <a href="/author/Albert-Einstein">(about)</a>      </span>      <div class="tags">          Tags:          <a class="tag" href="/tag/change/page/1/">change</a>          <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>          <a class="tag" href="/tag/thinking/page/1/">thinking</a>          <a class="tag" href="/tag/world/page/1/">world</a>      </div>  </div>

找到quote

>>> response.css("div.quote")  [<Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>, <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>]  >>>

上面是一个数组，说明找到很多quote. 选组第一个

>>> response.css("div.quote")[0]  <Selector xpath=u"descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data=u'<div class="quote" itemscope itemtype="h'>

提取 title, author 和 tags

>>> quote = response.css("div.quote")[0]  >>> title = quote.css("span.text::text").extract_first()  >>> title  u'u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.u201d'  >>> author = quote.css("small.author::text").extract_first()  >>> author  u'Albert Einstein'  >>>

>>> tags = quote.css("div.tags a.tag::text").extract()  >>> tags  [u'change', u'deep-thoughts', u'thinking', u'world']

遍历

>>> for quote in response.css("div.quote"):  ...   text = quote.css("span.text::text").extract_first()  ...   author = quote.css("small.author::text").extract_first()  ...   tags = quote.css("div.tags a.tag::text").extract()  ...   print(dict(text=text, author=author, tags=tags))  ...  {'text': u'u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}  {'text': u'u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}  {'text': u'u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.u201d', 'tags': [u'inspirational', u'life', u'live', u'miracle', u'miracles'], 'author': u'Albert Einstein'}    ...

在爬虫中提取数据

命令行的提取数据方法可以应用到爬虫中，改进我们到爬虫

import scrapy      class QuotesSpider(scrapy.Spider):      name = "quotes"      start_urls = [          'http://quotes.toscrape.com/page/1/',          'http://quotes.toscrape.com/page/2/',      ]        def parse(self, response):          for quote in response.css('div.quote'):              yield {                  'text': quote.css('span.text::text').extract_first(),                  'author': quote.css('small.author::text').extract_first(),                  'tags': quote.css('div.tags a.tag::text').extract(),              }

运行

>>>scrapy crawl quotes    ...    {'text': u'u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.u201d', 'tags': [u'change', u'deep-thoughts', u'thinking', u'world'], 'author': u'Albert Einstein'}  2017-08-08 13:00:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>  {'text': u'u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.u201d', 'tags': [u'abilities', u'choices'], 'author': u'J.K. Rowling'}  ...

存储数据

scrapy crawl quotes -o quotes.json

quotes.json

[  {"text": "u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},  {"text": "u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},  ...

Following links

我们希望通过第一个页面，爬到一个网站到所有页面。这就需要把页面到链接找到，继续爬下去。实例中到网站有一个next page链接，如下：

<ul class="pager">      <li class="next">          <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>      </li>  </ul>

定位到这个元素，并把href解析出来

>>> response.css('li.next a::attr(href)').extract_first()  u'/page/2/'

修改代码，递归抓取网页

import scrapy      class QuotesSpider(scrapy.Spider):      name = "quotes"      start_urls = [          'http://quotes.toscrape.com/page/1/',      ]        def parse(self, response):          for quote in response.css('div.quote'):              yield {                  'text': quote.css('span.text::text').extract_first(),                  'author': quote.css('small.author::text').extract_first(),                  'tags': quote.css('div.tags a.tag::text').extract(),              }            next_page = response.css('li.next a::attr(href)').extract_first()          if next_page is not None:              next_page = response.urljoin(next_page)              yield scrapy.Request(next_page, callback=self.parse)

response.urljoin把url会把网址和next_page的相对路径拼成一个绝对地址。它不是直接在response.url后加上next_page,而是在跟路径上加，看下面的测试。

>>> next_page = response.css('li.next a::attr(href)').extract_first()  >>> next_page  u'/page/2/'  >>> response.url  'http://quotes.toscrape.com/page/1/'  >>> response.urljoin(next_page)  u'http://quotes.toscrape.com/page/2/'

简化scrapy.Request

...  next_page = response.css('li.next a::attr(href)').extract_first()          if next_page is not None:              yield response.follow(next_page, callback=self.parse)  ...

response.follow不需要拼接url.

response.follow()不仅可以传String, 也可以直接传Selector. 下面的代码循环会对页面所有类似对链接进行递归爬取。

for href in response.css('li.next a::attr(href)'):      yield response.follow(href, callback=self.parse)

对于<a>元素，response.follow有一个更简洁对办法：

for a in response.css('li.next a'):      yield response.follow(a, callback=self.parse)

另一个例子

import scrapy      class AuthorSpider(scrapy.Spider):      name = 'author'        start_urls = ['http://quotes.toscrape.com/']        def parse(self, response):          # follow links to author pages          for href in response.css('.author + a::attr(href)'):              yield response.follow(href, self.parse_author)            # follow pagination links          for href in response.css('li.next a::attr(href)'):              yield response.follow(href, self.parse)        def parse_author(self, response):          def extract_with_css(query):              return response.css(query).extract_first().strip()            yield {              'name': extract_with_css('h3.author-title::text'),              'birthdate': extract_with_css('.author-born-date::text'),              'bio': extract_with_css('.author-description::text'),          }

这个爬虫从首页开始，会爬取作者页以及下一页。

运行

scrapy crawl author -o author.json

结果文件author.json,里面存储了所有author的name,birthdate和bio信息

[ {"bio": "Marilyn Monroe (born Norma Jeane Mortenson; June 1, 1926 u2013 August 5, 1962) was an American actress, model, and singer, who became a major sex symbol, starring in a number of commercially successful motion pictures during the 1950s and early 1960s.After spending much of her childhood in foster homes, Monroe began a career as a model, which led to a film contract in 1946 with Twentieth Century-Fox. Her early film appearances were minor, but her performances in The Asphalt Jungle and All About Eve (both 1950), drew attention. By 1952 she had her first leading role in Don't Bother to Knock and 1953 brought a lead in Niagara, a melodramatic film noir that dwelt on her seductiveness. Her "dumb blonde" persona was used to comic effect in subsequent films such as Gentlemen Prefer Blondes (1953), How to Marry a Millionaire (1953) and The Seven Year Itch (1955). Limited by typecasting, Monroe studied at the Actors Studio to broaden her range. Her dramatic performance in Bus Stop (1956) was hailed by critics and garnered a Golden Globe nomination. Her production company, Marilyn Monroe Productions, released The Prince and the Showgirl (1957), for which she received a BAFTA Award nomination and won a David di Donatello award. She received a Golden Globe Award for her performance in Some Like It Hot (1959). Monroe's last completed film was The Misfits, co-starring Clark Gable with screenplay by her then-husband, Arthur Miller.Marilyn was a passionate reader, owning four hundred books at the time of her death, and was often photographed with a book.The final years of Monroe's life were marked by illness, personal problems, and a reputation for unreliability and being difficult to work with. The circumstances of her death, from an overdose of barbiturates, have been the subject of conjecture. Though officially classified as a "probable suicide", the possibility of an accidental overdose, as well as of homicide, have not been ruled out. In 1999, Monroe was ranked as the sixth greatest female star of all time by the American Film Institute. In the decades following her death, she has often been cited as both a pop and a cultural icon as well as the quintessential American sex symbol.", "name": "Marilyn Monroe", "birthdate": "June 01, 1926"}, {"bio": "Anna Eleanor Roosevelt was an American political leader who used her influence as an active First Lady from 1933 to 1945 to promote the New Deal policies of her husband, President Franklin D. Roosevelt, as well as taking a prominent role as an advocate for civil rights. After her husband's death in 1945, she continued to be an internationally prominent author and speaker for the New Deal coalition. She was a suffragist who worked to enhance the status of working women, although she opposed the Equal Rights Amendment because she believed it would adversely affect women. In the 1940s, she was one of the co-founders of Freedom House and supported the formation of the United Nations. Eleanor Roosevelt founded the UN Association of the United States in 1943 to advance support for the formation of the UN. She was a delegate to the UN General Assembly from 1945 and 1952, a job for which she was appointed by President Harry S. Truman and confirmed by the United States Congress. During her time at the United Nations chaired the committee that drafted and approved the Universal Declaration of Human Rights. President Truman called her the "First Lady of the World" in tribute to her human rights achievements.She was one of the most admired persons of the 20th century, according to Gallup's List of Widely Admired People.", "name": "Eleanor Roosevelt", "birthdate": "October 11, 1884"}, ... ]

虽然同一个author页可能会被访问多次，不过，Scrapy会过滤掉重复的请求以降低对服务器的压力。这个特性可以通过DUPEFILTER_CLASS参数配置。

给爬虫传递参数

可以用－a选项给爬虫传参数，比如

scrapy crawl quotes -o quotes-humor.json -a tag=humor

这个参数会被传到爬虫的__init__方法里，称为爬虫的一个属性。

上面的参数会复制给爬虫的self.tag属性，我们可以利用这个属性，去爬取特定网页

import scrapy      class QuotesSpider(scrapy.Spider):      name = "quotes"        def start_requests(self):          url = 'http://quotes.toscrape.com/'          tag = getattr(self, 'tag', None)          if tag is not None:              url = url + 'tag/' + tag          yield scrapy.Request(url, self.parse)        def parse(self, response):          for quote in response.css('div.quote'):              yield {                  'text': quote.css('span.text::text').extract_first(),                  'author': quote.css('small.author::text').extract_first(),              }            next_page = response.css('li.next a::attr(href)').extract_first()          if next_page is not None:              yield response.follow(next_page, self.parse)

比如传tag=humor，这爬虫就只爬取 http://quotes.toscrape.com/tag/humor