python3 網絡爬蟲實例1

2019 年 12 月 13 日
筆記

scrapy

pip install scrapy pip install pyOpenSSL pip install cryptography pip install CFFI pip install lxml pip install cssselect pip install Twisted

創建爬蟲項目

scrapy startproject zhipinSpider

生成爬蟲

scrapy genspider job_position "zhipin.com"

image.png

目錄結構： items.py : pipelines.py:處理爬取的內容 settings.py :配置文件

先調試數據

scrapy shell – s USER_AGENT="xx" https://www.zhipin.com/c101280100/h101280100/

讓scrapy偽裝成瀏覽器

XPath語法

/ 匹配根節點 // 任意節點 . 當前節點 .. 父節點 @ 屬性 //div[@title="xxx"]/div

extract提取節點內容

image.png

CSS匹配

image.png

items.py

import scrapy

class ZhipinspiderItem(scrapy.Item): # 工作名稱 title = scrapy.Field() # 工資 salary = scrapy.Field() # 招聘公司 company = scrapy.Field() # 工作詳細鏈接 url = scrapy.Field() # 工作地點 work_addr = scrapy.Field() # 行業 industry = scrapy.Field() # 公司規模 company_size = scrapy.Field() # 招聘人 recruiter = scrapy.Field() # 發佈時間 publish_date = scrapy.Field()

job_spider.py

import scrapy from ZhipinSpider.items import ZhipinspiderItem

class JobPositionSpider(scrapy.Spider): # 定義該Spider的名字 name = 'job_position' # 定義該Spider允許爬取的域名 allowed_domains = ['zhipin.com'] # 定義該Spider爬取的首頁列表 start_urls = ['https://www.zhipin.com/c101280100/h_101280100/']

# 該方法負責提取response所包含的信息  # response代表下載器從start_urls中每個URL下載得到的響應  def parse(self, response):      # 遍歷頁面上所有//div[@class="job-primary"]節點      for job_primary in response.xpath('//div[@class="job-primary"]'):          item = ZhipinspiderItem()          # 匹配//div[@class="job-primary"]節點下/div[@class="info-primary"]節點          # 也就是匹配到包含工作信息的<div.../>元素          info_primary = job_primary.xpath('./div[@class="info-primary"]')          item['title'] = info_primary.xpath('./h3/a/div[@class="job-title"]/text()').extract_first()          item['salary'] = info_primary.xpath('./h3/a/span[@class="red"]/text()').extract_first()          item['work_addr'] = info_primary.xpath('./p/text()').extract_first()          item['url'] = info_primary.xpath('./h3/a/@href').extract_first()          # 匹配//div[@class="job-primary"]節點下./div[@class="info-company"]節點下          # 的/div[@class="company-text"]的節點          # 也就是匹配到包含公司信息的<div.../>元素          company_text = job_primary.xpath('./div[@class="info-company"]' +              '/div[@class="company-text"]')          item['company'] = company_text.xpath('./h3/a/text()').extract_first()          company_info = company_text.xpath('./p/text()').extract()          if company_info and len(company_info) > 0:              item['industry'] = company_info[0]          if company_info and len(company_info) > 2:              item['company_size'] = company_info[2]          # 匹配//div[@class="job-primary"]節點下./div[@class="info-publis"]節點下          # 也就是匹配到包含發佈人信息的<div.../>元素          info_publis = job_primary.xpath('./div[@class="info-publis"]')          item['recruiter'] = info_publis.xpath('./h3/text()').extract_first()          item['publish_date'] = info_publis.xpath('./p/text()').extract_first()          yield item        # 解析下一頁的鏈接      new_links = response.xpath('//div[@class="page"]/a[@class="next"]/@href').extract()      if new_links and len(new_links) > 0:          # 獲取下一頁的鏈接          new_link = new_links[0]          # 再次發送請求獲取下一頁數據          yield scrapy.Request("https://www.zhipin.com" + new_link, callback=self.parse)

pipelines.py

class ZhipinspiderPipeline(object): def process_item(self, item, spider): print("工作:" , item['title']) print("工資:" , item['salary']) print("工作地點:" , item['work_addr']) print("詳情鏈接:" , item['url'])

    print("公司:" , item['company'])      print("行業:" , item['industry'])      print("公司規模:" , item['company_size'])        print("招聘人:" , item['recruiter'])      print("發佈日期:" , item['publish_date'])

settings.py

–– coding: utf-8 ––

Scrapy settings for ZhipinSpider project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

https://doc.scrapy.org/en/latest/topics/settings.html

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'ZhipinSpider'

SPIDER_MODULES = ['ZhipinSpider.spiders'] NEWSPIDER_MODULE = 'ZhipinSpider.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'ZhipinSpider (+http://www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = True

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

配置默認的請求頭

DEFAULT_REQUEST_HEADERS = { "User-Agent" : "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0", 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8' }

Enable or disable spider middlewares

See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'ZhipinSpider.middlewares.ZhipinspiderSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'ZhipinSpider.middlewares.ZhipinspiderDownloaderMiddleware': 543,

}

Enable or disable extensions

See https://doc.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

配置使用Pipeline

ITEM_PIPELINES = { 'ZhipinSpider.pipelines.ZhipinspiderPipeline': 300, }

Enable and configure the AutoThrottle extension (disabled by default)

See https://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

啟動

scrapy crawl job_position

存入數據庫：pipelines.py

導入訪問MySQL的模塊

import mysql.connector

class ZhipinspiderPipeline(object): # 定義構造器，初始化要寫入的文件 def init(self): self.conn = mysql.connector.connect(user='root', password='32147', host='localhost', port='3306', database='python', use_unicode=True) self.cur = self.conn.cursor() # 重寫close_spider回調方法，用於關閉數據庫資源 def close_spider(self, spider): print('———-關閉數據庫資源———–') # 關閉游標 self.cur.close() # 關閉連接 self.conn.close() def process_item(self, item, spider): self.cur.execute("INSERT INTO job_inf VALUES(null, %s, %s, %s, %s, %s, %s, %s, %s, %s)", (item['title'], item['salary'], item['company'], item['url'], item['work_addr'], item['industry'], item.get('company_size'), item['recruiter'], item['publish_date'])) self.conn.commit()

處理反爬蟲

更改IP地址：middlewares.py

image.png

禁用cookie:settings.py

COOKIES_ENABLED=False

不遵守爬蟲規則

image.png

設置訪問頻率

image.png

image.png

python3 網絡爬蟲 實例1

scrapy

創建爬蟲項目

生成爬蟲

先調試數據

讓scrapy偽裝成瀏覽器

XPath語法

extract提取節點內容

CSS匹配

items.py

job_spider.py

pipelines.py

settings.py

–– coding: utf-8 ––

Scrapy settings for ZhipinSpider project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

https://doc.scrapy.org/en/latest/topics/settings.html

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'ZhipinSpider (+http://www.yourdomain.com)'

Obey robots.txt rules

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 16

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

配置默認的請求頭

Enable or disable spider middlewares

See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'ZhipinSpider.middlewares.ZhipinspiderSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'ZhipinSpider.middlewares.ZhipinspiderDownloaderMiddleware': 543,

}

Enable or disable extensions

See https://doc.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

配置使用Pipeline

Enable and configure the AutoThrottle extension (disabled by default)

See https://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

啟動

存入數據庫：pipelines.py

導入訪問MySQL的模塊

處理反爬蟲

更改IP地址：middlewares.py

python3 網絡爬蟲實例1