抓取貓眼電影排行

2019 年 10 月 10 日
筆記

本文鏈接：https://blog.csdn.net/weixin_40313634/article/details/89502198

抓取貓眼電影排行

環境

技術：requests 爬取網頁 + 正則表達式解析網頁

編輯：sublime + python3

爬取網站：https://maoyan.com/board/4?offset=0

代碼實現

import requests  import re    # 正則表達式的庫  import json  import random  import os    # 操作系統文件操作的庫    '''  功能：爬取單個網頁信息的文本內容  入參：待爬取網頁的網址  '''  def get_one_page(url):      headers = {          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'    # 假裝這是用谷歌瀏覽器訪問的，而不是爬蟲爬取。      }        # 如果爬取成功，則返回爬取的網頁文本信息；負責返回None。      response = requests.get(url, headers=headers)      if response.status_code == 200:          return response.text      return None    '''  功能：爬取單個網頁信息的二進制內容（圖片、視頻等）  入參：待爬取網頁的網址  說明：response.content表示爬取的二進制內容；response.text表示爬取的文本內容。只是和get_one_page的返回值不一樣，兩個接口是可以合併的。  '''  def get_one_image(url):      headers = {          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'      }        response = requests.get(url, headers=headers)      if response.status_code == 200:          return response.content      return None    '''  功能：解析網頁中的電影排名、圖片、標題、演員、時間、評分等信息  入參：爬取的網頁內容  '''  def parse_one_page(html):      # 根據網頁格式，結合待爬取的內容，得到的正則表達式      pattern = re.compile('''<dd>.*?board-index.*?>(.*?)</i>.*?title="(.*?)".*?<img data-src="(.*?)".*?<p class="star">(.*?)</p>.*?<p class="releasetime">(.*?)<.*?class="integer">(.*?)<.*?"fraction">(.*?)<''', re.S)      items = re.findall(pattern, html)      # 查找到的信息存儲在list里，將其改裝成字典形式的      for item in items:          yield{              'index':item[0],              'title':item[1],              'image':item[2],              'actor':item[3].strip(),              'time':item[4],              'score':item[5] + item[6]          }    def write_to_file(content):      # 將爬取到的電影信息保存到文件里；將其中的圖片下載下來，圖片名取隨機數字。      with open('result.json', 'a', encoding='utf-8') as f:          f.write(json.dumps(content, ensure_ascii=False) + ',n')      html = get_one_image(content['image'])      fdir = './image'      if not os.path.exists(fdir):          os.mkdir(fdir)      os.chdir(fdir)      num = random.random()      with open(str(num)+'.jpg', 'wb') as f:          f.write(html)      os.chdir('..')    def main(offset):      # 根據偏移量，拼接每一個網頁的網頁地址，用於爬取數據。      url = 'https://maoyan.com/board/4?offset=' + str(offset)      html = get_one_page(url)      for item in parse_one_page(html):          write_to_file(item)    if __name__ == '__main__':      # 循環獲取網址偏移量。      for i in range(10):          main(offset = i*10 )

抓取貓眼電影排行

抓取貓眼電影排行

環境

技術：requests 爬取網頁 + 正則表達式解析網頁

編輯：sublime + python3

爬取網站：https://maoyan.com/board/4?offset=0

代碼實現

注意事項

1. 文件名不能和module名一樣：負責會導致module里的方法找不到

2. 一定到對齊空格：編輯器設置成勇空格代替table；設置成空格可見。否則很容易引起格式問題。

VirMach 便宜 VPS

QNews

抓取貓眼電影排行

抓取貓眼電影排行

環境

技術：requests 爬取網頁 + 正則表達式解析網頁

編輯：sublime + python3

爬取網站：https://maoyan.com/board/4?offset=0

代碼實現

注意事項

1. 文件名不能和module名一樣：負責會導致module里的方法找不到

2. 一定到對齊空格：編輯器設置成勇空格代替table；設置成空格可見。否則很容易引起格式問題。

分享此文：

Related Posts

Atlas 2.1.0 實踐（4）—— 權限控制

QTP——功能測試

spiders:你好污啊

【解決方案】requests.exceptions.SSLError: HTTPSConnectionPool

VirMach 便宜 VPS

QNews

熱門搜尋