­

Python爬取京東筆記型電腦電腦,來看看那個牌子最棒

  • 2019 年 10 月 7 日
  • 筆記

一、前言

二、知識要求三、過程分析1.觀察主頁面和每個電腦介面的網址2.尋找每個電腦的id3.找到存放電腦的價格和評論數的資訊4.爬取資訊的思路四、urllib模組爬取京東筆記型電腦電腦的數據、並對其做一個可視化實戰五、可視化結果1.運行結果2.可視化結果

//

本文作者

王豪:行路難,多歧路,今安在,埋頭苦改bug會有時,直到bug改完才吃飯。

//

閱讀文本大概需要 5 分鐘。

一、前言

作為一個程式設計師,筆記型電腦電腦是必不可少的,我這裡對京東上的前2頁的筆記型電腦的好評論數,價格,店鋪等資訊進行爬取,並做一個可視化,根據可視化的圖,大家可以清晰的做出預測,方便大家購買划算的電腦。當然,我這裡前2頁的數據是遠遠不夠的,如果大家想要預測的更精準一些,可以改一下數字,獲取更多頁面的數據,這樣,預測結果會更精確。

二、知識要求

三、過程分析

1.觀察主頁面和每個電腦介面的網址

(1)觀察具體介面的網址,我們可以猜測,具體每個介面都有一個id,通過構造網址https://item.jd.com/【id】.html,就可以得到具體每個介面的網址。 (2)觀察主介面的網址,我們發現page=的屬性值就是具體的頁碼數,通過構造page的值,我們可以實現自動翻頁爬取資訊。對主介面網址一些不必要的資訊剔除,最後得到主介面翻頁的網址規律https://list.jd.com/list.html?cat=670,671,672&page=【頁碼數】

同過以上的分析,我們可以看見,獲取資訊的關鍵就是每個電腦的具體id代號,接下來,我們的任務就是要找到每個電腦的id。

2.尋找每個電腦的id

(1)首先,看看網頁源程式碼中是否會有每個電腦的id

在這裡插入圖片描述

我們再進入到剛剛搜索的哪個電腦名稱的具體介面,發現,確實是他的id

(3)根據id附件的一些屬性值,唯一確定所有電腦id 根據class="gl-i-wrap j-sku-item"屬性值定位,發現,唯一確定60個id,數了一下介面上的電腦,一頁確實是60個電腦,所以,電腦的id獲取到了。

(4)同理,根據<div class="p-name">屬性值獲取具體每個電腦的網址和電腦名,這樣我們連具體每個電腦的網址都不用構造了,直接可以獲取。

3.找到存放電腦的價格和評論數的資訊

(1)通過到網頁源程式碼中去找,發現完全找不到,所以,我猜測這些資訊隱藏在js包中。 (2)打開fiddler抓包工具,進行抓包分析。

可以看見,這些資訊確實是在js包裡面,複製該js包的網址,然後分析。 (3)分析有如下結論:

這裡,我也抓到了存放店鋪的js包,但是,這個js包的地址每次有一部分是隨機生成的,所以,獲取不到每台的電腦的店鋪名。但是,我有每台電腦的具體網址,而該介面裡面有該電腦的店鋪,所以,我可以訪問每台電腦的具體介面去獲取到店鋪消息。

4.爬取資訊的思路

(1)先爬每頁的資訊 (2)再爬每頁中每台電腦的價格、電腦名和評論數,以及每台電腦的網址 (3)爬取每台電腦的頁面,獲取店鋪資訊 (4)獲取完所有頁面資訊後,做一個可視化

四、urllib模組爬取京東筆記型電腦電腦的數據、並對其做一個可視化實戰

爬蟲文件:(建議大家邊看邊敲一遍,更加有利於學習)

# -*- coding: utf-8 -*-  import random  import urllib.request  import re  import time  from lxml import etree  from pyecharts import Bar  from pyecharts import Pie      headers = [      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",      "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",      "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",      "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",      "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",      "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",      "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",      "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",      "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",      "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",      "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",      "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",      "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",      "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",      "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"  ]    def main():      # 用來存放所有的電腦數據      allNames = []      allCommentNums = {}      allPrices = {}      allShops = {}        # 爬取前2頁的所有筆記型電腦電腦      for i in range(0, 1):          # 每頁地址規律:https://list.jd.com/list.html?cat=670,671,672&page=【頁碼】          print('正在爬取第'+str(i+1)+'頁的資訊...')          url = 'https://list.jd.com/list.html?cat=670,671,672&page='+str(i+1)          get_page_data(url, allNames, allCommentNums, allPrices, allShops)        # 以上為獲取資訊,以下為數據的可視化      names = allNames      commentNums = []      for name in names:          if allCommentNums[name] == None:              commentNums.append(0)          else:              commentNums.append(eval(allCommentNums[name]))      prices = []      for name in names:          if allPrices[name] == None:              prices.append(0)          else:              prices.append(eval(allPrices[name]))      shops = []      for name in names:          if allShops[name] != None:              shops.append(allShops[name])      for i in range(0, len(names)):          print(names[i])          print(commentNums[i])          print(prices[i])          print(shops[i])      # 將其評論數進行條形統計圖可視化      tiaoxing(names, prices)        # 將其店鋪進行餅圖可視化      # 先需要統計每個店鋪的個數      shopNames = list(set(shops))      nums = []      for i in range(0, len(shopNames)):          nums.append(0)      for shop in shops:          for i in range(0, len(shopNames)):              if shop == shopNames[i]:                  nums[i] += 1      bingtu(shopNames, nums)      def get_page_data(url, allNames, allCommentNums, allPrices, allShops):      # 爬取該頁內所有電腦的id、電腦名稱和該電腦的具體網址      response = urllib.request.Request(url)      response.add_header('User-Agent', random.choice(headers))      data = urllib.request.urlopen(response, timeout=1).read().decode('utf-8', 'ignore')      data = etree.HTML(data)      ids = data.xpath('//a[@class="p-o-btn contrast J_contrast contrast-hide"]/@data-sku')      names = data.xpath('//div[@class="p-name"]/a/em/text()')      hrefs = data.xpath('//div[@class="p-name"]/a/@href')      # 去掉重複的網址      print(len(hrefs))      hrefs = list(set(hrefs))      print(len(hrefs))      # 將每個電腦的網址構造完全,加上'https:'      for i in range(0, len(hrefs)):          hrefs[i] = 'https:'+hrefs[i]        # 根據id構造存放每台電腦評論數的js包的地址      # 其網址格式為:https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds=100000323510,100002368328&callback=jQuery5043746      str = ''      for id in ids:          str = str + id + ','      commentJS_url = 'https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds='+str[:-1]+'&callback=jQuery5043746'      # 爬取該js包,獲取每台電腦的評論數      response2 = urllib.request.Request(commentJS_url)      response2.add_header('User-Agent', random.choice(headers))      data = urllib.request.urlopen(response2, timeout=1).read().decode('utf-8', 'ignore')      pat = '{(.*?)}'      commentStr = re.compile(pat).findall(data)  # commentStr用來存放每個商品的關於評論數方面的所有資訊      comments = {}      for id in ids:          for str in commentStr:              if id in str:                  pat2 = '"CommentCount":(.*?),'                  comments[id] = re.compile(pat2).findall(str)[0]      print("ids為:", len(ids),ids)      print("name為:", len(names), names)      print("評論數為:", len(comments), comments)        # 根據id構造存放每台電腦價格的js包的地址      # 其網址格式為:https://p.3.cn/prices/mgets?callback=jQuery1702366&type=1&skuIds=J_7512626%2CJ_44354035037%2CJ_100003302532      str = ''      for i in range(0, len(ids)):          if i == 0:              str = str + 'J_' + ids[i] + '%'          else:              str = str + '2CJ_' + ids[i] + '%'      priceJS_url = 'https://p.3.cn/prices/mgets?callback=jQuery1702366&type=1&skuIds=' + str[:-1]      # 爬取該js包,獲取每台電腦的價格      response3 = urllib.request.Request(priceJS_url)      response3.add_header('User-Agent', random.choice(headers))      data = urllib.request.urlopen(response3, timeout=1).read().decode('utf-8', 'ignore')      priceStr = re.compile(pat).findall(data)  # priceStr用來存放每個商品關於價格方面的資訊      prices = {}      for id in ids:          for str in priceStr:              if id in str:                  pat3 = '"p":"(.*?)"'                  prices[id] = re.compile(pat3).findall(str)[0]      print("價格為:", prices)        # 爬取每個商品的店鋪,需要進入到對應的每個電腦的頁面去爬取店鋪資訊      shops = {}      for id in ids:          for href in hrefs:              if id in href:                  try:                      response4 = urllib.request.Request(href)                      response4.add_header('User-Agent', random.choice(headers))                      data = urllib.request.urlopen(response4, timeout=1).read().decode('gbk', 'ignore')                      shop = etree.HTML(data).xpath('//*[@id="crumb-wrap"]/div/div[2]/div[2]/div[1]/div/a/@title')                      print(shop)                      if shop == []:                          shops[id] = None                      else:                          shops[id] = shop[0]                      time.sleep(2)                  except Exception as e:                      print(e)      # 先去掉電腦名兩邊的空格和換行符      [name.strip() for name in names]      # 將數據分別添加到item中      for name in names:          allNames.append(name)      # 名字對應評論數的字典形式      for i in range(0, len(ids)):          if comments[ids[i]] == '':              allCommentNums[names[i]] = None          else:              allCommentNums[names[i]] = comments[ids[i]]      # 名字與價格對應起來      for i in range(0, len(ids)):          if prices[ids[i]] == '':              allPrices[names[i]] = None          else:              allPrices[names[i]] = prices[ids[i]]      # 名字與店鋪對應起來      for i in range(0, len(ids)):          allShops[names[i]] = shops[ids[i]]        def tiaoxing(names, prices):      bar = Bar("筆記型電腦電腦價格圖", "X-電腦名,Y-價格")      bar.add("筆記型電腦電腦", names, prices)      bar.show_config()      bar.render("D:\scrapy\jingdong\prices.html")      def bingtu(shopNames, nums):      attr = shopNames      v1 = nums      pie = Pie("筆記型電腦店鋪餅圖展示")      pie.add("", attr, v1, is_label_show=True)      pie.show_config()      pie.render("D:\scrapy\jingdong\shops.html")      if __name__ == '__main__':      main()  

五、可視化結果

1.運行結果

2.可視化結果

評論數條形統計圖:

店鋪扇形統計圖:

可以看見聯想的電腦買的最好。