哪吒數據提取、數據分析

  • 2019 年 10 月 5 日
  • 筆記

版權聲明:本文為部落客原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處鏈接和本聲明。

本文鏈接:https://blog.csdn.net/weixin_43908900/article/details/100882598

最近哪吒大火,所以我們分析一波哪吒的影評資訊,分析之前我們需要數據呀,所以開篇我們先講一下爬蟲的數據提取;話不多說,走著。

首先我們找到網站的url = "https://maoyan.com/films/1211270",找到評論區看看網友的吐槽,如下

F12打開看看有沒有評論資訊,我們發現還是有資訊的。

但是現在的問題時,我們好像只有這幾條評論資訊,完全不支援我們的分析呀,我們只能另謀出路了;

f12中由手機測試功能,打開刷新頁面,向下滾動看見查看好幾十萬的評論數據,點擊進入後,在network中會看見url = "http://m.maoyan.com/review/v2/comments.json?movieId=1211270&userId=-1&offset=15&limit=15&ts=1568600356382&type=3"api,有這個的時候我們就可以搞事情了。

但是隨著爬取,還是不能獲取完整的資訊,百度、Google、必應一下,我們通過時間段獲取資訊,這樣我們不會被貓眼給牆掉,所以我們使用該 url="http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime="

效果如下:

開始構造爬蟲程式碼:

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # author:albert time:2019/9/3  import  requests,json,time,csv  from fake_useragent import  UserAgent  #獲取userAgent  from datetime import  datetime,timedelta    def get_content(url):      '''獲取api資訊的網頁源程式碼'''      ua = UserAgent().random      try:          data = requests.get(url,headers={'User-Agent':ua},timeout=3 ).text          return data      except:          pass    def  Process_data(html):      '''對數據內容的獲取'''      data_set_list = []      #json格式化      data_list =  json.loads(html)['cmts']      for data in data_list:          data_set = [data['id'],data['nickName'],data['userLevel'],data['cityName'],data['content'],data['score'],data['startTime']]          data_set_list.append(data_set)      return  data_set_list    if __name__ == '__main__':      start_time = start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')  # 獲取當前時間,從當前時間向前獲取      # print(start_time)      end_time = '2019-07-26 08:00:00'        # print(end_time)      while start_time > str(end_time):          #構造url          url = 'http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime=' + start_time.replace(              ' ', '%20')          print('........')          try:              html = get_content(url)          except Exception as e:              time.sleep(0.5)              html = get_content(url)          else:              time.sleep(1)          comments = Process_data(html)          # print(comments[14][-1])          if comments:              start_time = comments[14][-1]              start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') + timedelta(seconds=-1)              # print(start_time)              start_time = datetime.strftime(start_time,'%Y-%m-%d %H:%M:%S')              print(comments)              #保存數據為csv              with open("comments_1.csv", "a", encoding='utf-8',newline='') as  csvfile:                  writer = csv.writer(csvfile)                  writer.writerows(comments)

———————————–數據分析部分———————————–

我們手裡有接近兩萬的數據後開始進行數據分析階段:

工具:jupyter、庫方法:pyecharts v1.0===> pyecharts 庫向下不兼容,所以我們需要使用新的方式(鏈式結構)實現:

我們先來分析一下哪吒的等級星圖,使用pandas 實現分組求和,正對1-5星的數據:

from pyecharts import options as opts  from pyecharts.globals import SymbolType  from pyecharts.charts import Bar,Pie,Page,WordCloud  from pyecharts.globals import ThemeType,SymbolType  import numpy  import pandas as pd    df = pd.read_csv('comments_1.csv',names=["id","nickName","userLevel","cityName","score","startTime"])  attr = ["一星", "二星", "三星", "四星", "五星"]  score = df.groupby("score").size()  # 分組求和  value = [      score.iloc[0] + score.iloc[1]+score.iloc[1],      score.iloc[3] + score.iloc[4],      score.iloc[5] + score.iloc[6],      score.iloc[7] + score.iloc[8],      score.iloc[9] + score.iloc[10],  ]
# 餅圖分析  # 暫時處理,不能直接調用value中的數據  attr = ["一星", "二星", "三星", "四星", "五星"]  value = [286, 43, 175, 764, 10101]    pie = (      Pie(init_opts=opts.InitOpts(theme=ThemeType.LIGHT))      .add('',[list(z) for z in zip(attr, value)])      .set_global_opts(title_opts=opts.TitleOpts(title='哪吒等級分析'))      .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{c}"))  )  pie.render_notebook()

實現效果:

然後進行詞雲分析:

import jieba  import matplotlib.pyplot as plt   #生成圖形  from  wordcloud import WordCloud,STOPWORDS,ImageColorGenerator    df = pd.read_csv("comments_1.csv",names =["id","nickName","userLevel","cityName","content","score","startTime"])    comments = df["content"].tolist()  # comments  df    # 設置分詞  comment_after_split = jieba.cut(str(comments), cut_all=False)  # 非全模式分詞,cut_all=false  words = " ".join(comment_after_split)  # 以空格進行拼接    stopwords = STOPWORDS.copy()  stopwords.update({"電影","最後","就是","不過","這個","一個","感覺","這部","雖然","不是","真的","覺得","還是","但是"})    bg_image = plt.imread('bg.jpg')  #生成  wc=WordCloud(      width=1024,      height=768,      background_color="white",      max_words=200,      mask=bg_image,            #設置圖片的背景      stopwords=stopwords,      max_font_size=200,      random_state=50,      font_path='C:/Windows/Fonts/simkai.ttf'   #中文處理,用系統自帶的字體      ).generate(words)    #產生背景圖片,基於彩色影像的顏色生成器  image_colors=ImageColorGenerator(bg_image)  #開始畫圖  plt.imshow(wc.recolor(color_func=image_colors))  #為背景圖去掉坐標軸  plt.axis("off")  #保存雲圖  plt.show()  wc.to_file("哪吒.png")

效果如下: