哪吒数据提取、数据分析
- 2019 年 10 月 5 日
- 筆記
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/weixin_43908900/article/details/100882598
最近哪吒大火,所以我们分析一波哪吒的影评信息,分析之前我们需要数据呀,所以开篇我们先讲一下爬虫的数据提取;话不多说,走着。
首先我们找到网站的url = "https://maoyan.com/films/1211270"
,找到评论区看看网友的吐槽,如下

F12打开看看有没有评论信息,我们发现还是有信息的。

但是现在的问题时,我们好像只有这几条评论信息,完全不支持我们的分析呀,我们只能另谋出路了;

f12中由手机测试功能,打开刷新页面,向下滚动看见查看好几十万的评论数据,点击进入后,在network中会看见url = "http://m.maoyan.com/review/v2/comments.json?movieId=1211270&userId=-1&offset=15&limit=15&ts=1568600356382&type=3"
api,有这个的时候我们就可以搞事情了。


但是随着爬取,还是不能获取完整的信息,百度、谷歌、必应一下,我们通过时间段获取信息,这样我们不会被猫眼给墙掉,所以我们使用该 url="http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime="
效果如下:

开始构造爬虫代码:
#!/usr/bin/env python # -*- coding: utf-8 -*- # author:albert time:2019/9/3 import requests,json,time,csv from fake_useragent import UserAgent #获取userAgent from datetime import datetime,timedelta def get_content(url): '''获取api信息的网页源代码''' ua = UserAgent().random try: data = requests.get(url,headers={'User-Agent':ua},timeout=3 ).text return data except: pass def Process_data(html): '''对数据内容的获取''' data_set_list = [] #json格式化 data_list = json.loads(html)['cmts'] for data in data_list: data_set = [data['id'],data['nickName'],data['userLevel'],data['cityName'],data['content'],data['score'],data['startTime']] data_set_list.append(data_set) return data_set_list if __name__ == '__main__': start_time = start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S') # 获取当前时间,从当前时间向前获取 # print(start_time) end_time = '2019-07-26 08:00:00' # print(end_time) while start_time > str(end_time): #构造url url = 'http://m.maoyan.com/mmdb/comments/movie/1211270.json?_v_=yes&offset=0&startTime=' + start_time.replace( ' ', '%20') print('........') try: html = get_content(url) except Exception as e: time.sleep(0.5) html = get_content(url) else: time.sleep(1) comments = Process_data(html) # print(comments[14][-1]) if comments: start_time = comments[14][-1] start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') + timedelta(seconds=-1) # print(start_time) start_time = datetime.strftime(start_time,'%Y-%m-%d %H:%M:%S') print(comments) #保存数据为csv with open("comments_1.csv", "a", encoding='utf-8',newline='') as csvfile: writer = csv.writer(csvfile) writer.writerows(comments)
———————————–数据分析部分———————————–
我们手里有接近两万的数据后开始进行数据分析阶段:
工具:jupyter、库方法:pyecharts v1.0===> pyecharts 库向下不兼容,所以我们需要使用新的方式(链式结构)实现:
我们先来分析一下哪吒的等级星图,使用pandas 实现分组求和,正对1-5星的数据:
from pyecharts import options as opts from pyecharts.globals import SymbolType from pyecharts.charts import Bar,Pie,Page,WordCloud from pyecharts.globals import ThemeType,SymbolType import numpy import pandas as pd df = pd.read_csv('comments_1.csv',names=["id","nickName","userLevel","cityName","score","startTime"]) attr = ["一星", "二星", "三星", "四星", "五星"] score = df.groupby("score").size() # 分组求和 value = [ score.iloc[0] + score.iloc[1]+score.iloc[1], score.iloc[3] + score.iloc[4], score.iloc[5] + score.iloc[6], score.iloc[7] + score.iloc[8], score.iloc[9] + score.iloc[10], ]
# 饼图分析 # 暂时处理,不能直接调用value中的数据 attr = ["一星", "二星", "三星", "四星", "五星"] value = [286, 43, 175, 764, 10101] pie = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.LIGHT)) .add('',[list(z) for z in zip(attr, value)]) .set_global_opts(title_opts=opts.TitleOpts(title='哪吒等级分析')) .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{c}")) ) pie.render_notebook()
实现效果:

然后进行词云分析:
import jieba import matplotlib.pyplot as plt #生成图形 from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator df = pd.read_csv("comments_1.csv",names =["id","nickName","userLevel","cityName","content","score","startTime"]) comments = df["content"].tolist() # comments df # 设置分词 comment_after_split = jieba.cut(str(comments), cut_all=False) # 非全模式分词,cut_all=false words = " ".join(comment_after_split) # 以空格进行拼接 stopwords = STOPWORDS.copy() stopwords.update({"电影","最后","就是","不过","这个","一个","感觉","这部","虽然","不是","真的","觉得","还是","但是"}) bg_image = plt.imread('bg.jpg') #生成 wc=WordCloud( width=1024, height=768, background_color="white", max_words=200, mask=bg_image, #设置图片的背景 stopwords=stopwords, max_font_size=200, random_state=50, font_path='C:/Windows/Fonts/simkai.ttf' #中文处理,用系统自带的字体 ).generate(words) #产生背景图片,基于彩色图像的颜色生成器 image_colors=ImageColorGenerator(bg_image) #开始画图 plt.imshow(wc.recolor(color_func=image_colors)) #为背景图去掉坐标轴 plt.axis("off") #保存云图 plt.show() wc.to_file("哪吒.png")
效果如下:
