數據分析實際案例之：pandas在餐廳評分數據中的使用

2022 年 2 月 25 日
筆記
pandas, Python, python編程, 程式那些事

簡介
餐廳評分數據簡介
分析評分數據

簡介

為了更好的熟練掌握pandas在實際數據分析中的應用，今天我們再介紹一下怎麼使用pandas做美國餐廳評分數據的分析。

餐廳評分數據簡介

數據的來源是UCI ML Repository，包含了一千多條數據，有5個屬性，分別是：

userID：用戶ID

placeID：餐廳ID

rating：總體評分

food_rating：食物評分

service_rating：服務評分

我們使用pandas來讀取數據：

import numpy as np

path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df

	userID	placeID	rating	food_rating	service_rating
0	U1077	135085	2	2	2
1	U1077	135038	2	2	1
2	U1077	132825	2	2	2
3	U1077	135060	1	2	2
4	U1068	135104	1	1	2
…	…	…	…	…	…
1156	U1043	132630	1	1	1
1157	U1011	132715	1	1	0
1158	U1068	132733	1	1	0
1159	U1068	132594	1	1	1
1160	U1068	132660	0	0	0

1161 rows × 5 columns

分析評分數據

如果我們關注的是不同餐廳的總評分和食物評分，我們可以先看下這些餐廳評分的平均數，這裡我們使用pivot_table方法：

mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
                                 aggfunc='mean')
mean_ratings[:5]

	food_rating	rating
placeID
132560	1.00	0.50
132561	1.00	0.75
132564	1.25	1.25
132572	1.00	1.00
132583	1.00	1.00

然後再看一下各個placeID，投票人數的統計：

ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]

placeID
132560     4
132561     4
132564     4
132572    15
132583     4
132584     6
132594     5
132608     6
132609     5
132613     6
dtype: int64

如果投票人數太少，那麼這些數據其實是不客觀的，我們來挑選一下投票人數超過4個的餐廳：

active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place

Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
            132609, 132613,
            ...
            135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
            135108, 135109],
           dtype='int64', name='placeID', length=124)

選擇這些餐廳的平均評分數據：

mean_ratings = mean_ratings.loc[active_place]
mean_ratings

	food_rating	rating
placeID
132560	1.000000	0.500000
132561	1.000000	0.750000
132564	1.250000	1.250000
132572	1.000000	1.000000
132583	1.000000	1.000000
…	…	…
135088	1.166667	1.000000
135104	1.428571	0.857143
135106	1.200000	1.200000
135108	1.181818	1.181818
135109	1.250000	1.000000

124 rows × 2 columns

對rating進行排序，選擇評分最高的10個：

top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]

	food_rating	rating
placeID
132955	1.800000	2.000000
135034	2.000000	2.000000
134986	2.000000	2.000000
132922	1.500000	1.833333
132755	2.000000	1.800000
135074	1.750000	1.750000
135013	2.000000	1.750000
134976	1.750000	1.750000
135055	1.714286	1.714286
135075	1.692308	1.692308

我們還可以計算平均總評分和平均食物評分的差值，並以一欄diff進行保存：

mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

	food_rating	rating	diff
placeID
132667	2.000000	1.250000	-0.750000
132594	1.200000	0.600000	-0.600000
132858	1.400000	0.800000	-0.600000
135104	1.428571	0.857143	-0.571429
132560	1.000000	0.500000	-0.500000
135027	1.375000	0.875000	-0.500000
132740	1.250000	0.750000	-0.500000
134992	1.500000	1.000000	-0.500000
132706	1.250000	0.750000	-0.500000
132870	1.000000	0.600000	-0.400000

將數據進行反轉，選擇差距最大的前10：

sorted_by_diff[::-1][:10]

	food_rating	rating	diff
placeID
134987	0.500000	1.000000	0.500000
132937	1.000000	1.500000	0.500000
135066	1.000000	1.500000	0.500000
132851	1.000000	1.428571	0.428571
135049	0.600000	1.000000	0.400000
132922	1.500000	1.833333	0.333333
135030	1.333333	1.583333	0.250000
135063	1.000000	1.250000	0.250000
132626	1.000000	1.250000	0.250000
135000	1.000000	1.250000	0.250000

計算rating的標準差，並選擇最大的前10個：

# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]

placeID
134987    1.154701
135049    1.000000
134983    1.000000
135053    0.991031
135027    0.991031
132847    0.983192
132767    0.983192
132884    0.983192
135082    0.971825
132706    0.957427
Name: rating, dtype: float64

本文已收錄於 //www.flydean.com/02-pandas-restaurant/

最通俗的解讀，最深刻的乾貨，最簡潔的教程，眾多你不知道的小技巧等你來發現！

歡迎關注我的公眾號:「程式那些事」,懂技術，更懂你！

Tags: pandas Python python編程程式那些事

數據分析實際案例之：pandas在餐廳評分數據中的使用

簡介

餐廳評分數據簡介

分析評分數據

VirMach 便宜 VPS

QNews

數據分析實際案例之：pandas在餐廳評分數據中的使用

簡介

餐廳評分數據簡介

分析評分數據

分享此文：

Related Posts

CentOS6/7 配置守護進程

.NET並發編程-函數閉包

Redis_RDB持久化之寫時複製技術的應用

Linux之父終於被勸動：用了30年的Linux內核大升級

VirMach 便宜 VPS

QNews

熱門搜尋