小白也能看懂的Pandas實操演示教程(上)

2019 年 10 月 7 日
筆記

今天主要帶大家來實操學習下Pandas，因為篇幅原因，分為了兩部分，本篇為上。

1 數據結構的簡介

pandas中有兩類非常重要的數據結構，就是序列Series和數據框DataFrame.Series類似於NumPy中的一維數組，可以使用一維數組的可用函數和方法，而且還可以通過索引標籤的方式獲取數據，還具有索引的自動對齊功能；DataFrame類似於numpy中的二維數組，同樣可以使用numpy數組的函數和方法，還具有一些其它靈活的使用。

1.1 Series的創建三種方法

通過一維數組創建序列m

import pandas as pd  import numpy as np    arr1=np.arange(10)  print("數組arr1：",arr1)  print("arr1的數據類型：",type(arr1))  s1=pd.Series(arr1)  print("序列s1:  ",s1)  print("s1的數據類型：",type(s1))

數組arr1： [0 1 2 3 4 5 6 7 8 9]  arr1的數據類型： <class 'numpy.ndarray'>  序列s1:  0    0  1    1  2    2  3    3  4    4  5    5  6    6  7    7  8    8  9    9  dtype: int32  s1的數據類型： <class 'pandas.core.series.Series'>

通過字典的方式創建序列

dict1={'a':1,'b':2,'c':3,'d':4,'e':5}  print("字典dict1：",dict1)  print("dict1的數據類型：",type(dict1))  s2=pd.Series(dict1)  print("序列s2：",s2)  print("s2的數據類型：",type(s2))

字典dict1： {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}  dict1的數據類型： <class 'dict'>  序列s2：a    1  b    2  c    3  d    4  e    5  dtype: int64  s2的數據類型： <class 'pandas.core.series.Series'>

通過已有DataFrame創建

由於涉及到了DataFrame的概念，所以等後面介紹了DataFrame之後補充下如何通過已有的DataFrame來創建Series。

1.2 DataFrame的創建三種方法

通過二維數組創建數據框

print("第一種方法創建DataFrame")  arr2=np.array(np.arange(12)).reshape(4,3)  print("數組2：",arr2)  print("數組2的類型",type(arr2))    df1=pd.DataFrame(arr2)  print("數據框1：  ",df1)  print("數據框1的類型：",type(df1))

第一種方法創建DataFrame  數組2： [[ 0  1  2]   [ 3  4  5]   [ 6  7  8]   [ 9 10 11]]  數組2的類型 <class 'numpy.ndarray'>  數據框1：      0   1   2  0  0   1   2  1  3   4   5  2  6   7   8  3  9  10  11  數據框1的類型： <class 'pandas.core.frame.DataFrame'>

通過字典列表的方式創建數據框

print("第二種方法創建DataFrame")  dict2={'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12],'d':[13,14,15,16]}  print("字典2-字典列表：",dict2)  print("字典2的類型",type(dict2))    df2=pd.DataFrame(dict2)  print("數據框2：  ",df2)  print("數據框2的類型：",type(df2))

第二種方法創建DataFrame  字典2-字典列表： {'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8], 'c': [9, 10, 11, 12], 'd': [13, 14, 15, 16]}  字典2的類型 <class 'dict'>  數據框2：      a  b   c   d  0  1  5   9  13  1  2  6  10  14  2  3  7  11  15  3  4  8  12  16  數據框2的類型： <class 'pandas.core.frame.DataFrame'>

通過嵌套字典的方式創建數據框

dict3={'one':{'a':1,'b':2,'c':3,'d':4},        'two':{'a':5,'b':6,'c':7,'d':8},        'three':{'a':9,'b':10,'c':11,'d':12}}  print("字典3-嵌套字典：",dict3)  print("字典3的類型",type(dict3))    df3=pd.DataFrame(dict3)  print("數據框3：  ",df3)  print("數據框3的類型：",type(df3))

字典3-嵌套字典： {'one': {'a': 1, 'b': 2, 'c': 3, 'd': 4}, 'two': {'a': 5, 'b': 6, 'c': 7, 'd': 8}, 'three': {'a': 9, 'b': 10, 'c': 11, 'd': 12}}  字典3的類型 <class 'dict'>  數據框3：      one  three  two  a    1      9    5  b    2     10    6  c    3     11    7  d    4     12    8  數據框3的類型： <class 'pandas.core.frame.DataFrame'>

有了DataFrame之後，這裡補充下如何通過DataFrame來創建Series。

s3=df3['one'] #直接拿出數據框3中第一列  print("序列3：  ",s3)  print("序列3的類型：",type(s3))  print("------------------------------------------------")  s4=df3.iloc[0] #df3['a'] #直接拿出數據框3中第一行--iloc  print("序列4：  ",s4)  print("序列4的類型：",type(s4))

序列3：   a    1  b    2  c    3  d    4  Name: one, dtype: int64  序列3的類型： <class 'pandas.core.series.Series'>  ------------------------------------------------  序列4：   one      1  three    9  two      5  Name: a, dtype: int64  序列4的類型： <class 'pandas.core.series.Series'>

2 數據索引index

無論數據框還是序列，最左側始終有一個非原始數據對象，這個就是接下來要介紹的數據索引。通過索引獲取目標數據，對數據進行一系列的操作。

2.1 通過索引值或索引標籤獲取數據

s5=pd.Series(np.array([1,2,3,4,5,6]))  print(s5) #如果不給序列一個指定索引值，序列會自動生成一個從0開始的自增索引

0    1  1    2  2    3  3    4  4    5  5    6  dtype: int32

通過index屬性獲取序列的索引值

s5.index

RangeIndex(start=0, stop=6, step=1)

為index重新賦值

s5.index=['a','b','c','d','e','f']  s5

a    1  b    2  c    3  d    4  e    5  f    6  dtype: int32

通過索引獲取數據

s5[3]

s5['e']

s5[[1,3,5]]

b    2  d    4  f    6  dtype: int32

s5[:4]

a    1  b    2  c    3  d    4  dtype: int32

s5['c':]

c    3  d    4  e    5  f    6  dtype: int32

s5['b':'e']  #通過索引標籤獲取數據，末端標籤的數據也是返回的，

b    2  c    3  d    4  e    5  dtype: int32

2.2 自動化對齊

#當對兩個  s6=pd.Series(np.array([10,15,20,30,55,80]),index=['a','b','c','d','e','f'])  print("序列6：",s6)  s7=pd.Series(np.array([12,11,13,15,14,16]),index=['a','c','g','b','d','f'])  print("序列7：",s7)    print(s6+s7)  #s6中不存在g索引，s7中不存在e索引，所以數據運算會產生兩個缺失值NaN。  #可以注意到這裡的算術運算自動實現了兩個序列的自動對齊  #對於數據框的對齊，不僅是行索引的自動對齊，同時也會對列索引進行自動對齊，數據框相當於二維數組的推廣  print(s6/s7)

序列6： a    10  b    15  c    20  d    30  e    55  f    80  dtype: int32  序列7： a    12  c    11  g    13  b    15  d    14  f    16  dtype: int32  a    22.0  b    30.0  c    31.0  d    44.0  e     NaN  f    96.0  g     NaN  dtype: float64  a    0.833333  b    1.000000  c    1.818182  d    2.142857  e         NaN  f    5.000000  g         NaN  dtype: float64

3 pandas查詢數據

通過布爾索引有針對的選取原數據的子集，指定行，指定列等。

test_data=pd.read_csv('test_set.csv')  # test_data.drop(['ID'],inplace=True,axis=1)  test_data.head()

非數值值特徵數值化

test_data['job'],jnum=pd.factorize(test_data['job'])  test_data['job']=test_data['job']+1    test_data['marital'],jnum=pd.factorize(test_data['marital'])  test_data['marital']=test_data['marital']+1    test_data['education'],jnum=pd.factorize(test_data['education'])  test_data['education']=test_data['education']+1    test_data['default'],jnum=pd.factorize(test_data['default'])  test_data['default']=test_data['default']+1    test_data['housing'],jnum=pd.factorize(test_data['housing'])  test_data['housing']=test_data['housing']+1    test_data['loan'],jnum=pd.factorize(test_data['loan'])  test_data['loan']=test_data['loan']+1    test_data['contact'],jnum=pd.factorize(test_data['contact'])  test_data['contact']=test_data['contact']+1    test_data['month'],jnum=pd.factorize(test_data['month'])  test_data['month']=test_data['month']+1    test_data['poutcome'],jnum=pd.factorize(test_data['poutcome'])  test_data['poutcome']=test_data['poutcome']+1    test_data.head()

查詢數據的前5行

test_data.head()

查詢數據的末尾5行

test_data.tail()

查詢指定的行

test_data.iloc[[0,2,4,5,7]]

查詢指定的列

test_data[['age','job','marital']].head()

查詢指定的行和列

test_data.loc[[0,2,4,5,7],['age','job','marital']]

查詢年齡為51的資訊

#通過布爾索引實現數據的自己查詢    test_data[test_data['age']==51].head()

查詢工作為5以上的年齡在51的資訊

test_data[(test_data['age']==51) & (test_data['job']>=5)].head()

查詢工作為5以上，年齡在51的人員，並且只選取指定列

#只選取housing,loan,contac和poutcome  test_data[(test_data['age']==51) & (test_data['job']>=5)][['education','housing','loan','contact','poutcome']].head()

可以看到，當有多個條件的查詢，需要在&或者|的兩端的條件括起來

4 對DataFrames進行統計分析

Pandas為我們提供了很多描述性統計分析的指標函數，包括，總和，均值，最小值，最大值等。

a=np.random.normal(size=10)  d1=pd.Series(2*a+3)  d2=np.random.f(2,4,size=10)  d3=np.random.randint(1,100,size=10)  print(d1)  print(d2)  print(d3)

0    5.811077  1    2.963418  2    2.295078  3    0.279647  4    6.564293  5    1.146455  6    1.903623  7    1.157710  8    2.921304  9    2.397009  dtype: float64  [0.18147396 0.48218962 0.42565903 0.10258942 0.55299842 0.10859328   0.66923199 1.18542009 0.12053079 4.64172891]  [33 17 71 45 33 83 68 41 69 23]

非空元素的計算

d1.count()

最小值

d1.min()

0.6149265534311872

最大值

d1.max()

6.217953512253818

最小值的位置

d1.idxmin()

最大值的位置

d1.idxmax()

10%分位數

d1.quantile(0.1)

1.4006153623854274

求和

d1.sum()

27.43961378467516

平均數

d1.mean()

2.743961378467515

中位數

d1.median()

2.3460435427041384

眾數

d1.mode()

0    0.279647  1    1.146455  2    1.157710  3    1.903623  4    2.295078  5    2.397009  6    2.921304  7    2.963418  8    5.811077  9    6.564293  dtype: float64

方差

d1.var()

4.027871738323722

標準差

d1.std()

2.0069558386580715

平均絕對偏差

d1.mad()

1.456849211331346

偏度

d1.skew()

1.0457755613918738

峰度

d1.kurt()

0.39322767370407874

一次性輸出多個描述性統計指標

d1.describe()

count    10.000000  mean      2.743961  std       2.006956  min       0.279647  25%       1.344189  50%       2.346044  75%       2.952890  max       6.564293  dtype: float64

#自定義一個函數，將這些統計描述指標全部匯總到一起  def stats(x):      return pd.Series([x.count(),x.min(),x.idxmin(),x.quantile(.25),x.median(),                       x.quantile(.75),x.mean(),x.max(),x.idxmax(),x.mad(),x.var(),x.std(),x.skew(),x.kurt()],                       index=['Count','Min','Which_Min','Q1','Median','Q3','Mean','Max','Which_Max','Mad','Var','Std','Skew',                             'Kurt'])

stats(d1)

Count        10.000000  Min           0.279647  Which_Min     3.000000  Q1            1.344189  Median        2.346044  Q3            2.952890  Mean          2.743961  Max           6.564293  Which_Max     4.000000  Mad           1.456849  Var           4.027872  Std           2.006956  Skew          1.045776  Kurt          0.393228  dtype: float64

對於數字型數據，它是直接統計一些數據性描述，觀察這一系列數據的範圍。大小、波動趨勢，便於判斷後續對數據採取哪類模型更合適。

#當實際工作中我們需要處理的是一系列的數值型數據框，可以使用apply函數將這個stats函數應用到數據框中的每一列  df=pd.DataFrame(np.array([d1,d2,d3]).T,columns=['x1','x2','x3']) #將之前創建的d1,d2,d3數據構建數據框  print(df.head())  df.apply(stats)

         x1        x2    x3  0  5.811077  0.181474  33.0  1  2.963418  0.482190  17.0  2  2.295078  0.425659  71.0  3  0.279647  0.102589  45.0  4  6.564293  0.552998  33.0

以上很簡單的創建了數值型數據的統計性描述，但對於離散型數據就不能使用該方法了。我們在統計離散變數的觀測數、唯一值個數、眾數水平及個數，只需要使用describe方法就可以實現這樣的統計了。

train_data=pd.read_csv('train_set.csv')  # test_data.drop(['ID'],inplace=True,axis=1)  train_data.head()

train_data['job'].describe()  #離散型數據的描述

count           25317  unique             12  top       blue-collar  freq             5456  Name: job, dtype: object

test_data['job'].describe()  #數值型數據的描述

count    10852.000000  mean         5.593255  std          2.727318  min          1.000000  25%          3.000000  50%          6.000000  75%          8.000000  max         12.000000  Name: job, dtype: float64

除了以上簡單的描述性統計之外，還提供了連續變數的相關係數（corr）和協方差（cov）的求解

df

 df.corr()  #相關係數的計算方法可以調用pearson方法、kendall方法、或者spearman方法，默認使用的是pearson方法

df.corr('spearman')

df.corr('pearson')

df.corr('kendall')

#如果只關注某一個變數與其餘變數的相關係數的話，可以使用corrwith，如下方只關注x1與其餘變數的相關係數  df.corrwith(df['x1'])

x1    1.000000  x2   -0.075466  x3   -0.393609  dtype: float64

#數值型變數間的協方差矩陣  df.cov()

OK，今天的pandas實操演示就到這裡，剩下的內容我們下期見。

上期問題：

你是淘寶的數據分析師，現在需要你預估雙十一的銷量，你不能獲得雙十一當天和之前的所有數據。只能獲得11月12日開始的數據，你應該如何預估？

答案解析：

因為是開放題，所以沒有固定答案，大家的回答分為兩類：

一類是通過後續雙十一的銷量，判斷16年，缺點是需要等一年，優點是簡單到不像話。
二類是通過11月12日之後的銷量數據，往前預估，期間會考慮一些權重。缺點是雙十一屬於波峰，預估難度大，優點是可操作性好。

因為題目主要看的是分析思維，目的是找出可能的思路，所以有沒有其他的方法呢？我們嘗試把思維放開，因為銷量能反應商品，有沒有其他維度？

我們可能會想到：退換貨率、和商品評價率。因為雙十一的商品只能在12日後退換貨和收貨後評價，我們就能根據這兩個指標平日的平均比率，以及雙十一商品的後續退換和評價總數，預估賣出總量。退換貨率肯定會虛高一些（畢竟雙十一退貨不少），那麼商品評價率更準確。

還有其他方法么？當然有，比如會有不少人用螞蟻花唄支付雙十一，那麼後續還款的比率能不能預估？

如果再將思路放開呢？雖然我不知道淘寶當天的數據，但是可以尋求外部數據，比如京東，京東的雙十一銷量是多少，是平時的多少倍，那麼就用這個倍數去預估淘寶的。

整體的分析結構就分為：

外部數據：

京東等其他平台雙十一銷量

內部數據：

商品數據-商品評價率、退換貨率、商品銷量
支付數據-螞蟻花唄支付比率等