數據分析高級教程(二)

  • 2019 年 10 月 6 日
  • 筆記

6 模塊開發——ETL

該項目的數據分析過程在hadoop集群上實現,主要應用hive數據倉庫工具,因此,採集並經過預處理後的數據,需要加載到hive數據倉庫中,以進行後續的挖掘分析。

6.1創建原始數據表

–在hive倉庫中建貼源數據表

drop table if exists ods_weblog_origin;create table ods_weblog_origin(valid string,remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,http_referer string,http_user_agent string)partitioned by (datestr string)row format delimitedfields terminated by '01';

點擊流模型pageviews表

drop table if exists ods_click_pageviews;create table ods_click_pageviews(Session string,remote_addr string,time_local string,request string,visit_step string,page_staylong string,http_referer string,http_user_agent string,body_bytes_sent string,status string)partitioned by (datestr string)row format delimitedfields terminated by '01';

時間維表創建

drop table dim_time if exists ods_click_pageviews;create table dim_time(year string,month string,day string,hour string)row format delimitedfields terminated by ',';

6.2導入數據

導入清洗結果數據到貼源數據表ods_weblog_originload data inpath '/weblog/preprocessed/16-02-24-16/' overwrite into table ods_weblog_origin partition(datestr='2013-09-18'); 0: jdbc:hive2://localhost:10000> show partitions ods_weblog_origin;+——————-+–+| partition |+——————-+–+| timestr=20151203 |+——————-+–+ 0: jdbc:hive2://localhost:10000> select count(*) from ods_origin_weblog;+——–+–+| _c0 |+——–+–+| 11347 |+——–+–+ 導入點擊流模型pageviews數據到ods_click_pageviews表0: jdbc:hive2://hdp-node-01:10000> load data inpath '/weblog/clickstream/pageviews' overwrite into table ods_click_pageviews partition(datestr='2013-09-18'); 0: jdbc:hive2://hdp-node-01:10000> select count(1) from ods_click_pageviews;+——+–+| _c0 |+——+–+| 66 |+——+–+ 導入點擊流模型visit數據到ods_click_visit表

6.3 生成ODS層明細寬表

6.3.1 需求概述

整個數據分析的過程是按照數據倉庫的層次分層進行的,總體來說,是從ODS原始數據中整理出一些中間表(比如,為後續分析方便,將原始數據中的時間、url等非結構化數據作結構化抽取,將各種字段信息進行細化,形成明細表),然後再在中間表的基礎之上統計出各種指標數據

6.3.2 ETL實現

建表——明細表 (源:ods_weblog_origin) (目標:ods_weblog_detail)

drop table ods_weblog_detail;create table ods_weblog_detail(valid string, –有效標識remote_addr string, –來源IPremote_user string, –用戶標識time_local string, –訪問完整時間daystr string, –訪問日期timestr string, –訪問時間month string, –訪問月day string, –訪問日hour string, –訪問時request string, –請求的urlstatus string, –響應碼body_bytes_sent string, –傳輸位元組數http_referer string, –來源url[dht1] ref_host string, –來源的hostref_path string, –來源的路徑ref_query string, –來源參數queryref_query_id string, –來源參數query的值http_user_agent string –客戶終端標識)partitioned by(datestr string);

–抽取refer_url到中間表 "t_ods_tmp_referurl"

–將來訪url分離出host path query query id

drop table if exists t_ods_tmp_referurl;create table t_ ods _tmp_referurl asSELECT a.*,b.*FROM ods_origin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, """, ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id;

–抽取轉換time_local字段到中間表明細表」t_ ods _detail」

drop table if exists t_ods_tmp_detail;create table t_ods_tmp_detail as select b.*,substring(time_local,0,10) as daystr,substring(time_local,11) as tmstr,substring(time_local,5,2) as month,substring(time_local,8,2) as day,substring(time_local,11,2) as hourFrom t_ ods _tmp_referurl b;

以上語句可以改寫成:

insert into table zs.ods_weblog_detail partition(datestr='$day_01')select c.valid,c.remote_addr,c.remote_user,c.time_local,substring(c.time_local,0,10) as daystr,substring(c.time_local,12) as tmstr,substring(c.time_local,6,2) as month,substring(c.time_local,9,2) as day,substring(c.time_local,11,3) as hour,c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agentfrom(SELECT a.valid,a.remote_addr,a.remote_user,a.time_local,a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id FROM zs.ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, """, ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query, ref_query_id) c"0: jdbc:hive2://localhost:10000> show partitions ods_weblog_detail;+———————+–+| partition |+———————+–+| dd=18%2FSep%2F2013 |+———————+–+1 row selected (0.134 seconds)


http://www.baidu.com/aapath?sousuoci=』angel』

parse_url_tuple(url,』HOST』,』PATH』,』QUERY』,』QUERY:id』)

7 模塊開發——統計分析

註:每一種統計指標都可以跟各維度表進行叉乘,從而得出各個維度的統計結果

篇幅限制,叉乘的代碼及注釋信息詳見項目工程代碼文件

為了在前端展示時速度更快,每一個指標都事先算出各維度結果存入mysql

提前準備好維表數據,在hive倉庫中創建相應維表,如:

時間維表:

create table v_time(year string,month string,day string,hour string)row format delimitedfields terminated by ','; load data local inpath '/home/hadoop/v_time.txt' into table v_time;

在實際生產中,究竟需要哪些統計指標通常由相關數據需求部門人員提出,而且會不斷有新的統計需求產生,以下為網站流量分析中的一些典型指標示例。

1. PV統計

1.1 多維度統計PV總量

1. 時間維度

–計算指定的某個小時pvsselect count(*),month,day,hour from dw_click.ods_weblog_detail group by month,day,hour; –計算該處理批次(一天)中的各小時pvsdrop table dw_pvs_hour;create table dw_pvs_hour(month string,day string,hour string,pvs bigint) partitioned by(datestr string); insert into table dw_pvs_hour partition(datestr='2016-03-18')select a.month as month,a.day as day,a.hour as hour,count(1) as pvs from ods_weblog_detail awhere a.datestr='2016-03-18' group by a.month,a.day,a.hour; 或者用時間維表關聯

維度:日

drop table dw_pvs_day;

維度:月

drop table t_display_pv_month;create table t_display_pv_month (pvs bigint,month string);insert into table t_display_pv_monthselect count(*) as pvs,a.month from t_dim_time ajoin t_ods_detail_prt b on a.month=b.month group by a.month;

2. 按終端維度統計pv總量

註:探索數據中的終端類型

select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like '%Mozilla%' limit 200;

終端維度:uc

drop table t_display_pv_terminal_uc;create table t_display_pv_ terminal_uc (pvs bigint,mm string,dd string,hh string);

終端維度:chrome

drop table t_display_pv_terminal_chrome;create table t_display_pv_ terminal_ chrome (pvs bigint,mm string,dd string,hh string);

終端維度:safari

drop table t_display_pv_terminal_safari;create table t_display_pv_ terminal_ safari (pvs bigint,mm string,dd string,hh string);

3. 按欄目維度統計pv總量

欄目維度:job

欄目維度:news

欄目維度:bargin

欄目維度:lane

1.2 人均瀏覽頁數

需求描述:比如,今日所有來訪者,平均請求的頁面數

–總頁面請求數/去重總人數

drop table dw_avgpv_user_d;

1.3 按referer維度統計pv總量

需求:按照來源及時間維度統計PVS,並按照PV大小倒序排序

— 按照小時粒度統計,查詢結果存入:( "dw_pvs_referer_h" )

drop table dw_pvs_referer_h;create table dw_pvs_referer_h(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string); insert into table dw_pvs_referer_h partition(datestr='2016-03-18')select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cntfrom ods_weblog_detail group by http_referer,ref_host,month,day,hour having ref_host is not nullorder by hour asc,day asc,month asc,pv_referer_cnt desc;

按天粒度統計各來訪域名的訪問次數並排序

drop table dw_ref_host_visit_cnts_h;create table dw_ref_host_visit_cnts_h(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string); insert into table dw_ref_host_visit_cnts_h partition(datestr='2016-03-18')select ref_host,month,day,hour,count(1) as ref_host_cntsfrom ods_weblog_detail group by ref_host,month,day,hour having ref_host is not nullorder by hour asc,day asc,month asc,ref_host_cnts desc;

註:還可以按來源地域維度、訪客終端維度等計算

1.4 統計pv總量最大的來源TOPN

需求描述:按照時間維度,比如,統計一天內產生最多pvs的來源topN

需要用到row_number函數

以下語句對每個小時內的來訪host次數倒序排序標號,

selectref_host,ref_host_cnts,concat(month,hour,day),

row_number() over(partition by concat(month,hour,day) order by ref_host_cnts desc) as od

from dw_ref_host_visit_cnts_h

效果如下:

根據上述row_number的功能,可編寫Hql取各小時的ref_host訪問次數topn

drop table dw_pvs_refhost_topn_h;create table dw_pvs_refhost_topn_h(hour string,toporder string,ref_host string,ref_host_cnts string) partitioned by(datestr string); insert into table zs.dw_pvs_refhost_topn_h partition(datestr='2016-03-18')select t.hour,t.od,t.ref_host,t.ref_host_cnts from (select ref_host,ref_host_cnts,concat(month,day,hour) as hour,row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od from zs.dw_ref_host_visit_cnts_h) t where od<=3;

結果如下:

註:還可以按來源地域維度、訪客終端維度等計算

2. 受訪分析

統計每日最熱門的頁面top10

drop table dw_pvs_d;

註:還可繼續得出各維度交叉結果

3. 訪客分析

3.1 獨立訪客

需求描述:按照時間維度比如小時來統計獨立訪客及其產生的pvCnts

對於獨立訪客的識別,如果在原始日誌中有用戶標識,則根據用戶標識即很好實現;

此處,由於原始日誌中並沒有用戶標識,以訪客IP來模擬,技術上是一樣的,只是精確度相對較低

時間維度:時

drop table dw_user_dstc_ip_h;create table dw_user_dstc_ip_h(remote_addr string,pvs bigint,hour string); insert into table dw_user_dstc_ip_h select remote_addr,count(1) as pvs,concat(month,day,hour) as hour from ods_weblog_detailWhere datestr='2013-09-18'group by concat(month,day,hour),remote_addr;

在此結果表之上,可以進一步統計出,每小時獨立訪客總數,每小時請求次數topn訪客等

如每小時獨立訪客總數:

select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;

練習:統計每小時請求次數topn的獨立訪客

時間維度:月

select remote_addr,count(1) as counts,month from ods_weblog_detailgroup by month,remote_addr;

時間維度:日

select remote_addr,count(1) as counts,concat(month,day) as dayfrom ods_weblog_detailWhere dd='18/Sep/2013'group by concat(month,day),remote_addr;

註:還可以按來源地域維度、訪客終端維度等計算

3.2 每日新訪客

需求描述:將每天的新訪客統計出來

實現思路:創建一個去重訪客累積表,然後將每日訪客對比累積表

時間維度:日

–曆日去重訪客累積表drop table dw_user_dsct_history;create table dw_user_dsct_history(day string,ip string) partitioned by(datestr string); –每日新用戶追加到累計表drop table dw_user_dsct_history;create table dw_user_dsct_history(day string,ip string) partitioned by(datestr string); –每日新用戶追加到累計表insert into table dw_user_dsct_history partition(datestr='2013-09-19')select tmp.day as day,tmp.today_addr as new_ip from(select today.day as day,today.remote_addr as today_addr,old.ip as old_addr from (select distinct remote_addr as remote_addr,"2013-09-19" as day from ods_weblog_detail where datestr="2013-09-19") todayleft outer join dw_user_dsct_history oldon today.remote_addr=old.ip) tmpwhere tmp.old_addr is null;

驗證:

select count(distinct remote_addr) from ods_weblog_detail;– 1005 select count(1) from dw_user_dsct_history where prtflag_day='18/Sep/2013';–845 select count(1) from dw_user_dsct_history where prtflag_day='19/Sep/2013';–160

時間維度:月

類似日粒度算法

註:還可以按來源地域維度、訪客終端維度等計算

4. Visit分析(點擊流模型)

4.2 回頭/單次訪客統計

需求描述:查詢今日所有回頭訪客及其訪問次數

實現思路:上表中出現次數>1的訪客,即回頭訪客;反之,則為單次訪客

drop table dw_user_returning;create table dw_user_returning(day string,remote_addr string,acc_cnt string)partitioned by (datestr string); insert overwrite table dw_user_returning partition(datestr='2013-09-18') select tmp.day,tmp.remote_addr,tmp.acc_cntfrom(select '2013-09-18' as day,remote_addr,count(session) as acc_cnt from click_stream_visit group by remote_addr) tmpwhere tmp.acc_cnt>1;

4.3 人均訪問頻次

需求:統計出每天所有用戶訪問網站的平均次數(visit)

總visit數/去重總用戶數

select sum(pagevisits)/count(distinct remote_addr) from click_stream_visit partition(datestr='2013-09-18');

5. Visit分析另一種實現方式[dht1]

5.1 mr程序識別出訪客的每次訪問

a.) 首先開發MAPREDUCE程序:UserStayTime

註:代碼略長,見項目工程代碼

b.) 提交MAPREDUCE程序進行運算

[hadoop@hdp-node-01 ~]$ hadoop jar weblog.jar cn.itcast.bigdata.hive.mr.UserStayTime /weblog/input /weblog/stayout4–導入hive表("t_display_access_info")中drop table ods_access_info;create table ods_access_info(remote_addr string,firt_req_time string,last_req_time string,stay_long string)partitioned by(prtflag_day string)row format delimitedfields terminated by 't'; load data inpath '/weblog/stayout4' into table ods_access_info partition(prtflag_day='18/Sep/2013');創建表時stay_long使用的string類型,但是在後續過程中發現還是用bigint更好,進行表修改alter table ods_access_info change column stay_long stay_long bigint;

5.2 將mr結果導入訪客訪問信息表"t_display_access_info"

由於有一些訪問記錄是單條記錄,mr程序處理處的結果給的時長是0,所以考慮給單次請求的停留時間一個默認市場30秒

drop table dw_access_info;

在訪問信息表的基礎之上,可以實現更多指標統計,如:

統計所有用戶停留時間平均值,觀察用戶在站點停留時長的變化走勢

select prtflag_day as dt,avg(stay_long) as avg_staylong

from dw_access_info group by prtflag_day;

5.3 回頭/單次訪客統計

註:從上一步驟得到的訪問信息統計表「dw_access_info」中查詢

–回頭訪客訪問信息表 "dw_access_info_htip"

drop table dw_access_info_htip;

–單次訪客訪問信息表 "dw_access_info_dcip"

drop table dw_access_info_dcip;create table dw_access_info_dcip(remote_addr string, firt_req_time string, last_req_time string, stay_long string,acc_counts string)partitioned by(prtflag_day string); insert into table dw_access_dcip partition(prtflag_day='18/Sep/2013')select b.remote_addr,b.firt_req_time,b.last_req_time,b.stay_long,a.acc_counts from (select remote_addr,count(remote_addr) as acc_counts from dw_access_info where prtflag_day='18/Sep/2013' group by remote_addr having acc_counts<2) ajoin dw_access_info bon a.remote_addr = b.remote_addr;

在回頭/單詞訪客信息表的基礎之上,可以實現更多統計指標,如:

–當日回頭客佔比

drop table dw_htpercent_d;

5.4 人均訪問頻度

–總訪問次數/去重總人數,從訪客次數匯總表中查詢

select avg(user_times.counts) as user_access_freqfrom(select remote_addr,counts from t_display_htip union allselect remote_addr,counts from t_display_access_dcip) user_times; –或直接從訪問信息表 t_display_access_info 中查詢select avg(a.acc_cts) from (select remote_addr,count(*) as acc_cts from dw_access_info group by remote_addr) a;

6.關鍵路徑轉化率分析——漏斗模型

轉化:在一條指定的業務流程中,各個步驟的完成人數及相對上一個步驟的百分比

6.1 需求分析

6.2 模型設計

定義好業務流程中的頁面標識,下例中的步驟為:

Step1、 /item%

Step2、 /category

Step3、 /order

Step4、 /index

6.3 開發實現

分步驟開發:

1、查詢每一個步驟的總訪問人數

create table route_numbs as

2、查詢每一步驟相對於路徑起點人數的比例

思路:利用join

select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from route_num rn

select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio

3、查詢每一步驟相對於上一步驟的漏出率

select tmp.rrstep as rrstep,tmp.rrnumbs/tmp.rnnumbs as ration

4、匯總以上兩種指標

select abs.step,abs.numbs,abs.ratio as abs_ratio,rel.ratio as rel_ratio

8 模塊開發——結果導出

報表統計結果,由sqoop從hive表中導出,示例如下,詳見工程代碼

sqoop export –connect jdbc:mysql://hdp-node-01:3306/webdb –username root –password root –table click_stream_visit –export-dir /user/hive/warehouse/dw_click.db/click_stream_visit/datestr=2013-09-18 –input-fields-terminated-by '01'

9 模塊開發——工作流調度

註:將整個項目的數據處理過程,從數據採集到數據分析,再到結果數據的導出,一系列的任務分割成若干個oozie的工作流,並用coordinator進行協調

工作流定義示例

Ooize配置片段示例,詳見項目工程

1、日誌預處理mr程序工作流定義

<workflow-app name="weblogpreprocess" xmlns="uri:oozie:workflow:0.4"><start to="firstjob"/><action name="firstjob"><map-reduce><job-tracker>${jobTracker}</job-tracker><name-node>${nameNode}</name-node><prepare><delete path="${nameNode}/${outpath}"/></prepare><configuration><property><name>mapreduce.job.map.class</name><value>cn.itcast.bigdata.hive.mr.WeblogPreProcess$WeblogPreProcessMapper</value></property> <property><name>mapreduce.job.output.key.class</name><value>org.apache.hadoop.io.Text</value></property><property><name>mapreduce.job.output.value.class</name><value>org.apache.hadoop.io.NullWritable</value></property> <property><name>mapreduce.input.fileinputformat.inputdir</name><value>${inpath}</value></property><property><name>mapreduce.output.fileoutputformat.outputdir</name><value>${outpath}</value></property><property><name>mapred.mapper.new-api</name><value>true</value></property><property><name>mapred.reducer.new-api</name><value>true</value></property> </configuration></map-reduce><ok to="end"/><error to="kill"/>

2、數據加載etl工作流定義:

<workflow-app xmlns="uri:oozie:workflow:0.5" name="hive2-wf"><start to="hive2-node"/> <action name="hive2-node"><hive2 xmlns="uri:oozie:hive2-action:0.1"><job-tracker>${jobTracker}</job-tracker><name-node>${nameNode}</name-node><configuration><property><name>mapred.job.queue.name</name><value>${queueName}</value></property></configuration><jdbc-url>jdbc:hive2://hdp-node-01:10000</jdbc-url><script>script.q</script><param>input=/weblog/outpre2</param></hive2><ok to="end"/><error to="fail"/></action> <kill name="fail"><message>Hive2 (Beeline) action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message></kill><end name="end"/></workflow-app>

3、數據加載工作流所用hive腳本:

create database if not exists dw_weblog;use dw_weblog;drop table if exists t_orgin_weblog;create table t_orgin_weblog(valid string,remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,http_referer string,http_user_agent string)row format delimitedfields terminated by '01';load data inpath '/weblog/preout' overwrite into table t_orgin_weblog; drop table if exists t_ods_detail_tmp_referurl;create table t_ods_detail_tmp_referurl asSELECT a.*,b.*FROM t_orgin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, """, ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id; drop table if exists t_ods_detail;create table t_ods_detail as select b.*,substring(time_local,0,11) as daystr,substring(time_local,13) as tmstr,substring(time_local,4,3) as month,substring(time_local,0,2) as day,substring(time_local,13,2) as hourfrom t_ods_detail_tmp_referurl b; drop table t_ods_detail_prt;create table t_ods_detail_prt(valid string,remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,http_referer string,http_user_agent string,host string,path string,query string,query_id string,daystr string,tmstr string,month string,day string,hour string) partitioned by (mm string,dd string); insert into table t_ods_detail_prt partition(mm='Sep',dd='18')select * from t_ods_detail where daystr='18/Sep/2013';insert into table t_ods_detail_prt partition(mm='Sep',dd='19')select * from t_ods_detail where daystr='19/Sep/2013';

更多工作流及hql腳本定義詳見項目工程


下節是單元測試,和可視化展示。