CVPR2018關鍵字分析生成詞雲圖與查找

今日目標:爬取CVPR2018論文,進行分析總結出提到最多的關鍵字,生成wordCloud詞雲圖展示,並且設置點擊後出現對應的論文以及鏈接

對任務進行分解:

①爬取CVPR2018的標題,簡介,關鍵字,論文鏈接

②將爬取的信息生成wordCloud詞雲圖展示

③設置點擊事件,展示對應關鍵字的論文以及鏈接

 

一、爬蟲實現

由於文章中並沒有找到關鍵字,於是將標題進行拆分成關鍵字,用逗號隔開

import re
import requests
from bs4 import BeautifulSoup
import demjson
import pymysql
import os

headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#創建頭部信息
url='//openaccess.thecvf.com/CVPR2018.py'
r=requests.get(url,headers=headers)
content=r.content.decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
dts=soup.find_all('dt',class_='ptitle')
hts='//openaccess.thecvf.com/'
#數據爬取
alllist=[]
for i in range(len(dts)):
    print('這是第'+str(i)+'')
    title=dts[i].a.text.strip()
    href=hts+dts[i].a['href']
    r = requests.get(href, headers=headers)
    content = r.content.decode('utf-8')
    soup = BeautifulSoup(content, 'html.parser')
    #print(title,href)
    divabstract=soup.find(name='div',attrs={"id":"abstract"})
    abstract=divabstract.text.strip()
    #print('第'+str(i)+'個:',abstract)
    alllink=soup.select('a')
    link=hts+alllink[4]['href'][6:]
    keyword=str(title).split(' ')
    keywords=''
    for k in range(len(keyword)):
        if(k==0):
            keywords+=keyword[k]
        else:
            keywords+=','+keyword[k]
    value=(title,abstract,link,keywords)
    alllist.append(value)
print(alllist)
tuplist=tuple(alllist)
#數據保存
db = pymysql.connect("localhost", "root", "fengge666", "yiqing", charset='utf8')
cursor = db.cursor()
sql_cvpr = "INSERT INTO cvpr values (%s,%s,%s,%s)"
try:
    cursor.executemany(sql_cvpr,tuplist)
    db.commit()
except:
      print('執行失敗,進入回調3')
      db.rollback()
db.close()

View Code

 

二、將數據進行wordCloud展示

首先找到對應的包,來展示詞雲圖

<script src='//cdn.bootcss.com/echarts/3.7.0/echarts.simple.js'></script>
<script src='js/echarts-wordcloud.js'></script>
<script src='js/echarts-wordcloud.min.js'></script>

然後通過異步加載,將後台的json數據進行展示。

由於第一步我們獲得的數據並沒有對其進行分析,因此我們在dao層會對其進行數據分析,找出所有的關鍵字的次數並對其進行降序排序(用Map存儲是最好的方式)

public Map<String,Integer> getallmax()
    {
        String sql="select * from cvpr";
        Map<String, Integer>map=new HashMap<String, Integer>();
        Map<String, Integer>sorted=new HashMap<String, Integer>();
        Connection con=null;
        Statement state=null;
        ResultSet rs=null;
        con=DBUtil.getConn();
        try {
            state=con.createStatement();
            rs=state.executeQuery(sql);
            while(rs.next())
            {
                String keywords=rs.getString("keywords");
                String[] split = keywords.split(",");
                for(int i=0;i<split.length;i++)
                {
                    if(map.get(split[i])==null)
                    {
                        map.put(split[i],0);
                    }
                    else
                    {
                        map.replace(split[i], map.get(split[i])+1);
                    }
                }
            }
        } catch (SQLException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        DBUtil.close(rs, state, con);
        sorted = map
                .entrySet()
                .stream()
                .sorted(Collections.reverseOrder(comparingByValue()))
                .collect(
                        toMap(Map.Entry::getKey, Map.Entry::getValue, (e1, e2) -> e2,
                                LinkedHashMap::new));
        return sorted;
    }

View Code

到servlet層後,我們還需對數據進行一定的篩選(介詞,a,等詞語應該去除掉,要不然會干擾我們分析關鍵字),取前30名關鍵字,在前台進行展示

request.setCharacterEncoding("utf-8");
        Map<String, Integer>sortMap=dao.getallmax();
        JSONArray json =new JSONArray();
        int k=0;
        for (Map.Entry<String, Integer> entry : sortMap.entrySet()) 
        {
            JSONObject ob=new JSONObject();
            ob.put("name", entry.getKey());
            ob.put("value", entry.getValue());
            if(!(entry.getKey().equals("for")||entry.getKey().equals("and")||entry.getKey().equals("With")||entry.getKey().equals("of")||entry.getKey().equals("in")||entry.getKey().equals("From")||entry.getKey().equals("A")||entry.getKey().equals("to")||entry.getKey().equals("a")||entry.getKey().equals("the")||entry.getKey().equals("by")))
            {
                json.add(ob);
                k++;
            }
            if(k==30)
                break;
        }
        System.out.println(json.toString());
        response.getWriter().write(json.toString());

View Code

 

三、設置點擊事件,展示對應關鍵字的論文以及鏈接

//設置點擊效果
var ecConfig = echarts.config;
myChart.on('click', eConsole);

用函數來實現點擊事件的內容:通過點擊的關鍵字,後台進行模糊查詢,找到對應的論文題目以及鏈接,返回到前端頁面

 //點擊事件
        function eConsole(param) {  
            if (typeof param.seriesIndex == 'undefined') {  
                return;  
            }  
            if (param.type == 'click') {
                var word=param.name;
                var htmltext="<table class='table table-striped' style='text-align:center'><caption style='text-align:center'>論文題目與鏈接</caption>";
                $.post(
                        'findkeytitle',
                        {'word':word},
                        function(result)
                        {
                            json=JSON.parse(result);
                            for(i=0;i<json.length;i++)
                            {
                                htmltext+="<tr><td><a target='_blank' href='"+json[i].Link+"'>"+json[i].Title+"</a></td></tr>";    
                            }
                            htmltext+="</table>"
                            $("#show").html(htmltext);
                        }
                )
            }  
       }

View Code

 

成果展示:

 

 

前台頁面代碼:

<html>
    <head>
        <meta charset="utf-8">
        <link href="css/bootstrap.min.css" rel="stylesheet">
        <!-- jQuery (Bootstrap 的所有 JavaScript 插件都依賴 jQuery,所以必須放在前邊) -->
        <script src="js/jquery-1.11.3.min.js"></script>
        <!-- 加載 Bootstrap 的所有 JavaScript 插件。你也可以根據需要只加載單個插件。 -->
        <script src="js/bootstrap.js"></script>
        <script src='//cdn.bootcss.com/echarts/3.7.0/echarts.simple.js'></script>
        <script src='js/echarts-wordcloud.js'></script>
        <script src='js/echarts-wordcloud.min.js'></script>
    </head>
    <body>
        <style>
            body{
                background-color: black;
            }
            #main {
                width: 70%;
                height: 100%;
                margin: 0;
                float:right;
                background: black;
            }
            #show{
                overflow-x: auto;
                 overflow-y: auto;
                width: 30%;
                height: 100%;
                float:left;
                margin-top:100dp;
                padding-top:100dp;
                background: pink;
            }
        </style>
        <div id='show'></div>
        <div id='main'></div>
    <script>
        $(function(){
            echartsCloud();
        });
        //點擊事件
        function eConsole(param) {  
            if (typeof param.seriesIndex == 'undefined') {  
                return;  
            }  
            if (param.type == 'click') {
                var word=param.name;
                var htmltext="<table class='table table-striped' style='text-align:center'><caption style='text-align:center'>論文題目與鏈接</caption>";
                $.post(
                        'findkeytitle',
                        {'word':word},
                        function(result)
                        {
                            json=JSON.parse(result);
                            for(i=0;i<json.length;i++)
                            {
                                htmltext+="<tr><td><a target='_blank' href='"+json[i].Link+"'>"+json[i].Title+"</a></td></tr>";    
                            }
                            htmltext+="</table>"
                            $("#show").html(htmltext);
                        }
                )
            }  
       }
        function echartsCloud(){
           
            
            $.ajax({
                 url:"getmax",
                 type:"POST",
                 dataType:"JSON",
                 async:true,
                 success:function(data)
                 {
                     var mydata = new Array(0);
               
                     for(var i=0;i<data.length;i++)
                     {
                         var d = {
                                 
                         };
                         d["name"] = data[i].name;//.substring(0, 2);
                         d["value"] = data[i].value;
                         mydata.push(d);
                     }
                     var myChart = echarts.init(document.getElementById('main'));
                     //設置點擊效果
                     var ecConfig = echarts.config;
                     myChart.on('click', eConsole);
                     
                     myChart.setOption({
                         title: {
                             text: ''
                         },
                         tooltip: {},
                         series: [{
                             type : 'wordCloud',  //類型為字符雲
                                 shape:'smooth',  //平滑
                                 gridSize : 8, //網格尺寸
                                 size : ['50%','50%'],
                                 //sizeRange : [ 50, 100 ],
                                 rotationRange : [-45, 0, 45, 90], //旋轉範圍
                                 textStyle : {
                                     normal : {
                                         fontFamily:'微軟雅黑',
                                         color: function() {
                                             return 'rgb(' + 
                                                 Math.round(Math.random() * 255) +
                                          ', ' + Math.round(Math.random() * 255) +
                                          ', ' + Math.round(Math.random() * 255) + ')'
                                                }
                                         },
                                     emphasis : {
                                         shadowBlur : 5,  //陰影距離
                                         shadowColor : '#333'  //陰影顏色
                                     }
                                 },
                                 left: 'center',
                                 top: 'center',
                                 right: null,
                                 bottom: null,
                                 width:'100%',
                                 height:'100%',
                                 data:mydata
                         }]
                     });
                 }
             });  
    }
    </script>                    
    </body>
</html>

View Code