985校訓中的頻繁詞

  • 2019 年 10 月 30 日
  • 筆記

本篇通過爬蟲和Fp-growth的簡單應用,從網頁上記載的985校訓中發現頻繁詞。

首先利用requests模組爬取上述指定網頁的全部html內容。

import requests  import re  from bs4 import BeautifulSoup      def download(url,user_agent='wswp',num_retries=2,proxies=None):      print("Downloading: ", url)      headers = {'User-Agent' : user_agent}      resp = requests.get(url, headers=headers, proxies=proxies)      html = None      try:          resp = requests.get(url, headers=headers, proxies=proxies)          html = resp.text          if resp.status_code >= 400:              print("Download error: ", html)              html = None              if num_retries>0 and 500 < resp.status_code <600:                  #遞歸調用,遇到5xx錯誤,最多重試 2 次                  return download(url, user_agent, num_retries-1, proxies)      except requests.exceptions.RequestException as e:          print('Download error: ' ,e.reason)          html = None      finally:          return html    url = 'https://baijiahao.baidu.com/s?id=1597717152971067895&wfr=spider&for=pc'  html = download(url)

接著利用BeautifulSoup提取我們感興趣的內容,即校訓部分:

soup = BeautifulSoup(html, 'html.parser')  html = soup.prettify() #修正可能存在的Html錯誤  print()  mottos = []  for matched in soup.find_all("span", attrs = {"class": "bjh-p"}): #提取      text = matched.text      print(matched.text) #會自動去掉多餘的空格符      if ":" in text:#去掉非校訓部分          mottos.append(text.split(":")

注意,這個985名單好像不全。

然後利用jieba分詞庫將各個校訓分詞:

import jieba  words = []  for motto in mottos:      words.append([x for x in jieba.lcut(motto[1]) if x!=' ' ]) #分詞,並去掉空格符  print("共有%d條校訓"%len(words))  print(words)

最後利用FP-growth演算法 發現校訓中的頻繁項集:

import fpGrowth_py36 as fpG  def findFreq(dataset, minSup):      initSet = fpG.createInitSet(dataset)      myFPtree, myHeaderTab = fpG.createTree(initSet, minSup)      freqList = []      if myFPtree is not None:          #myFPtree.disp()          fpG.mineTree(myFPtree, myHeaderTab, minSup, set([]), freqList)      return freqList    dataset = words  minSup = 4  freqList = findFreq(dataset, minSup)  print("支援度為%d時,頻繁項數為%d:"%(minSup, len(freqList)))  print("頻繁項集為:n", freqList)

「求實」、「求是」、「自強不息」,「創新」在各個985校訓中出現了4次或4次以上。出現最多的詞為「創新」(這略有點不夠「創新」):