985校训中的频繁词
- 2019 年 10 月 30 日
- 筆記
本篇通过爬虫和Fp-growth的简单应用,从网页上记载的985校训中发现频繁词。

首先利用requests模块爬取上述指定网页的全部html内容。
import requests import re from bs4 import BeautifulSoup def download(url,user_agent='wswp',num_retries=2,proxies=None): print("Downloading: ", url) headers = {'User-Agent' : user_agent} resp = requests.get(url, headers=headers, proxies=proxies) html = None try: resp = requests.get(url, headers=headers, proxies=proxies) html = resp.text if resp.status_code >= 400: print("Download error: ", html) html = None if num_retries>0 and 500 < resp.status_code <600: #递归调用,遇到5xx错误,最多重试 2 次 return download(url, user_agent, num_retries-1, proxies) except requests.exceptions.RequestException as e: print('Download error: ' ,e.reason) html = None finally: return html url = 'https://baijiahao.baidu.com/s?id=1597717152971067895&wfr=spider&for=pc' html = download(url)

接着利用BeautifulSoup提取我们感兴趣的内容,即校训部分:
soup = BeautifulSoup(html, 'html.parser') html = soup.prettify() #修正可能存在的Html错误 print() mottos = [] for matched in soup.find_all("span", attrs = {"class": "bjh-p"}): #提取 text = matched.text print(matched.text) #会自动去掉多余的空格符 if ":" in text:#去掉非校训部分 mottos.append(text.split(":")

注意,这个985名单好像不全。
然后利用jieba分词库将各个校训分词:
import jieba words = [] for motto in mottos: words.append([x for x in jieba.lcut(motto[1]) if x!=' ' ]) #分词,并去掉空格符 print("共有%d条校训"%len(words)) print(words)

最后利用FP-growth算法 发现校训中的频繁项集:
import fpGrowth_py36 as fpG def findFreq(dataset, minSup): initSet = fpG.createInitSet(dataset) myFPtree, myHeaderTab = fpG.createTree(initSet, minSup) freqList = [] if myFPtree is not None: #myFPtree.disp() fpG.mineTree(myFPtree, myHeaderTab, minSup, set([]), freqList) return freqList dataset = words minSup = 4 freqList = findFreq(dataset, minSup) print("支持度为%d时,频繁项数为%d:"%(minSup, len(freqList))) print("频繁项集为:n", freqList)

“求实”、“求是”、“自强不息”,“创新”在各个985校训中出现了4次或4次以上。出现最多的词为“创新”(这略有点不够“创新”):

