
  • 2019 年 10 月 5 日
  • 笔记

1.arex https://github.com/ahkimkoo/arex 2.Html2Article http://www.cnblogs.com/jasondan/p/3497757.html




  • 安装
pip install jparser
  • 使用



  • 下载安装,即下载url2io.py文件。 可以到这个github项目上下载:https://github.com/Neo-Luo/scrapy_baidu github主页下载最新版:https://github.com/url2io/url2io-python-sdk/
  • 官网注册 获取token:http://url2io.applinzi.com/
  • 使用:https://github.com/url2io/url2io-python-sdk/
  • url2io python3
  • 主要代码
# -*- coding:utf-8 -*-  import url2io,requests,time  from jparser import PageModel  from newspaper import Article      headers = {      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',      'Accept-Encoding': 'gzip, deflate',      'Accept-Language': 'zh-CN,zh;q=0.9',      'Connection': 'keep-alive',      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',  }    def get_url2io(url):      try:          ret = api.article(url=url, fields=['text', 'next'])          content=ret['text'].replace('r', '').replace('n', '')          return content      except Exception as e:          # import traceback          # ex_msg = '{exception}'.format(exception=traceback.format_exc())          # print(ex_msg, e)          return ''    def get_jparser(url):      try:          response = requests.get(url, headers=headers)          en_code = response.encoding          de_code = response.apparent_encoding          # print(en_code,de_code,'-----------------')          if de_code == None:              if en_code in ['utf-8', 'UTF-8']:  # en_code=utf-8时,de_code=utf-8,可以获取到内容                  de_code = 'utf-8'          elif de_code in ['ISO-8859-1', 'ISO-8859-2', 'Windows-1254', 'UTF-8-SIG']:              de_code = 'utf-8'          html = response.text.encode(en_code, errors='ignore').decode(de_code, errors='ignore')          pm = PageModel(html)          result = pm.extract()          ans = [x['data'] for x in result['content'] if x['type'] == 'text']          content=''.join(ans)          return content      except Exception as e:          # import traceback          # ex_msg = '{exception}'.format(exception=traceback.format_exc())          # print(ex_msg, e)          return ''      if __name__=='__main__':      token = '111111111'  # 请到url2io官网注册获取token      api = url2io.API(token)      url = 'https://36kr.com/p/5245238'      url = 'http://sc.stock.cnfol.com/ggzixun/20190909/27678429.shtml'      url='https://news.pedaily.cn/201908/445881.shtml'      # content=get_url2io(url)      content = get_jparser(url)      print(content)

Python Goose的使用:

代码比较方便,但是有些网址没有解析出来。 示例代码如下所示:

from goose import Goose  from goose.text import StopWordsChinese  url = 'http://www.chinanews.com/gj/2014/11-19/6791729.shtml'  g = Goose({'stipwords_class':StopWordsChinese})  article = g.extract(url = url)  print article.cleaned_text[:150]


基于行块分布函数的通用网页正文抽取 http://wenku.baidu.com/link?url=TOBoIHWT_k68h5z8k_Pmqr-wJMPfCy2q64yzS8hxsgTg4lMNH84YVfOCWUfvfORTlccMWe5Bd1BNVf9dqIgh75t4VQ728fY2Rte3x3CQhaS

网页正文及内容图片提取算法 http://www.jianshu.com/p/d43422081e4b


正文区密度:在去除HTML中所有tag之后,正文区字符密度更高,较少出现多行空白; 行块长度:非正文区域的内容一般单独标签(行块)中较短。

测试源码: https://github.com/rainyear/cix-extractor-py/blob/master/extractor.py#L9

#! /usr/bin/env python3  # -*- coding: utf-8 -*-  import requests as req  import re    DBUG   = 0    reBODY = r'<body.*?>([sS]*?)</body>'  reCOMM = r'<!--.*?-->'  reTRIM = r'<{0}.*?>([sS]*?)</{0}>'  reTAG  = r'<[sS]*?>|[ trfv]'    reIMG  = re.compile(r'<img[sS]*?src=['|"]([sS]*?)['|"][sS]*?>')    class Extractor():      def __init__(self, url = "", blockSize=3, timeout=5, image=False):          self.url       = url          self.blockSize = blockSize          self.timeout   = timeout          self.saveImage = image          self.rawPage   = ""          self.ctexts    = []          self.cblocks   = []        def getRawPage(self):          try:              resp = req.get(self.url, timeout=self.timeout)          except Exception as e:              raise e          if DBUG: print(resp.encoding)          resp.encoding = "UTF-8"          return resp.status_code, resp.text    #去除所有tag,包括样式、Js脚本内容等,但保留原有的换行符n:      def processTags(self):          self.body = re.sub(reCOMM, "", self.body)          self.body = re.sub(reTRIM.format("script"), "" ,re.sub(reTRIM.format("style"), "", self.body))          # self.body = re.sub(r"[n]+","n", re.sub(reTAG, "", self.body))          self.body = re.sub(reTAG, "", self.body)    #将网页内容按行分割,定义行块 blocki 为第 [i,i+blockSize] 行文本之和并给出行块长度基于行号的分布函数:      def processBlocks(self):          self.ctexts   = self.body.split("n")          self.textLens = [len(text) for text in self.ctexts]          self.cblocks  = [0]*(len(self.ctexts) - self.blockSize - 1)          lines = len(self.ctexts)          for i in range(self.blockSize):              self.cblocks = list(map(lambda x,y: x+y, self.textLens[i : lines-1-self.blockSize+i], self.cblocks))          maxTextLen = max(self.cblocks)          if DBUG: print(maxTextLen)          self.start = self.end = self.cblocks.index(maxTextLen)          while self.start > 0 and self.cblocks[self.start] > min(self.textLens):              self.start -= 1          while self.end < lines - self.blockSize and self.cblocks[self.end] > min(self.textLens):              self.end += 1          return "".join(self.ctexts[self.start:self.end])    #如果需要提取正文区域出现的图片,只需要在第一步去除tag时保留<img>标签的内容:      def processImages(self):          self.body = reIMG.sub(r'{{1}}', self.body)    #正文出现在最长的行块,截取两边至行块长度为 0 的范围:      def getContext(self):          code, self.rawPage = self.getRawPage()          self.body = re.findall(reBODY, self.rawPage)[0]          if DBUG: print(code, self.rawPage)          if self.saveImage:              self.processImages()          self.processTags()          return self.processBlocks()          # print(len(self.body.strip("n")))    if __name__ == '__main__':      ext = Extractor(url="http://blog.rainy.im/2015/09/02/web-content-and-main-image-extractor/",blockSize=5, image=False)      print(ext.getContext())


标签中图片链接的方法,增加正文密度。 目前少量测试发现的问题有: 1)文章分页或动态加载的网页; 2)评论长度过长喧宾夺主的网页。

参考: https://blog.csdn.net/weixin_43098787/article/details/88633973 https://www.cnblogs.com/zhaobang/p/7472091.html https://blog.csdn.net/levy_cui/article/details/51481306 https://www.v2ex.com/t/309948