【爬蟲】利用Python爬蟲爬取小麥苗itpub博客的所有文章的連接地址並寫入Excel中(2)

  • 2019 年 10 月 11 日
  • 筆記

今天小麥苗給大家分享的是【爬蟲】利用Python爬蟲爬取小麥苗itpub博客的所有文章的連接地址並寫入Excel中(2)。

【爬蟲】利用Python爬蟲爬取小麥苗itpub博客的所有文章的連接地址並寫入Excel中(2)

第一篇( http://blog.itpub.net/26736162/viewspace-2286553/ )是將地址寫入了txt文本文件中,這篇博客將爬取到的結果寫入Excel表格中。

Python爬取的源代碼:

import requests  import re  import xlwt  url = 'http://blog.itpub.net/26736162/list/%d/'  pattern = re.compile(r'<a target=_blank href="(.*?)" class="w750"><p class="title">(.*?)</p></a>')  # pattern=re.compile(r'<a target=_blank href="(.*?)" class="w750"><p class="title">')  # ret=pattern.findall(data)  # print(''.join(ret))  # def write2file(items):  #     with open('./download/lhrbest_itpub_link_title.txt','a',encoding='utf-8') as fp:  #         for item in items:  #             item=item[::-1]  #             s=':'.join(item)  #             # print('----',len(items))  #             fp.write(s+'n')  #             # fp.write('---------------------------------------------------------------n')  #     pass  def set_style(name, height,colour_index,horz=xlwt.Alignment.HORZ_LEFT,bold=False):      style = xlwt.XFStyle()  # 初始化樣式      font = xlwt.Font()  # 為樣式創建字體      font.name = name      font.bold = bold      font.colour_index = colour_index  # 1白2紅3綠4藍5黃 0 = Black, 1 = White, 2 = Red, 3 = Green, 4 = Blue, 5 = Yellow, 6 = Magenta, 7 = Cyan      font.height = height #0x190是16進制,換成10進制為400,然後除以20,就得到字體的大小為20      style.font = font      # 設置單元格對齊方式      alignment = xlwt.Alignment()  # 創建alignment      alignment.horz = horz  # 設置水平對齊為居中,May be: HORZ_GENERAL, HORZ_LEFT, HORZ_CENTER, HORZ_RIGHT, HORZ_FILLED, HORZ_JUSTIFIED, HORZ_CENTER_ACROSS_SEL, HORZ_DISTRIBUTED      alignment.vert = xlwt.Alignment.VERT_CENTER  # 設置垂直對齊為居中,May be: VERT_TOP, VERT_CENTER, VERT_BOTTOM, VERT_JUSTIFIED, VERT_DISTRIBUTED      style.alignment = alignment  # 應用alignment到style3上      # 設置單元格邊框      borders = xlwt.Borders()  # 創建borders      borders.left = xlwt.Borders.DASHED  # 設置左邊框的類型為虛線 May be: NO_LINE, THIN, MEDIUM, DASHED, DOTTED, THICK, DOUBLE, HAIR, MEDIUM_DASHED, THIN_DASH_DOTTED, MEDIUM_DASH_DOTTED, THIN_DASH_DOT_DOTTED, MEDIUM_DASH_DOT_DOTTED, SLANTED_MEDIUM_DASH_DOTTED, or 0x00 through 0x0D.      borders.right = xlwt.Borders.THIN  # 設置右邊框的類型為細線      borders.top = xlwt.Borders.THIN  # 設置上邊框的類型為打點的      borders.bottom = xlwt.Borders.THIN  # 設置底部邊框類型為粗線      borders.left_colour = 0x10  # 設置左邊框線條顏色      borders.right_colour = 0x20      borders.top_colour = 0x30      borders.bottom_colour = 0x40      style.borders = borders  # 將borders應用到style1上      return style  def init_excel():      f = xlwt.Workbook(encoding='gbk')  # 創建工作薄      # 創建個人信息表      sheet1 = f.add_sheet(u'小麥苗itpub博客鏈接地址', cell_overwrite_ok=True)      sheet1.col(0).width = 256 * 50      sheet1.col(1).width = 256 * 50      rowTitle = [u'博客文章標題', u'鏈接地址']      # rowDatas = [[u'張一', u'男', u'18'], [u'李二', u'女', u'20'], [u'黃三', u'男', u'38'], [u'劉四', u'男', u'88']]      for i in range(0, len(rowTitle)):          sheet1.write(0, i, rowTitle[i], set_style('Courier New', 220, 2, xlwt.Alignment.HORZ_CENTER, True))  # 後面是設置樣式      f.save('./download/excel_write_base.xlsx')      return  f,sheet1  # 寫excel  def write_excel(rowDatas,f,rowIndex):      f_excel=f[0]      f_sheet=f[1]      rowIndex= rowIndex if rowIndex == 0 else rowIndex*20      for k in range(0, len(rowDatas)):  # 先遍歷外層的集合,即每行數據              for j in range(0, len(rowDatas[k])):  # 再遍歷內層集合                  if j == 1:                      # 寫入數據,k+1表示先去掉標題行,另外每一行數據也會變化,j正好表示第一列數據的變化,rowdatas[k][j] 插入數據                      f_sheet.write(k +rowIndex+ 1, j,                                   xlwt.Formula('HYPERLINK("%s","%s")' % (rowDatas[k][::-1][j], rowDatas[k][::-1][j])),set_style('Courier New', 180,4))                  else:                      f_sheet.write(k +rowIndex+ 1, j, rowDatas[k][::-1][j],set_style('Courier New', 180,0))                  f_excel.save('./download/excel_write_base.xlsx')  headers = {      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}  def loadHtml(page):      if page >= 1:          f=init_excel() #初始化一個Excel工作簿,包括sheet          for p in range(1, page + 1):              url_itpub = url % (p)              print(url_itpub)              response = requests.get(url=url_itpub, headers=headers)              response.encoding = 'utf-8'              content = response.text              # print(content)              # Ctrl + Alt + V:提取變量              items = pattern.findall(content)              # print(items)              # write2file(items)              write_excel(items,f,p-1)          pass      else:          print('請輸入數字!!!')      pass  if __name__ == '__main__':      page = int(input('請輸入需要爬取多少頁:'))      loadHtml(page)

About Me:小麥苗

● 本文作者:小麥苗,只專註於數據庫的技術,更注重技術的運用

● 作者博客地址:http://blog.itpub.net/26736162/abstract/1/

● 本系列題目來源於作者的學習筆記,部分整理自網絡,若有侵權或不當之處還請諒解

● 版權所有,歡迎分享本文,轉載請保留出處

● 題目解答若有不當之處,還望各位朋友批評指正,共同進步