Python爬蟲抓取csdn部落格

2020 年 1 月 12 日
筆記

Python爬蟲抓取csdn部落格

昨天晚上為了下載保存某位csdn大牛的全部博文，寫了一個爬蟲來自動抓取文章並保存到txt文本，當然也可以保存到html網頁中。這樣就可以不用Ctrl+C 和Ctrl+V了，非常方便，抓取別的網站也是大同小異。

為了解析抓取的網頁，用到了第三方模組，BeautifulSoup，這個模組對於解析html文件非常有用，當然也可以自己使用正則表達式去解析，但是比較麻煩。

由於csdn網站的robots.txt文件中顯示禁止任何爬蟲，所以必須把爬蟲偽裝成瀏覽器，而且不能頻繁抓取，得sleep一會再抓，使用頻繁會被封ip的，但可以使用代理ip。

#-*- encoding: utf-8 -*-  '''  Created on 2014-09-18 21:10:39    @author: Mangoer  @email: [email protected]  '''    import urllib2  import re  from bs4 import BeautifulSoup  import random  import time    class CSDN_Blog_Spider:       def __init__(self,url):              print 'n'            print('已啟動網路爬蟲。。。')            print  '網頁地址： ' + url              user_agents = [                      'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',                      'Opera/9.25 (Windows NT 5.1; U; en)',                      'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',                      'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',                      'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',                      'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',                      "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",                      "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",                     ]            # use proxy ip             # ips_list = ['60.220.204.2:63000','123.150.92.91:80','121.248.150.107:8080','61.185.21.175:8080','222.216.109.114:3128','118.144.54.190:8118',            #           '1.50.235.82:80','203.80.144.4:80']              # ip = random.choice(ips_list)            # print '使用的代理ip地址： ' + ip              # proxy_support = urllib2.ProxyHandler({'http':'http://'+ip})            # opener = urllib2.build_opener(proxy_support)            # urllib2.install_opener(opener)              agent = random.choice(user_agents)              req = urllib2.Request(url)            req.add_header('User-Agent',agent)            req.add_header('Host','blog.csdn.net')            req.add_header('Accept','*/*')            req.add_header('Referer','http://blog.csdn.net/mangoer_ys?viewmode=list')            req.add_header('GET',url)            html = urllib2.urlopen(req)            page = html.read().decode('gbk','ignore').encode('utf-8')              self.page = page            self.title = self.getTitle()            self.content = self.getContent()            self.saveFile()                     def printInfo(self):            print('文章標題是：   '+self.title + 'n')                     print('內容已經存儲到out.txt文件中！')         def getTitle(self):            rex = re.compile('<title>(.*?)</title>',re.DOTALL)                      match = rex.search(self.page)            if match:                  return match.group(1)              return 'NO TITLE'         def getContent(self):            bs = BeautifulSoup(self.page)            html_content_list = bs.findAll('div',{'id':'article_content','class':'article_content'})            html_content = str(html_content_list[0])              rex_p = re.compile(r'(?:.*?)>(.*?)<(?:.*?)',re.DOTALL)            p_list = rex_p.findall(html_content)                content = ''            for p in p_list:                 if p.isspace() or p == '':                      continue                 content = content + p            return content         def saveFile(self):                        outfile = open('out.txt','a')            outfile.write(self.content)         def getNextArticle(self):            bs2 = BeautifulSoup(self.page)            html_nextArticle_list = bs2.findAll('li',{'class':'prev_article'})            # print str(html_nextArticle_list[0])            html_nextArticle = str(html_nextArticle_list[0])            # print html_nextArticle              rex_link = re.compile(r'<a href="(.*?)"',re.DOTALL)            link = rex_link.search(html_nextArticle)            # print link.group(1)              if link:                 next_url = 'http://blog.csdn.net' + link.group(1)                 return next_url              return None          class Scheduler:       def __init__(self,url):            self.start_url = url         def start(self):            spider = CSDN_Blog_Spider(self.start_url)            spider.printInfo()                while True:                 if spider.getNextArticle():                      spider = CSDN_Blog_Spider(spider.getNextArticle())                      spider.printInfo()                 elif spider.getNextArticle() == None:                      print 'All article haved been downloaded!'                      break                   time.sleep(10)        #url = input('請輸入CSDN博文地址：')  url = "http://blog.csdn.net/mangoer_ys/article/details/38427979"    Scheduler(url).start()

程式中有個問題一直不能解決：不能使用標題去命名文件，所以所有的文章全部放在一個out.txt中，說的編碼的問題，希望大神可以解決這個問題。

Previous post

Python配置與OpenCV進行配置

Next post

微博給自家CEO加了「頭髮特效」網友調侃：來去之間變萊䒧芝簡