Python爬蟲庫BeautifulSoup的介紹與簡單使用實例

  • 2020 年 3 月 18 日
  • 筆記

BeautifulSoup是一個可以從HTML或XML文件中提取數據的Python庫,本文為大家介紹下Python爬蟲庫BeautifulSoup的介紹與簡單使用實例其中包括了,BeautifulSoup解析HTML,BeautifulSoup獲取內容,BeautifulSoup節點操作,BeautifulSoup獲取CSS屬性等實例

一、介紹

BeautifulSoup庫是靈活又方便的網頁解析庫,處理高效,支援多種解析器。利用它不用編寫正則表達式即可方便地實現網頁資訊的提取。

二、快速開始

給定html文檔,產生BeautifulSoup對象

from bs4 import BeautifulSoup  html_doc = """  <html><head><title>The Dormouse's story</title></head>  <body>  <p class="title"><b>The Dormouse's story</b></p>    <p class="story">Once upon a time there were three little sisters; and their names were  <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,  <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and  <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;  and they lived at the bottom of a well.</p>    <p class="story">...</p>  """  soup = BeautifulSoup(html_doc,'lxml')

輸出完整文本

print(soup.prettify())
<html>   <head>   <title>    The Dormouse's story   </title>   </head>   <body>   <p class="title">    <b>    The Dormouse's story    </b>   </p>   <p class="story">    Once upon a time there were three little sisters; and their names were    <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">    Elsie    </a>    ,    <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">    Lacie    </a>    and    <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">    Tillie    </a>    ;  and they lived at the bottom of a well.   </p>   <p class="story">    ...   </p>   </body>  </html>  

瀏覽結構化數據

print(soup.title) #<title>標籤及內容  print(soup.title.name) #<title>name屬性  print(soup.title.string) #<title>內的字元串  print(soup.title.parent.name) #<title>的父標籤name屬性(head)  print(soup.p) # 第一個<p></p>  print(soup.p['class']) #第一個<p></p>的class  print(soup.a) # 第一個<a></a>  print(soup.find_all('a')) # 所有<a></a>  print(soup.find(id="link3")) # 所有id='link3'的標籤
<title>The Dormouse's story</title>  title  The Dormouse's story  head  <p class="title"><b>The Dormouse's story</b></p>  ['title']  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>  [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

找出所有標籤內的鏈接

for link in soup.find_all('a'):    print(link.get('href'))
http://example.com/elsie  http://example.com/lacie  http://example.com/tillie

獲得所有文字內容

print(soup.get_text())
The Dormouse's story    The Dormouse's story  Once upon a time there were three little sisters; and their names were  Elsie,  Lacie and  Tillie;  and they lived at the bottom of a well.  ...  

自動補全標籤並進行格式化

html = """  <html><head><title>The Dormouse's story</title></head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters; and their names were  <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,  <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and  <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;  and they lived at the bottom of a well.</p>  <p class="story">...</p>  """  from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.prettify())#格式化程式碼,自動補全  print(soup.title.string)#得到title標籤里的內容

標籤選擇器

選擇元素

html = """  <html><head><title>The Dormouse's story</title></head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters; and their names were  <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,  <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and  <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;  and they lived at the bottom of a well.</p>  <p class="story">...</p>  """  from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.title)#選擇了title標籤  print(type(soup.title))#查看類型  print(soup.head)

獲取標籤名稱

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.title.name)

獲取標籤屬性

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.p.attrs['name'])#獲取p標籤中,name這個屬性的值  print(soup.p['name'])#另一種寫法,比較直接

獲取標籤內容

print(soup.p.string)

標籤嵌套選擇

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.head.title.string)

子節點和子孫節點

html = """  <html>    <head>      <title>The Dormouse's story</title>    </head>    <body>      <p class="story">        Once upon a time there were three little sisters; and their names were        <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">          <span>Elsie</span>        </a>        <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a>        and        <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>        and they lived at the bottom of a well.      </p>      <p class="story">...</p>  """      from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.p.contents)#獲取指定標籤的子節點,類型是list

另一個方法,child:

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.p.children)#獲取指定標籤的子節點的迭代器對象  for i,children in enumerate(soup.p.children):#i接受索引,children接受內容      print(i,children)

輸出結果與上面的一樣,多了一個索引。注意,只能用循環來迭代出子節點的資訊。因為直接返回的只是一個迭代器對象。

獲取子孫節點:

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.p.descendants)#獲取指定標籤的子孫節點的迭代器對象  for i,child in enumerate(soup.p.descendants):#i接受索引,child接受內容      print(i,child)

父節點和祖先節點

parent

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(soup.a.parent)#獲取指定標籤的父節點

parents

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(list(enumerate(soup.a.parents)))#獲取指定標籤的祖先節點

兄弟節點

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml  print(list(enumerate(soup.a.next_siblings)))#獲取指定標籤的後面的兄弟節點  print(list(enumerate(soup.a.previous_siblings)))#獲取指定標籤的前面的兄弟節點

標準選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據標籤名、屬性、內容查找文檔。

name

html='''  <div class="panel">    <div class="panel-heading">      <h4>Hello</h4>    </div>    <div class="panel-body">      <ul class="list" id="list-1">        <li class="element">Foo</li>        <li class="element">Bar</li>        <li class="element">Jay</li>      </ul>      <ul class="list list-small" id="list-2">        <li class="element">Foo</li>        <li class="element">Bar</li>      </ul>    </div>  </div>  '''  from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  print(soup.find_all('ul'))#查找所有ul標籤下的內容  print(type(soup.find_all('ul')[0]))#查看其類型

下面的例子就是查找所有ul標籤下的li標籤:

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  for ul in soup.find_all('ul'):    print(ul.find_all('li'))  

attrs(屬性)

通過屬性進行元素的查找

html='''  <div class="panel">    <div class="panel-heading">      <h4>Hello</h4>    </div>    <div class="panel-body">      <ul class="list" id="list-1" name="elements">        <li class="element">Foo</li>        <li class="element">Bar</li>        <li class="element">Jay</li>      </ul>      <ul class="list list-small" id="list-2">        <li class="element">Foo</li>        <li class="element">Bar</li>      </ul>    </div>  </div>  '''      from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典類型,也就是想要查找的屬性  print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同樣的內容,因為這兩個屬性是在同一個標籤裡面的。

特殊類型的參數查找

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  print(soup.find_all(id='list-1'))#id是個特殊的屬性,可以直接使用  print(soup.find_all(class_='element')) #class是關鍵字所以要用class_

————————————————

text

根據文本內容來進行選擇:

html='''  <div class="panel">    <div class="panel-heading">      <h4>Hello</h4>    </div>    <div class="panel-body">      <ul class="list" id="list-1">        <li class="element">Foo</li>        <li class="element">Bar</li>        <li class="element">Jay</li>      </ul>      <ul class="list list-small" id="list-2">        <li class="element">Foo</li>        <li class="element">Bar</li>      </ul>    </div>  </div>  '''  from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  print(soup.find_all(text='Foo'))#查找文本為Foo的內容,但是返回的不是標籤  ————————————————

以說這個text在做內容匹配的時候比較方便,但是在做內容查找的時候並不是太方便。

方法

find

find用法和findall一模一樣,但是返回的是找到的第一個符合條件的內容輸出。

ind_parents(), find_parent()

find_parents()返回所有祖先節點,find_parent()返回直接父節點。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回後面的所有兄弟節點,find_next_sibling()返回後面的第一個兄弟節點

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節點,find_previous_sibling()返回前面第一個兄弟節點

find_all_next(),find_next()

find_all_next()返回節點後所有符合條件的節點,find_next()返回後面第一個符合條件的節點

find_all_previous(),find_previous()

find_all_previous()返回節點前所有符合條件的節點,find_previous()返回前面第一個符合條件的節點

CSS選擇器 通過select()直接傳入CSS選擇器即可完成選擇

html='''  <div class="panel">    <div class="panel-heading">      <h4>Hello</h4>    </div>    <div class="panel-body">      <ul class="list" id="list-1">        <li class="element">Foo</li>        <li class="element">Bar</li>        <li class="element">Jay</li>      </ul>      <ul class="list list-small" id="list-2">        <li class="element">Foo</li>        <li class="element">Bar</li>      </ul>    </div>  </div>  '''  from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  print(soup.select('.panel .panel-heading'))#.代表class,中間需要空格來分隔  print(soup.select('ul li')) #選擇ul標籤下面的li標籤  print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標籤下的,class=element的元素  print(type(soup.select('ul')[0]))#列印節點類型

再看看層層嵌套的選擇:

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  for ul in soup.select('ul'):      print(ul.select('li'))  

獲取屬性

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  for ul in soup.select('ul'):    print(ul['id'])# 用[ ]即可獲取屬性    print(ul.attrs['id'])#另一種寫法

獲取內容

from bs4 import BeautifulSoup  soup = BeautifulSoup(html, 'lxml')  for li in soup.select('li'):    print(li.get_text())    

用get_text()方法就能獲取內容了。

總結

推薦使用lxml解析庫,必要時使用html.parser

標籤選擇篩選功能弱但是速度快 建議使用find()、find_all() 查詢匹配單個結果或者多個結果

如果對CSS選擇器熟悉建議使用select()

記住常用的獲取屬性和文本值的方法