python爬虫-从零开始（五）pyQuery库

python爬虫—从零开始（五）pyQuery库

2019 年 10 月 5 日
筆記

什么是pyQuery：

强大又灵活的网页解析库。如果你觉得正则写起来太麻烦（我不会写正则），如果你觉得BeautifulSoup的语法太难记，如果你熟悉JQuery的语法，那么PyQuery就是你最佳的选择。

pyQuery的安装pip3 install pyquery即可安装啦。

pyQuery的基本用法：

初始化：

字符串初始化：

#!/usr/bin/env python  # -*- coding: utf-8 -*-    html = """  <html><head><title>The Dormouse's story</head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters;and thier names were  <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>  <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and  <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>  <p class="story">...</p>  """    from pyquery import PyQuery as pq  doc = pq(html)  print(doc('a'))

运行结果：

URL初始化：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # URL初始化    from pyquery import PyQuery as pq  doc = pq('http://www.baidu.com')  print(doc('input'))

运行结果：

文件初始化：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 文件初始化    from pyquery import PyQuery as pq  doc = pq(filename='baidu.html')  print(doc('title'))

运行结果：

选择方式和jquery一致，id、name、class都是如此，还有很多都和jquery一致。

基本CSS选择器：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # Css选择器    html = """  <html><head><title>The Dormouse's story</head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters;and thier names were  <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>  <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and  <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>  <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  print(doc('.title'))

运行结果：

查找元素：

子元素：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 子元素    html = """  <html><head><title>The Dormouse's story</head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters;and thier names were  <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>  <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and  <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>  <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('.title')  print(type(items))  print(items)  p = items.find('b')  print(type(p))  print(p)

该代码为查找id为title的标签，我们可以看到id为title的标签有两个一个是p标签，一个是a标签，然后我们再使用find方法，查找出我们需要的p标签，运行结果：

这里需要注意的是，我们所使用的find是查找每一个元素内部的标签.

children：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 子元素    html = """  <html><head><title>The Dormouse's story</head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters;and thier names were  <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>  <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and  <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>  <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('.title')  print(items.children())

运行结果：

也可以在children()内添加选择器条件：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 子元素    html = """  <html><head><title>The Dormouse's story</head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters;and thier names were  <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>  <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and  <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>  <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('.title')  print(items.children('b'))

输出结果和上面的一致。

父元素：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 子元素    html = """  <html><head><title>The Dormouse's story</head>  <body>  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>  <p class="story">Once upon a time there were three little sisters;and thier names were  <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>  <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and  <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p>  <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('#link1')  print(items)  print(items.parent())

运行结果：

这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素，祖先元素

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 祖先元素    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('#link1')  print(items)  print(items.parents('body'))

运行结果：

兄弟元素：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 兄弟元素    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('#link1')  print(items)  print(items.siblings('#link2'))

运行结果：

上面就把查找元素的方法都说了，下面我来看一下如何遍历元素。

遍历

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 兄弟元素    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('a')  for k,v in enumerate(items.items()):      print(k,v)

运行结果：

获取信息：

　　获取属性：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 获取属性    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('a')  print(items)  print(items.attr('href'))  print(items.attr.href)

运行结果：

获得文本：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 获取属性    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('a')  print(items)  print(items.text())  print(type(items.text()))

运行结果：

　获得HTML：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # 获取属性    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('a')  print(items.html())

运行结果：

DOM操作：

addClass、removeClass

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # DOM操作，addClass、removeClass    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('#link2')  print(items)  items.addClass('addStyle') # add_class  print(items)  items.remove_class('sister') # removeClass  print(items)

运行结果：

attr、css：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # DOM操作，attr,css    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  items = doc('#link2')  items.attr('name','addname')  print(items)  items.css('width','100px')  print(items)

可以给予新的属性，如果原来有该属性，会覆盖掉原有的属性

运行结果：

remove：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # DOM操作，remove    html = """  <div class="wrap">      Hello World      <p>This is a paragraph.</p>  </div>  """  from pyquery import PyQuery as pq  doc = pq(html)  wrap = doc('.wrap')  print(wrap.text())  wrap.find('p').remove()  print("remove以后的数据")  print(wrap)

运行结果：

还有很多其他的DOM方法，想了解更多的小伙伴可以阅读其官方文档，地址：https://pyquery.readthedocs.io/en/latest/api.html

伪类选择器：

#!/usr/bin/env python  # -*- coding: utf-8 -*-  # DOM操作，伪类选择器    html = """  <html>      <head>          <title>The Dormouse's story</title>      </head>      <body>          <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were              <a href="http://example.com/elsie" class="sister" id="link1">                  <span>Elsie</span>              </a>              <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>              and              <a href="http://example.com/elsie" class="sister" id="link3">Title</a>              <a href="http://example.com/elsie" class="body" id="link4">Title</a>          </p>          <p class="story">...</p>  """  from pyquery import PyQuery as pq  doc = pq(html)  # print(doc)  wrap = doc('a:first-child') # 第一个标签  print(wrap)  wrap = doc('a:last-child')  # 最后一个标签  print(wrap)  wrap = doc('a:nth-child(2)') # 第二个标签  print(wrap)  wrap = doc('a:gt(2)') # 比2大的索引 标签  即为  0 1 2 3 4 从0开始的  不是1  print(wrap)  wrap = doc('a:nth-child(2n)') # 第 2的整数倍 个标签  print(wrap)  wrap = doc('a:contains(Lacie)') # 包含Lacie文本的标签  print(wrap)

这里不在详细的一一列举了，了解更多CSS选择器可以查看官方文档，由W3C提供地址：http://www.w3school.com.cn/css/index.asp

到这里我们就把pyQuery的使用方法大致的说完了，想了解更多，更详细的可以阅读官方文档，地址：https://pyquery.readthedocs.io/en/latest/

上述代码地址：https://gitee.com/dwyui/pyQuery.git

感谢大家的阅读，不正确的地方，还希望大家来斧正，鞠躬，谢谢?。

python爬虫—从零开始（五）pyQuery库

VirMach 便宜 VPS

QNews

python爬虫—从零开始（五）pyQuery库

分享此文：

Related Posts

Android Studio的初次认识

不用找了，10分钟帮你搞定 feign+spring cloud！看完秒懂

Java设计模式：单例模式

python爬虫—从零开始（六）Selenium库

VirMach 便宜 VPS

QNews

熱門搜尋