python爬虫—从零开始(五)pyQuery库
- 2019 年 10 月 5 日
- 筆記
什么是pyQuery:
强大又灵活的网页解析库。如果你觉得正则写起来太麻烦(我不会写正则),如果你觉得BeautifulSoup的语法太难记,如果你熟悉JQuery的语法,那么PyQuery就是你最佳的选择。
pyQuery的安装pip3 install pyquery即可安装啦。
pyQuery的基本用法:
初始化:
字符串初始化:
#!/usr/bin/env python # -*- coding: utf-8 -*- html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) print(doc('a'))
运行结果:
URL初始化:
#!/usr/bin/env python # -*- coding: utf-8 -*- # URL初始化 from pyquery import PyQuery as pq doc = pq('http://www.baidu.com') print(doc('input'))
运行结果:
文件初始化:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 文件初始化 from pyquery import PyQuery as pq doc = pq(filename='baidu.html') print(doc('title'))
运行结果:
选择方式和jquery一致,id、name、class都是如此,还有很多都和jquery一致。
基本CSS选择器:
#!/usr/bin/env python # -*- coding: utf-8 -*- # Css选择器 html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) print(doc('.title'))
运行结果:
查找元素:
子元素:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('.title') print(type(items)) print(items) p = items.find('b') print(type(p)) print(p)
该代码为查找id为title的标签,我们可以看到id为title的标签有两个一个是p标签,一个是a标签,然后我们再使用find方法,查找出我们需要的p标签,运行结果:
这里需要注意的是,我们所使用的find是查找每一个元素内部的标签.
children:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('.title') print(items.children())
运行结果:
也可以在children()内添加选择器条件:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('.title') print(items.children('b'))
输出结果和上面的一致。
父元素:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 子元素 html = """ <html><head><title>The Dormouse's story</head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and thier names were <a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a> <a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and <a href ="http://example.com/title" class="title" id="link3">Title</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('#link1') print(items) print(items.parent())
运行结果:
这里只输出一个父元素。这里我们用parents方法会给予我们返回所有父元素,祖先元素
#!/usr/bin/env python # -*- coding: utf-8 -*- # 祖先元素 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('#link1') print(items) print(items.parents('body'))
运行结果:
兄弟元素:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 兄弟元素 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('#link1') print(items) print(items.siblings('#link2'))
运行结果:
上面就把查找元素的方法都说了,下面我来看一下如何遍历元素。
遍历
#!/usr/bin/env python # -*- coding: utf-8 -*- # 兄弟元素 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('a') for k,v in enumerate(items.items()): print(k,v)
运行结果:
获取信息:
获取属性:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 获取属性 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('a') print(items) print(items.attr('href')) print(items.attr.href)
运行结果:
获得文本:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 获取属性 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('a') print(items) print(items.text()) print(type(items.text()))
运行结果:
获得HTML:
#!/usr/bin/env python # -*- coding: utf-8 -*- # 获取属性 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('a') print(items.html())
运行结果:
DOM操作:
addClass、removeClass
#!/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,addClass、removeClass html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('#link2') print(items) items.addClass('addStyle') # add_class print(items) items.remove_class('sister') # removeClass print(items)
运行结果:
attr、css:
#!/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,attr,css html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) items = doc('#link2') items.attr('name','addname') print(items) items.css('width','100px') print(items)
可以给予新的属性,如果原来有该属性,会覆盖掉原有的属性
运行结果:
remove:
#!/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,remove html = """ <div class="wrap"> Hello World <p>This is a paragraph.</p> </div> """ from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print("remove以后的数据") print(wrap)
运行结果:
还有很多其他的DOM方法,想了解更多的小伙伴可以阅读其官方文档,地址:https://pyquery.readthedocs.io/en/latest/api.html
伪类选择器:
#!/usr/bin/env python # -*- coding: utf-8 -*- # DOM操作,伪类选择器 html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story" id="dromouse">Once upo a time were three little sister;and theru name were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/elsie" class="sister" id="link3">Title</a> <a href="http://example.com/elsie" class="body" id="link4">Title</a> </p> <p class="story">...</p> """ from pyquery import PyQuery as pq doc = pq(html) # print(doc) wrap = doc('a:first-child') # 第一个标签 print(wrap) wrap = doc('a:last-child') # 最后一个标签 print(wrap) wrap = doc('a:nth-child(2)') # 第二个标签 print(wrap) wrap = doc('a:gt(2)') # 比2大的索引 标签 即为 0 1 2 3 4 从0开始的 不是1 print(wrap) wrap = doc('a:nth-child(2n)') # 第 2的整数倍 个标签 print(wrap) wrap = doc('a:contains(Lacie)') # 包含Lacie文本的标签 print(wrap)
这里不在详细的一一列举了,了解更多CSS选择器可以查看官方文档,由W3C提供地址:http://www.w3school.com.cn/css/index.asp
到这里我们就把pyQuery的使用方法大致的说完了,想了解更多,更详细的可以阅读官方文档,地址:https://pyquery.readthedocs.io/en/latest/
上述代码地址:https://gitee.com/dwyui/pyQuery.git
感谢大家的阅读,不正确的地方,还希望大家来斧正,鞠躬,谢谢?。