beautifulsoup庫怎么在python中使用-創(chuàng)新互聯(lián)

今天就跟大家聊聊有關(guān)beautifulsoup庫怎么在python中使用，可能很多人都不太了解，為了讓大家更加了解，小編給大家總結(jié)了以下內(nèi)容，希望大家根據(jù)這篇文章可以有所收獲。

創(chuàng)新互聯(lián)建站專注為客戶提供全方位的互聯(lián)網(wǎng)綜合服務(wù)，包含不限于成都網(wǎng)站制作、成都網(wǎng)站建設(shè)、貴陽網(wǎng)絡(luò)推廣、成都微信小程序、貴陽網(wǎng)絡(luò)營銷、貴陽企業(yè)策劃、貴陽品牌公關(guān)、搜索引擎seo、人物專訪、企業(yè)宣傳片、企業(yè)代運營等，從售前售中售后，我們都將竭誠為您服務(wù)，您的肯定，是我們大的嘉獎；創(chuàng)新互聯(lián)建站為所有大學(xué)生創(chuàng)業(yè)者提供貴陽建站搭建服務(wù)，24小時服務(wù)熱線：13518219792，官方網(wǎng)址：muchs.cn

1. BeautifulSoup庫簡介

BeautifulSoup庫在python中被美其名為“靚湯”，它和和 lxml 一樣也是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數(shù)據(jù)。BeautifulSoup支持Python標(biāo)準庫中的HTML解析器,還支持一些第三方的解析器，若在沒用安裝此庫的情況下， Python 會使用 Python默認的解析器lxml，lxml 解析器更加強大，速度更快，而BeautifulSoup庫中的lxml解析器則是集成了單獨的lxml的特點，使得功能更加強大。

需要注意的是，Beautiful Soup已經(jīng)自動將輸入文檔轉(zhuǎn)換為Unicode編碼，輸出文檔轉(zhuǎn)換為utf-8編碼。因此在使用它的時候不需要考慮編碼方式，僅僅需要說明一下原始編碼方式就可以了。

使用pip命令工具安裝BeautifulSoup4庫

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/ BeautifulSoup # 使用清華大學(xué)鏡像源安裝

2. BeautifulSoup庫的主要解析器

在代碼中html.parser是一種針對于html網(wǎng)頁頁面的解析器，Beautiful Soup庫還有其他的解析器，用于針對不同的網(wǎng)頁

demo = 'https://www.baidu.com'
soup = BeautifulSoup(demo,'html.parser')

解析器	使用方法	條件
bs4的html解析器	BeautifulSoup(demo,‘html.parser')	安裝bs4庫
lxml的html解析器	BeautifulSoup(demo,‘lxml')	pip install lxml
lxml的xml解析器	BeautifulSoup(demo,‘xml')	pip install lxml
html5lib的解析器	BeautifulSoup(demo,‘html5lib')	pip install html5lib

3. BeautifulSoup的簡單使用

假如有一個簡單的網(wǎng)頁，提取百度搜索頁面的一部分源代碼為例

<!DOCTYPE html>
<html>
<head>
 <meta content="text/html;charset=utf-8" http-equiv="content-type" />
 <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
 <meta content="always" name="referrer" />
 <link
href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.
css" rel="stylesheet" type="text/css" />
 <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
 <div >
 <div >
 <div >
  <div >
  <a href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞
</a>
  <a href="https://www.hao123.com" rel="external nofollow" 
name="tj_trhao123">hao123 </a>
  <a href="http://map.baidu.com" rel="external nofollow" name="tj_trmap">地圖 </a>
  <a href="http://v.baidu.com" rel="external nofollow" name="tj_trvideo">視頻 </a>
  <a href="http://tieba.baidu.com" rel="external nofollow" name="tj_trtieba">貼吧
</a>
  <a href="//www.baidu.com/more/" rel="external nofollow" name="tj_briicon"
>更多產(chǎn)品 </a>
  </div>
 </div>
 </div>
 </div>
</body>
</html>

結(jié)合requests庫和使用BeautifulSoup庫的html解析器，對其進行解析有如下

import requests
from bs4 import BeautifulSoup

# 使用Requests庫加載頁面代碼
r = requests.get('https://www.baidu.com')
r.raise_for_status()  # 狀態(tài)碼返回
r.encoding = r.apparent_encoding
demo = r.text

# 使用BeautifulSoup庫解析代碼
soup = BeautifulSoup(demo,'html.parser')  # 使用html的解析器

print(soup.prettify())   # prettify 方式輸出頁面

beautifulsoup庫怎么在python中使用

4. BeautifuSoup的類的基本元素

BeautifulSoup4將復(fù)雜HTML文檔轉(zhuǎn)換成一個復(fù)雜的樹形結(jié)構(gòu),每個節(jié)點都是Python對象,BeautifulSoup庫有針對于html的標(biāo)簽數(shù)的特定元素，重點有如下三種

<p > ... </p>

Tag
NavigableString
Comment
BeautifulSoup

基本元素	說明
Tag	標(biāo)簽，最基本的信息組織單元，分別用<>和</>標(biāo)明開頭和結(jié)尾，格式：soup.a或者soup.p（獲取a標(biāo)簽中或者p標(biāo)簽中的內(nèi)容）
Name	標(biāo)簽的名字， … 的名字是‘p' 格式為：.name
Attributes	標(biāo)簽的屬性，字典形式組織，格式：.attrs
NavigableString	標(biāo)簽內(nèi)非屬性字符串，<>…</>中的字符串，格式：.string
Comment	標(biāo)簽內(nèi)的字符串的注釋部分，一種特殊的Comment類型

4.1 Tag

標(biāo)簽是html中的最基本的信息組織單元，使用方式如下

from bs4 import BeautifulSoup
html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")

print(bs.title) # 獲取title標(biāo)簽的所有內(nèi)容
print(bs.head) # 獲取head標(biāo)簽的所有內(nèi)容
print(bs.a)  # 獲取第一個a標(biāo)簽的所有內(nèi)容
print(type(bs.a))	# 類型

在Tag標(biāo)簽中最重要的就是html頁面中的name哈attrs屬性，使用方式如下

print(bs.name)
print(bs.head.name)			# head 之外對于其他內(nèi)部標(biāo)簽，輸出的值便為標(biāo)簽本身的名稱
print(bs.a.attrs) 			# 把 a 標(biāo)簽的所有屬性打印輸出了出來，得到的類型是一個字典。
print(bs.a['class']) 		# 等價 bs.a.get('class') 也可以使用get方法，傳入屬性的名稱，二者是等價的
bs.a['class'] = "newClass" # 對這些屬性和內(nèi)容進行修改
print(bs.a)
del bs.a['class']			# 對這個屬性進行刪除
print(bs.a)

4.2 NavigableString

NavigableString中的string方法用于獲取標(biāo)簽內(nèi)部的文字

from bs4 import BeautifulSoup
html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
print(bs.title.string)
print(type(bs.title.string))

4.3 Comment

Comment 對象是一個特殊類型的 NavigableString 對象，其輸出的內(nèi)容不包括注釋符號，用于輸出注釋中的內(nèi)容

from bs4 import BeautifulSoup
html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
print(bs.a)
# 標(biāo)簽中的內(nèi)容<a href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" name="tj_trnews"><!--新聞--></a>
print(bs.a.string) 		# 新聞
print(type(bs.a.string)) # <class 'bs4.element.Comment'>

5. 基于bs4庫的HTML內(nèi)容的遍歷方法

在HTML中有如下特定的基本格式，也是構(gòu)成HTML頁面的基本組成成分

beautifulsoup庫怎么在python中使用

而在這種基本的格式下有三種基本的遍歷流程

下行遍歷
上行遍歷
平行遍歷

三種種遍歷方式分別是從當(dāng)前節(jié)點出發(fā)。對之上或者之下或者平行的格式以及關(guān)系進行遍歷

5.1 下行遍歷

下行遍歷有三種遍歷的屬性，分別是

contents
children
descendants

屬性	說明
.contents	子節(jié)點的列表，將所有兒子節(jié)點存入列表
.children	子節(jié)點的迭代類型，用于循環(huán)遍歷兒子節(jié)點
.descendants	子孫節(jié)點的迭代類型，包含所有子孫節(jié)點，用于循環(huán)遍歷

使用舉例

soup = BeautifulSoup(demo,'html.parser') 

# 循環(huán)遍歷兒子節(jié)點
for child in soup.body.children:
	print(child)

# 循環(huán)遍歷子孫節(jié)點 
for child in soup.body.descendants:
 print(child)
 
# 輸出子節(jié)點的列表形式
print(soup.head.contents)
print(soup.head.contents[1])	# 用列表索引來獲取它的某一個元素

5.2 上行遍歷

上行遍歷有兩種方式

parent
parents

屬性	說明
.parent	節(jié)點的父親標(biāo)簽
.parents	節(jié)點先輩標(biāo)簽的迭代類型，用于循環(huán)遍歷先輩節(jié)點，返回一個生成器

使用舉例

soup = BeautifulSoup(demo,'html.parser') 

for parent in soup.a.parents:
	if parent is None:
		parent(parent)
	else:
		print(parent.name)

5.3 平行遍歷

平行遍歷有四種屬性

next_sibling
previous_sibling
next_siblings
previous_siblings

屬性	說明
.next_sibling	返回按照HTML文本順序的下一個平行節(jié)點標(biāo)簽
.previous_sibling	返回按照HTML文本順序的上一個平行節(jié)點標(biāo)簽
.next_siblings	迭代類型，返回按照html文本順序的后續(xù)所有平行節(jié)點標(biāo)簽
.previous_siblings	迭代類型，返回按照html文本順序的前序所有平行節(jié)點標(biāo)簽

beautifulsoup庫怎么在python中使用

平行遍歷舉例如下

for sibling in soup.a.next_sibling:
	print(sibling)		# 遍歷后續(xù)節(jié)點
	
for sibling in soup.a.previous_sibling:
	print(sibling)		# 遍歷

5.4 其他遍歷

屬性	說明
.strings	如果Tag包含多個字符串，即在子孫節(jié)點中有內(nèi)容，可以用此獲取，而后進行遍歷
.stripped_strings	與strings用法一致，可以去除掉那些多余的空白內(nèi)容
.has_attr	判斷Tag是否包含屬性

6. 文件樹搜索

使用soup.find_all(name,attrs,recursive,string,**kwargs)方法，用于返回一個列表類型，存儲查找的結(jié)果

name：對標(biāo)簽名稱的檢索字符串
attrs：對標(biāo)簽屬性值得檢索字符串，可標(biāo)注屬性檢索
recursive：是否對子孫全部檢索，默認為
Truestring：用與在信息文本中特定字符串的檢索

6.1 name參數(shù)

如果是指定的字符串：會查找與字符串完全匹配的內(nèi)容，如下

a_list = bs.find_all("a")
print(a_list)		# 將會返回所有包含a標(biāo)簽的內(nèi)容

如果是使用正則表達式：將會使用BeautifulSoup4中的search()方法來匹配內(nèi)容，如下

from bs4 import BeautifulSoup
import re

html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
t_list = bs.find_all(re.compile("a"))
for item in t_list:
 	print(item)		# 輸出列表

如果傳入一個列表：BeautifulSoup4將會與列表中的任一元素匹配到的節(jié)點返回，如下

t_list = bs.find_all(["meta","link"])
for item in t_list:
	print(item)

如果傳入一個函數(shù)或者方法：將會根據(jù)函數(shù)或者方法來匹配

from bs4 import BeautifulSoup

html = 'https://www.baidu.com'
bs = BeautifulSoup(html,"html.parser")
def name_is_exists(tag):
 	 return tag.has_attr("name")
t_list = bs.find_all(name_is_exists)
for item in t_list:
 	 print(item)

6.2 attrs參數(shù)

并不是所有的屬性都可以使用上面這種方式進行搜索，比如HTML的data屬性，用于指定屬性搜索

t_list = bs.find_all(data-foo="value")

6.3 string參數(shù)

通過通過string參數(shù)可以搜索文檔中的字符串內(nèi)容，與name參數(shù)的可選值一樣，string參數(shù)接受字符串，正則表達式，列表

from bs4 import BeautifulSoup
import re

html = 'https://www.baidu.com'
bs = BeautifulSoup(html, "html.parser")
t_list = bs.find_all(attrs={"data-foo": "value"})
for item in t_list:
 	print(item)
t_list = bs.find_all(text="hao123")
for item in t_list:
 	print(item)
t_list = bs.find_all(text=["hao123", "地圖", "貼吧"])
for item in t_list:
 	print(item)
t_list = bs.find_all(text=re.compile("\d"))
for item in t_list:
 	print(item)

使用find_all()方法的時，常用到正則表達式的形式import re如下所示

soup.find_all(sring = re.compile('pyhton'))		# 指定查找內(nèi)容

# 或者指定使用正則表達式要搜索的內(nèi)容
sring = re.compile('pyhton')		# 字符為python
soup.find_all(string)				# 調(diào)用方法模板

6.4 常用的fiid()方法如下

beautifulsoup庫怎么在python中使用

看完上述內(nèi)容，你們對beautifulsoup庫怎么在python中使用有進一步的了解嗎？如果還想了解更多知識或者相關(guān)內(nèi)容，請關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道，感謝大家的支持。

新聞名稱：beautifulsoup庫怎么在python中使用-創(chuàng)新互聯(lián)
分享鏈接：http://muchs.cn/article24/dejjce.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供靜態(tài)網(wǎng)站、網(wǎng)站策劃、網(wǎng)站改版、服務(wù)器托管、網(wǎng)站維護、微信小程序

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容