Python爬蟲之數(shù)據(jù)解析模塊lxml基礎（附：xpath和解析器介紹）

介紹：

目前成都創(chuàng)新互聯(lián)已為上千余家的企業(yè)提供了網(wǎng)站建設、域名、網(wǎng)頁空間、網(wǎng)站托管維護、企業(yè)網(wǎng)站設計、杞縣網(wǎng)站維護等服務，公司將堅持客戶導向、應用為本的策略，正道將秉承"和諧、參與、激情"的文化，與客戶和合作伙伴齊心協(xié)力一起成長，共同發(fā)展。

最近在學Python爬蟲，在這里對數(shù)據(jù)解析模塊lxml做個學習筆記。

lxml、xpath及解析器介紹：

lxml是Python的一個解析庫，支持HTML和XML的解析，支持xpath解析方式，而且解析效率非常高。xpath，全稱XML Path Language，即XML路徑語言，它是一門在XML文檔中查找信息的語言，它最初是用來搜尋XML文檔的，但是它同樣適用于HTML文檔的搜索

xml文件/html文件結點關系：
父節(jié)點(Parent)
子節(jié)點(Children)
同胞節(jié)點(Sibling)
先輩節(jié)點(Ancestor)
后代節(jié)點(Descendant)
xpath語法:
nodename    選取此節(jié)點的所有子節(jié)點
//          從任意子節(jié)點中選取
/           從根節(jié)點選取
.           選取當前節(jié)點
..          選取當前節(jié)點的父節(jié)點
@        選取屬性
解析器比較:
解析器         速度      難度
re                最快      難
BeautifulSoup 慢        非常簡單
lxml                 快        簡單
學習筆記：
# -*- coding: utf-8 -*-
from lxml import etree
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p><b>The Dormouse's story</b></p>
<p>Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class=... ... ... ... ... ... "sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" id="link2">Lacie</a> and
<a href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p>...</p>
"""
selector = etree.HTML(html_doc)   #創(chuàng)建一個對象
links = selector.xpath('//p[@class="story"]/a/@href')   # 取出頁面內(nèi)所有的鏈接
for link in links:
    print link
xml_test = """
<?xml version='1.0'?>
<?xml-stylesheet type="text/css" href="first.css"?>
<notebook>
    <user id="1" category='cb' class="dba python linux">
        <name>lizibin</name>
        <sex>m</sex>
        <address>sjz</address>
        <age>28</age>
        <concat>
            <email>konigerwin@163.com</email>
            <phone>135......</phone>
        </concat>
    </user>
    <user id="2" category='za'>
        <name>wsq</name>
        <sex>f</sex>
        <address>shanghai</address>
        <age>25</age>
        <concat>
            <email>konigerwiner@163.com</email>
            <phone>135......</phone>
        </concat>
    </user>
    <user id="3" category='za'>
        <name>liqian</name>
        <sex>f</sex>
        <address>SH</address>
        <age>28</age>
        <concat>
            <email>konigerwinarry@163.com</email>
            <phone>135......</phone>
        </concat>
    </user>
    <user id="4" category='cb'>
        <name>qiangli</name>
        <sex>f</sex>
        <address>SH</address>
        <age>29</age>
        <concat>
            <email>konigerwinarry@163.com</email>
            <phone>135......</phone>
        </concat>
    </user>
    <user id="5" class="dba linux c java python test teacher">
        <name>buzhidao</name>
        <sex>f</sex>
        <address>SH</address>
        <age>999</age>
        <concat>
            <email>konigerwinarry@163.com</email>
            <phone>135......</phone>
        </concat>
    </user>
</notebook>
"""
#r = requests.get('http://xxx.com/abc.xml')   也可以請求遠程服務器上的xml文件
#etree.HTML(r.text.encode('utf-8'))
xml_code = etree.HTML(xml_test)     #生成一個etree對象
#選取所有子節(jié)點的name(地址)
print xml_code.xpath('//name')
選取所有子節(jié)點的name值(數(shù)據(jù))
print xml_code.xpath('//name/text()')
print ''
#以notebook以根節(jié)點選取所有數(shù)據(jù)
notebook = xml_code.xpath('//notebook')

#取出第一個節(jié)點的name值(數(shù)據(jù))
print notebook[0].xpath('.//name/text()')[0]
addres = notebook[0].xpath('.//name')[0]
#取出和第一個節(jié)點同級的 address 值
print addres.xpath('../address/text()')
#選取屬性值
print addres.xpath('../address/@lang')
#選取notebook下第一個user的name屬性
print xml_code.xpath('//notebook/user[1]/name/text()')
#選取notebook下最后一個user的name屬性
print xml_code.xpath('//notebook/user[last()]/name/text()')
#選取notebook下倒數(shù)第二個user的name屬性
print xml_code.xpath('//notebook/user[last()-1]/name/text()')
#選取notebook下前兩名user的address屬性
print xml_code.xpath('//notebook/user[position()<3]/address/text()')
#選取所有分類為web的name
print xml_code.xpath('//notebook/user[@category="cb"]/name/text()')
#選取所有年齡小于30的人
print xml_code.xpath('//notebook/user[age<30]/name/text()')
#選取所有class屬性中包含dba的class屬性
print xml_code.xpath('//notebook/user[contains(@class,"dba")]/@class')
print xml_code.xpath('//notebook/user[contains(@class,"dba")]/name/text()')

新聞名稱：Python爬蟲之數(shù)據(jù)解析模塊lxml基礎（附：xpath和解析器介紹）
文章轉(zhuǎn)載：http://muchs.cn/article18/gddedp.html

成都網(wǎng)站建設公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站設計、小程序開發(fā)、定制開發(fā)、搜索引擎優(yōu)化、云服務器、靜態(tài)網(wǎng)站

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容