Python爬虫beautifulsoup4常用的解析方法总结 - 行业资讯 - 肥雀云

<>强摘要

　　<李>如何用beautifulsoup4解析各种情况的网页　　　　

<强> beautifulsoup4的使用

关于beautifulsoup4,官网已经讲的很详细了,我这里就把一些常用的解析方法做个总结,方便查阅。

<>强装载html文档

使用beautifulsoup的第一步是把html文档装载到beautifulsoup中,使其形成一个beautifulsoup对象。

　　　　　　进口的要求　　从bs4进口BeautifulSoup 　　url=" http://new.qq.com/omn/20180705/20180705A0920X.html " 　　r=requests.get (url) 　　html=r.text 　　#打印(html) 　　汤=BeautifulSoup (html, html.parser) 　　　　

初始化BeautifulSoup类时,需要加入两个参数,第一个参数即是我们爬到html源码,第二个参数是html解析器,常用的有三个解析器,分别”是html。解析器”、“lxml”、“html5lib”,官网推荐用lxml,因为效率高,当然需要pip安装lxml一下。

当然这三种解析方式在某些情况解析得到的对象内容是不同的,比如对于标签不完整这一情况(p标签只有一半):

　　　　　　汤=BeautifulSoup (“& lt; a> & lt;/p>”、“html.parser”) 　　#只有起始标签的会自动补全,只有结束标签的灰自动忽略　　#结果为:& lt; a> & lt;/a> 　　汤=BeautifulSoup (“& lt; a> & lt;/p>”、“lxml”) 　　#结果为:& lt; html> & lt; body> & lt; a> & lt;/a> & lt;/body> & lt;/html> 　　汤=BeautifulSoup (“& lt; a> & lt;/p>”、“html5lib”) 　　# html5lib则出现一般的标签都会自动补全　　#结果为:& lt; html> & lt; head> & lt;/head> & lt; body> & lt; a> & lt; p> & lt;/p> & lt;/a> & lt;/body> & lt;/html> 　　　　

<强>使用

在使用中,我尽量按照我使用的频率介绍,毕竟为了查阅~

　　<李>按照标签名称,id、类等信息获取某个标签　　　　　　　　html=' & lt; p class="标题" id=皃1”祝辞& lt; b>的睡鼠story & lt;/p>” 　　汤=BeautifulSoup (html、lxml的) 　　#根据类的名称获取标页签内的所有内容　　soup.find (class_=氨晏狻? 　　#或者　　汤。找到(“p”, class_="标题" id=皃1”) 　　#获取类为标题的p标签的文本内容”睡鼠的故事” 　　soup.find (class_=氨晏狻?.get_text () 　　#获取文本内容时可以指定不同标签之间的分隔符,也可以选择是否去掉前后的空白。　　汤=BeautifulSoup (' & lt; p class="标题" id=皃1”祝辞& lt; b>睡鼠的故事& lt;/b> & lt;/p> & lt; p class="标题" id=皃1”祝辞& lt; b>的睡鼠story & lt;/p>”,“html5lib”) 　　soup.find (class_=氨晏狻?。带=True get_text (“|”) 　　#结果为:榛睡鼠的故事|榛睡鼠的故事　　#获取类为标题的p标签的id 　　soup.find (class_=氨晏狻?. get (" id ") 　　#对类名称正则: 　　soup.find_all (class_=re.compile(“甲”)) 　　#递归参数,递归=False时,只发现当前标签的第一级子标签的数据　　汤=BeautifulSoup (' & lt; html> & lt; head> & lt; title> abc ', ' lxml ') 　　soup.html。find_all(“标题”,递归=False) 　　　　

　　<李>按照标签名称,id、类等信息获取多个标签　　　　　　　　汤=BeautifulSoup (' & lt; p class="标题" id=皃1”祝辞& lt; b>像故事& lt;/b> & lt;/p> & lt; p class="标题" id=皃1”祝辞& lt; b>的睡鼠story & lt;/p>”,“html5lib”) 　　#获取所有类为标题的标签　　因为我在soup.find_all (class_=氨晏狻?: 　　print (i.get_text ()) 　　#获取特定数量的类为标题的标签　　我的汤。find_all (class_=氨晏狻?限制=2): 　　print (i.get_text ()) 　　　　

　　<李>按照标签的其他属性获取某个标签　　　　　　　　html=' & lt; alog-action=" qb-ask-uname " href=" https://www.yisu.com/usercent " rel==捌降取薄巴獠縩ofollow”目标在蜗牛宋& lt;/a>” 　　汤=BeautifulSoup (html、lxml的) 　　#获取“蜗牛宋”,此时,该标签里既没有课也没有id,需要根据其属性来定义获取规则　　作者=soup.find (a, {“alog-action”:“qb-ask-uname”}) .get_text () 　　#或　　作者=汤。找到(attrs={“alog-action”:“qb-ask-uname”}) 　　　　

　　<李>找前头和后头的标签　　　　　　　　soup.find_all_previous (“p”) 　　soup.find_previous (“p”) 　　soup.find_all_next (“p”) 　　soup.find_next (p) 　　　　

　　<李>找父标签