python网络爬虫的流程步骤 - 行业资讯 - 肥雀云

　　介绍

本文将为大家详细介绍”python网络爬虫的流程步骤”,内容步骤清晰详细,细节处理妥当,而小编每天都会更新不同的知识点,希望这篇“python网络爬虫的流程步骤”能够给你意想不到的收获,请大家跟着小编的思路慢慢深入,具体内容如下,一起去收获新知识吧。

python网络爬虫步骤:首先准备所需库,编写爬虫调度程序,然后编写url管理器,并编写网页下载器,接着编写网页解析器;最后编写网页输出器即可。

<强> python网络爬虫步骤

<强>(1)准备所需库

我们需要准备一款名为BeautifulSoup(网页解析)的开源库,用于对下载的网页进行解析,我们是用的是PyCharm编译环境所以可以直接下载该开源库。

步骤如下:

选择文件→设置

打开项目:PythonProject下的项目翻译

点击加号添加新的库

输入bs4选择bs4点击安装Packge进行下载

<强>(2)编写爬虫调度程序

这里的bike_spider是项目名称引入的四个类分别对应下面的四段代码url管理器,url下载器,url解析器,url输出器。
#,爬虫调度程序　　得到bike_spider import url_manager, html_downloader,, html_parser, html_outputer 　　　　　　#,爬虫初始化　　class SpiderMain(对象): 　　,,,def __init__(自我): 　　,,,,,,,self.urls =, url_manager.UrlManager () 　　,,,,,,,self.downloader =, html_downloader.HtmlDownloader () 　　,,,,,,,self.parser =, html_parser.HtmlParser () 　　,,,,,,,self.outputer =, html_outputer.HtmlOutputer () 　　　　,,,def 胃(自我,,my_root_url): 　　,,,,,,,count =1 　　,,,,,,,self.urls.add_new_url (my_root_url) 　　,,,,,,,while self.urls.has_new_url (): 　　,,,,,,,,,,,试一试: 　　,,,,,,,,,,,,,,,new_url =, self.urls.get_new_url () 　　,,,,,,,,,,,,,,,印刷(“craw % d : % s", %,(计数,new_url)) 　　,,,,,,,,,,,,,,,#,下载网页　　,,,,,,,,,,,,,,,html_cont =, self.downloader.download (new_url) 　　,,,,,,,,,,,,,,,#,解析网页　　,,,,,,,,,,,,,,,,,new_urls new_data =, self.parser.parse (new_url, html_cont) 　　,,,,,,,,,,,,,,,self.urls.add_new_urls (new_urls) 　　,,,,,,,,,,,,,,,#,网页输出器收集数据　　,,,,,,,,,,,,,,,self.outputer.collect_data (new_data) 　　,,,,,,,,,,,,,,,if count ==, 10: 　　,,,,,,,,,,,,,,,,,,,休息　　,,,,,,,,,,,,,,,count +=1 　　,,,,,,,,,,,除了: 　　,,,,,,,,,,,,,,,印刷(“craw failed") 　　　　,,,,,,,self.outputer.output_html () 　　　　　　if __name__ ==,“__main__": 　　,,,root_url =,“http://baike.baidu.com/item/Python/407313" 　　,,,obj_spider =, SpiderMain () 　　,,,obj_spider.craw (root_url)
<强>(3)编写url管理器

我们把已经爬取过的url和未爬取的url分开存放以便我们不会重复爬取某些已经爬取过的网页。
#,url管理器　　class UrlManager(对象): 　　,,,def __init__(自我): 　　,,,,,,,self.new_urls =, () 　　,,,,,,,self.old_urls =, () 　　　　,,,def add_new_url(自我,url): 　　,,,,,,,if url is 没有: 　　,,,,,,,,,,,回来　　,,,,,,,if url not 拷贝self.new_urls 以及url not 拷贝self.old_urls: 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null 　　null
python网络爬虫的流程步骤