Python使用请求xpath并开启多线程爬取西刺代理ip实例 - 行业资讯

我就废话不多说啦,大家还是直接看代码吧!

　　　　　　导入请求,随机　　从lxml进口etree 　　进口线程　　导入的时间　　　　代理商之一=[ 　　“Mozilla/4.0 (compatible;MSIE 6.0;Windows NT 5.1;SV1;阿;net CLR 1.1.4322;net CLR 2.0.50727)”, 　　“Mozilla/4.0 (compatible;MSIE 7.0;Windows NT 6.0;Acoo浏览器;SLCC1;net CLR 2.0.50727;媒体中心电脑5.0;net CLR 3.0.04506)”, 　　“Mozilla/4.0 (compatible;MSIE 7.0;AOL 9.5;AOLBuild 4337.35;Windows NT 5.1;net CLR 1.1.4322;net CLR 2.0.50727)”, 　　“Mozilla/5.0(窗;U;MSIE 9.0;Windows NT 9.0;en - us)”, 　　“Mozilla/5.0 (compatible;MSIE 9.0;Windows NT 6.1;Win64;x64;三叉戟/5.0;net CLR 3.5.30729;net CLR 3.0.30729;net CLR 2.0.50727;媒体中心电脑6.0)”, 　　“Mozilla/5.0 (compatible;MSIE 8.0;Windows NT 6.0;三叉戟/4.0;WOW64;三叉戟/4.0;SLCC2;net CLR 2.0.50727;net CLR 3.5.30729;net CLR 3.0.30729;net CLR 1.0.3705;net CLR 1.1.4322)”, 　　“Mozilla/4.0 (compatible;MSIE 7.0 b;Windows NT 5.2;net CLR 1.1.4322;net CLR 2.0.50727;InfoPath.2;net CLR 3.0.04506.30)”, 　　“Mozilla/5.0(窗;U;Windows NT 5.1;应用)AppleWebKit/523.15 (KHTML,像壁虎,Safari/419.3) Arora/0.3 (c9dfb30变化:287)”, 　　“Mozilla/5.0 (X11;U;Linux;en - us) AppleWebKit/527 + (KHTML,像壁虎,Safari/419.3) Arora/0.6”, 　　“Mozilla/5.0(窗;U;Windows NT 5.1;en - us;房车:1.8.1.2pre)壁虎/20070215 K-Ninja 2.1.1”, 　　“Mozilla/5.0(窗;U;Windows NT 5.1;应用;房车:Firefox 1.9)壁虎/20080705/3.0 Kapiko/3.0”, 　　“Mozilla/5.0 (X11;Linux i686;U。)壁虎/20070322 Kazehakase 0.4.5”, 　　“Mozilla/5.0 (X11;U;Linux i686;en - us;房车:1.9.0.8)壁虎Fedora/1.9.0.8-1。fc10 Kazehakase/0.5.6”, 　　“Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/535.11 (KHTML,像壁虎)Chrome/17.0.963.56 Safari 535.11”, 　　“Mozilla/5.0(麦金塔电脑;Intel Mac OS X 10 _7_3) AppleWebKit/535.20 (KHTML,像壁虎)Chrome/19.0.1036.7 Safari 535.20”, 　　“Opera/9.80(麦金塔电脑;英特尔公司的Mac OS x10.6.8;U;fr)转眼间/2.9.168版本/11.52”, 　　] 　　　　def get_all_xici_urls (start_num stop_num): 　　xici_urls=[] 　　num的范围(start_num, len (stop_num) + 1): 　　xici_http_url=' http://www.xicidaili.com/wt/' 　　xici_http_url +=str (num) 　　xici_urls.append (xici_http_url) 　　打印(“获取所有待爬取xici url已完成…”) 　　返回xici_urls 　　def get_all_http_ip (xici_http_url、头部proxies_list): 　　试一试: 　　all_ip_xpath='//表//tr/孩子::* [2]/text()的　　all_prot_xpath='//表//tr/孩子::* [3]/text()的　　响应=requests.get (url=xici_http_url头=标题) 　　html_tree=etree.HTML (response.text) 　　ip_list=html_tree.xpath (all_ip_xpath) 　　port_list=html_tree.xpath (all_prot_xpath) 　　#打印(ip_list) 　　#打印(prot_list) 　　new_proxies_list=[] 　　指数的范围(1,len (ip_list)): 　　#打印(“http://{}: {}”.format (ip_list(指数),port_list(指数))) 　　proxies_dict={} 　　proxies_dict (“http”)=' http://{}: {}“.format (str (ip_list[指数]),str (port_list(指数))) 　　new_proxies_list.append (proxies_dict) 　　proxies_list +=new_proxies_list 　　返回proxies_list 　　除了例外e: 　　打印(“发生了错误:url为”,xici_http_url,“错误为“,e) 　　　　if __name__==癬_main__”: 　　start_num=int(输入(“请输入起始页面:“).strip ()) 　　stop_num=int(输入(“请输入结束页面:“).strip ()) 　　打印(“开始爬取……”) 　　t_list=[] 　　#容纳需要使用的西刺代理ip 　　proxies_list=[] 　　#使用多线程　　xici_urls=get_all_xici_urls (start_num stop_num) 　　在xici_urls xici_get_url: 　　#随机筛选一个useragent 　　头={“用户代理”:random.choice(代理商之一)} 　　t=threading.Thread(目标=get_all_http_ip args=(xici_get_url、头部proxies_list)) 　　t.start () 　　t_list.append (t) 　　在t_list j: 　　j.join () 　　打印(“所有需要的代理ip已爬取完成…”) 　　打印(proxies_list) 　　打印(len (proxies_list) 　　　　

网上爬取xici的帖子很多,但是验证都说的不是很清楚,这里我会认真给大家解释

这里我写了一个代理类代理,写了四个方法(一个人写法不必在意),get_user_agent(得到随机使用代理,请求头中最重要的一个),get_proxy(爬取代理IP), test_proxy(验证代理可用性),store_txt(将可用的代理保存到txt文件中。

1。爬取:标题是请求头,选择是可以选择是爬取Http代理还是https代理,首先,最终为开始和结束的页码(结束不包含最后一页)

　　　　　　def get_proxy(自我,标题,选择=癶ttp”,第一个=1,结束=2): 　　”“” 　　获取代理　　:参数选择: 　　:param第一:开始爬取的页数　　:param结束:结束爬取的后一页　　返回: 　　”“” 　　　　ip_list=[] 　　base_url=没有　　　　#选择爬取的网站,一个是http,一个是https的　　如果选择==癶ttp”: 　　base_url=' http://www.xicidaili.com/wt/' 　　elif选择==癶ttps”: 　　base_url=' http://www.xicidaili.com/wn/' 　　　　#控制页码用正则匹配,并将爬取的IP和端口号用:链接　　n的范围(第一,结束): 　　actual_url=base_url + str (n) 　　html=请求。get (url=actual_url头=标题)。text 　　模式=' (\ d + \ \ d + \ \ d + \ \ d +) & lt;/td> \ s * & lt; td> (\ d +) ' 　　re_list=re.findall (html)模式, 　　　　在re_list ip_port: 　　ip_port=ip_port [0] +‘:’+ ip_port [1] 　　ip_list.append (ip_port) 　　返回ip_list