首先使用
page = urllib2.urlopen(url).read()
爬出来的代码 与 页面源代码不一样?
百度说是因为,urllib爬取的是页面的初始代码,异步请求动态增加的那部分没有加载并爬取 or 防爬链接。
于是使用模拟浏览器的方式,设置代理、配置请求头,然后最后爬出的代码与urllib2.urlopen(url).read()一样的
请问这是为什么?可以怎么解决?
url = 'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=04%2C05%2C06%2C07&keyword=python&keywordtype=2&curr_page=1&lang=c&stype=2&postchannel=0000&workyear=99&cotype=99°reefrom=04%2C05%2C06&jobterm=99&companysize=03%2C04%2C05%2C06%2C07&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
req_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36',
'Referer':'http://search.51job.com'
}
opener = urllib2.build_opener(urllib2.ProxyHandler({'http':'14.18.252.61:80'}), urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url,headers=req_header)
resp = urllib2.urlopen(req)
html = resp.read()
page = urllib2.urlopen(url).read()
爬出来的代码 与 页面源代码不一样?
百度说是因为,urllib爬取的是页面的初始代码,异步请求动态增加的那部分没有加载并爬取 or 防爬链接。
于是使用模拟浏览器的方式,设置代理、配置请求头,然后最后爬出的代码与urllib2.urlopen(url).read()一样的
请问这是为什么?可以怎么解决?
url = 'http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=04%2C05%2C06%2C07&keyword=python&keywordtype=2&curr_page=1&lang=c&stype=2&postchannel=0000&workyear=99&cotype=99°reefrom=04%2C05%2C06&jobterm=99&companysize=03%2C04%2C05%2C06%2C07&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
req_header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36',
'Referer':'http://search.51job.com'
}
opener = urllib2.build_opener(urllib2.ProxyHandler({'http':'14.18.252.61:80'}), urllib2.HTTPHandler(debuglevel=1))
urllib2.install_opener(opener)
req = urllib2.Request(url,headers=req_header)
resp = urllib2.urlopen(req)
html = resp.read()