本文共 3425 字,大约阅读时间需要 11 分钟。
httpclient4.X 网页抓取代码: InputStream is = null; HttpGet httpGet = null; try { URL url = new URL(URL); URI uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), url.getRef()); httpGet = new HttpGet(uri); RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(5000).setConnectTimeout(2000).setRedirectsEnabled(true).build();//设置请求和传输超时时间 httpGet.setConfig(requestConfig); HttpClientContext context = HttpClientContext.create(); HttpResponse response = null; response = httpClient.execute(httpGet,context); long len = response.getEntity().getContentLength(); if (len > 0 && len/(1024*1024) > 1) {//大于1M return false; } // 获取所有的重定向位置 List<URI> redirectLocations = context.getRedirectLocations(); int responseCode = response.getStatusLine().getStatusCode(); if (responseCode == 200) { if (null != redirectLocations && redirectLocations.size()>0) { URL = redirectLocations.get(redirectLocations.size()-1).toString(); } //过滤掉非html页面,比如:json、xml if (!response.getEntity().getContentType().toString().contains("text/html")) { return false; } is = response.getEntity().getContent(); BufferedReader br = new BufferedReader(new InputStreamReader( is, charset)); String line = null; String content = ""; while ((line = br.readLine()) != null) { content += line+"\n"; } //过滤掉非html页面,比如:json、xml if (!content.contains("html")) { return false; } ...... return true; } return false; } catch (Exception localException) { localException.printStackTrace(); } finally { try { if (httpGet != null) { httpGet.releaseConnection(); } if (is != null) is.close(); } catch (Exception localException1) { } } return null; ---------------------------------------------------------------------------- response header是服务器可以设置的,所以content-length并不能完全解决 大文件拒绝抓取的问题。 需要这样一种思路: 1.header无法读取,能不能下载页面,边下载边判断,读取超时,默认文件太大,停止下载。 于是乎: 想到,java调用wget或者curl命令的方式: String cmd = "wget -v --output-document="+CrawlerConstants.Tmp_Dir+"/wget.txt --no-check-certificate --tries=3 " + url;* //cmd = "Wget_SingleDown_run.sh" // 执行命令 p = Runtime.getRuntime().exec(cmd); InputStream stderr = p.getErrorStream(); InputStreamReader isr = new InputStreamReader(stderr); BufferedReader br1 = new BufferedReader(isr); String line = null; String link = url; boolean is_text_html = false; //此处必须读取流,不然会阻塞 while ( (line = br1.readLine()) != null){ //这里会输出wget下载进度、Location跳转等header信息 System.out.println(line); } p.waitFor(); p.destroy(); --------------------------------------------------------------------- 继续看代码: 把wget命令封装在一个shell脚本中,在脚本中做超时判断; 具体有2个脚本文件: Wget_SingleDown_run.sh: #!/bin/bash ##usage: ./singleDown killapp dirfile url if [ $# -lt 3 ] then echo 'usage: ./singleDown killapp dirfile url' exit 1 fi ##重试次数 retryTime=2 ##等待超时(s) idleTimeOut=3 ##下载限时 downTimeOut=5 ##下载限速 ##limitRate=128k ##每次重试的间隔时间(s) ##waitRetry=1 url=`echo "$3" | sed "s/ /%20/g"` wget --no-check-certificate -t $retryTime -T $idleTimeOut -O $2 $url downPid=$! echo $downPid $1 $downTimeOut $downPid $2 >>/dev/null 2>&1 & clockPid=$! wait $downPid ps $clockPid if [ $? -eq 0 ] then kill -9 $clockPid fi exit 1 Wget_SleepAndKill_run.sh: #!/bin/bash ##usage: timeOut(s) pidToKill if [ $# -lt 3 ] then echo 'usage: timeOut(s) pidToKill fileDir' exit 1 fi sleep $1 ps $2 if [ $? -eq 0 ] then kill -9 $2 cat /dev/null > $3 fi ---------------------------------------------------------------------------- 如果没有Wget_SleepAndKill_run,程序不会正常结束,一直等待超时后退出。 注意:>>/dev/null 2>&1 & 用法转载地址:http://flebb.baihongyu.com/