本文共 12310 字,大约阅读时间需要 41 分钟。
爬取目标网址 :
需要爬取信息 : 网易云top13热评
使用之前的 HttpURLConnection 获取网页源码,经过分析发现,在源码中并没有热评信息
1 package bok; 2 3 import java.io.BufferedReader; 4 import java.io.InputStreamReader; 5 import java.net.HttpURLConnection; 6 import java.net.URL; 7 8 public class GC { 9 public static void main(String[] args) throws Exception{10 URL url = new URL("http://music.163.com/#/song?id=409649818") ;11 HttpURLConnection httpURLConnection = (HttpURLConnection)url.openConnection() ;12 String get = "" ;13 if(httpURLConnection.getResponseCode()==200){14 BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(httpURLConnection.getInputStream(),"UTF-8")) ;15 String read ;16 while(((read=bufferedReader.readLine()))!=null){17 get+=read+="\r\n" ;18 }19 System.out.println(get);20 }21 }22 }
部分源码如下:
1 {/if} 2 {else} 3 4 {/if} 5 6 7 8 91023 24 {/if} 25 26 ${dur2time(x.duration/1000)}{if x.ftype==2}{/if} 27 40 41 421122 1213 14 {var alia=songAlia(x)} 15 ${soil(x.name)}{if alia} - (${soil(alia)}){/if} 16 {if x.mvid>0} 17 MV 18 {/if} 19 202143 ${getArtistName(x.artists, '', '', false, false, true)} 4445 46 47 {/list} 48 49 50 51 110 152 220 236 253 254 255 256 276
获取的源码中既然没有热评信息
只有通过 F12 -> NetWork 分析网络请求
可以发现
有关热评信息的请求是http://music.163.com/weapi/v1/resource/comments/R_SO_4_409649818?csrf_token=
409649818 是歌曲ID
且表单数据与歌曲无关,是一段关于本机Cookie的信息,所以只需要一种表单数据,即可用来实现不同歌曲的请求
基本代码如下:
1 package 网易云热评爬取; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.NameValuePair; 5 import org.apache.http.client.entity.UrlEncodedFormEntity; 6 import org.apache.http.client.methods.CloseableHttpResponse; 7 import org.apache.http.client.methods.HttpGet; 8 import org.apache.http.client.methods.HttpPost; 9 import org.apache.http.impl.client.CloseableHttpClient;10 import org.apache.http.impl.client.HttpClients;11 import org.apache.http.message.BasicNameValuePair;12 import org.apache.http.util.EntityUtils;13 import java.util.ArrayList;14 import java.util.List;15 import java.util.regex.Matcher;16 import java.util.regex.Pattern;17 18 public class MyClawer {19 public static void printHot(String u) throws Exception{20 CloseableHttpClient closeableHttpClient = HttpClients.createDefault() ;21 HttpPost httpPost = new HttpPost(u) ;22 httpPost.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36");23 24 Listlist=new ArrayList ();25 list.add(new BasicNameValuePair("params","RlBC7U1bfy/boPwg9ag7/a7AjkQOgsIfd+vsUjoMY2tyQCPFgnNoxHeCY+ZuHYqtM1zF8DWIBwJWbsCOQ6ZYxBiPE3bk+CI1U6Htoc4P9REBePlaiuzU4M3rDAxtMfNN3y0eimeq3LVo28UoarXs2VMWkCqoTXSi5zgKEKbxB7CmlBJAP9pn1aC+e3+VOTr0"));26 list.add(new BasicNameValuePair("encSecKey","76a0d8ff9f6914d4f59be6b3e1f5d1fc3998317195464f00ee704149bc6672c587cd4a37471e3a777cb283a971d6b9205ce4a7187e682bdaefc0f225fb9ed1319f612243096823ddec88b6d6ea18f3fec883d2489d5a1d81cb5dbd0602981e7b49db5543b3d9edb48950e113f3627db3ac61cbc71d811889d68ff95d0eba04e9"));27 28 httpPost.setEntity(new UrlEncodedFormEntity(list));29 CloseableHttpResponse response=closeableHttpClient.execute(httpPost);30 31 HttpEntity entity=response.getEntity();32 String ux = EntityUtils.toString(entity,"utf-8") ;33 //System.out.println(ux);34 ArrayList s= getBook(ux);35 36 for(int i=0;i arrayList = new ArrayList () ;48 49 String con = "content(.*?)\"}" ;50 Pattern ah = Pattern.compile(con);51 Matcher mr = ah.matcher(read);52 while(mr.find()) {53 if (!arrayList.contains(mr.group())) {54 arrayList.add(mr.group());55 }56 }57 return arrayList ;58 }59 }
运行结果:
转载地址:http://usxqa.baihongyu.com/