是五月呀!

云音乐抓取

1. 初衷

作为云音乐的死忠用户,在大量版权丢失的情况下,依然坚守阵地,除了对UI的喜欢,还有就是歌曲下面精彩的评论信息了。
翻过很多热评后,就想把这些数据抓取下来。有了数据,就可以做数据分析。

2.探索抓取方案

我感兴趣的数据包括歌曲信息、用户信息和评论信息。
在web端搜索一首歌曲,例如女友喜欢的《umbrella》:http://music.163.com/song?id=21563094
打开开发者选项,抓一下这个接口的返回:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="baidu-site-verification" content="cNhJHKEzsD" />
<meta property="qc:admins" content="27354635321361636375" />
<link rel="canonical" href="https://music.163.com/song?id=21563094">
<meta name="applicable-device" content="pc">
<link rel="alternate" media="only screen and (max-width: 640px)" href="https://music.163.com/m/song?id=21563094">
<meta name="mobile-agent" content="format=html5;url=https://music.163.com/m/song?id=21563094">
<title>Umbrella - Rihanna/Jay-Z - 单曲 - 网易云音乐</title>
<meta name="keywords" content="Umbrella,Good Girl Gone Bad: Reloaded,Rihanna,Jay-Z" />
<meta name="description" content="歌手:Rihanna,Jay-Z。所属专辑:Good Girl Gone Bad: Reloaded。" />
<meta property="og:title" content="Umbrella - Rihanna/Jay-Z - 单曲 - 网易云音乐" />
<meta property="og:type" content="music.song" />
<meta property="og:image" content="http://p1.music.126.net/cSVFVXAwbx-UITEYuW2OCQ==/18923694625892133.jpg" />
<meta property="og:url" content="https://music.163.com/song?id=21563094" />
<script type="application/ld+json">
{
"@context": "https://ziyuan.baidu.com/contexts/cambrian.jsonld",
"@id": "http://music.163.com/song?id=21563094",
"appid": "1582028769404989",
"title": "Umbrella",
"images": ["http://p1.music.126.net/cSVFVXAwbx-UITEYuW2OCQ==/18923694625892133.jpg"],
"description": "歌手:Rihanna,Jay-Z。所属专辑:Good Girl Gone Bad: Reloaded。",
"pubDate": "2008-06-01T00:00:00"
}
<!-- ... -->

这个静态页面包含了我们需要的歌曲信息:歌曲名称,歌手名称,简单描述,封面等。

过滤一下comment关键字,可以抓到评论接口:
http://music.163.com/weapi/v1/resource/comments/R_SO_4_21563094?csrf_token=
观察下这个请求url,发现是
http://music.163.com/weapi/v1/resource/comments/R_SO_4_{songId}?csrf_token=
的规则。
但是直接发POST请求拿不到数据,说明这个接口做了权限校验。
继续分析这个接口的请求参数,发现Form Data中带了两个参数:

1
2
params: Tt0vzJun80fPMcJdTIQZjECkfD19yB+85n1n/yHPlpMyRMftbmFja0AccgQjzjJwVBCs/2P8Rv3G2twaIXyxcHHTPinf+r+W3My26Ig+7u4+MFHlt2da1t1mEqXDQFDLRMgQGBrHJ3SQwyL82t+FTpjoVRGVlIKUqT5+k/w9oFoPpCyLzICLhbdK+dHANH6g
encSecKey: 465fd8f7c29de467d6396e88b8b77d42971cb85134f8da7622925823019ab6774dc4b1cf6da9d39c063cb492e01349ae5801946700bc981b83cd5cddf66229ce794b63b8be99b7e47a88f415f412c6a5bf561a84b4cefb63b9571e05209e024b1e2dea08cbf9d299d11770682fc1f17117c612e0d113a26175f88b2a661bfc32

我们也加上两个表单参数:

1
2
3
4
curl -X POST \
'http://music.163.com/weapi/v1/resource/comments/R_SO_4_21563094/?csrf_token=' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'params=idbuC%2BeyC4tyaC8O0b17iPmu76ES3ExK4xUyvEcqLUc4ky04AGkMkKd8Agp1Z46Vh3TMCZUc4aqHdMaQ7MyYj2jzr2v3irkRTalDwLf5Hu4%2B3FrujioedL5Crw8a2G9rcV%2BjdUm7JkZbZ6ubRTl4QlRT2OjNj7DjcBj%2B5nKgH6Z2rGQzoL8XO5HAVED4uRBB&encSecKey=a889c30bc037e06c3cb96d2f694bdec265cda483db3b551d72c9c30d2eb5ee7fac6848cb295fcde522ba0b0ed709ca5f28638353016bad5b4224a76517f2776a8d1bea8e2a9318625b41c73e2aece423e0846bcd54093032287185b1f618fe39163f8df49939abac3f5ae8ebcea87cda6e921b9ba41e0ed33f492b1624a48a92'

返回结果结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
{
"isMusician": false,
"userId": -1,
"topComments": [],
"moreHot": true,
"hotComments": [
{
"user": {
"locationInfo": null,
"vipRights": {
"musicPackage": {
"vipCode": 200,
"rights": true
}
},
"userId": 51494587,
"nickname": "FireBurning_JASON",
"expertTags": null,
"userType": 0,
"authStatus": 0,
"remarkName": null,
"avatarUrl": "http://p1.music.126.net/8c28fFk1Yqml-CCulWzioA==/7806532557388357.jpg",
"experts": null,
"vipType": 10
},
"beReplied": [],
"pendantData": null,
"expressionUrl": null,
"time": 1424242903947,
"liked": false,
"likedCount": 14377,
"commentId": 11007152,
"content": "应该说这首歌才是rir的巅峰吧,然后就一直在巅峰"
}
],
"total": 6935,
"more": true
}

这个接口就包含我感兴趣的热门评论信息和用户信息。
对于简单抓取,这两个接口已经满足我的需求。如果想抓取全部评论,包含分页数据,那么需要分析云音乐的加密算法,其实在混淆后的JS中都有,有专门的文章分析,这里不做过多解释。

3. 代理IP池

有了接口,就可以写代码发送http请求遍历songId进行数据抓取了。
但是云音乐做了防刷处理,抓取几万条数据后,开始接口报错,被封了IP。
于是想要维护一个代理IP池,每次抓取时从池子中拿一个代理IP进行接口调用,防止IP封禁。

3.1 代理IP

来源:可以从淘宝上面买,可以从免费网站抓取。
分类:根据匿名程度,可以分为高匿,普通和透明。高匿的代理ip可以屏蔽掉真实IP,目标服务器完全不能感觉到代理IP的存在;普通代理ip能隐藏真实IP,但是目标服务器能感受到代理IP的存在;透明代理IP不能隐藏真实IP。

3.2 免费代理IP池

搜索一下代理IP,能找到大量网站提供一些免费代理IP,质量参差不齐,站越大,用的人越多,质量越差。
于是搭建免费代理IP池包含两个步骤:
(1)从免费代理IP网站抓取代理IP列表
(2)验证代理IP可用性
不同的代理IP网站抓取和解析的过程也不尽相同,但是不难。
验证过程就是使用抓取来的代理IP访问常用网站或者目标网站,如果可用则保留,不可用则丢弃。

4. 实现

4.1 配置文件读取

项目中使用三个自定义配置参数:请求代理IP列表超时时间,代理IP验证超时时间和真正请求歌曲信息接口超时时间。
application.yml中配置如下:

1
2
3
4
5
app:
http:
proxyLoadTimeout: 10
proxyValidateTimeout: 8
songLoadTimeout: 15

这里使用ConfigurationProperties注解读取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/**
* Description:
* http配置参数
* 修改此类需要maven重新编译,yml里面才能有提示
*
* @author damon4u
*/
@Component
@ConfigurationProperties("app.http")
@Data
public class HttpConfig {
private int proxyLoadTimeout;
private int proxyValidateTimeout;
private int songLoadTimeout;
}

为了能够在yml配置文件中提示自定义参数,需要引入spring-boot-configuration-processor:

1
2
3
4
5
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-configuration-processor</artifactId>
<optional>true</optional>
</dependency>

注意每次修改HttpConfig都需要重新maven编译,yml里面才能有提示。

4.2 使用普通http请求获取动态IP

网络请求层使用OKHttp + Retrofit2框架。
Retrofit2本身是对OKHttp网络请求工具的封装实现,可以方便进行各类HTTP方法请求,对请求参数和返回值进行类型转换。
首先引入maven依赖:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<!-- retrofit start -->
<dependency>
<groupId>com.squareup.retrofit2</groupId>
<artifactId>retrofit</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>com.squareup.retrofit2</groupId>
<artifactId>converter-jackson</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>com.squareup.retrofit2</groupId>
<artifactId>converter-scalars</artifactId>
<version>2.3.0</version>
</dependency>
<!-- retrofit end -->

retrofit是核心包,converter-jacksonconverter-scalars是两个返回值类型转换器,分别对json和普通类型(如string)的返回值进行转换。

获取动态IP的接口为:https://raw.githubusercontent.com/stamparm/aux/master/fetch-some-list.txt
返回结果为:

1
2
3
4
5
6
7
8
9
10
[
{
"proto": "http",
"ip": "176.62.180.209",
"country": "Russian Federation",
"anonymity": "high",
"type": "elite",
"port": 63909
}
]

对于代理IP,统一封装成一个 Proxy实体类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
@Data
public class Proxy implements Serializable {
/**
* 主键id
*/
private int id;
/**
* ip地址或者域名
*/
private String ip;
/**
* 端口号
*/
private int port;
/**
* 协议http、https、socks4等
*/
private String proto;
public Proxy() {}
public Proxy(String ip, int port, String proto) {
this.ip = ip;
this.port = port;
this.proto = proto.toLowerCase();
}
@Override
public String toString() {
return proto + "://" + ip + ":" + port;
}
}

首先我们封装一个请求client接口,专门负责发起https://raw.githubusercontent.com/域名下的请求,当然,这里我们只有一个普通的GET请求。

1
2
3
4
5
6
7
8
public interface GitProxyClient {
/**
* 获取代理IP列表
* @return
*/
@GET("/stamparm/aux/master/fetch-some-list.txt")
Call<List<Proxy>> loadProxy();
}

之后,需要想Spring容器注入bean:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
@Configuration
@Data
public class HttpClientFactory {
/**
* 网络配置参数
*/
private final HttpConfig httpConfig;
@Autowired
public HttpClientFactory(HttpConfig httpConfig) {
this.httpConfig = httpConfig;
}
@Bean
public OkHttpClient okHttpClient() {
return new OkHttpClient.Builder()
.addInterceptor(new HttpHeaderInterceptor())
.connectTimeout(httpConfig.getProxyLoadTimeout(), TimeUnit.SECONDS)
.readTimeout(httpConfig.getProxyLoadTimeout(), TimeUnit.SECONDS).build();
}
@Bean
public GitProxyClient gitProxyClient() {
ObjectMapper mapper = new ObjectMapper();
//设置输入时忽略在JSON字符串中存在但Java对象实际没有的属性
mapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);
return new Retrofit.Builder()
.addConverterFactory(JacksonConverterFactory.create(mapper))
.baseUrl("https://raw.githubusercontent.com/")
.client(okHttpClient())
.build().create(GitProxyClient.class);
}
}

HttpClientFactory先注册一个OkHttpClientbean,设置了超时时间和拦截器,这个请求并不需要设置代理IP。
其中HttpHeaderInterceptor很简单,只是添加一些公共头信息,包括一个随机UA:

1
2
3
4
5
6
7
8
9
10
11
public class HttpHeaderInterceptor implements Interceptor {
@Override
public Response intercept(Chain chain) throws IOException {
Request request = chain.request();
Request.Builder builder = request.newBuilder();
builder.addHeader("User-Agent", UAPool.getUA()); // 写死一个UA池子,每次随机拿一个,这里就不贴了
builder.addHeader("Connection", "keep-alive");
return chain.proceed(builder.build());
}
}

之后注册一个GitProxyClientbean,定制了Jackson反序列化时使用的ObjectMapper,设置了该client的基础域名。
接下来就可以使用gitProxyClient发起请求获取动态IP列表了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@Component
public class ProxyLoader {
@Autowired
private GitProxyClient gitProxyClient;
private void loadFromGit() throws Exception {
// 请求
Response<List<Proxy>> response = gitProxyClient.loadProxy().execute();
List<Proxy> proxyList = response.body();
// 过滤
filterProxy(proxyList);
}
}

4.3 过滤抓取来的代理IP

从网上抓取来的代理IP一般质量都不高,所以需要进行过滤。
常用的方法就是尝试使用代理IP去访问一个常用网站,如百度等,如果在短时间内正常返回则认为可用。
开始我也采用了这种方式,访问百度,但是发现可能会出现一种情况,就是百度可以访问,但是访问云音乐接口却出现错误。
既然如此,不如一步到位,过滤时就访问歌曲信息接口,如果能正常获取歌曲信息,那么认为代理可用。

先给出获取歌曲信息的Client:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public interface MusicClient {
/**
* 获取歌曲信息
*
* @param songId 歌曲id
* @return 页面html
*/
@GET("song")
@Headers("Referer: http://music.163.com/")
Call<String> songInfo(@Query("id") long songId);
/**
* 获取评论信息
*
* @param songId 歌曲id
* @param params 加密params
* @param encSecKey 加密encSecKey
* @return 评论接口返回体
*/
@POST("weapi/v1/resource/comments/R_SO_4_{songId}/?csrf_token=d2c9e86c94efabcc4b5a1a6d757d417e")
@FormUrlEncoded
@Headers("Referer: http://music.163.com/")
Call<CommentResponseBody> comment(@Path("songId") Long songId,
@Field("params") String params,
@Field("encSecKey") String encSecKey);
}

接下来需要一个定制OkHttp,添加代理IP,这里创建一个工厂:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public class HttpProxyClientFactory {
public static MusicClient musicClient(Proxy proxy, int timeoutInSecond) {
OkHttpClient.Builder builder = new OkHttpClient.Builder()
.connectTimeout(timeoutInSecond, TimeUnit.SECONDS)
.readTimeout(timeoutInSecond, TimeUnit.SECONDS)
.addInterceptor(new HttpHeaderInterceptor());
addProxy(proxy, builder);
OkHttpClient okHttpClient = builder.build();
ObjectMapper mapper = new ObjectMapper();
//设置输入时忽略在JSON字符串中存在但Java对象实际没有的属性
mapper.disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES);
Retrofit retrofit = new Retrofit.Builder()
.addConverterFactory(ScalarsConverterFactory.create())
.addConverterFactory(JacksonConverterFactory.create(mapper))
.baseUrl("http://music.163.com/")
.client(okHttpClient)
.build();
return retrofit.create(MusicClient.class);
}
/**
* 设置代理
*/
private static void addProxy(Proxy proxy, OkHttpClient.Builder builder) {
if (proxy != null) {
String schemeName = proxy.getProto();
java.net.Proxy.Type type = java.net.Proxy.Type.HTTP;
if (schemeName.startsWith("socks")) {
type = java.net.Proxy.Type.SOCKS;
}
builder.proxy(new java.net.Proxy(type, new InetSocketAddress(proxy.getIp(), proxy.getPort())));
}
}
}

由于每次请求的代理IP都不同,不能直接注入到bean,而是使用工厂方法,每次请求都重新构造一个OkHttpClient,然后生成MusicClient
注意在添加converterFactory时,先添加ScalarsConverterFactory,再JacksonConverterFactory,原因在doc中:

Because Jackson is so flexible in the types it supports, this converter assumes that it can handle all types. If you are mixing JSON serialization with something else (such as protocol buffers), you must add this instance last to allow the other converters a chance to see their types.

接下来就可以使用代理IP请求歌曲信息了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
private static Pattern NAME_PATTERN = Pattern.compile("<title>(.*) - 网易云音乐</title>");
private static Pattern DESCRIPTION_PATTERN = Pattern.compile("<meta name=\"description\" content=\"(.*)\" />");
private static Pattern IMAGE_PATTERN = Pattern.compile("<meta property=\"og:image\" content=\"(.*)\" />");
/**
* 获取歌曲信息
*
* @param songId 歌曲id
* @return 如果找到,返回:歌曲名称 - 歌手名称;否则返回null
*/
public Song getSongInfo(long songId, Proxy proxy, int timeoutInSecond) throws Exception {
Response<String> response = HttpProxyClientFactory.musicClient(proxy, timeoutInSecond)
.songInfo(songId).execute();
Preconditions.checkNotNull(response);
String songInfo = response.body();
Preconditions.checkArgument(StringUtils.isNotBlank(songInfo));
Matcher nameMatcher = NAME_PATTERN.matcher(songInfo);
Matcher descriptionMatcher = DESCRIPTION_PATTERN.matcher(songInfo);
Matcher imageMatcher = IMAGE_PATTERN.matcher(songInfo);
String name = "";
String description = "";
String image = "";
if (nameMatcher.find()) {
name = nameMatcher.group(1);
}
if (descriptionMatcher.find()) {
description = descriptionMatcher.group(1);
}
if (imageMatcher.find()) {
image = imageMatcher.group(1);
}
if (StringUtils.isNotBlank(name)) {
Song song = new Song();
song.setSongId(songId);
song.setName(name);
song.setDescription(description);
song.setImage(image);
song.setCreateTime(new Date());
logger.info("song={}", song);
return song;
}
return null;
}

剩下的过滤工作也就简单了,对于爬去来的代理IP列表,使用多个线程去过滤一下就好了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
private void filterProxy(List<Proxy> proxyList) {
if (CollectionUtils.isNotEmpty(proxyList)) {
final CountDownLatch latch = new CountDownLatch(proxyList.size());
ExecutorService executorService = Executors.newFixedThreadPool(5);
proxyList.forEach(proxy -> executorService.execute(new FilterJob(proxy, latch)));
try {
latch.await();
} catch (InterruptedException e) {
LOGGER.error(e.getMessage());
}
executorService.shutdownNow();
}
}
class FilterJob implements Runnable {
private Proxy proxy;
private CountDownLatch latch;
FilterJob(Proxy proxy, CountDownLatch latch) {
this.proxy = proxy;
this.latch = latch;
}
@Override
public void run() {
try {
// 先只要http类型的
if (proxy.getProto().startsWith("http")
&& commentLoader.getSongInfo(209996, proxy, httpConfig.getProxyValidateTimeout()) != null) {
LOGGER.info("================> {}", proxy);
proxyDao.save(new Proxy(proxy.getIp(), proxy.getPort(), proxy.getProto()));
}
} catch (Exception e) {
LOGGER.error(e.getMessage());
} finally {
latch.countDown();
}
}
}

4.4 定时抓取

上面我们实现了一次代理IP抓取,而抓取来的IP也会失效,所以我们需要定时抓取,更新代理IP池。
定时任务可以使用Spring的Schedule注解实现,也可以使用quartz,这里使用quartz。
引入依赖:

1
2
3
4
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-quartz</artifactId>
</dependency>

application.yml添加配置:

1
2
3
4
5
6
7
8
9
10
11
12
spring:
quartz:
properties:
org:
quartz:
scheduler:
instanceName: proxyScheduler
threadPool:
class: org.quartz.simpl.SimpleThreadPool
threadCount: 10
threadPriority: 5
job-store-type: MEMORY

这里使用简单的内存存储job,也可以使用数据库,配置项会麻烦些。
然后创建一个定时任务job,继承QuartzJobBean

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@Component
public class ProxyLoadJob extends QuartzJobBean {
private static final Logger LOGGER = LoggerFactory.getLogger(ProxyLoadJob.class);
@Autowired
private ProxyLoader proxyLoader;
@Override
protected void executeInternal(JobExecutionContext jobExecutionContext) {
LOGGER.info("ProxyLoadJob start.");
try {
proxyLoader.loadProxy();
} catch (Exception e) {
LOGGER.error(e.getMessage());
}
LOGGER.info("ProxyLoadJob end.");
}
}

有了任务,还要将它调度起来,我想20分钟跑一次:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@Configuration
public class JobScheduler {
private static final String JOB_NAME_PROXY_LOAD = "proxyLoadJob";
private static final String GROUP_NAME_PROXY = "proxyLoadGroup";
@Bean
public JobDetail proxyJob() {
return JobBuilder.newJob(ProxyLoadJob.class)
.withIdentity(JOB_NAME_PROXY_LOAD, GROUP_NAME_PROXY)
.storeDurably()
.build();
}
@Bean
public Trigger proxyJobTrigger() {
CronScheduleBuilder scheduleBuilder = CronScheduleBuilder.cronSchedule("0 0/20 * * * ?");
return TriggerBuilder.newTrigger()
.withSchedule(scheduleBuilder)
.withIdentity(JOB_NAME_PROXY_LOAD, GROUP_NAME_PROXY)
.forJob(proxyJob())
.build();
}
}

这里我们先注册一个JobDetail,需要一个名字和分组,然后设置为durably,即没有trigger指向时,也保留。
接下来注册一个trigger去触发上面的job,设置cron表达式,绑定job即可。

4.5 定时删除失效IP

上面我们实现了定时抓取新代理IP的功能,接下来还需要定时从库中删除失效的代理IP,不能光等到使用时再删除。
方法是从库中读取所有代理IP(如果代理IP量很大,可以分页处理),然后启动多个任务去校验代理IP有效性,将无效的删除:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public void validateFromDb() {
List<Proxy> allProxy = proxyDao.getAllProxy();
if (CollectionUtils.isNotEmpty(allProxy)) {
final CountDownLatch latch = new CountDownLatch(allProxy.size());
ExecutorService executorService = Executors.newFixedThreadPool(5);
allProxy.forEach(proxy -> executorService.execute(new ValidateJob(proxy, latch)));
for (Proxy proxy : allProxy) {
executorService.execute(new ValidateJob(proxy, latch));
}
try {
latch.await();
} catch (InterruptedException e) {
LOGGER.error(e.getMessage());
}
executorService.shutdownNow();
}
}
class ValidateJob implements Runnable {
private Proxy proxy;
private CountDownLatch latch;
ValidateJob(Proxy proxy, CountDownLatch latch) {
this.proxy = proxy;
this.latch = latch;
}
@Override
public void run() {
try {
if (commentLoader.getSongInfo(209996, proxy, httpConfig.getProxyValidateTimeout()) == null) {
proxyDao.delete(proxy.getId());
LOGGER.error("disable proxy={}", proxy);
}
} catch (Exception e) {
LOGGER.error(e.getMessage());
} finally {
latch.countDown();
}
}
}

同样需要一个定时任务,15分钟执行一次即可,代码与上面抓取代理任务类似。

4.6 使用代理IP获取歌曲信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
@Component
public class CommentLoader {
private static final Logger logger = LoggerFactory.getLogger(CommentLoader.class);
private static Pattern NAME_PATTERN = Pattern.compile("<title>(.*) - 网易云音乐</title>");
private static Pattern DESCRIPTION_PATTERN = Pattern.compile("<meta name=\"description\" content=\"(.*)\" />");
private static Pattern IMAGE_PATTERN = Pattern.compile("<meta property=\"og:image\" content=\"(.*)\" />");
@Resource
private SongDao songDao;
@Resource
private UserDao userDao;
@Resource
private CommentDao commentDao;
@Resource
private ProxyDao proxyDao;
@Autowired
private HttpConfig httpConfig;
/**
* 获取歌曲和评论信息
*
* @param start 歌曲id起始
* @param end 歌曲id结束
*/
public void loadSongs(int start, int end) throws Exception {
for(int songId = start; songId <= end; songId++) {
Proxy proxy = getRandomProxy();
Song songInfo = getSongInfo(songId, proxy, httpConfig.getSongLoadTimeout());
if (songInfo == null) { // 重试一次
proxyDao.delete(proxy.getId());
proxy = getRandomProxy();
songInfo = getSongInfo(songId, proxy, httpConfig.getSongLoadTimeout());
}
if (songInfo == null) { // 重试一次
proxyDao.delete(proxy.getId());
proxy = getRandomProxy();
songInfo = getSongInfo(songId, proxy, httpConfig.getSongLoadTimeout());
}
if (songInfo == null) {
proxyDao.delete(proxy.getId());
logger.error("songInfo is null. songId={}", songId);
}
if (songInfo != null) {
loadCommentInfo(songInfo, proxy);
}
TimeUnit.MILLISECONDS.sleep(50);
}
}
/**
* 获取歌曲信息
*
* @param songId 歌曲id
* @return 如果找到,返回:歌曲名称 - 歌手名称;否则返回null
*/
public Song getSongInfo(long songId, Proxy proxy, int timeoutInSecond) throws Exception {
Response<String> response = HttpProxyClientFactory.musicClient(proxy, timeoutInSecond)
.songInfo(songId).execute();
Preconditions.checkNotNull(response);
String songInfo = response.body();
Preconditions.checkArgument(StringUtils.isNotBlank(songInfo));
Matcher nameMatcher = NAME_PATTERN.matcher(songInfo);
Matcher descriptionMatcher = DESCRIPTION_PATTERN.matcher(songInfo);
Matcher imageMatcher = IMAGE_PATTERN.matcher(songInfo);
String name = "";
String description = "";
String image = "";
if (nameMatcher.find()) {
name = nameMatcher.group(1);
}
if (descriptionMatcher.find()) {
description = descriptionMatcher.group(1);
}
if (imageMatcher.find()) {
image = imageMatcher.group(1);
}
if (StringUtils.isNotBlank(name)) {
Song song = new Song();
song.setSongId(songId);
song.setName(name);
song.setDescription(description);
song.setImage(image);
song.setCreateTime(new Date());
logger.info("song={}", song);
return song;
}
return null;
}
/**
* 获取歌曲评论
* @param song 歌曲信息
* @param proxy
*/
private void loadCommentInfo(Song song, Proxy proxy) throws Exception {
Long songId = song.getSongId();
String params = "flQdEgSsTmFkRagRN2ceHMwk6lYVIMro5auxLK/JywlqdjeNvEtiWDhReFI+QymePGPLvPnIuVi3dfsDuqEJW204VdwvX+gr3uiRBeSFuOm1VUSJ1HqOc+nJCh0j6WGUbWuJC5GaHTEE4gcpWXX36P4Eu4djoQBzoqdsMbCwoolb2/WrYw/N2hehuwBHO4Oz";
String encSecKey = "0263b1cd3b0a9b621a819b73e588e1cc5709349b21164dc45ab760e79858bb712986ea064dbfc41669e527b767f02da7511ac862cbc54ea7d164fc65e0359962273616e68e694453fb6820fa36dd9915b2b0f60dadb0a6022b2187b9ee011b35d82a1c0ed8ba0dceb877299eca944e80b1e74139f0191adf71ca536af7d7ec25";
Response<CommentResponseBody> response = HttpProxyClientFactory.musicClient(proxy, httpConfig.getSongLoadTimeout()).comment(songId, params, encSecKey).execute();
long commentCount = 0;
if (response != null) {
CommentResponseBody commentResponseBody = response.body();
if (commentResponseBody != null) {
commentCount = commentResponseBody.getTotal();
List<CommentResponse> hotComments = commentResponseBody.getHotComments();
if (CollectionUtils.isNotEmpty(hotComments)) {
for (CommentResponse commentResponse : hotComments) {
Long userId = 0L;
User user = commentResponse.getUser();
if (user != null) {
userId = user.getUserId();
userDao.save(user);
}
Long beRepliedUserId = 0L;
String beRepliedContent = "";
List<CommentReplied> beRepliedList = commentResponse.getBeReplied();
if (CollectionUtils.isNotEmpty(beRepliedList)) {
CommentReplied commentReplied = beRepliedList.get(0);
User repliedUser = commentReplied.getUser();
if (repliedUser != null) {
beRepliedUserId = repliedUser.getUserId();
userDao.save(repliedUser);
}
beRepliedContent = commentReplied.getContent();
}
if (commentCount > 1000) {
// 总评论数少于1000的歌曲就扔掉了
Comment comment = new Comment();
comment.setCommentId(commentResponse.getCommentId());
comment.setSongId(songId);
comment.setLikedCount(commentResponse.getLikedCount());
comment.setContent(commentResponse.getContent());
comment.setUserId(userId);
comment.setBeRepliedUserId(beRepliedUserId);
comment.setBeRepliedContent(beRepliedContent);
comment.setCommentTime(new Date(commentResponse.getTime()));
comment.setCreateTime(new Date());
commentDao.save(comment);
}
}
}
}
}
song.setCommentCount(commentCount);
songDao.save(song);
}
/**
* 查询一个随机代理
* @return
*/
private Proxy getRandomProxy() {
int count = proxyDao.count();
return proxyDao.getByIndex(ThreadLocalRandom.current().nextInt(count));
}
}