Python爬虫下载腾讯课堂回放视频

在爬取了长江雨课堂回放的音频之后,又想尝试爬取腾讯课堂回放视频了,用于爬虫学习。

爬取分析

  1. 前言

    • 难点

      (1)提供plskeypskey(cookie中给出)

      (2)edk解密视频

    • 说明

      利用本代码下载视频时,需要修改代码中的tid(term_id)。

  2. 获取课程信息

    • url = "https://ke.qq.com/cgi-proxy/agency/exp/get_replay_list_to_c?tid={}&need_recording=0&page_idx=0&page_size=0&need_all=1&role_type=2&bkn=658893395&r=0.4397".format(tid)"
    • 主要获取课程各个视频的fileid
  3. 根据file_id获取得到文件信息的url参数

    • url = 'https://ke.qq.com/cgi-bin/qcloud/get_token?term_id={}&fileId={}&bkn=658893395&t=0.4467'.format(tid, file_id)
    • 主要获取视频的四个参数信息:exper sign t us
  4. 获取视频文件地址

    • url = 'https://playvideo.qcloud.com/getplayinfo/v2/1258712167/{}?exper={}&sign={}&t={}&us={}'.format(file_id, video_param['result']['exper'],video_param['result']['sign'],video_param['result']['t'],video_param['result']['us'])
    • 主要获取视频的m3u8文件地址
    • 对于同一视频来说,有不同清晰度的视频地址,代码中默认选择了最高清晰度
  5. 获取masterPlayList

    • https://1258712167.vod2.myqcloud.com/fb8e6c92vodtranscq1258712167/c76dde7c5285890800895871695/drm/voddrm.token.dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw==.master_playlist.m3u8?t=5eb4caaa&exper=0&us=8708789871727437569&sign=bb387e6ca1dfb28451dbb224d41f1bcf
    • dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw==是base64码加密后的字符串,其中主要包含plskeypskey
    • masterPlayList.m3u8文件中含有各个清晰度视频的m3u8地址。
  6. 下载最高清晰度视频的m3u8文件

    • https://1258712167.vod2.myqcloud.com/fb8e6c92vodtranscq1258712167/c76dde7c5285890800895871695/drm/voddrm.token.dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw==.v.f30741.m3u8?t=5eb4cbd0&exper=0&us=3781125914949347017&sign=dd6e77288a570373aa881c3ffa06fc19

    • 文件内容类似如下:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      #EXTINF:9.999,
      v.f30741.ts?start=596637520&end=597994143&type=mpegts&t=5eb4cbd0&exper=0&us=3781125914949347017&sign=dd6e77288a570373aa881c3ffa06fc19
      #EXT-X-KEY:METHOD=AES-128,URI="https://ke.qq.com/cgi-bin/qcloud/get_dk?edk=CiA3PFgfG%2BIQ7set2C1%2FAWxyVYHDD6T%2FukE95OnjE8BwRhCO08TAChiaoOvUBCokOTMyNDg4YmItOWZjYS00MzFiLWJiYjItNjFmMDhjYjNlYmM3&fileId=5285890800895871695&keySource=VodBuildInKMS&token=dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw%3D%3D",IV=0x00000000000000000000000000000000
      #EXTINF:9.999,
      v.f30741.ts?start=597994144&end=599294735&type=mpegts&t=5eb4cbd0&exper=0&us=3781125914949347017&sign=dd6e77288a570373aa881c3ffa06fc19
      #EXT-X-KEY:METHOD=AES-128,URI="https://ke.qq.com/cgi-bin/qcloud/get_dk?edk=CiA3PFgfG%2BIQ7set2C1%2FAWxyVYHDD6T%2FukE95OnjE8BwRhCO08TAChiaoOvUBCokOTMyNDg4YmItOWZjYS00MzFiLWJiYjItNjFmMDhjYjNlYmM3&fileId=5285890800895871695&keySource=VodBuildInKMS&token=dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw%3D%3D",IV=0x00000000000000000000000000000000
      #EXTINF:9.999,
      v.f30741.ts?start=599294736&end=600615071&type=mpegts&t=5eb4cbd0&exper=0&us=3781125914949347017&sign=dd6e77288a570373aa881c3ffa06fc19
      #EXT-X-KEY:METHOD=AES-128,URI="https://ke.qq.com/cgi-bin/qcloud/get_dk?edk=CiA3PFgfG%2BIQ7set2C1%2FAWxyVYHDD6T%2FukE95OnjE8BwRhCO08TAChiaoOvUBCokOTMyNDg4YmItOWZjYS00MzFiLWJiYjItNjFmMDhjYjNlYmM3&fileId=5285890800895871695&keySource=VodBuildInKMS&token=dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw%3D%3D",IV=0x00000000000000000000000000000000
      #EXTINF:2.286,
      v.f30741.ts?start=600615072&end=600980175&type=mpegts&t=5eb4cbd0&exper=0&us=3781125914949347017&sign=dd6e77288a570373aa881c3ffa06fc19
      #EXT-X-ENDLIST

      这里直接读取倒数第2行和倒数第4行,倒数第2行为视频最后片段地址,倒数第4行中含有edk文件地址

      接下来将视频最后片段地址中的start=600615072改为start=0,就是获取整个视频内容。

  7. 获取视频加密码(edk)

    • https://ke.qq.com/cgi-bin/qcloud/get_dk?edk=CiA3PFgfG%2BIQ7set2C1%2FAWxyVYHDD6T%2FukE95OnjE8BwRhCO08TAChiaoOvUBCokOTMyNDg4YmItOWZjYS00MzFiLWJiYjItNjFmMDhjYjNlYmM3&fileId=5285890800895871695&keySource=VodBuildInKMS&token=dWluPTMwMzI5NjQ1MTg7dm9kX3R5cGU9MDtjaWQ9MTAxMjM5NDt0ZXJtX2lkPTEwMDgzOTU1MTtwbHNrZXk9cF9sc2tleT0wMDA0MDAwMDFmMWYyMjFkZGI1NzlkN2EzNmM4NjhjOGNmZjZlMGQwYTM2NTliZGZlNWE1ZGYxMTc5MDljZDVmZTgyZGU2MTY4MWY2ODA0Y2UzZWE0MGVmO3Bza2V5PXBfc2tleT1SN0h6Yyp3ZTVpZHBvcjVNdGxVajFyc1dmU3pnYjVWSFk2N2dPR1RIY0hjXw%3D%3D
    • 此处将edk文件保存到了本地文件夹内
  8. 下载加密后的视频

    • https://1258712167.vod2.myqcloud.com/fb8e6c92vodtranscq1258712167/c76dde7c5285890800895871695/drm/v.f30741.ts?start=0&end=600980175&type=mpegts&t=5eb4cbd0&exper=0&us=3781125914949347017&sign=dd6e77288a570373aa881c3ffa06fc19
    • 由于采用requests.get来下载视频,下载速度较慢,可以用IDM或FDM直接下载该视频
  9. 将加密视频用edk解密

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    key = None
    with open(pathFolder+"get_dk", 'rb') as f:
    key = f.read()
    iv = b'0000000000000000'
    plain = ""
    with open(filepath, 'rb') as f:
    data = f.read()
    with open(pathFolder+os.path.basename(video_url.split('?')[0]), 'wb') as ff:
    cipher = AES.new(key, AES.MODE_CBC, iv)
    plain = cipher.decrypt(data)
    ff.write(plain)

    上面是解密核心代码,主要参考网上教程的。

全部代码

  • cookie不需要给出的

  • 只需要修改tid(term_id)下载不同的课程

  • replay_info_list = replay_info_list[0:1] #控制下载的课程编号,该行代码在中间,自己找 一下,控制下载哪些课程,该行已被注释。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import requests
import json
import os
import time
import sys,base64
from Crypto.Cipher import AES
import re

tid = "" # term_id
if tid == "":
print("请提供tid(修改代码)")
sys.exit()

header = {
'referer': 'https://ke.qq.com/webcourse/index.html',
'sec-fetch-mode':'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}

header1 = {}
header1['cookie'] = 'pgv_pvi=4293054464; pgv_pvid=5788047360; RK=G6x822k8bh; ptcz=914ed5ce06c276aa4953ea7500df39a67fb1df2eb190e9f69ad26cad54b020bb; tvfe_boss_uuid=ba32b6d764747241; _ga=amp-gE1bWJc9yhcUHO9BAikFmA; eas_sid=K1y5l6C477x4G97818d0F5i8L0; ied_qq=o3032964558; XWINDEXGREY=0; psrf_qqrefresh_token=FA06D9F8767666CA0604B4B544856CE3; psrf_qqunionid=CCF53FAF477ACC91C2927167C3818644; psrf_qqopenid=DF48DD6C080208437163660565BEFDD1; psrf_qqaccess_token=D7C1D536872833A8DF266B5C84EF5A76; psrf_access_token_expiresAt=1588309043; ts_uid=5957681050; localInterest=[2002]; ts_refer=ADTAGCLIENT.QQ.5689_.0; index_new_key={"index_interest_cate_id":2002}; isHideDealTips=1; iswebp=1; lskey=00010000eb670dafeaa133c037b5dd4233aea70c75f77644b045478d9d660249789c310e4f41c347010f56b1; p_lskey=000400001f1f221ddb579d7a36c868c8cff6e0d0a3659bdfe5a5df117909cd5fe82de61681f6804ce3ea40ef; o_cookie=3032964558; pac_uid=1_3032964558; pgv_si=s5251174400; _qpsvr_localtk=0.7523651816720494; uin=o3032964558; p_uin=o3032964558; tdw_auin_data=-; tdw_data={"ver4":"4","ver6":"InClass","refer":"","from_channel":"","path":"r-0.3174627216615984","auin":"-","uin":3032964558,"real_uin":"3032964558"}; tdw_first_visited=1; skey=@8SOlG6S2o; tdw_data_new_2={"auin":"-","sourcetype":"","sourcefrom":"","ver9":"3032964558","uin":"3032964558","visitor_id":"608417608166375","ver10":"","url_page":"","url_module":"","url_position":""}; pgv_info=ssid=s5064814228; Hm_lvt_0c196c536f609d373a16d246a117fd44=1586068475,1586071931,1586141631,1586245229; tdw_data_testid=; tdw_data_flowid=; pt4_token=T5zeUSCO-rh0DbB23Qv9s-dbbKKk58Lgoeep2x9Qf90_; p_skey=R7Hzc*we5idpor5MtlUj1rsWfSzgb5VHY67gOGTHcHc_; ts_last=ke.qq.com/webcourse/index.html; Hm_lpvt_0c196c536f609d373a16d246a117fd44=1586249333'

plskey = ''
pskey = ''
for i in header1['cookie'].split(";"):
if('p_skey' in i):
pskey = i.strip()
if('p_lskey' in i):
plskey = i.strip()
if len(plskey) == 0:
print('plskey未找到')
sys.exit()
if len(pskey) == 0:
print('pskey未找到')
sys.exit()

base64_code_raw = 'uin=3032964518;vod_type=0;cid=1012394;term_id={};plskey={};pskey={}'.format(tid, plskey, pskey)
base64_code = base64.b64encode(base64_code_raw.encode("utf-8"))
base64_code = str(base64_code)[2:-1]
print(base64_code)

url = "https://ke.qq.com/cgi-proxy/agency/exp/get_replay_list_to_c?tid={}&need_recording=0&page_idx=0&page_size=0&need_all=1&role_type=2&bkn=658893395&r=0.4397".format(tid)
if not os.path.exists('./{}'.format(tid)):
os.mkdir('./{}'.format(tid))

resp = requests.get(url, headers=header)
course_info_dict = json.loads(resp.text)

print(course_info_dict)

if(int(course_info_dict['retcode']) == 0):
print('*'*20)
print('INFO: 所有课程信息如下:\n')
print('-------------')
for course in course_info_dict['result']["replay_info_list"]:
print('task_id: {}\ntask_name:{}\nfileid:{}'.format(course['task_id'], course['task_name'], course['file']['file_id']))
print('时间:{}'.format(time.strftime("%Y-%m-%d %a %H:%M:%S", time.localtime(int(course['bg_time'])))))
print('-------------')
print('\nINFO: 课程信息输出完毕!')
print('*'*20+'\n')

replay_info_list = course_info_dict['result']["replay_info_list"]
tid = course_info_dict['result']['tid']
#replay_info_list.reverse()

#replay_info_list = replay_info_list[0:1] #控制下载的课程编号
for info in replay_info_list:
print(info)
file_id = info['file']['file_id']
task_name = info['task_name']
task_id = info['task_id']
bg_time = info['bg_time']
timestamp = int(bg_time)
time_local = time.localtime(timestamp)
dt = time.strftime("%Y-%m-%d %H.%M.%S",time_local)
pathFolder = './{}/{}_{}_{}/'.format(tid, task_id, task_name, dt)
if not os.path.exists(pathFolder):
os.mkdir(pathFolder)
url = 'https://ke.qq.com/cgi-bin/qcloud/get_token?term_id={}&fileId={}&bkn=658893395&t=0.4467'.format(tid, file_id)
respp = requests.get(url)
video_param = json.loads(respp.text)
#print(video_param)
#break

if int(video_param['retcode']) != 0:
print("获取视频参数失败")
break
url = 'https://playvideo.qcloud.com/getplayinfo/v2/1258712167/{}?exper={}&sign={}&t={}&us={}'.format(file_id, video_param['result']['exper'],video_param['result']['sign'],video_param['result']['t'],video_param['result']['us'])
resp = requests.get(url)
video_info = json.loads(resp.text)
if int(video_info['code']) != 0:
print("获取视频信息失败")
break
#print(video_info)

m3u8List = video_info["videoInfo"]["transcodeList"]
mastPlayList = video_info["videoInfo"]["masterPlayList"]
url = mastPlayList["url"]
code_prefix = os.path.dirname(mastPlayList["url"])+"/" +"voddrm.token."+base64_code+"."
print(code_prefix + os.path.basename(url))
resp = requests.get(code_prefix + os.path.basename(url))
with open(pathFolder+os.path.basename(mastPlayList["url"].split('?')[0]), 'wb') as f:
f.write(resp.content)

# download the best video

#print(len(m3u8List))
m3u8_i = m3u8List[-1]
raw_url = m3u8_i['url']
url = code_prefix + os.path.basename(raw_url)
print(url)


resp = requests.get(url)
with open(pathFolder+os.path.basename(raw_url.split('?')[0]), 'wb') as f:
f.write(resp.content)


edk_url = ''
video_url = ''

with open(pathFolder+os.path.basename(raw_url.split('?')[0]), 'r', encoding='utf8') as f:
lis = f.readlines()
print(edk_url)
edk_url = lis[-4][0:-1].split(',')[1][5:-1]
video_url = os.path.dirname(url) + '/' + lis[-2][0:-1]
temp = re.split("start=\d*?&", video_url)
video_url = 'start=0&'.join(temp)
print(edk_url)
print(video_url)
resp = requests.get(edk_url)
with open(pathFolder+"get_dk", 'wb') as f:
f.write(resp.content)
print("正在下载视频...")
filepath = pathFolder+"raw_"+os.path.basename(video_url.split('?')[0])
resp = requests.get(video_url)
with open(filepath, 'wb') as f:
f.write(resp.content)
print("正在下载完毕!")
print("正在解密视频...")
key = None
with open(pathFolder+"get_dk", 'rb') as f:
key = f.read()
iv = b'0000000000000000'
plain = ""
with open(filepath, 'rb') as f:
data = f.read()
with open(pathFolder+os.path.basename(video_url.split('?')[0]), 'wb') as ff:
cipher = AES.new(key, AES.MODE_CBC, iv)
plain = cipher.decrypt(data)
ff.write(plain)
print("视频解密完毕...")
os.remove(filepath)
else:
print('获取课程信息失败')