用python做youtube自動化下載器 程式碼

根據 savefrom條例
本實例及教程只用於學習交流用,權利歸savefrom.net所有
最後程式碼+注釋大概100行左右,具體程式碼以github程式碼為主(可以會在上面修復bug),本文只做具體講解

項目地址

github倉庫

思路

用python做youtube自動化下載器 思路

流程

1. post

根據思路里的第一步,我們首先需要用post方式取到加密後的js欄位,筆者使用了requests第三方庫來執行,關於爬蟲可以參考我之前的文章

i. 先把post中的headers格式化

# set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "//en.savefrom.net",
        "pragma": "no-cache",
        "referer": "//en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}

其中cookie部分可能要改,然後最好以你們瀏覽器上的為主,具體每個參數的含義不是本文範圍,可以自行去搜索引擎搜

ii.然後把參數也格式化

# set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}

其中sf_url欄位是我們要下載的youtube影片的url,其他參數都不變

iii. 最後再執行requests庫的post請求

# do the POST request
    r = requests.post(url="//en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()

注意是data=kv

iv. 封裝成一個函數

import requests

def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "//en.savefrom.net",
        "pragma": "no-cache",
        "referer": "//en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}
    # do the POST request
    r = requests.post(url="//en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()
    # get the result
    return r.text

2. 調用解密函數

i. 分析

這其中的難點在於在python里執行javascript程式碼,而晚上的解決方法有PyV8等,本文選用execjs。在思路部分我們可以發現js部分的最後幾行是解密函數,所以我們只需要在execjs中先執行一遍全部,然後再單獨執行解密函數就好了

ii. 先取出js部分

# target(youtube address) url
    url = "//www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]

這裡其實可以用正則,不過由於筆者正則表達式還不太熟練就直接用split

iii. 取第一個解密函數作為我們用的解密函數

當你多取幾次不同影片的結果,你就會發現每次的解密函數都不一樣,不過位置都是還是在固定行數

# split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"

所以name就是我們的解密函數了(變數名沒取太好hhh)

iv. 用execjs執行

# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(reo)
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

其中只取=後面的和去掉分號是指指執行這個函數而不用賦值,當先執行賦值+解密然後取值也不是不可以
但是我們可以發現馬上就報錯了(要是有這麼簡單就好了)

1. this也就是window變數不存在

如果沒記錯是報錯this或者$b,筆者嘗試把全部this去掉或者把全部框在一個class裡面(這樣子this就變成那個class了)不過都沒有成功,然後發現在npm下有個jsdom可以在execjs里模擬window變數(其實應該有更好方法的),所以我們需要下載npm和裡面的jsdom,然後改寫以上程式碼

    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')

其中

  • cwd欄位是npm root -g的結果,也就是npm的modules路徑
  • addition是用來模擬window
    但是我們又可以發現下一個錯誤

2. alert不存在

這個錯誤是因為在execjs下執行alert函數是沒有意義的,因為我們沒有瀏覽器讓他彈窗,且原本alert函數的定義是來源window而我們自定義了window,所以我們要在程式碼前重寫覆蓋alert函數(相當於定義一個alert)

# override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")

v. 整合程式碼

# target(youtube address) url
    url = "//www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"
    # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

3. 分析解密結果

i. 取關鍵json

運行完上面的部分,解密結果就存在text里了,而我們在思路中可以發現,真正對我們重要的就是存在window.parent.sf.videoResult.show()里的json,所以用正則表達式取這一部分的json

# get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")  

ii. 格式化json

python可以格式化json的庫有很多,這裡筆者用了json庫(記得import)

# use `json` to load json
    j = json.loads(result)

iii. 取下載地址

接下來就到了最後一步,根據思路里和json格式化工具我們可以發現j["url"][num]["url"]就是下載鏈接,而num是我們要的影片格式(不同解析度和類型)

# the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -

3. 全部程式碼

# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
#           - windows 10
#           - python 3.6.2
# @Dependence:
#           - jsdom in npm(windows also can use)
#           - requests, execjs, re, json in python
import requests
import execjs
import re
import json


def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "//en.savefrom.net",
        "pragma": "no-cache",
        "referer": "//en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}
    # do the POST request
    r = requests.post(url="//en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()
    # get the result
    return r.text


if __name__ == '__main__':
    # target(youtube address) url
    url = "//www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"
    # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))
    # get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
    # use `json` to load json
    j = json.loads(result)
    # the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -

  • 總計102行
  • 開發環境
# @Environment:
#           - windows 10
#           - python 3.6.2
  • 依賴
# @Dependence:
#           - jsdom in npm(windows also can use)
#           - requests, execjs, re, json in python

-end-

For 爬蟲
版權聲明:本文為部落客原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處鏈接和本聲明。
本文作者: //www.cnblogs.com/Eritque-arcus///blog.csdn.net/qq_40832960