脚本代码混淆-Python篇-pyminifier（2）

2019 年 11 月 6 日
笔记

微信公众号：七夜安全博客 关注信息安全技术、关注系统底层原理。问题或建议，请公众号留言。

前言

在上文中，我们讲解了pyminifier中简化和压缩代码的功能。本篇作为第二篇，也是最终篇，讲解一下最重要的功能：代码混淆，学习一下这个项目的混淆策略。大家如果觉得不错的话，一定要分享到朋友圈哈，写了快5000字，基本上每一个细节都给大家拆分出来了，贴了一部分关键代码，会长一些，一定要有耐心哟。

一.混淆效果

在讲解混淆策略之前，先看一下混淆的效果，恶不恶心，哈哈。对比着混淆的结果，再结合我的讲解，会理解地更加深入。

原始代码

专门设计了一段代码，基本上涵盖了经常出现的语法内容。

import io  import tokenize    abvcddfdf = int("10")  def enumerate_keyword_args(tokens=None):      keyword_args = {}      inside_function = False      dsfdsf,flag = inside_function,1      a = str(flag)      for index, tok in enumerate(tokens):          token_type = tok[0]          token_string = tok[1]          a = str(token_string)          b=a          if token_type == tokenize.NAME:              if token_string == "def":                  function_name = tokens[index+1][1]                  keyword_args.update({function_name: []})              elif inside_function:                  if tokens[index+1][1] == '=': # keyword argument                      print(api(text=token_string))                      keyword_args[function_name].append(token_string)  def api(text):      print(text)    def listified_tokenizer(source):      io_obj = io.StringIO(source)      return [list(a) for a in tokenize.generate_tokens(io_obj.readline)]    code = u'''  def api(text):      print(text)  '''  abcd=1212  bcde=abcd  cdef=(abcd,bcde)  defg=[abcd,bcde,cdef]  efgh = {abcd:"cvcv","b":"12121"}  f12212="hhah"  f112122="hheeereah"  tokens_list = listified_tokenizer(code)  print(tokens_list)  enumerate_keyword_args(tokens_list)

混淆后的效果

#!/usr/bin/env python  #-*- coding:utf-8 -*-  흞=int  ݽ=None  ﮄ=False  ﻟ=str  嬯=enumerate  눅=list  import io  ﭢ=io.StringIO  import tokenize  ﰅ=tokenize.generate_tokens  ނ=tokenize.NAME    嘢 = 흞("10")  def ܪ(tokens=ݽ):      蘩 = {}      ﭷ = ﮄ      dsfdsf,flag = ﭷ,1      a = ﻟ(flag)      for ﶨ, tok in 嬯(tokens):          ﶗ = tok[0]          ﯢ = tok[1]          a = ﻟ(ﯢ)          b=a          if ﶗ == ނ:              if ﯢ == "def":                  龉 = tokens[ﶨ+1][1]                  蘩.update({龉: []})              elif ﭷ:                  if tokens[ﶨ+1][1] == '=': # keyword argument                      print(ݖ(text=ﯢ))                      蘩[龉].append(ﯢ)  def ݖ(ﲖ):      print(ﲖ)    def ﰭ(source):      د = ﭢ(source)      return [눅(a) for a in ﰅ(د.readline)]    ﳵ = u'''  def api(text):      print(text)  '''  횗=1212  ﮪ=횗  ﲊ=(횗,ﮪ)  딲=[횗,ﮪ,ﲊ]  ࢹ = {횗:"cvcv","b":"12121"}  ﮤ="hhah"  ﱄ="hheeereah"  狾 = ﰭ(ﳵ)  print(狾)  ܪ(狾)

二.混淆策略

pyminifier的混淆策略分成五大部分，主要是针对变量名，函数名，类名，内置模块名和外部模块进行混淆。每种混淆又分成两步，第一步是确定要混淆的内容，第二步进行内容替换，替换成随机字符。

1.变量名混淆

针对变量名的混淆，并不是所有变量名都能混淆的，因为要保证安全性，混淆过头了，程序就无法运行了。在函数obfuscatable_variable对变量名进行了过滤，保留着可以混淆的变量名。

def obfuscatable_variable(tokens, index, ignore_length=False):        tok = tokens[index]      token_type = tok[0]      token_string = tok[1]      line = tok[4]      if index > 0:          prev_tok = tokens[index-1]#获取上一个Token      else: # Pretend it's a newline (for simplicity)          prev_tok = (54, 'n', (1, 1), (1, 2), '#n')      prev_tok_type = prev_tok[0]      prev_tok_string = prev_tok[1]      try:          next_tok = tokens[index+1]#获取下一个Token      except IndexError: # Pretend it's a newline          next_tok = (54, 'n', (1, 1), (1, 2), '#n')      next_tok_string = next_tok[1]      if token_string == "=":# 跳过赋值 = 后面的token          return '__skipline__'      if token_type != tokenize.NAME:#不是变量名称忽略          return None # Skip this token      if token_string.startswith('__'):## __ 开头的不管，比如__init__          return None      if next_tok_string == ".":# 导入的模块名（已经导入的）忽略          if token_string in imported_modules:              return None      if prev_tok_string == 'import':#导入的包名忽略          return '__skipline__'      if prev_tok_string == ".":#导入模块中的变量/函数忽略          return '__skipnext__'      if prev_tok_string == "for":#for循环中的变量如果长度大于2则进行混淆          if len(token_string) > 2:              return token_string      if token_string == "for":# for 关键字忽略          return None      if token_string in keyword_args.keys():#函数名忽略          return None      if token_string in ["def", "class", 'if', 'elif', 'import']:#关键字忽略          return '__skipline__'      if prev_tok_type != tokenize.INDENT and next_tok_string != '=':          return '__skipline__'      if not ignore_length:          if len(token_string) < 3:#长度小于3个则忽略              return None      if token_string in RESERVED_WORDS:#在保留字中也忽略          return None      return token_string

从函数中可以看到，有以下几类变量名不能混淆：

token属性不是tokenize.NAME的过滤掉，例如数字token，字符串token，符号token。
以 __ 开头的名称过滤掉，例如 init
导入的第三方的模块名和变量过滤掉，例如 import os，os不能修改。
for循环中的变量名长度小于等于2的过滤掉。
函数名过滤掉(接下来会有专门针对函数的处理方式)。
关键字和保留字过滤掉，长度小于3的名称也过滤掉。

确定了要混淆的内容，接下来进行替换，主要涉及replace_obfuscatables和obfuscate_variable函数，核心代码如下：

if token_string == replace and prev_tok_string != '.':# 不是导入的变量          if inside_function:#在函数里面              if token_string not in keyword_args[inside_function]:#判断是否在参数列表中                  if not right_of_equal: #token所在的这一行没有 = 或者token在 = 的左边                      if not inside_parens: # token不在( )之间                          return return_replacement(replacement) # 例如 a=123 ,str.length() 中的str                      else:                          if next_tok[1] != '=':# token在( )之间  api(text) 中的 text，                              return return_replacement(replacement)                  elif not inside_parens:#token在 = 的右边，token不在( )之间   例如 a = b 中的b                      return return_replacement(replacement)                  else:#token在 = 的右边，token在( )之间                      if next_tok[1] != '=': #例如a=api(text) text需要改变                          return return_replacement(replacement)          elif not right_of_equal:#token所在的这一行没有 = 或者token在 = 的左边              if not inside_parens:                  return return_replacement(replacement)              else:                  if next_tok[1] != '=':                      return return_replacement(replacement)          elif right_of_equal and not inside_parens:# 例如 a = b 中的b              return return_replacement(replacement)

在上述代码中可以看出，混淆变量名称需要区分作用域，即模块中的变量和函数中的变量，即使名称是一样的，但不是一回事，所以需要区分对待。通过如下三个变量进行划分：

inside_function 代表变量是在函数中
right_of_equal 代表着变量是在 = 的右侧
inside_parens 代表变量是在()中

大家可能奇怪，right_of_equal 和 inside_parens 是用来干什么的？其实是为了区分函数调用使用参数名的情况。例如：

def api(text):      print(text)    api(text="123")

在函数调用的时候, api(text="123")中的text是不能混淆的，不然会报错的。

2.函数名混淆

通过obfuscatable_function函数确定要混淆的函数名称，原理上很简单，排除类似_init_的函数，然后前一个token是def，那当前的token就是函数名称。

def obfuscatable_function(tokens, index, **kwargs):       ......      prev_tok_string = prev_tok[1]      if token_type != tokenize.NAME:          return None # Skip this token      if token_string.startswith('__'): # Don't mess with specials          return None      if prev_tok_string == "def": #获取函数名称          return token_string

对于函数名称的替换主要是在两个部位，一个是函数定义的时候，另一个是在函数调用的时候。函数定义的时候容易确定，函数调用的时候大体分成两种情况，一种是静态函数，另一种是动态函数，主要是要确认一下是否需要替换。具体代码位于obfuscate_function函数中：

def obfuscate_function(tokens, index, replace, replacement, *args):        def return_replacement(replacement):          FUNC_REPLACEMENTS[replacement] = replace          return replacement            ......      if token_string.startswith('__'):          return None      if token_string == replace:          if prev_tok_string != '.':              if token_string == replace: #函数定义                  return return_replacement(replacement)          else:#函数调用              parent_name = tokens[index-2][1]              if parent_name in CLASS_REPLACEMENTS:#classmethod                  # This should work for @classmethod methods                  return return_replacement(replacement)              elif parent_name in VAR_REPLACEMENTS:#实例函数                  # This covers regular ol' instance methods                  return return_replacement(replacement)

在代码的末尾通过prev_tok_string来判断是定义函数还是调用，如果prev_tok_string!=“.”，代表着定义。

通过parent_name是否在CLASS_REPLACEMENTS 和 VAR_REPLACEMENTS中，判断是静态函数还是动态函数，但是写的有点冗余，最后的处理方式都是一样的。

3.类名混淆

通过obfuscatable_class函数来确认要混淆的类名称，只要判断 prev_tok_string=="class" 即可。

def obfuscatable_class(tokens, index, **kwargs):      ......      prev_tok_string = prev_tok[1]      if token_type != tokenize.NAME:          return None # Skip this token      if token_string.startswith('__'): # Don't mess with specials          return None  #通过判断前一个token是class，就可以知道当前的是类名称      if prev_tok_string == "class":          return token_string

对于类名称的替换，这个项目进行了简化处理，无法跨模块跨文件进行混淆，这样的设定就简单了很多，关键代码在obfuscate_class函数中，其实直接就替换了，没啥复杂的。

def obfuscate_class(tokens, index, replace, replacement, *args):        def return_replacement(replacement):          CLASS_REPLACEMENTS[replacement] = replace          return replacement      ......      if prev_tok_string != '.': ##无法跨模块混淆          if token_string == replace:              return return_replacement(replacement)

4.builtin模块混淆

首先遍历token发现内置模块中的函数和类，代码中内置了 builtins表，enumerate_builtins函数通过比对里面的值来确定token是否是内置的。

builtins = [      'ArithmeticError',      'AssertionError',      'AttributeError',      ......      'str',      'sum',      'super',      'tuple',      'type',      'unichr',      'unicode',      'vars',      'xrange',      'zip'  ]

内置模块的混淆通过赋值的方式来实现，举个例子，在Python 中有个str的内置函数，正常代码如下：

sum = str(10)

混淆后：

xxxx= str  sum = xxxx(19)

原理如上所示，具体是通过obfuscate_builtins函数来实现的，将所有符合的内置函数/类，都转化成赋值等式，插入到token链的前面，但是有一点需要注意：新的token必须要放到解释器路径(#!/usr/bin/env python)和编码('# — coding: utf-8 –')之后，这样才不会报错。代码如下：

for tok in tokens[0:4]: # Will always be in the first four tokens          line = tok[4]          if analyze.shebang.match(line): # (e.g. '#!/usr/bin/env python')              if not matched_shebang:                  matched_shebang = True                  skip_tokens += 1          elif analyze.encoding.match(line): # (e.g. '# -*- coding: utf-8 -*-')              if not matched_encoding:                  matched_encoding = True                  skip_tokens += 1      insert_in_next_line(tokens, skip_tokens, obfuscated_assignments)

5.第三方模块与函数的混淆

针对第三方模块与函数的混淆，pyminifier进行了简化处理,具体逻辑在obfuscate_global_import_methods中，通过以下两种方式导入的模块忽略：

import xxx as ppp  from xxx import ppp

只处理 importpackage类型的导入。

枚举模块

首先通过 enumerate_global_imports 函数枚举所有通过import导入的模块，忽略了类里面和函数中导入的模块，只接受全局导入，核心代码如下：

elif token_type == tokenize.NAME:              if token_string in ["def", "class"]:                  function_count += 1              if indentation == function_count - 1: #出了函数之后才会相等                  function_count -= 1              elif function_count >= indentation: #排除了在函数内部和类内部的import导入                  if token_string == "import":                      import_line = True                  elif token_string == "from":                      from_import = True                  elif import_line:                      if token_type == tokenize.NAME  and tokens[index+1][1] != 'as':# 排除 import as                          if not from_import and token_string not in reserved_words:#排除from import                              if token_string not in imported_modules:                                  if tokens[index+1][1] == '.': # module.module                                      parent_module = token_string + '.'                                  else:                                      if parent_module:                                          module_string = (                                              parent_module + token_string)                                          imported_modules.append(module_string)                                          parent_module = ''                                      else:                                          imported_modules.append(token_string)

遍历函数并混淆

获取导入的模块后，接着遍历token，获取源文件中模块调用的函数，和之前的方法一样通过赋值的方式进行替换，举个例子：原代码：

import os  os.path.exists("text")

混淆后的代码：

import os  ﳀ=os.path  ﳀ.exists("text")

具体函数调用的替换代码很简短,module_method形如os.path，即ﳀ.exists("text")这部分：

if token_string == module_method.split('.')[0]:      if tokens[index+1][1] == '.':          if tokens[index+2][1] == module_method.split('.')[1]:              tokens[index][1] = replacement_dict[module_method]              tokens[index+1][1] = ""              tokens[index+2][1] = ""

接下来将替换变量进行定义，形如ﳀ=os.path，并通过insert_in_next_line函数插入到import模块的下方。有一点需要注意的是token索引index + 6,原因很简单， ﳀ=os.pathn转化为token的长度就是6。

elif import_line:      if token_string == module_method.split('.')[0]:          # Insert the obfuscation assignment after the import          ......          else:              line = "%s=%sn" % ( # This ends up being 6 tokens                  replacement_dict[module_method], module_method)          for indent in indents: # Fix indentation              line = "%s%s" % (indent[1], line)              index += 1          insert_in_next_line(tokens, index, line)          index += 6 # To make up for the six tokens we inserted  index += 1

混淆源生成

从上面讲解的混淆策略中，我们大体了解了pyminifier的工作方式，但是还有一点没有讲解，那就是混淆源的生成，什么意思呢？如下所示， os.path为啥会被替换成 ﳀ。

ﳀ=os.path

混淆源生成位于obfuscation_machine函数中，分成了两种情况。

在Py3中,支持unicode字符作为变量名称，所以基本上是使用unicode字符作为数据源，混淆后会出现各个国家的语言符号，看着着实恶心，而Py2则是使用的ASCII码的大小写作为数据源。数据源有了，然后进行随机化，让其变得更混乱一些。

代码如下：

# This generates a list of the letters a-z:      lowercase = list(map(chr, range(97, 123)))      # Same thing but ALL CAPS:      uppercase = list(map(chr, range(65, 90)))      if use_unicode:          # Python 3 lets us have some *real* fun:          allowed_categories = ('LC', 'Ll', 'Lu', 'Lo', 'Lu')          # All the fun characters start at 1580 (hehe):          big_list = list(map(chr, range(1580, HIGHEST_UNICODE)))          max_chars = 1000 # Ought to be enough for anybody :)          combined = []          rtl_categories = ('AL', 'R') # AL == Arabic, R == Any right-to-left          last_orientation = 'L'       # L = Any left-to-right          # Find a good mix of left-to-right and right-to-left characters          while len(combined) < max_chars:              char = choice(big_list)              if unicodedata.category(char) in allowed_categories:                  orientation = unicodedata.bidirectional(char)                  if last_orientation in rtl_categories:                      if orientation not in rtl_categories:                          combined.append(char)                  else:                      if orientation in rtl_categories:                          combined.append(char)                  last_orientation = orientation      else:          combined = lowercase + uppercase      shuffle(combined) # Randomize it all to keep things interesting

数据源有了，那按照什么顺序输出呢？

这就用到了permutations 函数，生成迭代器，对数据进行排列组合然后输出。

for perm in permutations(combined, identifier_length):              perm = "".join(perm)              if perm not in RESERVED_WORDS: # Can't replace reserved words                  yield perm

总结

pyminifier 算是一个不错的入门项目，帮助大家学习脚本混淆，但是不要用在生产环境中，bug挺多，而且混淆能力并不是很强。接下来我会接着讲解脚本混淆的技术手段，不限于python。

脚本代码混淆-Python篇-pyminifier（2）

前言

一.混淆效果

原始代码

混淆后的效果

二.混淆策略

1.变量名混淆

2.函数名混淆

3.类名混淆

4.builtin模块混淆

5.第三方模块与函数的混淆

枚举模块

遍历函数并混淆

混淆源生成

总结

VirMach 便宜 VPS

QNews

脚本代码混淆-Python篇-pyminifier（2）

前言

一.混淆效果

原始代码

混淆后的效果

二.混淆策略

1.变量名混淆

2.函数名混淆

3.类名混淆

4.builtin模块混淆

5.第三方模块与函数的混淆

枚举模块

遍历函数并混淆

混淆源生成

总结

分享此文：

Related Posts

区块链开发学习第七章：第一个Dapp-猜拳游戏

抽象工厂

Centos安装MySQL5.7.22文档

你的Redis有类转换异常么

VirMach 便宜 VPS

QNews

热门搜寻