Network traffic analysis for IR: Content deobfuscation
Introduction to obfuscation
Encoding and encryption techniques are used for a variety of purposes. Some of these are legitimate, like the use of encoding to enable passing of raw data in an ASCII-only protocol, while others are malicious.
Malware authors commonly make use of obfuscation technologies in their command-and-control traffic. In most cases, these authors don’t have a choice about whether or not to communicate over the network; however, they’re also aware of the fact that network analysts and incident responders will commonly collect and monitor network traffic for indicators of compromise. By making valuable data as difficult as possible to identify, they raise the difficulty of extracting valuable data from network traffic.
Common types of obfuscation
Obfuscation refers to the practice of making data unreadable. In practice, there are two main types of obfuscation used by hackers and malware authors: encoding and encryption.
Encoding techniques were initially designed for legitimate purposes: making non-printable characters fit into ASCII-only protocols. However, they’ve also been adopted by malware authors to slightly raise the bar for those trying to identify and read command and control traffic. Encoding algorithms can be reversed by anyone who can identify the algorithm.
Encryption algorithms, on the other hand, require knowledge of a secret key for deobfuscation. If done properly, this means that the obfuscated data is completely protected from eavesdroppers. However, some commonly used encryption algorithms are extremely weak, making it possible for a network traffic analyst to extract the protected data.
There are a variety of different encoding and encryption algorithms in use for command-and-control traffic. However, there are only a few that are both commonly used and easily breakable.
Base64 encoding is an algorithm designed to make non-printable data printable. This is accomplished by mapping a set of three bytes to a set of four printable characters (alphanumeric plus two special symbols).
In many cases, Base64 encoding is easy to identify since padded plaintexts result in a ciphertext ending in one or more equal signs. However, this is not always the case.
Decoding Base64 is easy unless the software is using a non-standard mapping of bytes to characters. The table above shows the standard mapping of characters to values for Base64 encoding calculations, but some malware authors use a different table to complicate analysis.
One example of the misuse of Base64 encoding to protect sensitive data is in the SMTP protocol. When credentials are passed over SMTP, they are encoded using Base64. This ensures that they are not visible in the network traffic in plaintext, but they can be easily identified and decoded.
ROT13 is an “encryption” algorithm often used by malware authors to obfuscate data. It is designed to protect text data using a substitution cipher. Each letter in the English alphabet is replaced with the character 13 steps to the right. For example, A would be replaced with N, Z with M and so on.
ROT13 can be easily identified since it maps letters to letters, as shown in the image above. As a result, the “encrypted” block of text will still contain only alphabetic ASCII characters.
While ROT13 is the most common rotation cipher, other step counts or substitution algorithms can be used. These can be identified and overcome using frequency analysis. Since letters are not equally used in the English language, the most common character in the ciphertext of a substitution cipher is likely to be E. This can help with the calculation of a rotation cipher’s step count or to ease cracking of any substitution cipher.
URL encoding is designed to allow non-printable or reserved characters to be included in a URL (which is useful for queries). Any character can be replaced with “percent encoding” or the hexadecimal representation of its ASCII value preceded by a percent sign.
The table above shows some examples of common characters and their percent encoding equivalents. A malware author may replace characters with their equivalents in command and control traffic not only for printability but to complicate identification of strings of interest.
XOR is a common operation used in encryption algorithms. In fact, the only provably secure encryption algorithm, the one-time pad, involves XORing each bit of the plaintext with a random key bit to produce the ciphertext. However, this is only secure if the key is the same length as the plaintext and is only used once.
When XOR encryption is used by malware authors, they don’t meet these conditions. Typically, a single-byte or short key is used and repeated for the length of the plaintext. Under these conditions, XOR encryption is easily identifiable and breakable.
The image above is extracted from a traffic capture of a malware sample that uses XOR encryption for content obfuscation. As shown, much of the content consists of the characters “mlvr” repeated over and over.
This is caused by the fact that the XOR encryption is used to encrypt an executable, and executables often include large stretches of NULL characters. XORing a key with a x00 plaintext byte results in a ciphertext byte equal to the key value. As a result, the key is leaked here and can be used to decrypt the file. The only challenge is determining the correct sequence of the four bytes (mlvr, lvrm, vrml or rmlv), but with only four options, even a brute-force search can be performed quickly.
Conclusion: Content deobfuscation in incident response
Obfuscating sensitive content in command-and-control traffic is a common trick for malware authors. If the data is properly encrypted, there is nothing that an incident responder can do; however, many authors use encoding or weak encryption instead of proper algorithms.
As a result, identifying and decoding obfuscated content can be extremely profitable for incident responders. In general, people only work to hide and protect sensitive data, so looking for and deobfuscating weakly-protected data can help an incident responder find the most valuable data in a network traffic capture. This information can provide valuable intelligence about the operation being performed on the infected computer and/or the data being stolen by the hacker.