In the last article, we reviewed how to analyze malicious PDF documents. In this last part of the article series, we will continue analyzing PDF documents with other tools. In this article, review the Origami framework which can be used to inspect and extract various objects from PDF documents.

As a refresher let’s reiterate essential keywords concerning PDF document analysis.

  1. /AA: This defines the Automatic Actions that is embedded in the document when the user opens the document. It should be noted that events an also declared inside this like cursor movement to trigger a particular action.
  2. /AcroForm: This shows whether Adobe forms are used in PDF documents or not.
  3. /ObjStm: This is used to define object stream which can hide specific objects. We will see this in the later part of the series.
  4. /JS: Embedded JavaScript within the document.
  5. /GoTo*: Redirected to the specified destination in the PDF file.
  6. /URI: Resource accessed as pointed by URL
  7. /SubmitForm and /GoToR: This indicates the data send to the URL.
  8. /Launch: This launches a program.

Let’s start using different utilities inside the Origami framework.

First, let’s see PDF Walker, which is a GUI program included a part of Origami framework. Below is the outcome when a pdf is loaded into the PDFwalker.

As we can see that the PDFwalker has extracted all the embedded objects from the PDF. Now we must search for a JavaScript object, so first let’s consider the references of JavaScript.

This search will give us the reference of object 32

To view this object, click on Document > Jump to Object and type the object number like below

This will show us the Object 32 stream

It must be noted that PDF Walker identifies the encoded algorithm used in the PDF document and applies necessary decoding. For this document, PDFwalker identifies FlateDecode and applies the necessary filter

Above we can see the decoded stream. We can dump this stream by right-clicking the stream and dump it.

Moreover, below is the decoded dump output

Origami also includes a command line tool PDFextract which automatically locates, decodes and extracts JavaScript code. It must be noted that PDFextract can also extract embedded images and file attachment. To instruct the tool to extract only JavaScript, we must supply this with -j parameter.

Moreover, it will create a direct <filename>. dump >script and will dump the extracted script inside it. Below is an example of extracted JavaScript from the sample.pdf.7

Now let’s explore another sample with both these tools.

Launching the sample inside PDFwalker like below

Ethical Hacking Training – Resources (InfoSec)

And now search for JavaScript as is done earlier. It will give reference to Object 10

Let’s jump to Object 10

and it will give following output

As we can see that it references Object 12, so let’s jump to object 12.

And it gives following output. It points to Object 13

Continuing the same process, let’s jump to Object 13.

And below is the embedded object 13.

Now the stream can be decoded and then analyzed further.

Let’s analyze the same PDF using pdf-extract. This time we will extract everything in the sample PDF and not just JavaScript like below

Below we can see that the pdf-extract tool has extracted 2 pdf streams, 2 scripts from the sample pdf file and dump it to mentioned locations.

After this, we can use SpiderMonkey to deobfuscate the script located in the sample.dump/scripts folder. Using spider monkey will show us the extraction of JS into eval 1 and eval 2 and after looking at the contents of eval.002.log, it contains the deobfuscated JS as can be seen below.

As discussed earlier, now also we can see that the exploit is targeting the Collab.CollectEmailInfo vulnerability. Please note the use of NOP sled in the different variables above. Now to analyze further we need to copy the shellcode in variable brIW1yTY and convert it into an executable, we will do it using shellcode2exe like below

Since there are %u, so we need to convert the Unicode to hex first like below

Below are the contents of Shellcode-hex

And now let’s convert this into exe using the shellcode2exe.py like below

And it successfully converts the shellcode to exe binary

This exe can be analyzed further, for example, a quick search for ‘HTTP’ in binary reveals

Reveals

So, this is all for PDF analysis using these tools. There are other tools as well such as PDF Stream Dumper, Peeppdf, AnalyzePDF which can also be used to analyze malicious PDF.

As you have seen in this article and last article, there is a procedure which analysts must follow to identify properly, locate, extract and de-obfuscate and further analyze embedded scripts in such malicious documents.