In the last part of this article series, we have seen the structure of PDF document and all the essential keywords which can be used by analysts to carry investigations and are also used by various tools to depict the nature of the document being analyzed.

Below is a refresher of important keyword concerning PDF document analysis.

  1. /AA: This defines the Automatic Actions that is embedded in the document when the user opens the document. It should be noted that events an also declared inside this like cursor movement to trigger a particular action.
  2. /AcroForm: This shows whether Adobe forms are used in PDF documents or not.
  3. /ObjStm: This is used to define object stream which can hide specific other objects. We will see this in the later part of the series.
  4. /JS: Embedded JavaScript within the document.
  5. /GoTo*: Redirected to the specified destination in the PDF file.
  6. /URI: Resource accessed as pointed by URL
  7. /SubmitForm and /GoToR: This indicates the data send to the URL.
  8. /Launch: This launches a program.

Let’s start the analysis of PDF documents.

Using pdfid, an analyst can perform initial, quick analysis of PDF documents. As stated above this tool checks for the important keywords in the document and depict their actions. This utility is a part of Didier Steven’s PDF tools. Let’s see this utility in action

Below is the pdfid utility in action on a malicious PDF file samplepdf.pdf. Run the pdfid utility on document like below

Moreover, below is the output of this document.

As we can see that pdfid found out the references to keywords inside the pdf file. Let’s try to understand these references a bit better.

This PDF file is 1 page long(/Page) and has at least one instance of JavaScript(/Javascript) and Automatic Action(/AA) embedded into it. This pdf document also contains the Adobe forms(/AcroForm).

We can also get the metadata of the document by using the –extra switch with the pdf id tool like below

Moreover, we can get this extra information from the tool

As we can see that we can get necessary metadata from the document like creation date, modified date, entropy value for different sections/streams, however, the analyst must note that these fields can be easily changed by the malicious author and should not be trusted by the analysts to form the basis of any investigation.

Now we have found indicators of this file being malicious; we need to know the content of the file. We cannot do it with pdfid, but there is another utility available known as pdf-parser which can help us to point to the object and extract the content out of PDF file.

Let’s see pdf-parser in action:

Pdfparser has a –search option which can help us to search for the objects. We have seen earlier that pdfid found out an embedded JavaScript. Let’s try to locate that object with pdfparser

Moreover, it will give us the following output

Ethical Hacking Training – Resources (InfoSec)

This can be deciphered as the utility found the Object number 31 containing /Javascript string. Inside this, we can see a reference to object 32(if you remember we have seen in the previous article about this. Keyword ‘R’ denotes a reference to object and in this case, object 31 refers to Object 32 means Object 31 is going to execute the JavaScript stored in Object 32.)

Now we need to find the object 32, and for that, we have the –object switch to use with We can use it like below

Moreover, we will see output like below

So, this object (Object 32) contains a stream of length 1822 bytes, and it is compressed& encoded using FlateDecode.

Now to decompress the stream, has a –filter switch and also a –raw switch to keep the output without escaping special characters. This option can be used like below

Below is the content of output.js

And it looks like this JS is looking for the version of PDF viewer (Adobe Acrobat in this case)

From the above code, we can see that the variable ‘sc’ seems to hold the shellcode. JS checks for PDF viewer version, and if it is lower than 6.0, then it tries to exploit a buffer vulnerability in the Collab.CollecEmailInfo by passing a crafted variable to it (plin in this case) as the value of msg parameter.

After that, it invokes the start() function with app.setTimeout which is used to execute the function after a specified amount of time (10 milliseconds in this case).

Now since we know that shellcode is inside the ‘sc’ variable and we need to analyze how it will work, we need to make an exe out of the shellcode. For that first, we need to copy the contents of sc variable and remove all the garbage from it like below.

And then we need to convert the Unicode to hex characters. There can be a simple customized script to build for that like to replace %u with \x.

I will be using unicode2hex-escaped script that comes shipped with Remnux to make the conversion


And below is the shell.txt file after conversion

Now since we have the shellcode in hex format, now we can use a utility such as to convert the txt file into an exe file. Use shellcode2exe utility like below

And we will get the output like below

Since the exe is available now, now normal analysis using debugger and disassembler can continue.

In this article we have seen how we can parse the structure of a PDF file identifying essential keywords, enumerating objects, identifying vulnerabilities, extracting shellcode and converting it into an exe for further analysis. In the next and last part of this article series, we will take a look at other tools to perform the analysis.