In the last part of this article series, we have seen the structure of PDF document and all the essential keywords which can be used by analysts to carry investigations and are also used by various tools to depict the nature of the document being analyzed.
Below is a refresher of important keyword concerning PDF document analysis.
- /AA: This defines the Automatic Actions that is embedded in the document when the user opens the document. It should be noted that events an also declared inside this like cursor movement to trigger a particular action.
- /AcroForm: This shows whether Adobe forms are used in PDF documents or not.
- /ObjStm: This is used to define object stream which can hide specific other objects. We will see this in the later part of the series.
- /GoTo*: Redirected to the specified destination in the PDF file.
- /URI: Resource accessed as pointed by URL
- /SubmitForm and /GoToR: This indicates the data send to the URL.
- /Launch: This launches a program.
Let’s start the analysis of PDF documents.
Using pdfid, an analyst can perform initial, quick analysis of PDF documents. As stated above this tool checks for the important keywords in the document and depict their actions. This utility is a part of Didier Steven’s PDF tools. Let’s see this utility in action
Below is the pdfid utility in action on a malicious PDF file samplepdf.pdf. Run the pdfid utility on document like below
Moreover, below is the output of this document.
As we can see that pdfid found out the references to keywords inside the pdf file. Let’s try to understand these references a bit better.
We can also get the metadata of the document by using the –extra switch with the pdf id tool like below
Moreover, we can get this extra information from the tool
As we can see that we can get necessary metadata from the document like creation date, modified date, entropy value for different sections/streams, however, the analyst must note that these fields can be easily changed by the malicious author and should not be trusted by the analysts to form the basis of any investigation.
Now we have found indicators of this file being malicious; we need to know the content of the file. We cannot do it with pdfid, but there is another utility available known as pdf-parser which can help us to point to the object and extract the content out of PDF file.
Let’s see pdf-parser in action:
Moreover, it will give us the following output
Ethical Hacking Training – Resources (InfoSec)
Now we need to find the object 32, and for that, we have the –object switch to use with pdf-parser.py. We can use it like below
Moreover, we will see output like below
So, this object (Object 32) contains a stream of length 1822 bytes, and it is compressed& encoded using FlateDecode.
Now to decompress the stream, pdf-parser.py has a –filter switch and also a –raw switch to keep the output without escaping special characters. This option can be used like below
Below is the content of output.js
And it looks like this JS is looking for the version of PDF viewer (Adobe Acrobat in this case)
From the above code, we can see that the variable ‘sc’ seems to hold the shellcode. JS checks for PDF viewer version, and if it is lower than 6.0, then it tries to exploit a buffer vulnerability in the Collab.CollecEmailInfo by passing a crafted variable to it (plin in this case) as the value of msg parameter.
After that, it invokes the start() function with app.setTimeout which is used to execute the function after a specified amount of time (10 milliseconds in this case).
Now since we know that shellcode is inside the ‘sc’ variable and we need to analyze how it will work, we need to make an exe out of the shellcode. For that first, we need to copy the contents of sc variable and remove all the garbage from it like below.
And then we need to convert the Unicode to hex characters. There can be a simple customized script to build for that like to replace %u with \x.
I will be using unicode2hex-escaped script that comes shipped with Remnux to make the conversion
And below is the shell.txt file after conversion
Now since we have the shellcode in hex format, now we can use a utility such as shellcode2exe.py to convert the txt file into an exe file. Use shellcode2exe utility like below
And we will get the output like below
Since the exe is available now, now normal analysis using debugger and disassembler can continue.
In this article we have seen how we can parse the structure of a PDF file identifying essential keywords, enumerating objects, identifying vulnerabilities, extracting shellcode and converting it into an exe for further analysis. In the next and last part of this article series, we will take a look at other tools to perform the analysis.