Steganalysis: Your X-Ray Vision through Hidden Data
Steganography is often mistaken with cryptography, but they are very different in their operations. The major similarity between them is they were coined from Greek words.
steganos – covered
cryptos – secret
graphos – writing
That gives us hidden writing for steganography and secret writing for cryptography. Fill out the form below to get the downloadable tool accompanying this article.
I will advise you read this previously written article before proceeding. Soufiane Tahiri explained some basics. Just like reconnaissance has to be done for everything we do as security professionals, we also need to gather information on files we will make use of. I’m making use of the “file” command in my Linux shell to gather information on some images here:
This shows that the JPEG images are stored in JFIF format. I tried the same on other image file types and the output was:
If we’ll be covering our data with some of these images, we need to have an idea of what information the image already has. After this vague meta data, I went ahead to use the exiftool to grab more data from the images.
The exiftool can be used to read metadata from files like so:
In my case, I have about 6 images to run that command on, so I will run the tool recursively to output all the results for image files in a folder into .txt files having the file names:
exiftool -r -w .txt lab/
Now I can read each of the text files with vim.
I don’t have a camera-taken photo. If you do, you can try the exiftool with it and you will get a lot more information like the camera type, date picture was taken, and more.
Now we’ll study the hexadecimal values of this various image mimetypes to help us observe possible hexadecimal changes after data has been embedded in them.
JPG: Both images above show that a regular JPG begins with 0xFFD8 and ends with 0xFFD9
PNG: This shows that PNG images should begin with hex value 0x89504E47 and end with 0xAE426082
BMP: Bitmap images have an inconsistency in their end values from my study, but their beginning hex values are 0x424D36
GIF: Similar to bitmaps, end values for gif images vary, but they usually begin with 0x47494638
All this information we have helps us to some extent, as some ACTIVE steganography tools will leave trails by adding extra hex values after the regular endings for the mimetype.
Common Methods of Steganography
- Null Ciphers
- Media File Steganography
Null ciphers are intended to confuse cryptanalysts, as they involve scrambling data by playing with words. An nth character of each word in a sentence can be used to derive a message. An example is this:
Derived messages from the example null ciphers are GUN, DATA, gonzalez. In real life occurrences, these messages will be more obscure, as they wouldn’t be written in large conspicuous red fonts. The first and second examples have the messages hidden in their 1st characters, but the third case has the messages in the penultimate characters i.e n-2th character. Wikipedia gave this example:
Susan sAys GaIl Lies. MAtt leTs Susan fEel joVial. Elated (or) aNgry?
This example uses a pattern (1,2,3,1,2,3,…). For more obscurity, the nth character for the message in each word can be dynamic and still follow a pattern to be uncovered by the message receiver.
Steganography can be implemented in a lot of physical ways. In the ancient days, they shaved heads to tattoo messages. That is still done today.
In the TV series “Prison Break”, information was written on white paper with white ink, and it had to be shaded with a gray pencil to read through. I have a similar digital implementation of this as a pen.
See this Pen: http://codepen.io/bl4ckdu5t/pen/tnBFg
You can hover your mouse around the blue border, and if you do this patiently from the extreme left to right, you will get the text I have in it, which reads, “This is some text for some amazing contents and it is not to be seen by regular viewers.”
Media File Steganography
There are various ways by which data can be stored in image files or other media files. I will walk through some techniques and tools that can be really useful.
copy /b image.jpg + data.rar endfile.jpg
Data to be hidden should be zipped in a zip or rar archive.
After this we run the command above, with ‘endfile.jpg’ being the name of the image file where we want the data to be stored. In my example below, I chose to use ‘output.jpg’:
An output of the ‘output.jpg’ image, we’ll rename it to ‘output.rar’ and open it with WinRAR.
Now that we’ve successfully read the data in output.rar, we will change the extension back to jpg. It still works fine as an image, but we have to check the hex value to see if it is like a regular image as we have performed reconnaissance on regular images earlier. This is .jpg, so we expect a start of 0xFFD8 and end of 0xFFD9.
Dang! See what we have here: some extra values after FFD9. Also, we can see on the ASCII column there’s write.txt, which is the carrier of our information. Also, we have the information in the text file displayed (“This is some information to be hidden”).
We try the same process with a more packaged .NET GUI program (Steganography) and get the same result.
This is very simple for those that get headaches around CLI.
Back to my Linux box, let’s try some common Linux tools.
To embed data in image with Steghide, we run:
steghide embed -ef sample.txt -cf image.jpg -sf output.jpg
This passively stores the sample.txt data into the output.jpg file with the image.jpg as cover.
ef = embed file
cf = cover file
sf = stegofile
I referred to the embedding of Steghide as passive because when you inspect the hex values, there are no unusual changes on the file. I have just embedded data in a .jpg file, and it is still as every other .jpg file beginning with 0XFFD8 and ending with 0xFFD9.
Steghide also implements crypto steganography.
What’s crypto steganography?
This is just as the name implies. It combines cryptographic arts to steganography by encrypting data before being covered by the media file. Steghide uses rijndaeo-128 to encrypt by default. Other usable encryption algorithms can be seen by running:
With Steghide, we can embed data in an audio file that has a WMV format.
steghide embed -e none -ef secret.txt -cf song.wav -sf output.wav
$ outguess -k "password" -d image.jpg outguess.jpg
JPEG compression quality set to 75
Extracting usable bits: 32622 bits
Correctable message size: 20161 bits, 61.80%
Encoded ‘secret.txt’: 1136 bits, 142 bytes
Finding best embedding…
0: 581(49.7%)[51.1%], bias 474(0.82), saved: -1, total: 1.78%
3: 575(49.2%)[50.6%], bias 383(0.67), saved: 0, total: 1.76%
33: 572(49.0%)[50.4%], bias 346(0.60), saved: 0, total: 1.75%
105: 533(45.6%)[46.9%], bias 381(0.71), saved: 4, total: 1.63%
105, 914: Embedding data: 1136 in 32622
Bits embedded: 1168, changed: 533(45.6%)[46.9%], bias: 381, tot: 32543, skip: 31375
Foiling statistics: corrections: 217, failed: 2, offset: 51.654867 +- 138.564823
Total bits changed: 914 (change 533 + bias 381)
Storing bitmap into data…
Outguess also works smoothly and leaves no appended data in the JFIF standard hex value.
To extract embedded text:
$ outguess -r outguess.jpg mysecret.txt
Building your own steganography tool
A lot of steganography tools use the LSB (Least Significant Bit) Algorithm. Some use the enhanced LSB. LSB best works with BMP (Bitmap) files because they use loss-less compression. The best we can use are the 24 bit BMP files because of their small size. Some chose to use even smaller like 8 bit.
A tool like steghide we used is not limited to BMP images and it works fine with other image file types. The LSB algorithm had been exploited as it is still a replacement technique that had been implemented.
Building a steganography tool requires that you know some techniques involved. I know of two major techniques, which are:
- Replacement Technique
- Appending Technique
Earlier in the article, I referred to replacement techniques as passive and appended techniques as active. That’s because it’s the best way I can think of them. Using the first windows steganography method with WinRAR, we appended data to the image as we saw in our inspected hex values. For Steghide, we found no significant tampering in the hex code because it implements a replacement technique.
LSB in a binary data is usually the 8th bit which has a decimal value of 1 and the MSB (Most Significant Bit) is that with the 128 decimal value.
1 1 1 1 1 1 1 1
1 = 128 1 = 64 1 = 32 1 = 16 1 = 8 1 = 4 1 = 2 1 = 1
By tampering with this 8th bit for each ASCII representation of our image, we can store our arbitrary content in them and it wouldn’t make a significant change in the tampered file (cover file).
If we have a cover image with binary:
11001011 00101110 10100110
Tampering with the LSB will result in:
11001011 00101111 10100111
A bothering question will be “Which bytes should be selected for LSB replacement?” There are algorithms that use sequential selections and some use pseudo random selections. Each of these will contain our bits of our embedded data.
For the tool we will be building we will use the appending technique. The program will be called Stegman and it will be written in Python.
Before we start writing our program, I like to think of the workflow of my problem in plain words before a code implementation, and we will go through the two major functions to be performed by the problem: embedding data and extracting embedded data.
The program checks the hexadecimal values of a JPG file to see if there is extra data after 0xFFD9, and for PNG files, it checks for data after 0x426082. If extra data is found, it means that data has been embedded to image. If none is found, it allows its user to embed data which is appended after the regular hex ending for the image type.
In extraction cases, we check for files after the expected hex endings again and convert them if existing to ASCII and store in a file specified by the user.
I’ll start by importing required modules for the program:
import sys, re, binascii, string
Next, we need a little function to get hexadecimal code from images:
f = open(image, ‘rb’)
data = f.read()
hexcode = binascii.hexlify(data)
Another function to check if there is appended data to image hexadecimal:
def extradatacheck(data, type):
if type == ‘png’:
pattern = r'(?<=426082)(.*)’
elif type == ‘jpg’:
pattern == r'(?<=FFD9)(.*)’
match = re.search(pattern, data)
The extract and embed functions rely on those functions. In the check above, I used a regex look behind to check for characters after 0xFFD9 in cases of JPG images and 0x426082 in cases of PNG images, as our program is meant to work with only these two formats.
The Embed function:
def embed(embedFile, coverFile, stegFile):
filetype = coverFile[-3:]
stegtype = stegFile[-3:]
if filetype != ‘png’ and filetype != ‘jpg’:
print ‘Invalid format’
elif filetype != stegtype:
print ‘Output file has to be in the same format as cover image (%s)’ % string.swapcase(filetype)
data = open(embedFile, ‘r’).read()
info = gethex(coverFile)
if extradatacheck(info, filetype):
print ‘File already contains embedded data’
info += data.encode(‘hex’)
f = open(stegFile, ‘w’)
print ‘Storing data to’, stegFile
The function enforces that the user only makes use of PNG and JPG images. It also ensures the user stores the stego output in a format the same as the cover image. It then checks if the file already has embedded data based on our appending technique. If there is data already found, the program quits telling the user that it has found already embedded data.
The Extract function
def extract(stegFile, outFile):
filetype = stegFile[-3:]
data = gethex(stegFile)
if extradatacheck(data, filetype):
store = open(outFile, ‘w’)
store.write( binascii.unhexlify(extradatacheck(data, filetype)) )
print ‘Extracted data stored to’, outFile
print ‘File has no embedded data in it’
Once again we check to see if there is any appended data. If there is, we open the specified storage file and store the embedded file in it after it has been converted from hex to ASCII.
To get this complete tool, it can be downloaded with this article.
To see how the tool works, run:
python stegman.py –h
Advanced Steganography Case Study
Challenge 7 of the Python challenge presents a case where data has to be found in a given image. We have the following image to extract a word from in order to get to the next challenge.
This requires that we use the PIL (Python Image Library) to grab what the data is. The data is hidden in pixels of this image. I’ll give a little code hint on finding the hidden data, but I will not give a complete solution to avoid making this a spoiler for the challenge.
>>> import Image
>>> img = Image.open(‘file.png’)
>>> img.getpixel( (0, 50) )
(115, 115, 115, 255)
This is just little information we can get on the case image. With more manipulations on this, we will get a resulting string that is plausible as a message.
Would you like to test your skills further with a CTF challenge? Check this out: