Introduction

In this part of the series, we are going to examine the different ways to escape HTML characters in PHP in order to add security to your web project. We will also give a brief introduction to PHP’s Perl-compatible regular expressions and show how they can be used for input validation. We are also going to examine PHP 5’s built-in input validation and filtering methods (focusing mostly on filter_var).

Transforming HTML characters

If we have some code, for example a search engine in our website which responds to get parameters and has the following snippet:

SNIPPET 1

A legitimate user might get a page resembling something like this:

However, any user is going to be able to add tags to the queries and at the very least change drastically the way your page is formatted. For example, he can target particular browsers and send links with malicious GET parameters which would load external JavaScript files.

Above is an example of how we can easily change both HTML and CSS on the page (a relatively harmless example).

htmlspecialchars

To combat this, we can use htmlspecialchars(), htmlentities() or strip_tags(); htmlspecialchars() takes a string and as optional parameters – ‘flags’, the ‘encoding’ to be used when converting the characters and a ‘double encoding’ option which is set to true by default and when turned off forces PHP not to encode existing HTML entities.

A sample usage would prevent such XSS vulnerabilities and show the tags instead of applying them:

SNIPPET 2

However, htmlspecialchars only changes the ampersand, double quotes, and less and greater than symbols by default.

Thus, we could still get undesired effects. For example, here is a sample way to apply tags when the single quotes are not escaped.

Suppose we have the following snippet:

SNIPPET 3

A legitimate request would look like this:

The line just adds a link to the page that points to an HTML file (it would be dynamically generated) named after the sought keyword and displays the keyword as the text child node of the anchor. You can see that the $query variable is passed to the anchor and to the text which contains the escaped with htmlspecialchars() input.

However, consider if the user tries to see whether the single quote is also escaped and types something like:

http://localhost:8079/Tests/index.php?q=Chocolate’ style=’font-size:5em’

Then the user has successfully closed our anchor tag and added an arbitrary attribute. He can then try to add inline JavaScript and keep on testing for ways to exploit the vulnerability.

Figure 1: After the single quote exploit

Figure 2: Legitimate request (the anchor before the exploit)

To fix this, we just pass the ENT_QUOTES flag.

SNIPPET 4

After we escape the single quotes as well, this vulnerability vanishes.

To transform an escaped string containing markup to markup, again we use:

htmlspecialchars_decode($query);

Strip Tags

If you want to be more radical, you can remove all HTML and PHP tags from a string or remove only a selection of them. The built-in function strip_tags() takes a string in which to remove the tags and optionally another string that pinpoints which tags are allowed.

SNIPPET 5

The above code results in all tags being removed from the string.

Optionally, we can allow any tag we want, but we have to do some manual escaping as users can enter whatever attributes they want. Of course, contemporary browsers have XSS protection, but there are still clients with older systems that could be targeted with such malicious links.

SNIPPET 6

We get an error in the console telling us that the XSS Auditor did not execute the script on up to date Chrome, but this would not be necessarily the response all users will get.

Htmlentities

Another function you can use is htmlentities()

The difference between htmlspecialchars and htmlentities is that htmlentities translates all HTML characters entity equivalents to the particular entities. This basically mean that it applies also to entities such as © (the copyright symbol), € (the euro symbol) and all others.

For example, if we use htmlspecialchars() and enter the euro sign we will get the following result:

However, if we use htmlentities, the euro character will be properly translated to its relevant HTML entity:

It is it important to know the default flag of htmlentities() is ENT_COMPAT, which only converts double quotes (single quotes are not translated like they are in htmlspecialchars()).

Therefore, you also have to use ENT_QUOTES where appropriate:

SNIPPET 7

Validating input

For most purposes the built-in function filter_var can be used. It is available in servers with greater than 5.2.0 version of PHP. It takes a variable or static input and returns false on failure and the filtered data on success. We can use it for validation and sanitization of input.

It has to be mentioned that filter_var’s validation mechanisms do not only tell you if the input is valid but also sanitize it by removing the illegal characters.

Below is an example of how it works:

SNIPPET 8

This script will display that the email is legit if it is a valid email or display that the email is invalid if it is not.

There are also validation filters for Booleans (FILTER_VALIDATE_BOOLEAN) which returns true only when the input is one of the following string combinations:

  • “1”
  • “true”
  • “on”
  • “yes”

In every other string value it will return false.

Another validation filter is for floating point numbers (FILTER_VALIDATE_FLOAT) which does not return false when:

  • The input is a numeric floating point value (Example: 22.2)
  • The input is a string containing a floating point value (Example: “22.2”, ‘22.2’)

Optionally, you can pass an option – FILTER_FLAG_ALLOW_THOUSAND would allow a thousand separator such as a comma (,). FILTER_FLAG_ALLOW_SCIENTIFIC would allow the number to be in scientific notation (e,E) and there is also FILTER_FLAG_ALLOW_FRACTION.

The FILTER_VALIDATE_INT filter would return the filtered integer or false on failure to extract an integer. There are flags to allow octal and hexadecimal numbers (FILTER_FLAG_ALLOW_OCTAL and FILTER_FLAG_ALLOW_HEX) and the ability to extract a number from a specified range. Here is a sample:

Options are passed in a two-dimensional array. The parent array contains the ‘flags’ index and ‘options’ is a nested array with all options where the key is the option name and the value is the value that the option should have. Here is how we can validate an integer to be between 1 and 100 and allow hexadecimal values.

When we pass 120 – false (or 0) is returned and we get a message that the number is invalid:

Your int 0 is invalid. Redirect to form.

If we enter 85, 85 is stored in $int and we get this statement:

Your int 85 is legit. Save it into the database.

Similarly, if we use a hexadecimal value below 100 (let’s say 10), the integer also passes the validation:

$int = filter_var("0xA", FILTER_VALIDATE_INT, $options);

We are going to show one last example with the URL filter:

SNIPPET 10

We are validating an URL and passing a flag to allow only URLs with a query string attached to them (a GET parameter). We get the following response:

Your URL is invalid. Redirect to form.

If we instead try the following URL, we will get a positive response:

$url = filter_var("http://www.dimoff.biz/?id=1", FILTER_VALIDATE_URL, $options);

Your URL http://www.dimoff.biz/?id=1 is legit. Save it into the database.

The drawback of the validation is that internationalized domain names would always fail validation (only Latin URLs will pass the test – those containing ASCII characters).

There are also filters to validate regular expressions and IP addresses (both IPv4 and IPv6).

You can check filter_var_array(), which can filter multiple variables inserted in an array at once.

3.1 Validation through regular expressions

There are times when the built-in validations are not sufficient or do not include the validation you require. In such cases you can use preg_replace, preg_match, preg_match_all or preg_grep to do the job.

For example, you may want to allow both Bulgarian and American zip codes. However, Bulgarian zip codes consist of 4 digits, whereas American zip codes consist of 5 digits.

To do this you can use regular expressions:

SNIPPET 11

This regular expression tests if the input starts with a number which is repeated 4 or 5 and times and then ends. Here are some tests:

Your zip code 23135 is legit. Save it into the database.

Your zip code 2313 is legit. Save it into the database.

Your zip code 231 is invalid. Redirect to form.

Your zip code 231352 is invalid. Redirect to form.

3.2 Regular Expressions 101

Regular expressions in PHP must start and end with the same delimiter (usually /expression/ is used).

^ checks whether the input starts with something.

$ checks whether the input ends with something.

A value in square brackets [ ] means one of a particular character, for example [Abc] means the input can either be A or b or c.

[A-Z] means the input can be a single character anything between A to Z, for example M or D.

Uppercase and lowercase characters differ, so you would have to use [A-Za-z] if you wanted any alphabetic character. Similarly you could use 0-9 or \d (which is almost the same but it includes some other characters).

? means the character preceding it can be repeated 0 or one time.

+ means it can be repeated one or more times.

* means the item can be repeated 0 or more times.

Alternatively you can provide a minimum number of repetitions [A-Z]{10} ( at least 10 characters) or a minimum and maximum number of characters [0-9]{4,5}.

. stands for any character (.? would mean any character zero or one times).

To escape these characters that are used within regular expressions and test for their literal character, you use backslash ( \ ). For example, #[A-Z\+]#.

There are also some escaped characters with special meanings such as \d for digit and \s for space.

Characters enclosed in brackets signify that the values should be captured for future use. For example, you may try the following regular expression: preg_replace(“/([0-9]{4,5})-([A-Z]+)/”, “($1)$2”, $input);

If you give it a string such as “5432-PA”, it will transform it to become (5432)PA.

Preg_replace takes a regular expression as a first argument, the replacement string as second, and the variable to look into as third.

In the replacement string, $0 would give the whole original string, $1 would be the first bracketed item, $2 the second and so on.

Also, a way to filter input is using the ^ symbol in the beginning of square brackets in a preg_replace call. It would mean replace everything different than the values following the caret (^).

For example, preg_replace(“/[^\w]/”, “”, $input); would cause input such as <script>alert()</script> to be filtered to “scriptalertscript”

\w essentially matches all word characters and we are saying to replace all non-word characters with nothing.

Conclusion

We have covered some essential practices when working with input and we hope that you can start creating applications that are a little bit more secure and robust, or refactor existing projects by making them more secure with filter_var, regular expressions, or by filtering possible HTML coming from inputs.