Google hacking is a time honored tradition that goes back many years. There are specific Google searches that will allow users to directly download documents that the company might not want to have publicly available. This kind of attack takes on a number of different Google searches that will be covered in this paper.
The one thing to remember about this is that security engineers within companies should continue to Google search their companies URL’s all the time to ensure that confidential data is not directly accessible via Google. It is common to find music, video, spreadsheets and other data in Google because a company did not properly secure their web servers or in some cases have a robots.txt file with allow or disallow statements in them.
Finding the Manifest file
Each CloudFront S3 bucket has a manifest file that describes each and every document that can be found on the origin server. The manifest file is an important file not just for the use of CloudFront, but it is essentially an index of every file that can be found on the CloudFront distribution system. CloudFront will automatically propagate those files within the CloudFront network, and keep track of them by putting the date and time and object ID as part of the manifest file. When the manifest file is created you want to set up a masking field in the manifest so that you will only serve specific file types within the manifest file. If you wanted to make sure that only MP3′s or MP4′s were allowed in the manifest file, then you would mask for those file types. This will then ensure that .doc, .xls, .ppt and other files are not part of the manifest file.
The first search is:
This will allow access to the full sitemap that a company generates to be indexed in Google. The output from this is a non-formatted XML file that shows all the data about the objects in the CloudFront S3 bucket, or in the CloudFront origin server as it was propagated to the cloudfront.net URL.
You can then open up the specific sitemap.xml file for each of these buckets within the propagated data from the origin server, and start mapping the layout of how data is organized with the CloudFront system that the company is using by looking at each contents tag.
Each object within this XML file will have the location of where it is stored, including full file path and type, the last modified date, the ETag which is a specific “hash” of the files current version of the distribution information, and the storage class of the file. Usually in this hack the storage class will always be standard, which for this hack looks to be set to public read only. This would be a normal setting for the object because the public will need to read these assets to access them. This file makes it exceedingly easy to deconstruct how the web server is built, the naming convention used for objects, if the objects are recent additions or if they are simply support objects like navigation buttons. A hacker can fairly easily work out where all the good files are by studying the sitemap.xml file, and then work out what is publicly shared and should be shared against what is internal company information that will be buried in directories that have been set for public access but should not be.
The Second Search is looking for specific file types within specific directories. There are usually specific directories used by software installations when they are installed. S3 buckets can be set for streaming video, or set for serving documents, and then added to CloudFront as specific origin servers. Amazon recommends separating the types of data to be served (static or streaming) in their S3 documentation when building out a S3 bucket.
site:cloudfront.net /tmp .mp3
This search will look at cloudfront.net, the /tmp directory and the file type .mp3. Specifically we are looking for objects in CloudFront that are in the temporary directory used by many web services and software programs, and then looking for MP3 files that we can download directly. The ability to directly access these files may bypass any front end sign in processes that a company might be using to access those files. If you have your mp3 files behind a paywall, this method allows direct access and does not involve the authentication stream that would be used to login a customer.
You can also use a different approach by changing directories, file types, and other information within the search string to identify information that can be used or downloaded for later use. If the company has a S3 bucket that is being used as the origin server, and they have access logs on, then your download will be tracked by the company. But setting logs on the S3 bucket, based on my observations, is a rare occurrence. Companies prefer to use the logs via the HTTP server for access logging rather than direct logging to the S3 bucket. Using this and the sitemap XML file, you can “drunk walk” the origin server to find interesting information.
A third search technique looks for private or confidential files within the Amazon CloudFront distribution. Using any variation on private, confidential, secret, company confidential or other markings within the document, you can access interesting files that companies might not want you to notice directly.
site:cloudfront.net /private confidential
This search looks for a directory called private with files that have the word confidential in them. This is usually set up for application forms, franchise information or other data types like this that are form based. However, this can also lead to interesting files in the system that point to truly company confidential information that should not be searched by Google. This generally bypasses all the login or authentication measures put in place by a company at the web front end, allowing for nearly unregulated access to interesting files that are stored in CloudFront.
More interesting file types that can be pulled from this are financial data using the file type XLS, log files, databases, and other interesting information that should not be indexed by Google. You could also end up finding remarkably interesting files that will give you a lot of intelligence about a company, generally unseen and generally unnoticed by the company itself.
To protect yourself you should Google your own CloudFront URL to make sure that it is not showing up in Google Searches. When you build a web page that points to direct objects within CloudFront, Google will search that origin server and build a list of files that will be stored in the Google index. You can always request that specific items be removed from Google because they should not be there. Yet that can often happen only after a company or user has Google searched themselves to see what is out there already. If it is already in Google, you should assume that many people on the internet have already seen this file, and then take appropriate measures to ensure that the data is secured, or part of a breach notification effort, if Personally Identifiable Information (PII) has been compromised.
You should set your robots.txt file to disallow anything that points to a cloudfront.net URL. This is a simple safety measure for anyone who has a web server that is using objects in CloudFront. This will not overly influence how Google indexes the site, it will simply drop the links to anything cloudfront.net and not to the surrounding text or data structures around those objects. If you are tagging information, or have text wrapped around the object, like in some picture sharing web sites, then this will not change anything other than the location of the data object being called by the otherwise dynamic web site that is being run.
Turn on logging on your S3 bucket so that comparing how people are accessing objects with the CloudFront system can be correlated. Most companies rely on their web service logs, but in my experience do not turn on logging on the S3 bucket. By being able to compare and contrast the web service access log and the data that is gathered by the S3 bucket it will be easy to setup a correlation between the two logs to determine who is directly accessing the data via Google, and who is going through the proper authorization or web front end to access the information that is in CloudFront. If you are seeing a high level of direct accesses to confidential information in S3, then you will be able to work on security countermeasures to ensure that private data is truly private and not exposed within Google. It is also recommended to use IAM or other federated identity mechanisms to restrict access to objects. This will ensure that object access in CloudFront is at least part of a companywide authentication measure and that data cannot be directly accessed through Google.
Read the AWS documentation for CloudFront including the API information for more information on how to secure the CloudFront data from casual Google hackers. You should also understand the security and protection mechanisms in S3 if you are using S3 as your origin point for data. There are multiple ways of protecting data within S3 that are often unknown to most system administrators within S3.
There are four types of access control processes within S3 – using the Identity and Access Management (IAM) system within Amazon AWS, access control lists, specific policies can be set up for each Amazon S3 bucket that is used in CloudFront and Query String authentication to ensure that the data request goes through an additional layer of security by using a security token embedded in the query string itself.
It is highly recommended that the query string be part of a database request, and not actually embedded in the URL. It would be dangerous to include a query string that is public and can be indexed with [firstname.lastname@example.org]. You should always use good URL programming standards.
There are many ways to secure your data in the Amazon Cloud, but you should be aware of the limitations of storing objects in the cloud unprotected. It is very easy to set up standard security controls around the objects in your CloudFront distribution if you are using an Amazon S3 bucket as your origin server. You should at all times have a robots.txt file that disallows specific indexing of cloudfront.net from any search engines spidering system. It is important to ensure that object ACL’s and authentication measures are in place for company confidential or private customer information in order to prevent it from being exposed in Google with an origination point of Cloudfront.net.
You should do a Google search of your own CloudFront URL to see what is being spidered by Google, and if it is truly confidential information, request to have it removed. You should also ensure that you are using masking in your CloudFront set up to ensure that the file types that need to be exposed are available in the sitemap. But you could also set the sitemap to mask all file types so that the logical hierarchy of directories cannot be easily deconstructed by someone reading the XML file.