When scraping the Google search engine, we need to be careful so that Google doesn’t detect our automated tool as a bot, which will redirect us to a captcha website, where we’ll need to enter the captcha in order to continue. We don’t want that, since then Google will block us and we won’t be able to perform any more searches without entering the captcha. And we certainly won’t take the time to check out if the Google captcha can be broken, so we can automatically send captcha strings to the server to unblock us. We just need to be careful enough not to overdo it.

GGGoogleScan

GGGoogleScan is a Google scraper which performs automated searches and returns results of search queries in the form of URLs or hostnames. Datamining Google’s search index is useful for many applications. Despite this, Google makes it difficult for researchers to perform automatic search queries. The aim of is to make automated searches possible by avoiding the search activity that is detected as bot behaviour [1]. Basically we can enumerate hostnames and URLs with the GGGoogleScan tool, which can prove a valuable resource for later.

This tool has a number of ways to avoid being detected as a bot; one of them is horizontal searching, where we’re searching for multiple search words in parallel without requesting the contents of, for example, 1-50 results found by that search query. Rather than that, we’re making a large number of search queries, saving the results and only requesting a small number of web pages that were found as a result of scanning.

When we first run the tool, the help page will be displayed like this:

# ./gggooglescan
 _______ _____ _____                       __
|     __|   __|   __|.-----..-----..-----.|  |.-----..-----..----..---.-..-----.
|    |  |  |  |  |  ||  o  ||  o  ||  o  ||  ||  o__||__ --||  __||  o  ||     |
|_______|_____|_____||_____||_____||___  ||__||_____||_____||____||___._||__|__|
   G-G-Googlescan vo.4 (o4/2o1o)   |_____|          by urbanadventurer
------------.-------------------------------------------------------------------
Description | Enumerate hostnames and URLs using the Google Search Engine
Homepage    | www.morningstarsecurity.com/research/gggooglescan
Author      | Andrew Horton (urbanadventurer) from Security-Assessment.com
Features    | Antibot avoidance, Search within a country, Horizontal searching,
            | URL encoding and more
------------'-------------------------------------------------------------------

Usage: ./gggooglescan [OPTION]... [QUERY|-i]

 -i=INPUTFILE   File containing search queries
 -e=WORDLIST    Combine each word from a wordlist with the query (avoid deep
                queries), can combine with -i
 -c=CC          Search within a country, eg. au, uk or nz
 -d=NUM         Depth of results. Num of pages to return. Default: 5
 -g=IP          IP or hostname of a Google search website.
                Default: www.google.com
 -l=FILE        Log file, output is appended if the file already exists
 -o             Output hostnames, instead of URLs
 -u=AGENT|NUM   User Agent. Default is to randomly select one
 -t             List user agents
 -s=SECONDS     Sleep for SECONDS between each query. Default: 0
 -q             Quiet. Turn off comment lines.
 -x             Pass cmdline args to curl, eg. -x "--socks5 127.0.0.1:8118"

We can see that there are a number of options we can use with this tool. Usually, we want to use the -d option to return the specified number of pages. That number shouldn’t be too large, so that we’re still in horizontal search mode where we’re not detected as a bot. We can also use the -s option to sleep the specified number of seconds between requests, which further hides our activity. The options -e and -i can be used to specify the input files that contain a whole and a part of the search query. At the end we can use a command like this:

# gggooglescan -l output.log -d 1 -e wordlist "test"

The wordlist file is provided by the GGGoogleScan by default and contains 97,070 arbitrary words. The above query will save all the found URLs in the output.log file and will search with a queries like the ones presented below:
- test entry1
- test entry2
- test …
- test entryN

Where the entry1, entry2, …, entryN are the line entries in the wordlist file. One of the queries of the above command was also the “test aaliyah’s” query, which returned the following results:

# test aaliyah's
# Query: test aaliyah's
# http://www.google.com/search?q=test%20aaliyah%27s&start=0&num=10

http://www.blogger.com/?tab=wj


http://lyfstylmusic.com/post/33840145930


http://www.whosampled.com/sample/view/66855/Silkie-Test_Aaliyah-Rock%20the%20Boat/


http://gmat.bellcurves.com/teacher-profile/aaliyah-williams


http://articles.nydailynews.com/2012-10-08/news/34326860_1_aaliyah-lil-wayne-tha-carter


http://globalgrind.com/style/sessilee-lopez-block-magazine-aaliyah-fashion-spread-photos


http://www.hiphophavoc.com/havoc/2012/10/lil-wayne-reflects-on-drake-producing-aaliyahs-posthumous-album/


http://www.hiphopdx.com/index/news/id.20675/title.timbaland-reacts-to-drake-executive-producing-aaliyahs-album


http://www.hiphopdx.com/index/news/


http://www.hiphopdx.com/index/news/id.21643/title.just-blaze-mourns-aaliyahs-passing-says-they-were-scheduled-to-collaborate


http://www.hiphopdx.com/index/news/


http://saveaaliyah.com/


http://www.facebook.com/permalink.php?story_fbid=10152021546610515&id=26702510514

But how can we be sure that the tool presents the right queries? We can just google for that in the search engine. We can see the results of the same query on the picture below:
We can see that the links in the picture are the same as those obtained by the GGGoogleScan tool. Now we’re ready to use some of more advanced functions of the GGGoogleScan tool. We can search the results by country by using the -c option the GGGoogleScan tool provides. So if we wanted to search for only the sites in the United Kingdom we would form the search as follows (notice the -c uk option being added to the command):

# gggooglescan <strong>-c uk</strong> -l output.log -d 1 -e wordlist "test"

We can also use the -x option to use the proxy if we need that. Keep in mind that the format of the -x option is the same as that of the curl command line program that is capable of transferring data from and to the server using a number of supported protocols, including HTTP and HTTPS.

If we would like to get the first 100 links of some web page, we could use a command like the following:

# gggooglescan -l output.log -d 100 "site:<a href="http://www.target.si/">www.target.</a>com

We can do more than that. We can download a list of Google dorks and scan with those. What we will gain is automatic enumeration of hostnames and URLs that are detected with one of the Google dorks on a specific site. But first we must obtain a list of Google dorks. We know that Google dorks are located on a webpage like http://www.exploit-db.com/google-dorks/, but the site doesn’t provide the download button to download them all; we need to go through the pages one by one and download them by hand or write a script that will do it for us. But of course, we don’t want to go to all that trouble, since there’s an easier way. We can download the Search Diggity tool. After the download and installation of the tool, we can use it to basically do the same, except that we’re limited by the Google API, which doesn’t allow us to do many search requests per day. But when we’re doing a penetration test, we would like to check for all Google dorks at once for the new customer. With the use of GGGoogleScan this is possible, if we’re careful. To obtain the list of Google dorks, we can go to the C:Program FilesStach & LiuSearchDiggity folder and copy the “Google Queries.txt” file from there. This file contains all the Google dorks that the SearchDiggity uses to do its job. The Google dorks in this file are represented as follows:

SLDB;;Custom;;passlist
SLDB;;Custom;;passlist.txt (a better way)
SLDB;;Custom;;passwd
SLDB;;Custom;;passwd / etc (reliable)
SLDB;;Custom;;people.lst

In order to understand that list, we also need to view the Queries menu of the SearchDiggity tool. That menu is presented in the picture below:

We can immediately see how the “Google Queries.txt” file is being parsed to provide that menu. First there’s the database name followed by the ‘;;’ separated, followed by the category and at the end the actual Google dork for the current category (again separated with the ‘;;’ from its category). We can use this knowledge to quickly throw away the database name and the category, which will leave us just the Google dorks that we can use as input to the GGGoogleScan tool.

To parse the “Google Queries.txt” file, let’s first rename it as queries.txt for easier manipulation. To quickly parse the file and only grab the Google queries we can use the following command to split each line by the ‘;;’ separator and taking the last column and saving it into the file queries2.txt:

Want to learn more?? The InfoSec Institute Ethical Hacking course goes in-depth into the techniques used by malicious, black hat hackers with attention getting lectures and hands-on lab exercises. While these hacking skills can be used for malicious purposes, this class teaches you how to use the same hacking techniques to perform a white-hat, ethical hack, on your organization. You leave with the ability to quantitatively assess and measure threats to information assets; and discover where your organization is most vulnerable to black hat hackers. Some features of this course include:

  • Dual Certification - CEH and CPT
  • 5 days of Intensive Hands-On Labs
  • Expert Instruction
  • CTF exercises in the evening
  • Most up-to-date proprietary courseware available
# cat queries.txt | awk 'BEGIN { FS=";;"} ; {print $NF}' >> queries2.txt

# gggooglescan -l output.log -d 10 -e queries2.txt "site:www.target.com"

The reason why this works is because usually the www.target.com doesn’t return any pages for most of the queries being submitted to the Google search engine. This is because the search queries are too specific to certain environments to return any results; and the target can’t use all of the existing technologies, just some specific ones. An example of a search query is:

site:www.target.com intitle:guestbook inurl:guestbook "powered by Adva"

Is the above query really going to return some results? Probably not, because it’s way too complicated and way too specific to the Adva guestbook. But if the www.target.com uses the Adva guestbook it can certainly return some results, which are more than welcome.

With the above command we won’t know which queries were submitted that found a certain hostname or URL, but it doesn’t really matter. We have the URL, which we can enter into the web browser to inspect it, and most of the time it will be immediately clear what the problem is; therefore we don’t need to know the search query that was used, since the whole point is to find a vulnerability or an inconsistency.

We can also search the www.target.com site for any common extensions. A list of common extension is provided below:

filetype:asmx
filetype:aspx
filetype:bak
filetype:cfg
filetype:cgi
filetype:csv
filetype:dat
filetype:exe
filetype:htm
filetype:ini
filetype:log
filetype:php
filetype:pHp
filetype:ppt
filetype:pwd
filetype:qbw
filetype:txt
filetype:wml
filetype:xml

We can save the extensions in the ext.txt file and then run the command as follows, which will get all the URLs where the resources with any of the above extensions is located:

# gggooglescan -l output.log -d 10 -e ext.txt "site:www.target.com"

There’s only one problem with the GGGoogleScan: it doesn’t tell you when you’ve been blocked, so you don’t actually know whether you’ve been blocked or not. The scan will keep going on, but whenever the request is made, we’ll be redirected to the captcha site, and won’t be able to get results. We can detect this if there are no results being written to the output file; in such case we can be pretty sure that Google has detected our automation tool and blocked us. A request that requested “site:google.com php” that was blocked was intercepted with Wireshark and can be seen below:

We can see that we were redirected to the http://www.google.com/sorry/ page, where the captcha is waiting to be filled out. The captcha can be seen on the picture below:

It would certainly be a good thing if the GGGoogleScan could detect this and wait for us to fill in the captcha and continue or at least detect this and stop the program, so we would know when it was blocked and could continue afterwards.

After careful observation of the script, we can determine that the script has a simple way of handling the case when Google blocks the script and it can’t continue because a redirect is occurring. That piece of code is presented below:

if `grep -q "http://sorry.google.com/sorry/?continue=" $TMPFILE`; then
if (( $COMMENTS )); then
echo "# You're acting like a bot. Wait now." >&2
fi
sleep $BOT_SLEEP
rm -f $TMPFILE
break
fi

This looks right: if the returned request contains the string “http://sorry.google.com/sorry/?continue=” then the script will sleep for BOT_SLEEP time, which is 60 minutes, because it needs to wait for Google to unblock us. At that time, an error message “# You’re acting like a bot. Wait now.” will also be displayed to let us know that we’ve been blocked and the script can’t continue. But when running our own GGGoogleScan scenario and after being blocked, the following is printed on the screen:

# next page of results link missing

This certainly isn’t the error message that we should get, so what’s going on? The problem is that the script hasn’t being updated lately and Google changed the redirection URL from “http://sorry.google.com/sorry/?continue=” (known by the script) to “http://www.google.com/sorry/?continue=” (known by Google) and this is the reason the script can’t detect when we’ve being blocked. We must change the above string to the one Google returns when being blocked, so the script will detect it and wait 60 minutes before continuing. After rerunning the script, we can see that we have indeed been blocked and that the script will wait now:

# You're acting like a bot. Wait now.

Conclusion

We’ve seen how we can use the GGGoogleScan tool to prevent being limited by the 100 queries per day, which is as much as Google allows if we use the Google search engine API.

References:
[1] GGGoogleScan, accessible on http://www.morningstarsecurity.com/research/gggooglescan.