4 Information Gathering Web Edition Module Cheat Sheet
4 Information Gathering Web Edition Module Cheat Sheet
CHEAT SHEET
Cheat Sheet
Web reconnaissance is the first step in any security assessment or penetration testing engagement. It's akin to a
detective's initial investigation, meticulously gathering clues and evidence about a target before formulating a plan of
action. In the digital realm, this translates to accumulating information about a website or web application to identify
potential vulnerabilities, security misconfigurations, and valuable assets.
The primary goals of web reconnaissance revolve around gaining a comprehensive understanding of the target's digital
footprint. This includes:
Identifying Assets: Discovering all associated domains, subdomains, and IP addresses provides a map of the
target's online presence.
Uncovering Hidden Information: Web reconnaissance aims to uncover directories, files, and technologies that
are not readily apparent and could serve as entry points for an attacker.
Analyzing the Attack Surface: By identifying open ports, running services, and software versions, you can
assess the potential vulnerabilities and weaknesses of the target.
Gathering Intelligence: Collecting information about employees, email addresses, and technologies used can
aid in social engineering attacks or identifying specific vulnerabilities associated with certain software.
Web reconnaissance can be conducted using either active or passive techniques, each with its own advantages and
drawbacks:
Active Involves directly interacting with the Higher Port scanning, vulnerability
Reconnaissance target system, such as sending probes scanning, network mapping
or requests.
Passive Gathers information without directly Lower Search engine queries, WHOIS
Reconnaissance interacting with the target, relying on lookups, DNS enumeration, web
publicly available data. archive analysis, social media
WHOIS
WHOIS is a query and response protocol used to retrieve information about domain names, IP addresses, and other
internet resources. It's essentially a directory service that details who owns a domain, when it was registered, contact
information, and more. In the context of web reconnaissance, WHOIS lookups can be a valuable source of information,
potentially revealing the identity of the website owner, their contact information, and other details that could be used for
further investigation or social engineering attacks.
For example, if you wanted to find out who owns the domain example.com, you could run the following command in your
terminal:
whois example.com
This would return a wealth of information, including the registrar, registration, and expiration dates, nameservers, and
contact information for the domain owner.
However, it's important to note that WHOIS data can be inaccurate or intentionally obscured, so it's always wise to
verify the information from multiple sources. Privacy services can also mask the true owner of a domain, making it
more difficult to obtain accurate information through WHOIS.
DNS
The Domain Name System (DNS) functions as the internet's GPS, translating user-friendly domain names into the
numerical IP addresses computers use to communicate. Like GPS converting a destination's name into coordinates,
DNS ensures your browser reaches the correct website by matching its name with its IP address. This eliminates
memorizing complex numerical addresses, making web navigation seamless and efficient.
The dig command allows you to query DNS servers directly, retrieving specific information about domain names. For
instance, if you want to find the IP address associated with example.com, you can execute the following command:
dig example.com A
This command instructs dig to query the DNS for the A record (which maps a hostname to an IPv4 address) of
example.com. The output will typically include the requested IP address, along with additional details about the query
and response. By mastering the dig command and understanding the various DNS record types, you gain the ability to
extract valuable information about a target's infrastructure and online presence.
DNS servers store various types of records, each serving a specific purpose:
MX Specifies mail servers responsible for handling email for the domain.
Subdomains
Subdomains are essentially extensions of a primary domain name, often used to organize different sections or services
within a website. For example, a company might use mail.example.com for their email server or blog.example.com for
their blog.
From a reconnaissance perspective, subdomains are incredibly valuable. They can expose additional attack surfaces,
reveal hidden services, and provide clues about the internal structure of a target's network. Subdomains might host
development servers, staging environments, or even forgotten applications that haven't been properly secured.
The process of discovering subdomains is known as subdomain enumeration. There are two main approaches to
subdomain enumeration:
Active Directly interacts with the target's DNS servers or utilizes tools Brute-forcing, DNS zone
Enumeration to probe for subdomains. transfers
Passive Collects information about subdomains without directly Certificate Transparency (CT)
Enumeration interacting with the target, relying on public sources. logs, search engine queries
Active enumeration can be more thorough but carries a higher risk of detection. Conversely, passive enumeration is
stealthier but may not uncover all subdomains. Combining both techniques can significantly increase the likelihood of
discovering a comprehensive list of subdomains associated with your target, expanding your understanding of their
online presence and potential vulnerabilities.
Subdomain Brute-Forcing
Subdomain brute-forcing is a proactive technique used in web reconnaissance to uncover subdomains that may not be
readily apparent through passive methods. It involves systematically generating many potential subdomain names and
testing them against the target's DNS server to see if they exist. This approach can unveil hidden subdomains that may
host valuable information, development servers, or vulnerable applications.
One of the most versatile tools for subdomain brute-forcing is dnsenum. This powerful command-line tool combines
various DNS enumeration techniques, including dictionary-based brute-forcing, to uncover subdomains associated with
your target.
To use dnsenum for subdomain brute-forcing, you'll typically provide it with the target domain and a wordlist containing
potential subdomain names. The tool will then systematically query the DNS server for each potential subdomain and
report any that exist.
For example, the following command would attempt to brute-force subdomains of example.com using a wordlist named
subdomains.txt:
Zone Transfers
DNS zone transfers, also known as AXFR (Asynchronous Full Transfer) requests, offer a potential goldmine of
information for web reconnaissance. A zone transfer is a mechanism for replicating DNS data across servers. When a
zone transfer is successful, it provides a complete copy of the DNS zone file, which contains a wealth of details about
the target domain.
This zone file lists all the domain's subdomains, their associated IP addresses, mail server configurations, and other
DNS records. This is akin to obtaining a blueprint of the target's DNS infrastructure for a reconnaissance expert.
To attempt a zone transfer, you can use the dig command with the axfr (full zone transfer) option. For example, to
request a zone transfer from the DNS server ns1.example.com for the domain example.com, you would execute:
dig @ns1.example.com example.com axfr
However, zone transfers are not always permitted. Many DNS servers are configured to restrict zone transfers to
authorized secondary servers only. Misconfigured servers, though, may allow zone transfers from any source,
inadvertently exposing sensitive information.
Virtual Hosts
Virtual hosting is a technique that allows multiple websites to share a single IP address. Each website is associated
with a unique hostname, which is used to direct incoming requests to the correct site. This can be a cost-effective way
for organizations to host multiple websites on a single server, but it can also create a challenge for web
reconnaissance.
Since multiple websites share the same IP address, simply scanning the IP won't reveal all the hosted sites. You need
a tool that can test different hostnames against the IP address to see which ones respond.
Gobuster is a versatile tool that can be used for various types of brute-forcing, including virtual host discovery. Its vhost
mode is designed to enumerate virtual hosts by sending requests to the target IP address with different hostnames. If a
virtual host is configured for a specific hostname, Gobuster will receive a response from the web server.
To use Gobuster to brute-force virtual hosts, you'll need a wordlist containing potential hostnames. Here's an example
command:
gobuster vhost -u http://192.0.2.1 -w hostnames.txt
In this example, -u specifies the target IP address, and -w specifies the wordlist file. Gobuster will then systematically
try each hostname in the wordlist and report any that results in a valid response from the web server.
The crt.sh website provides a searchable interface for CT logs. To efficiently extract subdomains using crt.sh within
your terminal, you can use a command like this:
curl -s "https://crt.sh/?q=%25.example.com&output=json" | jq -r '.[].name_value' | sed 's/\*\.//g' | sort -u
This command fetches JSON-formatted data from crt.sh for example.com (the % is a wildcard), extracts domain names
using jq, removes any wildcard prefixes (*.) with sed, and finally sorts and deduplicates the results.
Web Crawling
Web crawling is the automated exploration of a website's structure. A web crawler, or spider, systematically navigates
through web pages by following links, mimicking a user's browsing behavior. This process maps out the site's
architecture and gathers valuable information embedded within the pages.
A crucial file that guides web crawlers is robots.txt. This file resides in a website's root directory and dictates which
areas are off-limits for crawlers. Analyzing robots.txt can reveal hidden directories or sensitive areas that the website
owner doesn't want to be indexed by search engines.
Scrapy is a powerful and efficient Python framework for large-scale web crawling and scraping projects. It provides a
structured approach to defining crawling rules, extracting data, and handling various output formats.
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ['http://example.com/']
After running the Scrapy spider, you'll have a file containing scraped data (e.g., example_data.json). You can analyze
these results using standard command-line tools. For instance, to extract all links:
jq -r '.[] | select(.file != null) | .file' example_data.json | sort -u
This command uses jq to extract links, awk to isolate file extensions, sort to order them, and uniq -c to count their
occurrences. By scrutinizing the extracted data, you can identify patterns, anomalies, or sensitive files that might be of
interest for further investigation.
inurl: Searches for a specific term in the URL of a page. inurl:admin login
intitle: Searches for a term within the title of a page. intitle:"index of" /backup
"search term" Searches for the exact phrase within quotation marks. "internal error" site:example.com
By creatively combining these operators and crafting targeted queries, you can uncover sensitive documents, exposed
directories, login pages, and other valuable information that may aid in your reconnaissance efforts.
Web Archives
Web archives are digital repositories that store snapshots of websites across time, providing a historical record of their
evolution. Among these archives, the Wayback Machine is the most comprehensive and accessible resource for web
reconnaissance.
The Wayback Machine, a project by the Internet Archive, has been archiving the web for over two decades, capturing
billions of web pages from across the globe. This massive historical data collection can be an invaluable resource for
security researchers and investigators.
Historical View past versions of websites, including pages, Identify past website content or functionality
Snapshots content, and design changes. that is no longer available.
Hidden Explore directories and files that may have been Discover sensitive information or backups that
Directories removed or hidden from the current version of were inadvertently left accessible in previous
the website. versions.
Feature Description Use Case in Reconnaissance
Content Track changes in website content, including text, Identify patterns in content updates and
Changes images, and links. assess the evolution of a website's security
posture.
By leveraging the Wayback Machine, you can gain a historical perspective on your target's online presence, potentially
revealing vulnerabilities that may have been overlooked in the current version of the website.