5 InformationGathering-WebEdition
5 InformationGathering-WebEdition
Attackers leverage this information to tailor their attacks, allowing them to target specific
weaknesses and bypass security measures. Conversely, defenders use recon to proactively
identify and patch vulnerabilities before malicious actors can leverage them.
Types of Reconnaissance
Web reconnaissance encompasses two fundamental methodologies: active and passive
reconnaissance. Each approach offers distinct advantages and challenges, and
understanding their differences is crucial for adequate information gathering.
Active Reconnaissance
In active reconnaissance, the attacker directly interacts with the target system to
gather information. This interaction can take various forms:
Passive Reconnaissance
In contrast, passive reconnaissance involves gathering information about the target without
directly interacting with it. This relies on analysing publicly available information and
resources, such as:
Passive reconnaissance is generally considered stealthier and less likely to trigger alarms
than active reconnaissance. However, it may yield less comprehensive information, as it
relies on what's already publicly accessible.
In this module, we will delve into the essential tools and techniques used in web
reconnaissance, starting with WHOIS. Understanding the WHOIS protocol provides a
gateway to accessing vital information about domain registrations, ownership details, and
the digital infrastructure of targets. This foundational knowledge sets the stage for more
advanced recon methods we'll explore later.
WHOIS
WHOIS is a widely used query and response protocol designed to access databases that
store information about registered internet resources. Primarily associated with domain
names, WHOIS can also provide details about IP address blocks and autonomous systems.
Think of it as a giant phonebook for the internet, letting you look up who owns or is
responsible for various online assets.
whois inlanefreight.com
[...]
Domain Name: inlanefreight.com
Registry Domain ID: 2420436757_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.registrar.amazon
Registrar URL: https://registrar.amazon.com
Updated Date: 2023-07-03T01:11:15Z
Creation Date: 2019-08-05T22:43:09Z
[...]
History of WHOIS
The history of WHOIS is intrinsically linked to the vision and dedication of Elizabeth Feinler,
a computer scientist who played a pivotal role in shaping the early internet.
In the 1970s, Feinler and her team at the Stanford Research Institute's Network Information
Center (NIC) recognised the need for a system to track and manage the growing number of
network resources on the ARPANET, the precursor to the modern internet. Their solution
was the creation of the WHOIS directory, a rudimentary yet groundbreaking database that
stored information about network users, hostnames, and domain names.
Key figures like Randy Bush and John Postel contributed to the development of the RIR
system, which divided the responsibility of managing internet resources into regional zones.
This decentralisation improved scalability and resilience, allowing WHOIS to keep pace with
the internet's rapid expansion.
This centralized oversight helped to standardize WHOIS data formats, improve accuracy,
and resolve domain disputes arising from issues like cybersquatting, trademark infringement,
or conflicts over unused domains. ICANN's Uniform Domain-Name Dispute-Resolution
Policy (UDRP) provides a framework for resolving such conflicts through arbitration.
The implementation of the General Data Protection Regulation ( GDPR ) in 2018 further
accelerated this trend, requiring WHOIS operators to comply with strict data protection rules.
Identifying Key Personnel : WHOIS records often reveal the names, email
addresses, and phone numbers of individuals responsible for managing the domain.
This information can be leveraged for social engineering attacks or to identify potential
targets for phishing campaigns.
Discovering Network Infrastructure : Technical details like name servers and IP
addresses provide clues about the target's network infrastructure. This can help
penetration testers identify potential entry points or misconfigurations.
Historical Data Analysis : Accessing historical WHOIS records through services
like WhoisFreaks can reveal changes in ownership, contact information, or technical
details over time. This can be useful for tracking the evolution of the target's digital
presence.
Utilising WHOIS
Let's consider three scenarios to help illustrate the value of WHOIS data.
Registration Date : The domain was registered just a few days ago.
Registrant : The registrant's information is hidden behind a privacy service.
Name Servers : The name servers are associated with a known bulletproof hosting
provider often used for malicious activities.
This combination of factors raises significant red flags for the analyst. The recent registration
date, hidden registrant information, and suspicious hosting strongly suggest a phishing
campaign. The analyst promptly alerts the company's IT department to block the domain and
warns employees about the scam.
Further investigation into the hosting provider and associated IP addresses may uncover
additional phishing domains or infrastructure the threat actor uses.
Based on this information, the researcher concludes that the C2 server is likely hosted on a
compromised or "bulletproof" server. The researcher then uses the WHOIS data to identify
the hosting provider and notify them of the malicious activity.
Registration Dates : The domains were registered in clusters, often shortly before
major attacks.
Registrants : The registrants use various aliases and fake identities.
Name Servers : The domains often share the same name servers, suggesting a
common infrastructure.
Takedown History : Many domains have been taken down after attacks, indicating
previous law enforcement or security interventions.
These insights allow analysts to create a detailed profile of the threat actor's tactics,
techniques, and procedures (TTPs). The report includes indicators of compromise (IOCs)
based on the WHOIS data, which other organisations can use to detect and block future
attacks.
Using WHOIS
Before using the whois command, you'll need to ensure it's installed on your Linux system.
It's a utility available through linux package managers, and if it's not installed, it can be
installed simply with
sudo apt update
sudo apt install whois -y
The simplest way to access WHOIS data is through the whois command-line tool. Let's
perform a WHOIS lookup on facebook.com :
whois facebook.com
[...]
Registry Registrant ID:
Registrant Name: Domain Admin
Registrant Organization: Meta Platforms, Inc.
[...]
The WHOIS output for facebook.com reveals several key details:
1. Domain Registration :
Registrar : RegistrarSafe, LLC
Creation Date : 1997-03-29
Expiry Date : 2033-03-30
These details indicate that the domain is registered with RegistrarSafe, LLC, and has been
active for a considerable period, suggesting its legitimacy and established online presence.
The distant expiry date further reinforces its longevity.
2. Domain Owner :
Registrant/Admin/Tech Organization : Meta Platforms, Inc.
Registrant/Admin/Tech Contact : Domain Admin
This information identifies Meta Platforms, Inc. as the organization behind facebook.com ,
and "Domain Admin" as the point of contact for domain-related matters. This is consistent
with the expectation that Facebook, a prominent social media platform, is owned by Meta
Platforms, Inc.
3. Domain Status :
clientDeleteProhibited , clientTransferProhibited ,
clientUpdateProhibited , serverDeleteProhibited ,
serverTransferProhibited , and serverUpdateProhibited
These statuses indicate that the domain is protected against unauthorized changes,
transfers, or deletions on both the client and server sides. This highlights a strong emphasis
on security and control over the domain.
4. Name Servers :
A.NS.FACEBOOK.COM , B.NS.FACEBOOK.COM , C.NS.FACEBOOK.COM ,
D.NS.FACEBOOK.COM
These name servers are all within the facebook.com domain, suggesting that Meta
Platforms, Inc. manages its DNS infrastructure. It is common practice for large organizations
to maintain control and reliability over their DNS resolution.
Overall, the WHOIS output for facebook.com aligns with expectations for a well-established
and secure domain owned by a large organization like Meta Platforms, Inc.
While the WHOIS record provides contact information for domain-related issues, it might not
be directly helpful in identifying individual employees or specific vulnerabilities. This
highlights the need to combine WHOIS data with other reconnaissance techniques to
understand the target's digital footprint comprehensively.
DNS
The Domain Name System ( DNS ) acts as the internet's GPS, guiding your online journey
from memorable landmarks (domain names) to precise numerical coordinates (IP
addresses). Much like how GPS translates a destination name into latitude and longitude for
navigation, DNS translates human-readable domain names (like www.example.com ) into the
numerical IP addresses (like 192.0.2.1 ) that computers use to communicate.
Imagine navigating a city by memorizing the exact latitude and longitude of every location
you want to visit. It would be incredibly cumbersome and inefficient. DNS eliminates this
complexity by allowing us to use easy-to-remember domain names instead. When you type
a domain name into your browser, DNS acts as your navigator, swiftly finding the
corresponding IP address and directing your request to the correct destination on the
internet.
Without DNS, navigating the online world would be akin to driving without a map or GPS – a
frustrating and error-prone endeavour.
Yes
Connects to Website
Checks Cache IP Found?
Sends DNS Query to
No
Resolver
DNS Resolver
Yes
Returns IP to Computer
Checks Cache
No Recursive Lookup
1. Your Computer Asks for Directions (DNS Query) : When you enter the domain
name, your computer first checks its memory (cache) to see if it remembers the IP
address from a previous visit. If not, it reaches out to a DNS resolver, usually provided
by your internet service provider (ISP).
2. The DNS Resolver Checks its Map (Recursive Lookup) : The resolver also has a
cache, and if it doesn't find the IP address there, it starts a journey through the DNS
hierarchy. It begins by asking a root name server, which is like the librarian of the
internet.
3. Root Name Server Points the Way : The root server doesn't know the exact address
but knows who does – the Top-Level Domain (TLD) name server responsible for the
domain's ending (e.g., .com, .org). It points the resolver in the right direction.
4. TLD Name Server Narrows It Down : The TLD name server is like a regional map. It
knows which authoritative name server is responsible for the specific domain you're
looking for (e.g., example.com ) and sends the resolver there.
5. Authoritative Name Server Delivers the Address : The authoritative name server
is the final stop. It's like the street address of the website you want. It holds the correct
IP address and sends it back to the resolver.
6. The DNS Resolver Returns the Information : The resolver receives the IP address
and gives it to your computer. It also remembers it for a while (caches it), in case you
want to revisit the website soon.
7. Your Computer Connects : Now that your computer knows the IP address, it can
connect directly to the web server hosting the website, and you can start browsing.
For example:
127.0.0.1 localhost
192.168.1.10 devserver.local
To edit the hosts file, open it with a text editor using administrative/root privileges. Add new
entries as needed, and then save the file. The changes take effect immediately without
requiring a system restart.
127.0.0.1 myapp.local
0.0.0.0 unwanted-site.com
The zone file, a text file residing on a DNS server, defines the resource records (discussed
below) within this zone, providing crucial information for translating domain names into IP
addresses.
To illustrate, here's a simplified example of what a zone file, for example.com might look
like:
@ IN NS ns1.example.com.
@ IN NS ns2.example.com.
@ IN MX 10 mail.example.com.
www IN A 192.0.2.1
mail IN A 198.51.100.1
ftp IN CNAME www.example.com.
This file defines the authoritative name servers ( NS records), mail server ( MX record), and
IP addresses ( A records) for various hosts within the example.com domain.
DNS servers store various resource records, each serving a specific purpose in the domain
name resolution process. Let's explore some of the most common DNS concepts:
Root Name Server The top-level servers in the There are 13 root servers
DNS hierarchy. worldwide, named A-M: a.root-
servers.net
TLD Name Server Servers responsible for Verisign for .com , PIR for .org
specific top-level domains
(e.g., .com, .org).
Authoritative The server that holds the Often managed by hosting
Name Server actual IP address for a providers or domain registrars.
domain.
DNS Record Types Different types of information A, AAAA, CNAME, MX, NS,
stored in DNS. TXT, etc.
Now that we've explored the fundamental concepts of DNS, let's dive deeper into the
building blocks of DNS information – the various record types. These records store different
types of data associated with domain names, each serving a specific purpose:
The " IN " in the examples stands for "Internet." It's a class field in DNS records that
specifies the protocol family. In most cases, you'll see " IN " used, as it denotes the Internet
protocol suite (IP) used for most domain names. Other class values exist (e.g., CH for
Chaosnet, HS for Hesiod) but are rarely used in modern DNS configurations.
In essence, " IN " is simply a convention that indicates that the record applies to the
standard internet protocols we use today. While it might seem like an extra detail,
understanding its meaning provides a deeper understanding of DNS record structure.
Digging DNS
Having established a solid understanding of DNS fundamentals and its various record types,
let's now transition to the practical. This section will explore the tools and techniques for
leveraging DNS for web reconnaissance.
DNS Tools
DNS reconnaissance involves utilizing specialized tools designed to query DNS servers and
extract valuable information. Here are some of the most popular and versatile tools in the
arsenal of web recon professionals:
Command Description
dig domain.com Performs a default A record lookup for the domain.
dig domain.com A Retrieves the IPv4 address (A record) associated with the
domain.
dig domain.com Retrieves the IPv6 address (AAAA record) associated with the
AAAA domain.
dig domain.com MX Finds the mail servers (MX records) responsible for the domain.
dig domain.com NS Identifies the authoritative name servers for the domain.
dig domain.com Retrieves any TXT records associated with the domain.
TXT
Command Description
dig domain.com Retrieves the canonical name (CNAME) record for the domain.
CNAME
dig domain.com Retrieves the start of authority (SOA) record for the domain.
SOA
dig @1.1.1.1 Specifies a specific name server to query; in this case 1.1.1.1
domain.com
dig +noall Displays only the answer section of the query output.
+answer
domain.com
dig domain.com Retrieves all available DNS records for the domain (Note: Many
ANY DNS servers ignore ANY queries to reduce load and prevent
abuse, as per RFC 8482).
Caution: Some servers can detect and block excessive DNS queries. Use caution and
respect rate limits. Always obtain permission before performing extensive DNS
reconnaissance on a target.
Groping DNS
dig google.com
;; QUESTION SECTION:
;google.com. IN A
;; ANSWER SECTION:
google.com. 0 IN A 142.251.47.142
This output is the result of a DNS query using the dig command for the domain
google.com . The command was executed on a system running DiG version 9.18.24-
0ubuntu0.22.04.1-Ubuntu . The output can be broken down into four key sections:
1. Header
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16449 : This line
indicates the type of query ( QUERY ), the successful status ( NOERROR ), and a
unique identifier ( 16449 ) for this specific query.
;; flags: qr rd ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0,
ADDITIONAL: 0 : This describes the flags in the DNS header:
qr : Query Response flag - indicates this is a response.
rd : Recursion Desired flag - means recursion was requested.
ad : Authentic Data flag - means the resolver considers the data
authentic.
The remaining numbers indicate the number of entries in each section
of the DNS response: 1 question, 1 answer, 0 authority records, and 0
additional records.
;; WARNING: recursion requested but not available : This indicates that
recursion was requested, but the server does not support it.
2. Question Section
;google.com. IN A : This line specifies the question: "What is the IPv4 address
(A record) for google.com ?"
3. Answer Section
google.com. 0 IN A 142.251.47.142 : This is the answer to the query. It
indicates that the IP address associated with google.com is 142.251.47.142 .
The ' 0 ' represents the TTL (time-to-live), indicating how long the result can be
cached before being refreshed.
4. Footer
;; Query time: 0 msec : This shows the time it took for the query to be
processed and the response to be received (0 milliseconds).
;; SERVER: 172.23.176.1#53(172.23.176.1) (UDP) : This identifies the DNS
server that provided the answer and the protocol used (UDP).
;; WHEN: Thu Jun 13 10:45:58 SAST 2024 : This is the timestamp of when the
query was made.
;; MSG SIZE rcvd: 54 : This indicates the size of the DNS message received
(54 bytes).
An opt pseudosection can sometimes exist in a dig query. This is due to Extension
Mechanisms for DNS ( EDNS ), which allows for additional features such as larger message
sizes and DNS Security Extensions ( DNSSEC ) support.
If you just want the answer to the question, without any of the other information, you can
query dig using +short :
104.18.20.126
104.18.21.126
Subdomains
When exploring DNS records, we've primarily focused on the main domain (e.g.,
example.com ) and its associated information. However, beneath the surface of this primary
domain lies a potential network of subdomains. These subdomains are extensions of the
main domain, often created to organise and separate different sections or functionalities of a
website. For instance, a company might use blog.example.com for its blog,
shop.example.com for its online store, or mail.example.com for its email services.
Subdomain Enumeration
Subdomain enumeration is the process of systematically identifying and listing these
subdomains. From a DNS perspective, subdomains are typically represented by A (or AAAA
for IPv6) records, which map the subdomain name to its corresponding IP address.
Additionally, CNAME records might be used to create aliases for subdomains, pointing them
to other domains or subdomains. There are two main approaches to subdomain
enumeration:
Another passive approach involves utilising search engines like Google or DuckDuckGo.
By employing specialised search operators (e.g., site: ), you can filter results to show only
subdomains related to the target domain.
Additionally, various online databases and tools aggregate DNS data from multiple sources,
allowing you to search for subdomains without directly interacting with the target.
Each of these methods has its strengths and weaknesses. Active enumeration offers more
control and potential for comprehensive discovery but can be more detectable. Passive
enumeration is stealthier but might not uncover all existing subdomains. Combining both
approaches provides a more thorough and effective subdomain enumeration strategy.
Subdomain Bruteforcing
Tool Description
dnsenum Comprehensive DNS enumeration tool that supports dictionary and brute-
force attacks for discovering subdomains.
fierce User-friendly tool for recursive subdomain discovery, featuring wildcard
detection and an easy-to-use interface.
dnsrecon Versatile tool that combines multiple DNS reconnaissance techniques and
offers customisable output formats.
amass Actively maintained tool focused on subdomain discovery, known for its
integration with other tools and extensive data sources.
assetfinder Simple yet effective tool for finding subdomains using various techniques,
ideal for quick and lightweight scans.
puredns Powerful and flexible DNS brute-forcing tool, capable of resolving and
filtering results effectively.
DNSEnum
dnsenum is a versatile and widely-used command-line tool written in Perl. It is a
comprehensive toolkit for DNS reconnaissance, providing various functionalities to gather
information about a target domain's DNS infrastructure and potential subdomains. The tool
offers several key functions:
DNS Record Enumeration : dnsenum can retrieve various DNS records, including A,
AAAA, NS, MX, and TXT records, providing a comprehensive overview of the target's
DNS configuration.
Zone Transfer Attempts : The tool automatically attempts zone transfers from
discovered name servers. While most servers are configured to prevent unauthorised
zone transfers, a successful attempt can reveal a treasure trove of DNS information.
Subdomain Brute-Forcing : dnsenum supports brute-force enumeration of
subdomains using a wordlist. This involves systematically testing potential subdomain
names against the target domain to identify valid ones.
Google Scraping : The tool can scrape Google search results to find additional
subdomains that might not be listed in DNS records directly.
Reverse Lookup : dnsenum can perform reverse DNS lookups to identify domains
associated with a given IP address, potentially revealing other websites hosted on the
same server.
WHOIS Lookups : The tool can also perform WHOIS queries to gather information about
domain ownership and registration details.
Let's see dnsenum in action by demonstrating how to enumerate subdomains for our target,
inlanefreight.com . In this demonstration, we'll use the subdomains-top1million-
5000.txt wordlist from SecLists, which contains the top 5000 most common subdomains.
In this command:
dnsenum VERSION:1.2.6
----- inlanefreight.com -----
Host's addresses:
__________________
inlanefreight.com. 300 IN A
134.209.24.248
[...]
www.inlanefreight.com. 300 IN A
134.209.24.248
support.inlanefreight.com. 300 IN A
134.209.24.248
[...]
done.
While brute-forcing can be a fruitful approach, there's a less invasive and potentially more
efficient method for uncovering subdomains – DNS zone transfers. This mechanism,
designed for replicating DNS records between name servers, can inadvertently become a
goldmine of information for prying eyes if misconfigured.
loop [transfer]
DNS Record
ACK (Acknowledgement)
secondaryServer primaryServer
1. Zone Transfer Request (AXFR) : The secondary DNS server initiates the process by
sending a zone transfer request to the primary server. This request typically uses the
AXFR (Full Zone Transfer) type.
2. SOA Record Transfer : Upon receiving the request (and potentially authenticating the
secondary server), the primary server responds by sending its Start of Authority (SOA)
record. The SOA record contains vital information about the zone, including its serial
number, which helps the secondary server determine if its zone data is current.
3. DNS Records Transmission : The primary server then transfers all the DNS records in
the zone to the secondary server, one by one. This includes records like A, AAAA, MX,
CNAME, NS, and others that define the domain's subdomains, mail servers, name
servers, and other configurations.
4. Zone Transfer Complete : Once all records have been transmitted, the primary server
signals the end of the zone transfer. This notification informs the secondary server that
it has received a complete copy of the zone data.
5. Acknowledgement (ACK) : The secondary server sends an acknowledgement message
to the primary server, confirming the successful receipt and processing of the zone
data. This completes the zone transfer process.
The Zone Transfer Vulnerability
While zone transfers are essential for legitimate DNS management, a misconfigured DNS
server can transform this process into a significant security vulnerability. The core issue lies
in the access controls governing who can initiate a zone transfer.
In the early days of the internet, allowing any client to request a zone transfer from a DNS
server was common practice. This open approach simplified administration but opened a
gaping security hole. It meant that anyone, including malicious actors, could ask a DNS
server for a complete copy of its zone file, which contains a wealth of sensitive information.
Subdomains : A complete list of subdomains, many of which might not be linked from
the main website or easily discoverable through other means. These hidden
subdomains could host development servers, staging environments, administrative
panels, or other sensitive resources.
IP Addresses : The IP addresses associated with each subdomain, providing potential
targets for further reconnaissance or attacks.
Name Server Records : Details about the authoritative name servers for the domain,
revealing the hosting provider and potential misconfigurations.
Remediation
Fortunately, awareness of this vulnerability has grown, and most DNS server administrators
have mitigated the risk. Modern DNS servers are typically configured to allow zone transfers
only to trusted secondary servers, ensuring that sensitive zone data remains confidential.
However, misconfigurations can still occur due to human error or outdated practices. This is
why attempting a zone transfer (with proper authorisation) remains a valuable
reconnaissance technique. Even if unsuccessful, the attempt can reveal information about
the DNS server's configuration and security posture.
This command instructs dig to request a full zone transfer ( axfr ) from the DNS server
responsible for zonetransfer.me . If the server is misconfigured and allows the transfer,
you'll receive a complete list of DNS records for the domain, including all subdomains.
dig axfr @nsztm1.digi.ninja zonetransfer.me
Virtual Hosts
Once the DNS directs traffic to the correct server, the web server configuration becomes
crucial in determining how the incoming requests are handled. Web servers like Apache,
Nginx, or IIS are designed to host multiple websites or applications on a single server. They
achieve this through virtual hosting, which allows them to differentiate between domains,
subdomains, or even separate websites with distinct content.
The key difference between VHosts and subdomains is their relationship to the Domain
Name System (DNS) and the web server's configuration.
If a virtual host does not have a DNS record, you can still access it by modifying the hosts
file on your local machine. The hosts file allows you to map a domain name to an IP
address manually, bypassing DNS resolution.
Websites often have subdomains that are not public and won't appear in DNS records.
These subdomains are only accessible internally or through specific configurations. VHost
fuzzing is a technique to discover public and non-public subdomains and VHosts by
testing various hostnames against a known IP address.
Virtual hosts can also be configured to use different domains, not just subdomains. For
example:
<VirtualHost *:80>
ServerName www.example2.org
DocumentRoot /var/www/example2
</VirtualHost>
<VirtualHost *:80>
ServerName www.another-example.net
DocumentRoot /var/www/another-example
</VirtualHost>
HTTP Response
Display Content
In essence, the Host header functions as a switch, enabling the web server to dynamically
determine which website to serve based on the domain name requested by the browser.
1. Name-Based Virtual Hosting : This method relies solely on the HTTP Host header
to distinguish between websites. It is the most common and flexible method, as it
doesn't require multiple IP addresses. It’s cost-effective, easy to set up, and supports
most modern web servers. However, it requires the web server to support name-based
virtual hosting and can have limitations with certain protocols like SSL/TLS .
2. IP-Based Virtual Hosting : This type of hosting assigns a unique IP address to each
website hosted on the server. The server determines which website to serve based on
the IP address to which the request was sent. It doesn't rely on the Host header , can
be used with any protocol, and offers better isolation between websites. Still, it requires
multiple IP addresses, which can be expensive and less scalable.
3. Port-Based Virtual Hosting : Different websites are associated with different ports
on the same IP address. For example, one website might be accessible on port 80,
while another is on port 8080. Port-based virtual hosting can be used when IP
addresses are limited, but it’s not as common or user-friendly as name-based virtual
hosting and might require users to specify the port number in the URL.
gobuster
Gobuster is a versatile tool commonly used for directory and file brute-forcing, but it also
excels at virtual host discovery. It systematically sends HTTP requests with different Host
headers to a target IP address and then analyses the responses to identify valid virtual
hosts.
There are a couple of things you need to prepare to brute force Host headers:
1. Target Identification : First, identify the target web server's IP address. This can
be done through DNS lookups or other reconnaissance techniques.
2. Wordlist Preparation : Prepare a wordlist containing potential virtual host names.
You can use a pre-compiled wordlist, such as SecLists, or create a custom one based
on your target's industry, naming conventions, or other relevant information.
The -u flag specifies the target URL (replace <target_IP_address> with the actual
IP).
The -w flag specifies the wordlist file (replace <wordlist_file> with the path to your
wordlist).
The --append-domain flag appends the base domain to each word in the wordlist.
In newer versions of Gobuster, the --append-domain flag is required to append the base
domain to each word in the wordlist when performing virtual host discovery. This flag
ensures that Gobuster correctly constructs the full virtual hostnames, which is essential for
the accurate enumeration of potential subdomains.
In older versions of Gobuster, this functionality was handled differently, and the --append-
domain flag was not necessary. Users of older versions might not find this flag available or
needed, as the tool appended the base domain by default or employed a different
mechanism for virtual host generation.
Gobuster will output potential virtual hosts as it discovers them. Analyse the results
carefully, noting any unusual or interesting findings. Further investigation might be needed to
confirm the existence and functionality of the discovered virtual hosts.
There are a couple of other arguments that are worth knowing:
Consider using the -t flag to increase the number of threads for faster scanning.
The -k flag can ignore SSL/TLS certificate errors.
You can use the -o flag to save the output to a file for later analysis.
Virtual host discovery can generate significant traffic and might be detected by intrusion
detection systems (IDS) or web application firewalls (WAF). Exercise caution and obtain
proper authorization before scanning any targets.
In the sprawling mass of the internet, trust is a fragile commodity. One of the cornerstones of
this trust is the Secure Sockets Layer/Transport Layer Security ( SSL/TLS ) protocol,
which encrypts communication between your browser and a website. At the heart of
SSL/TLS lies the digital certificate , a small file that verifies a website's identity and
allows for secure, encrypted communication.
However, the process of issuing and managing these certificates isn't foolproof. Attackers
can exploit rogue or mis-issued certificates to impersonate legitimate websites, intercept
sensitive data, or spread malware. This is where Certificate Transparency (CT) logs come
into play.
Root Hash
(Hash of All Below)
Hash 1 Hash 2
(Hash of Cert 1 & Cert 2) (Hash of Cert 3 & Cert 4)
Cert 1
Cert 2 Cert 3 Cert 4
(Unsupported markdown:
(blog.inlanefreight.com) (dev.inlanefreight.com) (api.inlanefreight.com)
link)
Root Hash : The topmost node, a single hash representing the entire log's state.
Hash 1 & Hash 2 : Intermediate nodes, each a hash of two child nodes (either
certificates or other hashes).
Cert 1 - Cert 4 : Leaf nodes representing individual SSL/TLS certificates for different
subdomains of inlanefreight.com .
This structure allows for efficient verification of any certificate in the log. By providing the
Merkle path (a series of hashes) for a particular certificate, anyone can verify that it is
included in the log without downloading the entire log. For instance, to verify Cert 2
(blog.inlanefreight.com) , you would need:
This process ensures that even if a single bit of data in a certificate or the log itself is altered,
the root hash will change, immediately signaling tampering. This makes CT logs an
invaluable tool for maintaining the integrity and trustworthiness of SSL/TLS certificates,
ultimately enhancing internet security.
Furthermore, CT logs can unveil subdomains associated with old or expired certificates.
These subdomains might host outdated software or configurations, making them potentially
vulnerable to exploitation.
In essence, CT logs provide a reliable and efficient way to discover subdomains without the
need for exhaustive brute-forcing or relying on the completeness of wordlists. They offer a
unique window into a domain's history and can reveal subdomains that might otherwise
remain hidden, significantly enhancing your reconnaissance capabilities.
Searching CT Logs
There are two popular options for searching CT logs:
crt.sh lookup
While crt.sh offers a convenient web interface, you can also leverage its API for
automated searches directly from your terminal. Let's see how to find all 'dev' subdomains
on facebook.com using curl and jq :
*.dev.facebook.com
*.newdev.facebook.com
*.secure.dev.facebook.com
dev.facebook.com
devvm1958.ftw3.facebook.com
facebook-amex-dev.facebook.com
facebook-amex-sign-enc-dev.facebook.com
newdev.facebook.com
secure.dev.facebook.com
Fingerprinting
Targeted Attacks : By knowing the specific technologies in use, attackers can focus
their efforts on exploits and vulnerabilities that are known to affect those systems. This
significantly increases the chances of a successful compromise.
Identifying Misconfigurations : Fingerprinting can expose misconfigured or
outdated software, default settings, or other weaknesses that might not be apparent
through other reconnaissance methods.
Prioritising Targets : When faced with multiple potential targets, fingerprinting
helps prioritise efforts by identifying systems more likely to be vulnerable or hold
valuable information.
Building a Comprehensive Profile : Combining fingerprint data with other
reconnaissance findings creates a holistic view of the target's infrastructure, aiding in
understanding its overall security posture and potential attack vectors.
Fingerprinting Techniques
There are several techniques used for web server and technology fingerprinting:
Banner Grabbing : Banner grabbing involves analysing the banners presented by web
servers and other services. These banners often reveal the server software, version
numbers, and other details.
Analysing HTTP Headers : HTTP headers transmitted with every web page request
and response contain a wealth of information. The Server header typically discloses
the web server software, while the X-Powered-By header might reveal additional
technologies like scripting languages or frameworks.
Probing for Specific Responses : Sending specially crafted requests to the target
can elicit unique responses that reveal specific technologies or versions. For example,
certain error messages or behaviours are characteristic of particular web servers or
software components.
Analysing Page Content : A web page's content, including its structure, scripts, and
other elements, can often provide clues about the underlying technologies. There may
be a copyright header that indicates specific software being used, for example.
A variety of tools exist that automate the fingerprinting process, combining various
techniques to identify web servers, operating systems, content management systems, and
other technologies:
Tool Description Features
Wappalyzer Browser extension and online service Identifies a wide range of web
for website technology profiling. technologies, including CMSs,
frameworks, analytics tools, and
more.
BuiltWith Web technology profiler that provides Offers both free and paid plans
detailed reports on a website's with varying levels of detail.
technology stack.
WhatWeb Command-line tool for website Uses a vast database of
fingerprinting. signatures to identify various
web technologies.
Nmap Versatile network scanner that can be Can be used with scripts (NSE)
used for various reconnaissance to perform more specialised
tasks, including service and OS fingerprinting.
fingerprinting.
Netcraft Offers a range of web security Provides detailed reports on a
services, including website website's technology, hosting
fingerprinting and security reporting. provider, and security posture.
wafw00f Command-line tool specifically Helps determine if a WAF is
designed for identifying Web present and, if so, its type and
Application Firewalls (WAFs). configuration.
Fingerprinting inlanefreight.com
Let's apply our fingerprinting knowledge to uncover the digital DNA of our purpose-built host,
inlanefreight.com . We'll leverage both manual and automated techniques to gather
information about its web server, technologies, and potential vulnerabilities.
Banner Grabbing
Our first step is to gather information directly from the web server itself. We can do this using
the curl command with the -I flag (or --head ) to fetch only the HTTP headers, not the
entire page content.
curl -I inlanefreight.com
The output will include the server banner, revealing the web server software and version
number:
curl -I inlanefreight.com
curl -I https://inlanefreight.com
We now get a really interesting header, the server is trying to redirect us again, but this time
we see that it's WordPress that is doing the redirection to
https://www.inlanefreight.com/
curl -I https://www.inlanefreight.com
HTTP/1.1 200 OK
Date: Fri, 31 May 2024 12:12:26 GMT
Server: Apache/2.4.41 (Ubuntu)
Link: <https://www.inlanefreight.com/index.php/wp-json/>;
rel="https://api.w.org/"
Link: <https://www.inlanefreight.com/index.php/wp-json/wp/v2/pages/7>;
rel="alternate"; type="application/json"
Link: <https://www.inlanefreight.com/>; rel=shortlink
Content-Type: text/html; charset=UTF-8
A few more interesting headers, including an interesting path that contains wp-json . The
wp- prefix is common to WordPress.
Wafw00f
Web Application Firewalls ( WAFs ) are security solutions designed to protect web
applications from various attacks. Before proceeding with further fingerprinting, it's crucial to
determine if inlanefreight.com employs a WAF, as it could interfere with our probes or
potentially block our requests.
To detect the presence of a WAF, we'll use the wafw00f tool. To install wafw00f , you can
use pip3:
Once it's installed, pass the domain you want to check as an argument to the tool:
wafw00f inlanefreight.com
______
/ \
( W00f! )
\ ____/
,, __ 404 Hack Not Found
|`-.__ / / __ __
/" _/ /_/ \ \ / /
*===* / \ \_/ / 405 Not Allowed
/ )__// \ /
/| / /---` 403 Forbidden
\\/` \ | / _ \
`\ /_\\_ 502 Bad Gateway / / \ \ 500 Internal Error
`_____``-` /_/ \_\
~ WAFW00F : v2.2.0 ~
The Web Application Firewall Fingerprinting Toolkit
The wafw00f scan on inlanefreight.com reveals that the website is protected by the
Wordfence Web Application Firewall ( WAF ), developed by Defiant.
This means the site has an additional security layer that could block or filter our
reconnaissance attempts. In a real-world scenario, it would be crucial to keep this in mind as
you proceed with further investigation, as you might need to adapt techniques to bypass or
evade the WAF's detection mechanisms.
Nikto
Nikto is a powerful open-source web server scanner. In addition to its primary function as a
vulnerability assessment tool, Nikto's fingerprinting capabilities provide insights into a
website's technology stack.
Nikto is pre-installed on pwnbox, but if you need to install it, you can run the following
commands:
To scan inlanefreight.com using Nikto , only running the fingerprinting modules, execute
the following command:
The -h flag specifies the target host. The -Tuning b flag tells Nikto to only run the
Software Identification modules.
Nikto will then initiate a series of tests, attempting to identify outdated software, insecure
files or configurations, and other potential security risks.
- Nikto v2.5.0
--------------------------------------------------------------------------
-
+ Multiple IPs found: 134.209.24.248, 2a03:b0c0:1:e0::32c:b001
+ Target IP: 134.209.24.248
+ Target Hostname: www.inlanefreight.com
+ Target Port: 443
--------------------------------------------------------------------------
-
+ SSL Info: Subject: /CN=inlanefreight.com
Altnames: inlanefreight.com, www.inlanefreight.com
Ciphers: TLS_AES_256_GCM_SHA384
Issuer: /C=US/O=Let's Encrypt/CN=R3
+ Start Time: 2024-05-31 13:35:54 (GMT0)
--------------------------------------------------------------------------
-
+ Server: Apache/2.4.41 (Ubuntu)
+ /: Link header found with value: ARRAY(0x558e78790248). See:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Link
+ /: The site uses TLS and the Strict-Transport-Security HTTP header is
not defined. See: https://developer.mozilla.org/en-
US/docs/Web/HTTP/Headers/Strict-Transport-Security
+ /: The X-Content-Type-Options header is not set. This could allow the
user agent to render the content of the site in a different fashion to the
MIME type. See: https://www.netsparker.com/web-vulnerability-
scanner/vulnerabilities/missing-content-type-header/
+ /index.php?: Uncommon header 'x-redirect-by' found, with contents:
WordPress.
+ No CGI Directories found (use '-C all' to force check all possible dirs)
+ /: The Content-Encoding header is set to "deflate" which may mean that
the server is vulnerable to the BREACH attack. See:
http://breachattack.com/
+ Apache/2.4.41 appears to be outdated (current is at least 2.4.59).
Apache 2.2.34 is the EOL for the 2.x branch.
+ /: Web Server returns a valid response with junk HTTP methods which may
cause false positives.
+ /license.txt: License file found may identify site software.
+ /: A Wordpress installation was found.
+ /wp-login.php?action=register: Cookie wordpress_test_cookie created
without the httponly flag. See: https://developer.mozilla.org/en-
US/docs/Web/HTTP/Cookies
+ /wp-login.php:X-Frame-Options header is deprecated and has been replaced
with the Content-Security-Policy HTTP header with the frame-ancestors
directive instead. See: https://developer.mozilla.org/en-
US/docs/Web/HTTP/Headers/X-Frame-Options
+ /wp-login.php: Wordpress login found.
+ 1316 requests: 0 error(s) and 12 item(s) reported on remote host
+ End Time: 2024-05-31 13:47:27 (GMT0) (693 seconds)
--------------------------------------------------------------------------
-
+ 1 host(s) tested
Crawling
Crawling , often called spidering , is the automated process of systematically
browsing the World Wide Web . Similar to how a spider navigates its web, a web crawler
follows links from one page to another, collecting information. These crawlers are essentially
bots that use pre-defined algorithms to discover and index web pages, making them
accessible through search engines or for other purposes like data analysis and web
reconnaissance.
1. Homepage : You start with the homepage containing link1 , link2 , and link3 .
Homepage
├── link1
├── link2
└── link3
2. Visiting link1 : Visiting link1 shows the homepage, link2 , and also link4 and
link5 .
link1 Page
├── Homepage
├── link2
├── link4
└── link5
3. Continuing the Crawl : The crawler continues to follow these links systematically,
gathering all accessible pages and their links.
This example illustrates how a web crawler discovers and collects information by
systematically following links, distinguishing it from fuzzing which involves guessing potential
links.
Breadth-First Crawling
Page 4
Seed URL
Page 2
Page 5
Page 1
Page 6
Page 3
Page 7
Breadth-first crawling prioritizes exploring a website's width before going deep. It starts
by crawling all the links on the seed page, then moves on to the links on those pages, and
so on. This is useful for getting a broad overview of a website's structure and content.
Depth-First Crawling
Seed URL
Page 3 Page 4
Page 1 Page 2 Page 5
In contrast, depth-first crawling prioritizes depth over breadth. It follows a single path of
links as far as possible before backtracking and exploring other paths. This can be useful for
finding specific content or reaching deep into a website's structure.
The choice of strategy depends on the specific goals of the crawling process.
A single piece of information, like a comment mentioning a specific software version, might
not seem significant on its own. However, when combined with other findings—such as an
outdated version listed in metadata or a potentially vulnerable configuration file discovered
through crawling—it can transform into a critical indicator of a potential vulnerability.
The true value of extracted data lies in connecting the dots and constructing a
comprehensive picture of the target's digital landscape.
For instance, a list of extracted links might initially appear mundane. But upon closer
examination, you notice a pattern: several URLs point to a directory named /files/ . This
triggers your curiosity, and you decide to manually visit the directory. To your surprise, you
find that directory browsing is enabled, exposing a host of files, including backup archives,
internal documents, and potentially sensitive data. This discovery wouldn't have been
possible by merely looking at individual links in isolation; the contextual analysis led you to
this critical finding.
Similarly, seemingly innocuous comments can gain significance when correlated with other
discoveries. A comment mentioning a "file server" might not raise any red flags initially.
However, when combined with the aforementioned discovery of the /files/ directory, it
reinforces the possibility that the file server is publicly accessible, potentially exposing
sensitive information or confidential data.
Therefore, it's essential to approach data analysis holistically, considering the relationships
between different data points and their potential implications for your reconnaissance goals.
robots.txt
Imagine you're a guest at a grand house party. While you're free to mingle and explore, there
might be certain rooms marked "Private" that you're expected to avoid. This is akin to how
robots.txt functions in the world of web crawling. It acts as a virtual " etiquette guide "
for bots, outlining which areas of a website they are allowed to access and which are off-
limits.
What is robots.txt?
Technically, robots.txt is a simple text file placed in the root directory of a website (e.g.,
www.example.com/robots.txt ). It adheres to the Robots Exclusion Standard, guidelines
for how web crawlers should behave when visiting a website. This file contains instructions in
the form of "directives" that tell bots which parts of the website they can and cannot crawl.
User-agent: *
Disallow: /private/
This directive tells all user-agents ( * is a wildcard) that they are not allowed to access any
URLs that start with /private/ . Other directives can allow access to specific directories or
files, set crawl delays to avoid overloading a server or provide links to sitemaps for efficient
crawling.
1. User-agent : This line specifies which crawler or bot the following rules apply to. A
wildcard ( * ) indicates that the rules apply to all bots. Specific user agents can also be
targeted, such as "Googlebot" (Google's crawler) or "Bingbot" (Microsoft's crawler).
2. Directives : These lines provide specific instructions to the identified user-agent.
Common directives include:
Analyzing robots.txt
Here's an example of a robots.txt file:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
User-agent: Googlebot
Crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml
All user agents are disallowed from accessing the /admin/ and /private/
directories.
All user agents are allowed to access the /public/ directory.
The Googlebot (Google's web crawler) is specifically instructed to wait 10 seconds
between requests.
The sitemap, located at https://www.example.com/sitemap.xml , is provided for
easier crawling and indexing.
By analyzing this robots.txt, we can infer that the website likely has an admin panel located
at /admin/ and some private content in the /private/ directory.
Well-Known URIs
The .well-known standard, defined in RFC 8615, serves as a standardized directory within
a website's root domain. This designated location, typically accessible via the /.well-
known/ path on a web server, centralizes a website's critical metadata, including
configuration files and information related to its services, protocols, and security
mechanisms.
By establishing a consistent location for such data, .well-known simplifies the discovery
and access process for various stakeholders, including web browsers, applications, and
security tools. This streamlined approach enables clients to automatically locate and retrieve
specific configuration files by constructing the appropriate URL. For instance, to access a
website's security policy, a client would request https://example.com/.well-
known/security.txt .
This is just a small sample of the many .well-known URIs registered with IANA. Each entry
in the registry offers specific guidelines and requirements for implementation, ensuring a
standardized approach to leveraging the .well-known mechanism for various applications.
{
"issuer": "https://example.com",
"authorization_endpoint": "https://example.com/oauth2/authorize",
"token_endpoint": "https://example.com/oauth2/token",
"userinfo_endpoint": "https://example.com/oauth2/userinfo",
"jwks_uri": "https://example.com/oauth2/jwks",
"response_types_supported": ["code", "token", "id_token"],
"subject_types_supported": ["public"],
"id_token_signing_alg_values_supported": ["RS256"],
"scopes_supported": ["openid", "profile", "email"]
}
1. Endpoint Discovery :
Authorization Endpoint : Identifying the URL for user authorization requests.
Token Endpoint : Finding the URL where tokens are issued.
Userinfo Endpoint : Locating the endpoint that provides user information.
2. JWKS URI : The jwks_uri reveals the JSON Web Key Set ( JWKS ), detailing the
cryptographic keys used by the server.
3. Supported Scopes and Response Types : Understanding which scopes and response
types are supported helps in mapping out the functionality and limitations of the OpenID
Connect implementation.
4. Algorithm Details : Information about supported signing algorithms can be crucial for
understanding the security measures in place.
Exploring the IANA Registry and experimenting with the various .well-known URIs is an
invaluable approach to uncovering additional web reconnaissance opportunities. As
demonstrated with the openid-configuration endpoint above, these standardized URIs
provide structured access to critical metadata and configuration details, enabling security
professionals to comprehensively map out a website's security landscape.
Creepy Crawlies
Web crawling is vast and intricate, but you don't have to embark on this journey alone. A
plethora of web crawling tools are available to assist you, each with its own strengths and
specialties. These tools automate the crawling process, making it faster and more efficient,
allowing you to focus on analyzing the extracted data.
Scrapy
We will leverage Scrapy and a custom spider tailored for reconnaissance on
inlanefreight.com . If you are interested in more information on crawling/spidering
techniques, refer to the " Using Web Proxies" module, as it forms part of CBBH as well.
Installing Scrapy
Before we begin, ensure you have Scrapy installed on your system. If you don't, you can
easily install it using pip, the Python package installer:
This command will download and install Scrapy along with its dependencies, preparing your
environment for building our spider.
ReconSpider
First, run this command in your terminal to download the custom scrapy spider,
ReconSpider , and extract it to the current working directory.
wget -O ReconSpider.zip
https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
unzip ReconSpider.zip
With the files extracted, you can run ReconSpider.py using the following command:
Replace inlanefreight.com with the domain you want to spider. The spider will crawl the
target and collect valuable information.
results.json
After running ReconSpider.py , the data will be saved in a JSON file, results.json . This
file can be explored using any text editor. Below is the structure of the JSON file produced:
{
"emails": [
"[email protected]",
"[email protected]",
...
],
"links": [
"https://www.themeansar.com",
"https://www.inlanefreight.com/index.php/offices/",
...
],
"external_files": [
"https://www.inlanefreight.com/wp-
content/uploads/2020/09/goals.pdf",
...
],
"js_files": [
"https://www.inlanefreight.com/wp-includes/js/jquery/jquery-
migrate.min.js?ver=3.3.2",
...
],
"form_fields": [],
"images": [
"https://www.inlanefreight.com/wp-
content/uploads/2021/03/AboutUs_01-1024x810.png",
...
],
"videos": [],
"audio": [],
"comments": [
"<!-- #masthead -->",
...
]
}
Each key in the JSON file represents a different type of data extracted from the target
website:
By exploring this JSON structure, you can gain valuable insights into the web application's
architecture, content, and potential points of interest for further investigation.
Search engines serve as our guides in the vast landscape of the internet, helping us
navigate through the seemingly endless expanse of information. However, beyond their
primary function of answering everyday queries, search engines also hold a treasure trove of
data that can be invaluable for web reconnaissance and information gathering. This practice,
known as search engine discovery or OSINT (Open Source Intelligence) gathering, involves
using search engines as powerful tools to uncover information about target websites,
organisations, and individuals.
At its core, search engine discovery leverages the immense power of search algorithms to
extract data that may not be readily visible on websites. Security professionals and
researchers can delve deep into the indexed web by employing specialised search
operators, techniques, and tools, uncovering everything from employee information and
sensitive documents to hidden login pages and exposed credentials.
Open Source : The information gathered is publicly accessible, making it a legal and
ethical way to gain insights into a target.
Breadth of Information : Search engines index a vast portion of the web, offering a
wide range of potential information sources.
Ease of Use : Search engines are user-friendly and require no specialised technical
skills.
Cost-Effective : It's a free and readily available resource for information gathering.
The information you can pull together from Search Engines can be applied in several
different ways as well:
Security Assessment : Identifying vulnerabilities, exposed data, and potential attack
vectors.
Competitive Intelligence : Gathering information about competitors' products,
services, and strategies.
Investigative Journalism : Uncovering hidden connections, financial transactions,
and unethical practices.
Threat Intelligence : Identifying emerging threats, tracking malicious actors, and
predicting potential attacks.
However, it's important to note that search engine discovery has limitations. Search engines
do not index all information, and some data may be deliberately hidden or protected.
Search Operators
Search operators are like search engines' secret codes. These special commands and
modifiers unlock a new level of precision and control, allowing you to pinpoint specific types
of information amidst the vastness of the indexed web.
While the exact syntax may vary slightly between search engines, the underlying principles
remain consistent. Let's delve into some essential and advanced search operators:
Google Dorking
Google Dorking, also known as Google Hacking, is a technique that leverages the power of
search operators to uncover sensitive information, security vulnerabilities, or hidden content
on websites, using Google Search.
Here are some common examples of Google Dorks, for more examples, refer to the Google
Hacking Database:
Finding Login Pages:
site:example.com inurl:login
site:example.com (inurl:login OR inurl:admin)
Identifying Exposed Files:
site:example.com filetype:pdf
site:example.com (filetype:xls OR filetype:docx)
Uncovering Configuration Files:
site:example.com inurl:config.php
site:example.com (ext:conf OR ext:cnf) (searches for extensions commonly
used for configuration files)
Locating Database Backups:
site:example.com inurl:backup
site:example.com filetype:sql
Web Archives
In the fast-paced digital world, websites come and go, leaving only fleeting traces of their
existence behind. However, thanks to the Internet Archive's Wayback Machine, we have a
unique opportunity to revisit the past and explore the digital footprints of websites as they
once were.
The Wayback Machine is a digital archive of the World Wide Web and other information on
the Internet. Founded by the Internet Archive, a non-profit organization, it has been archiving
websites since 1996.
It allows users to "go back in time" and view snapshots of websites as they appeared at
various points in their history. These snapshots, known as captures or archives, provide a
glimpse into the past versions of a website, including its design, content, and functionality.
1. Crawling : The Wayback Machine employs automated web crawlers, often called
"bots," to browse the internet systematically. These bots follow links from one webpage
to another, like how you would click hyperlinks to explore a website. However, instead
of just reading the content, these bots download copies of the webpages they
encounter.
2. Archiving : The downloaded webpages, along with their associated resources like
images, stylesheets, and scripts, are stored in the Wayback Machine's vast archive.
Each captured webpage is linked to a specific date and time, creating a historical
snapshot of the website at that moment. This archiving process happens at regular
intervals, sometimes daily, weekly, or monthly, depending on the website's popularity
and frequency of updates.
3. Accessing : Users can access these archived snapshots through the Wayback
Machine's interface. By entering a website's URL and selecting a date, you can view
how the website looked at that specific point. The Wayback Machine allows you to
browse individual pages and provides tools to search for specific terms within the
archived content or download entire archived websites for offline analysis.
The frequency with which the Wayback Machine archives a website varies. Some websites
might be archived multiple times a day, while others might only have a few snapshots spread
out over several years. Factors that influence this frequency include the website's popularity,
its rate of change, and the resources available to the Internet Archive.
It's important to note that the Wayback Machine does not capture every single webpage
online. It prioritizes websites deemed to be of cultural, historical, or research value.
Additionally, website owners can request that their content be excluded from the Wayback
Machine, although this is not always guaranteed.
Why the Wayback Machine Matters for Web
Reconnaissance
The Wayback Machine is a treasure trove for web reconnaissance, offering information that
can be instrumental in various scenarios. Its significance lies in its ability to unveil a
website's past, providing valuable insights that may not be readily apparent in its current
state:
Automating Recon
While manual reconnaissance can be effective, it can also be time-consuming and prone to
human error. Automating web reconnaissance tasks can significantly enhance efficiency and
accuracy, allowing you to gather information at scale and identify potential vulnerabilities
more rapidly.
Efficiency : Automated tools can perform repetitive tasks much faster than humans,
freeing up valuable time for analysis and decision-making.
Scalability : Automation allows you to scale your reconnaissance efforts across a
large number of targets or domains, uncovering a broader scope of information.
Consistency : Automated tools follow predefined rules and procedures, ensuring
consistent and reproducible results and minimising the risk of human error.
Comprehensive Coverage : Automation can be programmed to perform a wide range of
reconnaissance tasks, including DNS enumeration, subdomain discovery, web
crawling, port scanning, and more, ensuring thorough coverage of potential attack
vectors.
Integration : Many automation frameworks allow for easy integration with other tools
and platforms, creating a seamless workflow from reconnaissance to vulnerability
assessment and exploitation.
Reconnaissance Frameworks
These frameworks aim to provide a complete suite of tools for web reconnaissance:
FinalRecon
FinalRecon offers a wealth of recon information:
optional arguments:
-h, --help show this help message and exit
--url URL Target URL
--headers Header Information
--sslinfo SSL Certificate Information
--whois Whois Lookup
--crawl Crawl Target
--dns DNS Enumeration
--sub Sub-Domain Enumeration
--dir Directory Search
--wayback Wayback URLs
--ps Fast Port Scan
--full Full Recon
Extra Options:
-nb Hide Banner
-dt DT Number of threads for directory enum [ Default : 30 ]
-pt PT Number of threads for port scan [ Default : 50 ]
-T T Request Timeout [ Default : 30.0 ]
-w W Path to Wordlist [ Default : wordlists/dirb_common.txt ]
-r Allow Redirect [ Default : False ]
-s Toggle SSL Verification [ Default : True ]
-sp SP Specify SSL Port [ Default : 443 ]
-d D Custom DNS Servers [ Default : 1.1.1.1 ]
-e E File Extensions [ Example : txt, xml, php ]
-o O Export Format [ Default : txt ]
-cd CD Change export directory [ Default :
~/.local/share/finalrecon ]
-k K Add API key [ Example : shodan@key ]
To get started, you will first clone the FinalRecon repository from GitHub using git clone
https://github.com/thewhiteh4t/FinalRecon.git . This will create a new directory
named "FinalRecon" containing all the necessary files.
Next, navigate into the newly created directory with cd FinalRecon . Once inside, you will
install the required Python dependencies using pip3 install -r requirements.txt . This
ensures that FinalRecon has all the libraries and modules it needs to function correctly.
To ensure that the main script is executable, you will need to change the file permissions
using chmod +x ./finalrecon.py . This allows you to run the script directly from your
terminal.
Finally, you can verify that FinalRecon is installed correctly and get an overview of its
available options by running ./finalrecon.py --help . This will display a help message
with details on how to use the tool, including the various modules and their respective
options:
For instance, if we want FinalRecon to gather header information and perform a Whois
lookup for inlanefreight.com , we would use the corresponding flags ( --headers and --
whois ), so the command would be:
______ __ __ __ ______ __
/\ ___\/\ \ /\ "-.\ \ /\ __ \ /\ \
\ \ __\\ \ \\ \ \-. \\ \ __ \\ \ \____
\ \_\ \ \_\\ \_\\"\_\\ \_\ \_\\ \_____\
\/_/ \/_/ \/_/ \/_/ \/_/\/_/ \/_____/
______ ______ ______ ______ __ __
/\ == \ /\ ___\ /\ ___\ /\ __ \ /\ "-.\ \
\ \ __< \ \ __\ \ \ \____\ \ \/\ \\ \ \-. \
\ \_\ \_\\ \_____\\ \_____\\ \_____\\ \_\\"\_\
\/_/ /_/ \/_____/ \/_____/ \/_____/ \/_/ \/_/
[!] Headers :
Skills Assessment
To complete the skills assessment, answer the questions below. You will need to apply a
variety of skills learned in this module, including:
Using whois
Analysing robots.txt
Performing subdomain bruteforcing
Crawling and analysing results