Web Scraping in Java With Jsoup
Last Updated :
21 Mar, 2024
Web scraping means the process of extracting data from websites. It's a valuable method for collecting data from the various online sources. Jsoup is a Java library that makes handling HTML content easier. Let's learn how to build a basic web scraper with Jsoup.
Prerequisites
Here's what you need to use in:
Concept
Jsoup helps us to read HTML documents. It lets us follow the document's structure and extract the data we want. We use CSS selectors or DOM traversal methods for this. With Jsoup, we go to a website, get its HTML, and take out things like text, links or images.
Step-by-Step Implementation
Now, let's create a basic Java project using Maven.
Step 1: Create a Java Maven project
Open the cmd/terminal and run the following commands to create a new Maven project.
mvn archetype:generate
-DgroupId=com.example
-DartifactId=java-jsoup1
-DarchetypeArtifactId=maven-archetype-quickstart
-DinteractiveMode=false

This command will generate a basic Maven project structure. Below we can see the Maven project builds successfully.
Step 2: Add Jsoup Dependency
Open the pom.xml file in the project folder then add the Jsoup dependency into it and save the file.
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.14.3</version>
</dependency>
</dependencies>
Step 3: Create a Java File
In the src/main/java/com/example folder, create a Java file named MyScrapper.java.
Java
package com.example;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class MyScrapper {
public static void main(String[] args)
{
try {
Document doc
= Jsoup
.connect("https://geeksforgeeks.org/")
.get();
Elements links = doc.select("a[href]");
Elements images = doc.select("img[src]");
System.out.println("Links: ");
for (Element link : links) {
System.out.println(link.attr("href"));
}
System.out.println("\n-----\n");
System.out.println("Images:");
for (Element image : images) {
System.out.println(image.attr("src"));
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Explanation:
- Firstly We declare that the class is part of the com.example package.
- After that, we import those classes we need from Jsoup library and java.io package.
- Then created a class named MyScrapper with the main mthod because its the entry point of the program. and it throws an IOException. This shows that input/output exceptions could happen.
- We use Jsoup.connect("https://geeksforgeeks.org/").get() to establish a connection to the website and fetch its HTML content as a Document object.
- Selecting Elements:
- doc.select("a[href]") selects all anchor elements (<a>) with an href attribute and stores them in the links variable.
- doc.select("img[src]") selects all image elements (<img>) with a src attribute and stores them in the images variable.
- And at last, the extracted values are printed to the console.
Step 4: Run the Program
To run the project, use below maven commands.
mvn compile
mvn exec:java -Dexec.mainClass="com.example.MyScrapper"
Note: The project can be direct run by presssing the run icon in the IDE.
Output:
In this article will teach you how to make a basic web scraper using Jsoup in Java. Make sure to follow website rules and scrape responsibly and fairly.
Similar Reads
Download Web Page using Java Downloading a webpage can be helpful in many situations, such as downloading different web pages locally, data scraping or building our own tool. We can automate this process using a Java program. In this article, we are going to learn how to download a web page using Java with Java's I/O and networ
2 min read
How to Save a Web Page with Selenium using Java? Selenium is widely known for automating browser interactions, but it can also be used to save web pages directly. This capability is particularly useful when you need to archive web content, save dynamic pages that change frequently, or scrape and store HTML for later analysis. In this tutorial, weâ
3 min read
Selenium with Java Tutorial Deep dive into the powerful world of automated web testing and learn how to harness the capabilities of Selenium with Java, from setting up your environment to writing effective test scripts. Whether you're new to automated testing or looking to upscale your skills, this Selenium with Java tutorial
8 min read
Selenium Webdriver Handling Checkbox using Java A set of tools and libraries that allow the user to automate interactions with web browsers is known as Selenium. It is too diverse tool that can handle every operation of a webpage starting from opening the webpage to closing of webpage. In this article, we will see the various steps to handle the
6 min read
Locating Strategies By ID Using Java In Selenium WebDriver, locating a web element accurately is crucial for automating web applications. One of the most efficient ways to locate an element is by using its ID. The By.id strategy is often the most reliable and fastest locator, as the id attribute is usually unique within the HTML docume
3 min read
Locating Strategies By XPath Using Java Web application testing must be rigorous and thorough. For this reason, many tests for web applications are automated. Selenium is an open-source framework that allows us to automate web browser testing. In the following article, we will explore how to use XPath to locate elements on our HTML code u
4 min read