Working With HTML Elements With Jsoup
In a previous article I covered how to parse HTML documents with jsoup and various aspects involved. This article is dedicated to working with HTML elements with jsoup in Java. In the previous article I showed how to get started coding with jsoup and how to add the library to your project. So, I am not going to repeat that here once again. I am going to start from where we left our coding in the previous article.
An HTML start and end tag (including self closing tags) defines a section of the document where we can put some other tags, comments, CData, and text. We call each such section an Element. So, an element contains tag name, attributes of the tag and child nodes. Node is the superclass of Element. Comments, CData, plain texts etc. are also of the type Node. Remember that attributes are also considered as nodes.
Creating Elements
To create an element, create a new object of the Element class and pass the tag name to the constructor as a string. So, our code should look like the following:
Element div1 = new Element("div");
If we want to see some output we can write our code like the following:
import org.jsoup.nodes.Element;
public class JavaSoup {
public static void main(String[] args){
Element div1 = new Element("div");
System.out.println(div1);
}
}
Outputs:
<div></div>
There is another constructor by which we can define a base URL for the element. But the two other constructors except from the first one do not accept tag names as a string. Instead it requires a Tag object.
Element div1 = new Element(Tag.valueOf("div"), "http://example.com");
Adding Attributes To Elements
We can set attributes by invoking the attr() method on the elements. Alternatively, we can put attributes at Element creation time. With the third constructors we can provide attributes for the element. It accepts the Attributes object as the third argument along with a Tag object as the first argument and baseUri as the second.
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
public class JavaSoup {
public static void main(String[] args){
Attributes atrs = new Attributes();
atrs.put("class", "myclass");
atrs.put("id", "div1");
Element div1 = new Element(Tag.valueOf("div"), "http://example.com", atrs);
System.out.println(div1);
}
}
Outputs:
<div class="myclass" id="div1"></div>
If you want to set attributes later, you can invoke the attr() method on the element. You can either put boolean attribute values or string values. Add the following line before the last line of code above.
div1.attr("name", "no-name");
Outputs:
<div class="myclass" id="div1" name="no-name"></div>
Adding Classes
Often we need to add classes to elements. We can add classes with the help of the attr() method but that is not a good idea. Class attributes of an element can contain multiple class names and when we add classes with that method, that method will overwrite previous classes if any exist. So, instead we need to add classes with the addClass() method.
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
public class JavaSoup {
public static void main(String[] args){
Attributes atrs = new Attributes();
atrs.put("class", "myclass");
atrs.put("id", "div1");
Element div1 = new Element(Tag.valueOf("div"), "http://example.com", atrs);
div1.addClass("cls2");
System.out.println(div1);
}
}
Outputs:
<div class="myclass cls2" id="div1"></div>
Look at the output. It did not overwrite the previous class(es).
Inserting Texts Into an Elements
To insert plain text into an element invoke the appendText() method on the element with the text you want to put there.
import org.jsoup.nodes.Element;
public class JavaSoup {
public static void main(String[] args){
Element div1 = new Element("div");
div1.appendText("I am a piece of plain text - I am also a text node when parsed");
System.out.println(div1);
}
}
Outputs:
<div>
I am a piece of plain text - I am also a text node when parsed
</div>
Inserting Child Into an Element
Other elements can be inserted into an elements with the help of the insertChildren() method. Let's add one text node and one element into our existing element.
import org.jsoup.nodes.Element;
public class JavaSoup {
public static void main(String[] args){
Element div1 = new Element("div");
div1.appendText("I am a piece of plain text - I am also a text node when parsed");
Element span1 = new Element("span");
span1.appendText("This is a plain text inside the span tag");
div1.insertChildren(1, span1);
System.out.println(div1);
}
}
Outputs:
<div>
I am a piece of plain text - I am also a text node when parsed
<span>This is a plain text inside the span tag</span>
</div>
Look, span was inserted after the text node as we commanded it to be inserted into the index 1.
Finding Elements
There are various mechanisms for finding elements inside an HTML document. We can use the methods on the hosting document objects and also on the elements. To get the fun of it, let's retrieve a page and start working on it. So, our initial code should look like the following:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JavaSoup {
public static void main(String[] args) throws Exception{
Document doc = Jsoup.connect("https://techcrunch.com").get();
// rest of the code goes here.
}
}
We can invoke the following methods on the document object and on the element objects to find other elements inside of them.
- getElementById(String id)
- getElementsByTag(String tag)
- getElementsByClass(String className)
- getElementsByAttribute(String key) (and related methods)
For example, we want to get all the anchor links from the TechCrunch home page.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JavaSoup {
public static void main(String[] args) throws Exception{
Document doc = Jsoup.connect("https://techcrunch.com").get();
Elements links = doc.getElementsByTag("a");
}
}
Also, we can find elements with the help of CSS selectors. To do so we need to invoke the select() method on the document object or on the element objects.
Let's find all the post links on the home page of TechCrunch. So, what are the rules of the game we are going to play?
- All the post links are enclosed inside h2 tag in which there is a class named post-title.
- The a tag inside of that h2 contains the link. We can get the link from href attribute of the element.
From rule number 1, our CSS selector becomes h2[class='post-title']. And from the rule number 2 our CSS selector becomes h2[class='post-title'] a. Let's try this in code.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JavaSoup {
public static void main(String[] args) throws Exception{
String cssSelector = "h2[class='post-title'] a";
Document doc = Jsoup.connect("https://techcrunch.com").get();
Elements post_links = doc.select(cssSelector);
System.out.println(post_links);
}
}
Outputs:
<a href="https://techcrunch.com/2017/08/28/who-is-new-uber-ceo-dara-khosrowshahi/" data-omni-sm="gbl_river_headline,1">Who is new Uber CEO Dara Khosrowshahi?</a>
<a href="https://techcrunch.com/2017/08/27/spacexs-hyperloop-pod-speed-competition-winner-tops-200-mph/" data-omni-sm="gbl_river_headline,2">SpaceX’s Hyperloop Pod speed competition winner tops 200 MPH</a>
...
...
...
But, we need cleaner output. We do not want any markup. We only want the links. To do that we need get the value of href attribute from the anchor elements.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JavaSoup {
public static void main(String[] args) throws Exception{
String cssSelector = "h2[class='post-title'] a";
Document doc = Jsoup.connect("https://techcrunch.com").get();
Elements post_links = doc.select(cssSelector);
for(Element el : post_links) {
System.out.println(el.attr("href"));
}
}
}
Outputs:
https://techcrunch.com/2017/08/28/who-is-new-uber-ceo-dara-khosrowshahi/
https://techcrunch.com/2017/08/27/spacexs-hyperloop-pod-speed-competition-winner-tops-200-mph/
https://techcrunch.com/2017/08/27/china-doubles-down-on-real-name-registration-laws-forbidding-anonymous-online-posts/
https://techcrunch.com/2017/08/27/coast-guard-asks-people-stranded-by-harvey-to-call-instead-of-posting-on-social-media/
https://techcrunch.com/2017/08/27/breaking-uber-has-selected-a-ceo/
...
...
...
Navigating Elements
There are some helper methods to navigate an element. We can navigate through the document, elements, and other nodes.
- siblingElements()
- firstElementSibling()
- lastElementSibling()
- nextElementSibling()
- previousElementSibling()
- parent()
- children()
- child()
These are some of the methods of Element for navigating through the element.
To get other data out of the element, use the following methods.
- attr()
- attributes()
- id()
- className()
- classNames()
- text()
- html()
- outerHtml()
- data()
- tag()
- tagName()
Let's try some methods to navigate over the elements and get some data out of them.
I am going to find the primary navigation menu as an element. The primary nav menu is wrapped inside a nav tag with class name nav-primary.
String cssSelector = "nav[class='nav-primary']";
Document doc = Jsoup.connect("https://techcrunch.com").get();
Elements navs = doc.select(cssSelector);
Element nav = navs.get(0);
The first child of this nav element is a list that contains other menu elements. So, we can navigate to the list element with the following code.
Element list = nav.children().get(0);
Inside the list elements of this list we can find anchor <a> tags. We want to get all the menu names and those menu names are inside these anchor tags as plain text. So, we can invoke select() on the list element with the selector li a. We can invoke the text() method on the anchor elements to get the menu titles. So, our final code looks like the following:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JavaSoup {
public static void main(String[] args) throws Exception{
String cssSelector = "nav[class='nav-primary']";
Document doc = Jsoup.connect("https://techcrunch.com").get();
Elements navs = doc.select(cssSelector);
Element nav = navs.get(0);
Element list = nav.children().get(0);
Elements anchors = list.select("li a");
for(Element a : anchors){
System.out.println(a.text());
}
}
}
Outputs:
News
Startups
Mobile
Gadgets
Enterprise
Social
Europe
Asia
Crunch Network
Unicorn Leaderboard
Gift Guides
All Topics
All Galleries
All Timelines
...
...
...
This list of menu titles contains all the top level and child level menus. Here’s some homework: separate the top level menus along with their child menus and display each level of sub menu with one more additional space before them.
Conclusion
There are many other methods of the Element class, but all of them cannot be covered within a single article. It is also not really useful to cover all of them. I have covered the most important aspects of the Element class and its functionality and you should practice them properly. You are also advised to look at the official documentation and class reference guide of jsoup.
Be sure to stop by the homepage to search and compare the best SDKs, APIs, and other development tools.
Recent Stories
Top DiscoverSDK Experts
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}
{{CommentsModel.TotalCount}} Comments
Your Comment