By Md. Sabuj Sarker | 8/25/2017 | General |Beginners

Understanding Universal Resource Locator (URL) for Programmers

Understanding Universal Resource Locator (URL) for Programmers

URLs are one of most important elements in real life programming and across programming languages. Almost all applications need some kind of interaction with the web. Every resource or component on the web is identified, interacted with, or communicated to with the help of URLs. We even communicate with other applications with the help of URLs that expose their service through a port by running on the local machine. Many developers misunderstand URLs and work blindly with it. They just try to have the work done. But it is very important to understand URLs properly to make an application that never breaks due to bugs in URL processing. Even many developers take it very lightly and depend on third party libraries. I am writing this article for all the programmers who work with some kind of web technology and who still need a little more knowledge and explanation on URLs.

Breaking a URL Into Parts

Let's play divide and conquer. Here is an example of a typical URL that we are going to start with:

https://en.wikipedia.org

Without knowing any rule, we can divide the URL into the following parts:

  • https
  • en.wikipedia.org

I have left out the :// intentionally. It is nothing but a separator between the two parts.

Scheme/Protocol

What is HTTPS? HTTPS is the secure/encrypted version of HTTP. HTTP is a protocol between a server software and a client software by which the applications talk to each other. A protocol is a set of predefined rules. Every application that uses HTTP(S) must implement those rules to communicate or talk to each other. Usually the short name of the protocol is called the scheme. Scheme is a mandatory URL component.

Host

en.wikipedia.org is a domain or subdomain. In technical terms we call it host. Every host finally resolves to an IP address. A host can be a domain/subdomain name or any IP address internally. The IP address can be an IPv4 address or an IPv6 address. A host is also a mandatory URL component.

The URL we just used brings us to the homepage of English Wikipedia. Let's try a bigger URL this time.

https://en.wikipedia.org/wiki/Internet

  • https
  • en.wikipedia.org
  • wiki/Internet

I have left '/' between the host name and 'wiki/Internet' intentionally. The first forward slash after the host’s name is used to separate the name and the path.

Path

Imagine the internet as a green village where only robots can walk. Also, imagine that every internet domain name is a house. Now, if you are a robot and you want to reach different houses then you must know which path to follow. The path component of the URLs do the same thing. It tells you to go to a specific direction to get what you want. Path is an optional URL component.

In our case wiki/Internet is the path of the URL.

We see other URLs that contain questions marks, percentage signs, equal signs and a lot of other things. Let's work with such a URL now.

https://en.wikipedia.org/w/index.php?search=Programmers%27+Joke&title=Special:Search&go=Go&searchToken=2rcj4uhsfa5n6eh6jfblmxg6j

This is a URL that I was brought to when I searched Wikipedia with Programmers' Joke. If I divide the URL into components then I get:

  • https
  • en.wikipedia.org
  • w/index.php
  • search=Programmers%27+Joke&title=Special:Search&go=Go&searchToken=2rcj4uhsfa5n6eh6jfblmxg6j

We’ve got something new now—the last component.

Query

Query in a URL contains one or more pairs of key value where the value is optional. We encounter queries mostly on search result pages when we use some kind of search engine. A query parameter comes after the path. Path and query are separated by a question mark (?). A form submission that uses GET as the submission method generates a query string. In other words, if you want to submit a form programmatically you need to build a query string with form names and values of the form. A query string is also an optional component of any URL.

Let's see another type of URL:

https://en.wikipedia.org/wiki/Internet#Protocols

Dividing the URL into components we get:

  • https
  • en.wikipedia.org
  • wiki/Internet
  • Protocols
    The # is a separator between path/query and URL fragment.

Fragment

A URL fragment is an identifier in an HTML/xHTML page that identifies a specific element of that page. A fragment might come after the host (if no other component after the host is present) or after the path (if a query is not present) or it might come after the query string. The separator that is used for this is a pound sign (#). In reality the string that is used as a URL fragment is an HTML tag id in that specific HTML page. Fragment is mainly used for navigating to a specific part of a page easily. A URL fragment is also an optional component of the URL system.

Have you ever encountered URLs like this?:

http://example.com:8080/some-path

This is not so uncommon of a URL pattern. I think there is not a single professional programmer who does not know about such URLs. If we break the URL into parts we get:

  • http
  • example.com
  • 8080
  • some-path

Port

Port is a whole number that is used to connect with a known host. Port is a gate through which data can pass from system/device to system/device. We cannot connect to a host or server that does not expose or bind to a certain port. Every scheme has a default port number. For example, the default port number of HTTP is 80. As there is a default port for every scheme, we do not need to specify it. But if we want to connect through a different port then we must specify it. Port is an optional component when we are using the default port of the specific scheme.

There are other components that we usually don’t encounter. The other components are: user and password.

User & Password

If you need to login to a website with the Basic HTTP Authentication mechanism, to use it you can include that username and password with the URL. Doing so the browser will not prompt you for username and password. The username and password comes after the scheme. Username and password are separated by a colon (:). The username and password pair is separated from the host part by a (@) sign. Remember that username and password are optional. You can even provide only the username if your password is empty or optional. An example is given below:

http://yourusername:yourpassword@example.com/protected-path

Formal Representation of URL

An URL can be represented formally as follows:

scheme:[//[user[:password]@]host[:port]][/path][?query][#fragment]

I hope this article will help you understand URLs better and help you code better. I am going to write more articles on how to parse and use URLs with different programming languages.

 

Stop by the homepage to search and compare SDKs, Dev Tools, and Libraries.

By Md. Sabuj Sarker | 8/25/2017 | General

{{CommentsModel.TotalCount}} Comments

Your Comment

{{CommentsModel.Message}}

Recent Stories

Top DiscoverSDK Experts

User photo
3355
Ashton Torrence
Web and Windows developer
GUI | Web and 11 more
View Profile
User photo
3220
Mendy Bennett
Experienced with Ad network & Ad servers.
Mobile | Ad Networks and 1 more
View Profile
User photo
3060
Karen Fitzgerald
7 years in Cross-Platform development.
Mobile | Cross Platform Frameworks
View Profile
Show All
X

Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.

{{compareToolModel.Error}}

Now comparing:

{{product.ProductName | createSubstring:25}} X
Compare Now