Website scraping with Scrapy
Scrapy is a tool created for website scraping. It is the ideal alternative for BeautifulSoup and self-crafted scrapers. It is easy to install and use and it provides parallelism out of the box.
About Scrapy
Scrapy is not simply a library like BeautifulSoup, it is a tool. If I go a step further, Scrapy is the tool for website scraping with Python. It has a large user-base and a big community who cares about the projects. This can be seen in the port to Python 3: in 2015 there was no version which you could install and use with Python 3, you needed Python 2 to use Scrapy.
When you install it, you can see that there are a lot of dependencies which get installed along the way to provide the required functionality.
And because of all this, Scrapy can leverage you from doing tasks like parallelism or exporting information. It has built-in solutions for splitting up your downloading and parsing to be more efficient and export them in the format you like.
It comes with an REPL (Read-Eval-Print Loop), called the Scrapy shell, where you can load a website and experiment with different queries to extract required information without launching your project every time -- and to see errors and missing fields. I recommend, that you start off with this REPL and if you have everything you need to extract your data then write your project.
To extract contents you can use CSS selectors like with BeautifulSoup or XPath selectors. I prefer using XPath in this case because it makes scraping the contents more readable. The drawback is that you need to know how XPath works to write and read your code. This means in this article that I will write XPath queries to extract information.
Installing
To get the examples working, you need Python 3 and Scrapy installed. To obtain Scrapy, simply execute the following command:
pip install scrapy
I will use the version 1.2.1 through this article. I mention this because Scrapy is maintained continuously and it can happen that some functionality changes between versions.
An example
In this example we will create a simple project to navigate through the blog and gather each article's title and URL and export it to a CSV file.
First of all we will use the REPL to write our code to extract the information. To do this launch the shell:
scrapy shell 'http://www.discoversdk.com/blog'
I have put quotes around the URL to avoid any parsing problems. It is a good practice to do this when you start the shell.
To extract articles, we need to find all elements which have the ckt-article class attribute:
>>> articles = response.xpath('//div[@class="ckt-article"]')
>>> len(articles)
10
Now we need to find the URLs to the blog articles in these articles. They reside in h2 tags, so we extract those:
h2s = [article.xpath('.//h2') for article in articles]
>>> len(h2s)
10
The dot . is needed in the previous extraction. Without it we would find all the h2 tags and we do not want this.
After this we need to get only the a elements inside these h2s and extract the URL and the text of these anchors:
>>> entries = [(h2.xpath('./a/text()'), h2.xpath('./a/@href')) for h2 in h2s]
>>> len(entries)
10
In the code snippet above I extract pairs (2-tuples) into the list of entries. One such pair contains the article title and the URL. Now let's look at the contents of the entries:
>>> entries[0]
([<Selector xpath='./a/text()' data=' HTTP in Angular 2'>], [<Selector xpath='./a/@href' data='/blog/http-in-angular-2'>])
OK, this looks like there is something missing. Yes, the xpath method returns a selector. To get the contents we need to extract our information. But this is only half the battle: as you can see each of the elements in a pair is itself a list. And we do not want it, so we have to get only the first element of these lists:
>>> entries = [(h2.xpath('./a/text()').extract()[0], h2.xpath('./a/@href').extract()[0]) for h2 in h2s]
>>> entries[0]
(' HTTP in Angular 2', '/blog/http-in-angular-2')
Finally we need the next button if it is not disabled to navigate to the next site:
>>> next_btn = response.xpath('//a[@class="ckt-next-btn "]/@href').extract()
>>> next_btn[0]
'/blog/page/2'
I have left the extracted result for the next button in a variable and printed the first element in the list later. This is required when there is no next button or disabled, because if the XPath expression finds nothing you would try to access an element of an empty list what would raise an exception.
Now one thing to notice is that the links are missing the first part of the URL. Fortunately we do not have to add it manually, we can use a Python library urllib to do this for us:
>>> from urllib.parse import urljoin
>>> urljoin(response.url, next_btn[0])
'http://www.discoversdk.com/blog/page/2'
>>> entries = [(h2.xpath('./a/text()').extract()[0], urljoin(response.url, h2.xpath('./a/@href').extract()[0])) for h2 in h2s]
>>> entries[0]
(' HTTP in Angular 2', 'http://www.discoversdk.com/blog/http-in-angular-2')
The project
Now it is time to create our project to extract information from the blog:
scrapy startproject discoversdk
You can start your first spider with:
cd discoversdk
scrapy genspider example example.com
As the output states we go to the discoversdk folder and create a new spider:
scrapy genspider blog 'www.discoversdk.com/blog'
Created spider 'blog' using template 'basic' in module:
discoversdk.spiders.blog
The URL we provided for the genspider command will be used as the start_urls and the allowed_domains for the spider.
Now we are ready to write our spider. Navigate to the discoversdk/spiders folder and open the blog.py file for editing. There is a basic spider generated already which does nothing (the pass inside the parse method denotes this):
# -*- coding: utf-8 -*-
import scrapy
class BlogSpider(scrapy.Spider):
name = "blog"
allowed_domains = ["www.discoversdk.com/blog"]
start_urls = ['http://www.discoversdk.com/blog/']
def parse(self, response):
pass
Actually this what we need. The parse method, as its name suggests, does the parsing of the contents. Every time Scrapy crawls an URL it is passed to the parse method inside the response object which contains the request you sent and the resulting contents of the site. There are options to direct the responses to other methods but this is not discussed in this article.
Because we have already gathered the information we will extract in the previous section, I skip the details and give you my solution:
def parse(self, response):
anchors = response.xpath('//div[@class="ckt-article"]/h2/a')
for a in anchors:
yield dict(title=a.xpath('./text()').extract()[0],
url=urljoin(response.url, a.xpath('./@href').extract()[0]))
next_btn = response.xpath('//a[@class="ckt-next-btn "]/@href').extract()
if next_btn:
yield scrapy.Request(urljoin(response.url, next_btn[0]))
The interesting parts are the yield commands. The first one in the for loop creates dictionaries for each title-URL combination. yielding it tells Scrapy that we have found an item we want to export. Creating a dict on the fly with this constructor function is a great solution for such simple results. As your exports get more complex (you have many fields, to gather information from different pages) you will want to create an Item class which acts like a dictionary but you can restrict keys and leverage you from generating dictionaries on-the-fly. And because it is a class you can reuse it in different spiders. The second yield tells Scrapy to parse the next page which can be reached through the provided URL. When Scrapy finds and loads the page at the end of the URL it converts it to a Response object and calls the parse method again.
The crawl (the run of the spider) finishes when there are no more URLs to gather and all gathered URLs are parsed. In the case of this example this happens when we reach the last page with entries and extracted the article from that page too.
We have the spider ready, let's start it. To do this navigate to the folder created when generated the project (it is the outermost discoversdk if you followed along) and execute the following command:
scrapy crawl blog
The parameter blog is the name of the spider to launch and we named our spider blog.
Now you see some output messages. The default log-level is debug so you see all kinds of information. Among others you will see lines like this one:
2016-11-19 16:00:38 [scrapy] DEBUG: Scraped from <200 http://www.discoversdk.com/blog/>
{'url': 'http://www.discoversdk.com/blog/python-vs-javascript', 'title': ' Python vs JavaScript'}
This is an example of a dictionary we created in the parse method and yielded it. Well, currently this is the only way we see our results.
I mentioned that you can export results to different kinds of files out of the box. We will export our results to a CSV file called blog_entries.csv. To do this execute the following:
scrapy crawl blog -o blog_entries.csv
If you look at the execution folder, you will see that there is a new file blog_entries.csv which contains the exported results in a CSV format.
Conclusion
Scrapy is a great tool for website scraping -- and we scraped only the surface. There are ways to navigate through the links on a webpage and extracting information in generalized ways, scrape multiple sites.
I suggest that you consider taking a deeper look at Scrapy if you want to do website scraping because it gives you features which can be problematic to implement manually. Naturally this toolset comes with a lot of learning to fully utilize everything in the way you should.
And always remember to honor the Terms of Service and Privacy Policy of websites you want to scrape: sometimes they prohibit content scraping (mostly for spiders and robots) but having a simple script is almost a bot.
Recent Stories
Top DiscoverSDK Experts
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}
{{CommentsModel.TotalCount}} Comments
Your Comment