By Gabor Laszlo Hajba | 11/28/2016 | Product Comparisons |Beginners

BeautifulSoup4 vs Scrapy

BeautifulSoup4 vs Scrapy

In this article I will compare two solutions for website scraping with Python.

I introduced BeautifulSoup4 and Scrapy previously with full reviews on each one of them. Now it’s time to compare them and help you decide: which one shall you use for your projects?

As mentioned previously: BeautifulSoup is a content extractor which means it needs to get the source of a website to be able to do parsing; in contrast Scrapy is a website scraping tool that uses Python, because Scrapy can crawl the contents of your webpage prior to extracting – BTW, don’t have to write much code to achieve this.

You could say: "Then Scrapy is the tool I always want to use". Well, this works most of the time but sometimes I prefer BeautifulSoup over Scrapy, and I’ll explain why.

A year ago...

One year ago, there was a point when I used BeautifulSoup without hesitating: Python 3. Scrapy was not released to this interpreter at that time (prior version 1.1) so if I wanted to do scraping based on Python 3, there was no other option but to use BeautifulSoup and a content downloader (requests) to achieve my goal!

Today

It is different because Scrapy is available on Python 3 as well. Hands down it would be easier to use the tool over the content extractor but sometimes the simpler solution is better because you must write the code yourself and you can clearly steer the workflow -- with Scrapy a lot of things are hidden from your eyes and you need to read through the documentation and follow different forums to to find the right solution for your problem.

One example would be LinkedIn. You may want to scrape some information from this website (note Linkedin policy prohibits scraping but I mention it solely for the sake of my example) with Scrapy. Scrapy handles the cookies for you out of the box but LinkedIn is a tricky site: it has cookies which should not be added to your request even if they are included in the response: they contain the text "delete me" (or something similar to that) which will tell the server validating the request's cookies that something is not OK with the caller, it is not a regular website.

With BeautifulSoup and requests you can customize this behavior. Naturally this involves more coding but you have everything in the tip of your fingers to take care about all the cookies you want to send. requests, of course, can give you the solution to send cookies automatically but you can customize them too.

But if you want to get data from LinkedIn: use their API because this is the way they suggest and you do not get banned.

Throttling

Most websites do not use "tricky" cookies like we have seen previously but count incoming requests from IP addresses in a given time-frame. If the amount is too high, the IP address is blocked for some period. And because Scrapy does multiple requests at a time you can easily encounter this problem.

If this happens try to set the CONCURRENT_REQUESTS_PER_DOMAIN or CONCURRENT_REQUESTS_PER_IP in the settings.py file to a smaller number. I suggest 1 or 2 instead of the default of 16. Another option is to increase the time between requests sent: change the DOWNLOAD_DELAY parameter again in the settings.py to a bigger value like 3-5.

Naturally this makes your scraping slower but you don't let the target website crash, it can handle requests of other users and you do not get blocked which in the end makes your scraping faster.

With BeautifulSoup there is no such problem by default because such applications run in one thread and block until the response is there from the server and the parsing is done. Naturally sometimes this can cause troubles if you have a fast internet connection and computer and you can gather and process information really fast and you get blocked. To avoid this you have to take care manually: you can add some sleep into your code for example between getting the website contents and extracting the information:

def scrape(url):
    soup = BeautifulSoup(urlopen(url), 'html.parser')
    sleep(3)
    for article in soup.find_all('div', class_='ckt-article'):

In the example above we wait 3 seconds before we extract the contents. I know this can be slow but sometimes it is faster than a "working for 5 minutes -- getting banned for 24 hours -- working for 5 minutes -- getting banned ..." almost endless loop.

JavaScript / AJAX / XHR

One point which is sometimes problematic with most websites when you want to scrape them is dynamic content processing with JavaScript. None of these tools can handle JavaScript calls by default but you can try to add them to both with some custom implementation.

It takes sometimes much time to find out what calls you have to look after when preparing your scraping and sometimes even more work is required to write the extraction code. But remember: even AJAX and XHR calls are requests which can be handled with Scrapy (by yielding a Request with the right parser function to point at) and with BeautifulSoup (by writing a function or code block which gathers the information by calling the URL which is called by JavaScript too).

Just the one or the other?

If you ask yourself: "Which one shall I use? I like how BeautifulSoup treats parsing but I love the ways Scrapy leverages my work with less code." In this case my answer is: use both. Because Scrapy is a website scraper it uses content extractors. And because BeautifulSoup is a content extractor you can include it in your project to do the scraping with this library instead of built-in solutions:

 

import scrapy
from bs4 import BeautifulSoup
def parse(self, response):
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title

As you can see you can mix both approaches together so you do not have to learn a new syntax if you just want to switch over to Scrapy.

Conclusion

We have seen that even if Scrapy is more powerful sometimes you want to write your own code to handle everything what can go wrong. Sometimes you have to use BeautifulSoup with a custom, hand-crafted solution instead of an already existing power-tool. And you do not have sacrifice BeautifulSoup if you switch to Scrapy: you can use both together.

If you want to start with website scraping I suggest you to get started with Scrapy because in 90% you get everything you need and you can customize your solution easily. If you do not have everything there are ways to extend the existing features to fulfill your needs -- and if this does not help switch to a hand-crafted solution.

But always remember to honor the Terms of Service and Privacy Policy of websites you want to scrape: sometimes they prohibit content scraping (mostly for spiders and robots) but having a simple script is almost a bot.

By Gabor Laszlo Hajba | 11/28/2016 | Product Comparisons

{{CommentsModel.TotalCount}} Comments

Your Comment

{{CommentsModel.Message}}

Recent Stories

Top DiscoverSDK Experts

User photo
500
Gábor László Hajba
Well-grounded software developer
Data Handling | Web and 17 more
View Profile
User photo
200
Noor Khan
Senior Software Engineer (Web)
GUI | Data Handling and 17 more
View Profile
User photo
60
Billy Joel Ranario
Full Stack Web Developer and Article Writer
GUI | Data Handling and 31 more
View Profile
Show All
X

Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.

{{compareToolModel.Error}}

Now comparing:

{{product.ProductName | createSubstring:25}} X
Compare Now