My Web Scraping Journey and Why Scrapy Stands Out

Scraping the web is relatively easy now-a-days thanks to the companies that try to lower the barrier to entry with their services and open-sourcing their excellent tools to the developer community. Given the plethora of options out there, the one that I ultimately fell in love with was Scrapy, a web scraping framework, written in Python, by the excellent engineering team at Zyte.

Foundation

It all starts by creating a new Scrapy project in a virtualenv. At this time, I decided to keep things simple by using Pipenv to manage my Scrapy project's environment.

For this post, I'm just going to use a random project name: web_weaver

Once you've settled on a name for your Scrapy project, go ahead and create it using the following command:

scrapy startproject web_weaver

Next, feel free to add whatever Python packages interest you. However, I will say that Scrapy includes many libraries out-of-the-box, so make sure you're not adding something unnecessary! In my case, the most notable packages I added via the Pipfile are the following:

  1. SQLAlchemy
  2. Alembic
  3. psycopg
  4. ipython
  5. boto3
  6. Pillow
  7. scrapy-splash
  8. shub

The first three packages were essential for configuring and interacting with the PostgreSQL database created to store data during a spider(s) crawl session(s). SQLAlchemy is the premier ORM in the Python world. Meanwhile, Alembic is used for handling database migrations via SQLAlchemy. Finally, psycopg is the goto PostgreSQL adapter that allows for my Scrapy project and database to communicate with each other.

Furthermore, the boto3 package is maintained by AWS which enables developers "...to create, configure, and manage AWS services..." by providing "...an object-oriented API as well as low-level access to AWS services." In fact, Scrapy utilizes boto3 to handle your AWS keys (typically generated in AWS IAM service) so any and all files downloaded or generated can be exported directly to your S3 bucket during the crawl session. Commonly, your media files (i.e., images, videos, etc.) and feed exports, which are either in JSON, JSON lines, CSV, or XML format.

Note: JSON lines should be used for batching crawl sessions that collect massive amounts of data. For example, instead of having one JSON file with thousands of objects, you can configure Scrapy to output JSON line files where each contain a few hundred products. Not only is this data separated into multiple files where the naming convention is customizable, but it allows for extraneous programs to read these files much more efficiently. The reason for this being a single JSON file would require the entire JSON array to be loaded into memory. In contrast with JSON lines which does not have any JSON array structure, and instead, has a single JSON object on each new line. More information can be found in the Scrapy docs.

Pillow is a powerful image processing library. However, in my experience, I have not directly interacted with this package inside of my Scrapy project at all. Rather, Scrapy's Images Pipline requires it for processing thumbnail/variant image sizes and normalizing images to JPEG/RGB format.

Not all websites are built the same

One of the biggest lessons I learned from the many months of teaching myself advanced web scraping techniques is the following:

Do not ever start writing a new spider before thoroughly inspecting your target website's (1) HTML page code/structure and (2) Network Activity.

Not only can you find yourself lost or scraping lines of code over and over again, but you can potentially save yourself a lot of unnecessary work, period.

After scraping several websites, you'll begin noticing that each website falls to one of several categories, which are listed in order of complexity (i.e., simple to complex):

  • Data is loaded from a [publicly accessible] JSON API
  • Content is present in HTML from the web page's initial load similar to the average web page visited.
  • Data is loaded dynmically from an inaccessible API source, which requires either pre-rendering JavaScript or using a headless browser.

This is where additional libraries, like scrapy-splash and scrapy-playwright, become crucial to get your spider working.

Closing remarks

Scrapy is an amazing framework that I strongly encourage you trying out and learn for yourself. If you have any experience building web applications, learning how to scrape a website (even your own!) will give you a completely different perspective on, well, just about everything.

Finally, while I am not affiliated with Zyte, I have been following them for several years now, and I am excited to try out their new Zyte API service. This service can greatly reduce the amount of bans and increase quality of structured data extracted at a decent price.