Getting Started with Scrapy

Installation

To get started with Scrapy, you need to install it. This can be done easily using pip:

pip install Scrapy

Creating a New Scrapy Project

Once Scrapy is installed, you can create a new project by running the following command in your terminal:

scrapy startproject myproject

This command creates a directory structure that includes the following:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/

Defining Items

In Scrapy, an item is a simple container for the scraped data. You define items in items.py. Here’s an example of how to define an item for a book scraping project:

import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    availability = scrapy.Field()

Creating a Spider

Spiders are classes that you define to scrape information from a website. Create a new spider in the spiders directory. Here’s an example spider that scrapes book data from a fictional website:

import scrapy
from myproject.items import BookItem

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = [
        'http://books.toscrape.com/',
    ]

    def parse(self, response):
        for book in response.css('article.product_pod'):
            item = BookItem()
            item['title'] = book.css('h3 a::attr(title)').get()
            item['author'] = book.css('p.author::text').get()
            item['price'] = book.css('p.price_color::text').get()
            item['availability'] = book.css('p.availability::text').get().strip()
            yield item

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Running the Spider

To run your spider, navigate to your project’s root directory and execute the following command:

scrapy crawl books -o books.json

This command will run the books spider and output the scraped data into a file named books.json.

Scrapy Settings

You can customize the behavior of your Scrapy project by modifying the settings.py file. Here are some important settings:

SettingDescription
USER_AGENTDefines the user agent that the spider will use.
DOWNLOAD_DELAYSets a delay for requests to avoid overwhelming the server.
ITEM_PIPELINESDefines the order and priority of item processing.

Example of modifying settings.py:

# settings.py

USER_AGENT = 'myproject (+http://www.yourdomain.com)'
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {
   'myproject.pipelines.MyPipeline': 300,
}

Item Pipelines

Item pipelines are used to process the scraped data after it has been extracted. You can define your pipeline in pipelines.py. Here’s an example that saves the scraped items to a JSON file:

import json

class MyPipeline:
    def open_spider(self, spider):
        self.file = open('books.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Best Practices for Web Scraping with Scrapy

  1. Respect Robots.txt: Always check the robots.txt file of the website you are scraping to ensure that your actions comply with the site's scraping policy.
  1. Use User Agents: Set a user agent in your settings to mimic a real browser and avoid being blocked.
  1. Implement Error Handling: Use Scrapy’s built-in logging and error handling to manage exceptions and retries.
  1. Throttle Requests: Use DOWNLOAD_DELAY and AUTOTHROTTLE_ENABLED settings to avoid overwhelming the target website.
  1. Store Data Efficiently: Choose the right storage format for your scraped data, such as JSON, CSV, or databases, depending on your needs.

Conclusion

Scrapy is a powerful framework for web scraping that allows developers to extract data efficiently from websites. By following the steps outlined in this tutorial, you can create your own web scraper and implement best practices to ensure ethical scraping.

Learn more with useful resources