Colly provides a simple API that allows you to define how to handle requests, parse HTML, and extract data. This tutorial will guide you through the process of setting up a basic web scraper, demonstrating how to scrape data from a sample website, and discussing best practices for web scraping in Go.

Getting Started

Installation

Before you begin, ensure you have Go installed on your machine. You can install Colly using the Go package manager. Open your terminal and run:

go get -u github.com/gocolly/colly/v2

Creating a Basic Scraper

Let's create a simple web scraper that extracts article titles and links from a blog. Start by creating a new Go file, scraper.go, and add the following code:

package main

import (
    "fmt"
    "log"

    "github.com/gocolly/colly/v2"
)

func main() {
    // Create a new collector
    c := colly.NewCollector()

    // Set up a callback for when a visited HTML element is found
    c.OnHTML("h2.entry-title a", func(e *colly.HTMLElement) {
        title := e.Text
        link := e.Attr("href")
        fmt.Printf("Title: %s, Link: %s\n", title, link)
    })

    // Set up a callback for when a visited HTML element is found
    c.OnHTML("div.pagination a.next", func(e *colly.HTMLElement) {
        nextLink := e.Attr("href")
        fmt.Println("Next page:", nextLink)
        c.Visit(nextLink)
    })

    // Start the scraping process
    err := c.Visit("https://example-blog.com")
    if err != nil {
        log.Fatal(err)
    }
}

Code Explanation

  1. Collector Initialization: We create a new collector instance with colly.NewCollector(), which is responsible for making requests and handling responses.
  1. Element Callbacks: We define callbacks for specific HTML elements. In this case, we are targeting h2.entry-title a, which represents the article titles and their links. The OnHTML method allows us to specify how to handle these elements when they are found.
  1. Pagination Handling: We also set up a callback to handle pagination by looking for the div.pagination a.next element. If found, we extract the link to the next page and visit it.
  1. Starting the Scraper: Finally, we initiate the scraping process by visiting the target URL.

Running the Scraper

To run the scraper, execute the following command in your terminal:

go run scraper.go

You should see the titles and links of articles printed in your terminal.

Handling Errors and Rate Limiting

When scraping websites, it's essential to handle errors gracefully and respect the target site's policies. Colly provides built-in support for rate limiting and error handling.

Error Handling

You can set up an error callback to log errors during the scraping process:

c.OnError(func(r *colly.Response, err error) {
    log.Printf("Request failed with response: %v, error: %v", r, err)
})

Rate Limiting

To avoid overwhelming the target server, you can set a delay between requests. Use the c.Limit method:

c.Limit(&colly.Limit{
    DomainGlob:  "*example-blog.com*",
    Parallelism: 2,
    Delay:       2 * time.Second,
})

This configuration allows a maximum of 2 concurrent requests with a 2-second delay between each request.

Storing Scraped Data

After extracting data, you might want to store it in a structured format. For this example, we will save the scraped data in a CSV file. You can use the encoding/csv package to achieve this.

CSV Example

Here’s how you can modify the previous example to save the scraped data:

package main

import (
    "encoding/csv"
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly/v2"
)

func main() {
    // Create a new collector
    c := colly.NewCollector()

    // Create a CSV file
    file, err := os.Create("articles.csv")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    writer := csv.NewWriter(file)
    defer writer.Flush()

    // Write header
    writer.Write([]string{"Title", "Link"})

    // Set up a callback for when a visited HTML element is found
    c.OnHTML("h2.entry-title a", func(e *colly.HTMLElement) {
        title := e.Text
        link := e.Attr("href")
        fmt.Printf("Title: %s, Link: %s\n", title, link)

        // Write to CSV
        writer.Write([]string{title, link})
    })

    // Start the scraping process
    err = c.Visit("https://example-blog.com")
    if err != nil {
        log.Fatal(err)
    }
}

Best Practices for Web Scraping

  1. Respect Robots.txt: Always check the site's robots.txt file to understand which pages can be scraped.
  2. Rate Limiting: Implement rate limiting to avoid overwhelming the server with requests.
  3. Error Handling: Gracefully handle errors to ensure your scraper can recover from issues.
  4. Data Storage: Choose an appropriate format for storing scraped data, such as CSV, JSON, or a database.

Conclusion

In this tutorial, we explored how to build a web scraper in Go using the Colly framework. We covered the basics of setting up a scraper, handling pagination, managing errors, and storing data in CSV format. By following best practices, you can create efficient and respectful web scrapers.

Learn more with useful resources: