
Building Web Scrapers in Go with Colly
Colly provides a simple API that allows you to define how to handle requests, parse HTML, and extract data. This tutorial will guide you through the process of setting up a basic web scraper, demonstrating how to scrape data from a sample website, and discussing best practices for web scraping in Go.
Getting Started
Installation
Before you begin, ensure you have Go installed on your machine. You can install Colly using the Go package manager. Open your terminal and run:
go get -u github.com/gocolly/colly/v2Creating a Basic Scraper
Let's create a simple web scraper that extracts article titles and links from a blog. Start by creating a new Go file, scraper.go, and add the following code:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a new collector
c := colly.NewCollector()
// Set up a callback for when a visited HTML element is found
c.OnHTML("h2.entry-title a", func(e *colly.HTMLElement) {
title := e.Text
link := e.Attr("href")
fmt.Printf("Title: %s, Link: %s\n", title, link)
})
// Set up a callback for when a visited HTML element is found
c.OnHTML("div.pagination a.next", func(e *colly.HTMLElement) {
nextLink := e.Attr("href")
fmt.Println("Next page:", nextLink)
c.Visit(nextLink)
})
// Start the scraping process
err := c.Visit("https://example-blog.com")
if err != nil {
log.Fatal(err)
}
}Code Explanation
- Collector Initialization: We create a new collector instance with
colly.NewCollector(), which is responsible for making requests and handling responses.
- Element Callbacks: We define callbacks for specific HTML elements. In this case, we are targeting
h2.entry-title a, which represents the article titles and their links. TheOnHTMLmethod allows us to specify how to handle these elements when they are found.
- Pagination Handling: We also set up a callback to handle pagination by looking for the
div.pagination a.nextelement. If found, we extract the link to the next page and visit it.
- Starting the Scraper: Finally, we initiate the scraping process by visiting the target URL.
Running the Scraper
To run the scraper, execute the following command in your terminal:
go run scraper.goYou should see the titles and links of articles printed in your terminal.
Handling Errors and Rate Limiting
When scraping websites, it's essential to handle errors gracefully and respect the target site's policies. Colly provides built-in support for rate limiting and error handling.
Error Handling
You can set up an error callback to log errors during the scraping process:
c.OnError(func(r *colly.Response, err error) {
log.Printf("Request failed with response: %v, error: %v", r, err)
})Rate Limiting
To avoid overwhelming the target server, you can set a delay between requests. Use the c.Limit method:
c.Limit(&colly.Limit{
DomainGlob: "*example-blog.com*",
Parallelism: 2,
Delay: 2 * time.Second,
})This configuration allows a maximum of 2 concurrent requests with a 2-second delay between each request.
Storing Scraped Data
After extracting data, you might want to store it in a structured format. For this example, we will save the scraped data in a CSV file. You can use the encoding/csv package to achieve this.
CSV Example
Here’s how you can modify the previous example to save the scraped data:
package main
import (
"encoding/csv"
"fmt"
"log"
"os"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a new collector
c := colly.NewCollector()
// Create a CSV file
file, err := os.Create("articles.csv")
if err != nil {
log.Fatal(err)
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// Write header
writer.Write([]string{"Title", "Link"})
// Set up a callback for when a visited HTML element is found
c.OnHTML("h2.entry-title a", func(e *colly.HTMLElement) {
title := e.Text
link := e.Attr("href")
fmt.Printf("Title: %s, Link: %s\n", title, link)
// Write to CSV
writer.Write([]string{title, link})
})
// Start the scraping process
err = c.Visit("https://example-blog.com")
if err != nil {
log.Fatal(err)
}
}Best Practices for Web Scraping
- Respect Robots.txt: Always check the site's
robots.txtfile to understand which pages can be scraped. - Rate Limiting: Implement rate limiting to avoid overwhelming the server with requests.
- Error Handling: Gracefully handle errors to ensure your scraper can recover from issues.
- Data Storage: Choose an appropriate format for storing scraped data, such as CSV, JSON, or a database.
Conclusion
In this tutorial, we explored how to build a web scraper in Go using the Colly framework. We covered the basics of setting up a scraper, handling pagination, managing errors, and storing data in CSV format. By following best practices, you can create efficient and respectful web scrapers.
Learn more with useful resources:
