go-colly爬虫上手(爬取英文名人名言)

语言: CN / TW / HK

最近发现了一个quote-lib网站: https://www.goodreads.com

于是了解到golang有个在github上star数超过6K的爬虫框架:Colly.

项目目的

我想首先将这个goodreads的quotes全都爬下来,然后保存到一个文件里。 最后解析爬下来的quotes,为了优美的markdown效果而格式化每个quote,使得在网页中这样展示出来:

每条quote有三个元素:quote的类型, quote文本体,作者或出处

“We are what we pretend to be, so we must be careful about what we pretend to be.” ‎.

Kurt Vonnegut, Mother Night

“Sometimes you wake up. Sometimes the fall kills you. And sometimes, when you fall, you fly.”

Neil Gaiman, Fables & Reflections

准备工作

简要介绍

Lightning Fast and Elegant Scraping Framework for Gophers.

Colly provides a clean interface to write any kind of crawler/scraper/spider.

With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

go-colly git-repo url

gocolly/colly : https://github.com/gocolly/colly

安装

$ go get -u github.com/gocolly/colly/...

go环境

$ go version                
go version go1.12.8 linux/amd64

you can export GO111MODULE=on optionaly

快速上手

draft:

package main

import (
    "fmt"
    "os"
    "regexp"
    "strings"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/extensions"
)

func main() {

    fileName := "quote.md"
    file, errFile := os.Create(fileName)
    if errFile != nil {
        println("operating system create file error :%s", errFile.Error())
        panic(errFile)
    }
    defer func() {
        err := file.Close()
        if err != nil {
            println("file close error")
        }
    }()

    c := colly.NewCollector()
    errProxy := c.SetProxy("http://127.0.0.1:1080/")
    if errProxy != nil {
        println("colly set proxy error :%s", errProxy.Error())
        panic(errProxy)
    }
    // c.AllowedDomains  = []string{"https://www.goodreads.com"}
    c.AllowURLRevisit = true
    extensions.RandomUserAgent(c)

    c.OnHTML(".quoteText ",
        func(e *colly.HTMLElement) {
            text := strings.TrimSpace(strings.Split(e.Text, "―")[0])
            author := TrimSpaceNewlineInString(strings.TrimSpace(e.ChildText(".authorOrTitle")))

            fileWriteForMarkdown(file, text, author)
        })

    c.OnHTML(".next_page", func(e *colly.HTMLElement) {
        println("visit: ", e.Request.AbsoluteURL(e.Attr("href")))
        errHrefVisit := c.Visit(e.Request.AbsoluteURL(e.Attr("href")))
        if errHrefVisit != nil {
            panic(errHrefVisit)
        }

    })

    errVisit := c.Visit("https://www.goodreads.com/quotes/tag/philosophy")
    if errVisit != nil {
        panic(errVisit)
    }

}

func TrimSpaceNewlineInString(s string) string {
    re := regexp.MustCompile(`\n`)
    return re.ReplaceAllString(s, " ")
}

func fileWriteForMarkdown(file *os.File, lines ...string) {
    var admotionBot = `
\{\{% /admonition %\}\}
`
    head := fmt.Sprintf(`
\{\{%% admonition quote "%s" %%\}\}
`, lines[1])
    _, err := (*file).Write([]byte(head))
    if err != nil {
        println("file write error ", err.Error())
    }
    _, err = (*file).Write([]byte(lines[0]))
    if err != nil {
        println("file write error ", err.Error())
    }
    _, err = (*file).Write([]byte(admotionBot))
    if err != nil {
        println("file write error ", err.Error())
    }
}

func fileWriteDirect(file *os.File,lines ...string){

    _, err := (*file).Write([]byte(lines[0]))
    if err != nil {
        println("file write error ", err.Error())
    }
    _, err = (*file).Write([]byte(lines[1]))
    if err != nil {
        println("file write error ", err.Error())
    }
}
分享到: