Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

Examine.com supplement research scraper!

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

crypto currency scraping

Scrap-mtg-top-8 - A top 8 mtg scraper using python

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

This is python to scrape overview and reviews of companies from Glassdoor.

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Python scrapper scrapping torrent website and download new movies Automatically.

A webdriver-based script for reserving Tsinghua badminton courts.

抖音批量下载用户所有无水印视频

Explore scraping with BeautifulSoup!

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

Iptvcrawl - A scrapy project for crawl IPTV playlist

SmartScraper: 简单、自动、快捷的Python网络爬虫

A social networking service scraper in Python

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

Scraping Top Repositories for Topics on GitHub,

This is a webscraper for a specific website

A pure-python HTML screen-scraping library

Parse feeds in Python