GoogleSpider

Crawl the information of a given keyword on Google search engine

Config

DataBase

Currently, data is stored in mongodb, and the database configuration is in line 15-19 of the setting. py file, which can be modified by yourself.

# MONGODB
MONGO_IP = "localhost"
MONGO_PORT = 27017
MONGO_DB = "Google_spider"
MONGO_USER_NAME = ""
MONGO_USER_PASS = ""

Log

LOG_NAME = os.path.basename(os.getcwd())
LOG_PATH = "log/%s.log" % LOG_NAME  # log path
LOG_LEVEL = "DEBUG"
LOG_COLOR = True  
LOG_IS_WRITE_TO_CONSOLE = True 
LOG_IS_WRITE_TO_FILE = True  
LOG_MODE = "w" 
LOG_MAX_BYTES = 10 * 1024 * 1024  # Maximum bytes
LOG_BACKUP_COUNT = 20  # Number of log files reserved
LOG_ENCODING = "utf8"  # code
OTHERS_LOG_LEVAL = "ERROR"  # leval

Spider

Download interval
- ```
SPIDER_SLEEP_TIME = [0, 1]
```
Maximum number of requests (100 by default)
- ```
SPIDER_MAX_RETRY_TIMES = 100
```
  Note
  
  If an illegal interface is encountered during crawling, an exception of 'user agent -- illegal interface' will be thrown, and then the crawler task will retry until the data is successfully crawled or more than 100 times

data structure

key	value type	example
title	str	“Donald Trump - Wikipedia”
keyword	str	“Trump"
url	str	"https://en.wikipedia.org/wiki/Donald_Trump"
text	str	Donald Trump - Wikipedia 1 hour ago · Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States ... Vice President: Mike Pence In office January 20, 2017 – January 20, 2021: In office; January 20, 2017 – January 20, 2021 Occupation: Politician; businessman; television presenter Parents: Fred Trump; Mary Anne MacLeod"

Quick start

Crawl the 3 page data with the keyword 'Trump'

from spiders.google_curl import GoogleCurl

spider = GoogleCurl('Trump', 3)
spider.start()

The first parameter is the search keyword, and the second parameter is the number of pages crawled

Crawl the information of a given keyword on Google search engine

Related tags

Overview

GoogleSpider

Config

DataBase

Log

Spider

data structure

Quick start

Owner

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

Lovely Scrapper

Simply scrape / download all the media from an fansly account.

download NCERT books using scrapy

Web scraper build using python.

用python爬取江苏几大高校的就业网站，并提供3种方式通知给用户，分别是通过微信发送、命令行直接输出、windows气泡通知。

a small library for extracting rich content from urls

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

This project was created using Python technology and flask tools to scrape a music site

抢京东茅台脚本，定时自动触发，自动预约，自动停止

Anonymously scrapes onlinesim.ru for new usable phone numbers.

Subscrape - A Python scraper for substrate chains

A dead simple crawler to get books information from Douban.

Kusonime scraper using python3

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

A simple flask application to scrape gogoanime website.

IGLS - Instagram Like Scraper CLI tool

Python script who crawl first shodan page and check DBLTEK vulnerability