A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Last update: Feb 10, 2022

Overview

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

Multiple pages with one level nesting have been scraped. The propagation has been implemented by gathering internal links from the main page followed by looping on them.
To avoid getting banned from the remote server, a mechanism dealing with proxy servers was implemented.
A free public proxy server is commonly assumed as unreliable in terms of availability. To overcome this issue:
- another scraping script extracts a list of free public proxy servers from a web site.
- with each launch of the script, the list of 10 proxy servers gets updated by currently available proxy servers.
- during the script execution, some proxy servers get unavailable. Thus, each scraping query goes through this list and searches for an alive proxy server to execute a query.
To speed up the scraping of the total 101 web pages multithreading is involved. The work is divided among 4 threads running almost simultaneously.
The extracted data is being written directly to a DataBase.

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Related tags

Overview

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

Owner

Kushal Shingote

让中国用户使用git从github下载的速度提高1000倍!

Find thumbnails and original images from URL or HTML file.

Linkedin webscraping - Linkedin web scraping with python

feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫，以及完善的爬虫报警机制。

Get paper names from dblp.org

Searching info from Google using Python Scrapy

fork huanghyw/jd_seckill

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

TikTok Username Swapper/Claimer/etc

Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Scrapes proxies and saves them to a text file

This is a python api to scrape search results from a url.

Meme-videos - Scrapes memes and turn them into a video compilations

a way to scrape a database of all of the isef projects

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

A high-level distributed crawling framework.

PyQuery-based scraping micro-framework.

This tool can be used to extract information from any website