A web scraper that exports your entire WhatsApp chat history.

Overview

WhatSoup 🍲

A web scraper that exports your entire WhatsApp chat history.

Table of Contents

  1. Overview
  2. Demo
  3. Prerequisites
  4. Instructions
  5. Frequently Asked Questions

Overview

Problem

  1. Exports are limited up to a maximum of 40,000 messages
  2. Exports skip the text portion of media-messages by replacing the entire message with instead of for example My favorite selfie of us 😻🐶🤳
  3. Exports are limited to a .txt file format

Solution

WhatSoup solves these problems by loading the entire chat history in a browser, scraping the chat messages (only text, no media), and exporting it to .txt, .csv, or .html file formats.

Example output:

WhatsApp Chat with Bob Ross.txt

02/14/2021, 02:04 PM - Eddy Harrington: Hey Bob 👋 Let's move to Signal!
02/14/2021, 02:05 PM - Bob Ross: You can do anything you want. This is your world.
02/15/2021, 08:30 AM - Eddy Harrington: How about we use WhatSoup 🍲 to backup our cherished chats?
02/15/2021, 08:30 AM - Bob Ross: However you think it should be, that’s exactly how it should be.
02/15/2021, 08:31 AM - Eddy Harrington: You're the best, Bob ❤
02/19/2021, 11:24 AM - Bob Ross:  My latest happy 🌲 painting for you.

Demo

Watch the video on YouTube

Prerequisites

  • You have a WhatsApp account
  • You have Chrome browser installed
  • You have some familiarity with setting up and running Python scripts
  • Your terminal supports unicode (UTF-8) characters (for chat emoji's)

Instructions

  1. Make sure your WhatsApp chat settings are set to English language. This needs to be done on your phone (instructions here). You can change it back afterwards, but for now the script relies on certain HTML elements/attributes that contain English characters/words.

  2. Clone the repo:

    git clone https://github.com/eddyharrington/WhatSoup.git
    
  3. Create a virtual environment:

    # Windows
    python -m venv env
    
    # Linux & Mac
    python3 -m venv env
    
  4. Activate the virtual environment:

    # Windows
    env/Scripts/activate
    
    # Linux & Mac
    source env/bin/activate
    
  5. Install the dependencies:

    # Windows
    pip install -r requirements.txt
    
    # Linux & Mac
    python3 -m pip install -r requirements.txt
    
  6. Setup your environment

  • Download ChromeDriver and extract it to a local folder (such as the env folder)

  • Get your Chrome browser Profile Path by opening Chrome and entering chrome://version into the URL bar

  • Create an .env file with an entry for DRIVER_PATH and CHROME_PROFILE that specify the directory paths for your ChromeDriver and your Chrome Profile from above steps:

    # Windows
    DRIVER_PATH = 'C:\path-to-your-driver\chromedriver.exe'
    CHROME_PROFILE = 'C:\Users\your-username\AppData\Local\Google\Chrome\User Data'
    
    # Linux & Mac
    DRIVER_PATH = '/Users/your-username/path-to-your-driver/chromedriver'
    CHROME_PROFILE = '/Users/your-username/Library/Application Support/Google/Chrome/Default'
    
  1. Run the script

    # Windows
    python whatsoup.py
    
    # Linux & Mac
    python3 whatsoup.py
    

    Note for Mac users: you may get blocked when trying to run the script the first time with a message about chromedriver not being from an identified developer. This is normal. Follow these instructions to grant chromedriver an exception, then re-run the script.

Frequently Asked Questions

Does it download pictures / media?

No.

How large of chats can I load/export?

The most demanding part of the process is loading the entire chat in the browser, in which performance heavily depends on how much memory your computer has and how well Chrome handles the large DOM load. For reference, my largest chat (~50k messages) uses about 10GB of RAM. If you load more than the current record let me know and add yourself to the leader board.

WhatSoup Largest Chat Leader Board

# Name Date Message Count Time
🥇 Eddy 2021-02-28 47,550 28139 sec / 7.8 hrs
🥈 ? ? ? ?
🥉 ? ? ? ?

How long does it take to load/export?

Depends on the chat size and how performant your computer is, however below is a ballpark range to expect. For large chats, I recommend turning your PC's sleep/power settings to OFF and running the script in the evening or before bed so it loads over night.

# of msgs in chat history Load time
500 1 min
5,000 12 min
10,000 35 min
25,000 3.5 hrs
50,000 8 hrs

Why is it so slow?!

Basically, browsers become easily bottlenecked when loading massive amounts of rich data in WhatsApp, which is a WebSocket application and is constantly sending/receiving information and changing the HTML/DOM.

I'm open to ideas but most of the things I tried didn't help performance:

  • Chrome vs Firefox
  • Headless browsing
  • Disabling images
  • Removing elements from DOM
  • Changing 'experimental' browser settings to allocate more memory

Can I...

  1. Use Firefox instead of Chrome? Yes, not out of the box though. There are a few Selenium differences and nuances to get it working, which I can share if there's interest. TODO.

  2. Use headless? Yes, but I only got this to work with Firefox and not Chrome.

  3. Use WhatSoup to scrape a local WhatsApp HTML file? Yes, you'd just need to bypass a few functions from main() and load the HTML file into Selenium's driver, then run the scraping/exporting functions like the below. If there's enough interest I can look into adding this to WhatSoup myself. TODO.

    # Load and scrape data from local HTML file
    def local_scrape(driver):
        driver.get('C:\your-WhatSoup-dir\source.html')
        scraped = scrape_chat(driver)
        scrape_is_exported("source", scraped)
    
  4. Contribute to WhatSoup? Please do!

Owner
Eddy Harrington
Eddy Harrington
TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

TarkovScrappy A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov! Hideout items

Joshua Smeda 2 Apr 11, 2022
a way to scrape a database of all of the isef projects

ISEF Database This is a simple web scraper which gets all of the projects and abstract information from here. My goal for this is for someone to get i

William Kaiser 1 Mar 18, 2022
Python scraper to check for earlier appointments in Clalit Health Services

clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

Dekel 16 Sep 17, 2022
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

Muhammad Abdullah 273 Dec 31, 2022
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

adriano atambo 1 Jan 16, 2022
Introduction to WebScraping Workshop - Semcomp 24 Beta

Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

Luísa Moura 19 Sep 11, 2022
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
This is python to scrape overview and reviews of companies from Glassdoor.

Data Scraping for Glassdoor This is python to scrape overview and reviews of companies from Glassdoor. Please use it carefully and follow the Terms of

Houping 5 Jun 23, 2022
Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

68 Oct 08, 2022
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 09, 2022
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022