MongoDB utility to inflate the contents of small collection to a new larger collection

Overview

MongoDB Data Inflater ("data-inflater")

The data-inflater tool is a MongoDB utility to automate the creation of a new large database collection using data sourced from an existing smaller database collection.

By default, the utility will use the Atlas 'sample data set' database collection sample_mflix.movies as the source collection. However, most users will provide parameters to the utility to specify the use of their own database and source collection. If you do want to use the Atlas sample data set, see the sample data manual page for more information.

The data-inflater utility issues multiple concurrent aggregation processes, each copying batches of records in parallel for increased performance. The resulting collection will contain documents with duplicated data but with new unique _id field values. The variance ratio of data in the new collection will approximately reflect the variance ratio of the source collection. Therefore, you should ensure you have supplied at least a few different documents (if not a few hundred or thousand) in the source collection.

If you are running a sharded cluster, the utility will ensure the target collection is sharded with a shard key, and where it can, it will pre-split the chunks to avoid subsequent needless balancer overhead. For example, if you specify the --shardkey parameter for this utility to reference a field (e.g. product_name) as the range based shard key, before creating the target collection, the utility will introspect the spread of values for the shard key field (e.g. product_name). The utility will then create pre-split chunks in the new empty target collection before any data is copied to it, to maximise performance.

How To Run

In a running MongoDB cluster (self-managed or running in Atlas), ensure you have created and populated a source collection with at least one sample record in it (ideally more with varying values for the fields across the different documents to reflect the shape and variance you desire).

Ensure Python3 (version 3.8 or greater) and the MongoDB Python Driver (PyMongo) are already installed on your workstation. Example to install PyMongo:

pip3 install --user pymongo

Ensure the .py script is executable and then execute the following to view the utility's help instructions and the full list of parameters that you can provide:

./data-inflater.py -h

Execute the following to connect to a locally running single server database (default port) to copy and expand the data from an existing source collection, mydb.mySrcColl, to an a new collection, mydb.myDestColl, which will contain 1 million records:

./data-inflater.py --url 'mongodb://localhost:27017' -d 'mydb' -c 'mySrcColl' -t 'myDestColl' -s 1000000

Execute the following to connect to an Atlas cluster (ensure you've already loaded the Atlas sample data set), to inflate the data from the source movies collection to the new movies_big collection, which will contain 100 million records (note, first change the URL username, password and hostname shown, to match the URL of your Atlas cluster):

./data-inflater.py --url 'mongodb+srv://usr:[email protected]/'
Owner
Paul Done
Paul Done
A simple toolchain for moving Remarkable highlights to Readwise

A simple toolchain for moving Remarkable highlights to Readwise

zach wick 20 Dec 20, 2022
Simple python module to get the information regarding battery in python.

Battery Stats A python3 module created for easily reading the current parameters of Battery in realtime. It reads battery stats from /sys/class/power_

Shreyas Ashtamkar 5 Oct 21, 2022
Extends the pyranges module with operations on joined genomic intervals

tiedpyranges Extends the pyranges module with operations on joined genomic intervals (e.g. exons of same transcript) Install with: pip install tiedpyr

Marco Mariotti 4 Aug 05, 2022
A string extractor module for python

A string extractor module for python

Fayas Noushad 4 Jul 19, 2022
Abstraction of a Unit, includes convertions and basic operations.

Units Abstraction of a Unit, includes convertions and basic operations. ------ EXAMPLE : Free Fall (No air resistance) ------- from units_test import

1 Dec 23, 2021
UUID version 7, which are time-sortable (following the Peabody RFC4122 draft)

uuid7 - time-sortable UUIDs This module implements the version 7 UUIDs, proposed by Peabody and Davis in https://www.ietf.org/id/draft-peabody-dispatc

Steve Simmons 22 Dec 20, 2022
Yet another retry utility in Python

Yet another retry utility in Python, avereno being the Malagasy word for retry.

Haute École d'Informatique de Madagascar 4 Nov 02, 2021
Python based utilities for interacting with digital multimeters that are built on the FS9721-LP3 chipset.

Python based utilities for interacting with digital multimeters that are built on the FS9721-LP3 chipset.

Fergus 1 Feb 02, 2022
Macro recording and metaprogramming in Python

macro-kit is a package for efficient macro recording and metaprogramming in Python using abstract syntax tree (AST).

8 Aug 31, 2022
Run async workflows using pytest-fixtures-style dependency injection

Run async workflows using pytest-fixtures-style dependency injection

Simon Willison 26 Jun 26, 2022
convert a dict-list object from / to a typed object(class instance with type annotation)

objtyping 带类型定义的对象转换器 由来 Python不是强类型语言,开发人员没有给数据定义类型的习惯。这样虽然灵活,但处理复杂业务逻辑的时候却不够方便——缺乏类型检查可能导致很难发现错误,在IDE里编码时也没

Song Hui 15 Dec 22, 2022
✨ Voici un code en Python par moi, et en français qui permet de générer du texte Lorem.

Lorem Gen ❗ Voici un code en Python par moi, et en français qui permet de générer du texte Lorem. Dépendences : pip install lorem_text 💖 Enjoy 🎫 Mon

MrGabin 3 Jun 07, 2021
Implementing C++ Semantics in Python

Implementing C++ Semantics in Python

Tamir Bahar 7 May 18, 2022
A script to check for common mistakes in LaTeX source files of scientific papers.

LaTeX Paper Linter This script checks for common mistakes in LaTeX source files of scientific papers. Usage python3 paperlint.py file.tex [-i/x inc

Michael Schwarz 12 Nov 16, 2022
Set of utilities for exporting/controlling your robot in Blender

Blender Robotics Utils This repository contains utilities for exporting/controlling your robot in Blender Maintainers This repository is maintained by

Robotology 33 Nov 30, 2022
Enable ++x and --x expressions in Python

By default, Python supports neither pre-increments (like ++x) nor post-increments (like x++). However, the first ones are syntactically correct since Python parses them as two subsequent +x operation

Alexander Borzunov 85 Dec 29, 2022
async parser for JET

This project is mainly aims to provide an async parsing option for NTDS.dit database file for obtaining user secrets.

15 Mar 08, 2022
RapidFuzz is a fast string matching library for Python and C++

RapidFuzz is a fast string matching library for Python and C++, which is using the string similarity calculations from FuzzyWuzzy

Max Bachmann 1.7k Jan 04, 2023
A script to parse and display buy_tag and sell_reason for freqtrade backtesting trades

freqtrade-buyreasons A script to parse and display buy_tag and sell_reason for freqtrade backtesting trades Usage Copy the buy_reasons.py script into

Robert Davey 31 Jan 01, 2023
A clock app, which helps you with routine tasks.

Clock This app helps you with routine tasks. Alarm Clock Timer Stop Watch World Time (Which city you want) About me Full name: Matin Ardestani Age: 14

Matin Ardestani 13 Jul 30, 2022