Weird Sort-and-Compress Thing

A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by some name I don't know yet). There's a lot still to improve about this algorithm, so be careful where you use it.

How it works

Here's an example for the following list:

l = [1, 2, 2, 2, 3]

The algorithm starts with counting sort, creating a dictionary with each unique number as key and the number of occurences of it in the list as the value:

d = {1: 1, 2: 3, 3: 1}

To decrease the space needed to store the numbers in memory, we'll only store the first number and then the difference between each of the next numbers and the previous one:

d2 = [(1, 1), (1, 3), (1, 1))

Now, the minimum amount of memory we need to store every key that's in d2 is 1 bit, since 1 is the maximum difference between any subsequent elements. The same applies to the values, except that to store any value here we need 2 bits of memory, since the maximum value is 3(11 in binary). So we know that we can store this list as a sequence of 3 bits elements, like this:

d2_bin = ["101", "111", 101"]

We can now return the list as a single number, along with a pair of integers containing the number of bits in each key and the number of bits in each value, allowing the value to be decompressed.

Memory efficiency

Here's a list with the sum of the number of bits of all numbers in a list with 100 elements, generated with random values in the range 0 to 50 and generated 20 times, vs. the number of bits in the resulting compressed integer(taking as a premise that all numbers in the array are all actually stored in continuous memory, including duplicates):

And 1000 numbers from 0 to 50, also 20 times:

4724 => 358
4827 => 309
4818 => 308
4801 => 309
4763 => 309
4763 => 309
4801 => 359
4757 => 359
4766 => 309
4794 => 309
4769 => 309
4789 => 359
4887 => 359
4787 => 309
4761 => 309
4749 => 309
4844 => 308
4798 => 359
4799 => 308
4763 => 359

Weird Sort-and-Compress Thing

Related tags

Overview

Weird Sort-and-Compress Thing

How it works

Memory efficiency

Owner

Douglas

Paddlespeech Streaming ASR GUI

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Convolutional 2D Knowledge Graph Embeddings resources

Implementation of N-Grammer, augmenting Transformers with latent n-grams, in Pytorch

vits chinese, tts chinese, tts mandarin

LUKE -- Language Understanding with Knowledge-based Embeddings

🦆 Contextually-keyed word vectors

Search msDS-AllowedToActOnBehalfOfOtherIdentity

NL. The natural language programming language.

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

基于“Seq2Seq+前缀树”的知识图谱问答

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

Semantic search for quotes.

translate using your voice

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2