A fast python implementation of the SimHash algorithm.

Overview

FLoC SimHash

This Python package provides hashing algorithms for computing cohort ids of users based on their browsing history. As such, it may be used to compute cohort ids of users following Google's Federated Learning of Cohorts (FLoC) proposal.

The FLoC proposal is an important part of The Privacy Sandbox, which is Google's replacement for third-party cookies. FLoC will enable interest-based advertising, thus preserving an important source of monetization for today's web.

The main idea, as outlined in the FLoC whitepaper, is to replace user cookie ids, which enable user-targeting across multiple sites, by cohort ids. A cohort would consist of a set of users sharing similar browsing behaviour. By targeting a given cohort, advertisers can ensure that relevant ads are shown while user privacy is preserved by a hiding in the pack mechanism.

The FLoC whitepaper mentions several mechanisms to map users to cohorts, with varying amounts of centralized information. The algorithms currently being implemented in Google Chrome as a POC are methods based on SimHash, which is a type of locality-sensitive hashing initially introduced for detecting near-duplicate documents.

Contents

Installation

The floc-simhash package is available at PyPI. Install using pip as follows.

pip install floc-simhash

The package requires python>=3.7 and will install scikit-learn as a dependency.

Usage

The package provides two main classes.

  • SimHash, applying the SimHash algorithm on the md5 hashes of tokens in the given document.

  • SimHashTransformer, applying the SimHash algorithm to a document vectorization as part of a scikit-learn pipeline

Finally, there is a third class available:

  • SortingSimHash, which performs the SortingLSH algorithm by first applying SimHash and then clipping the resulting hashes to a given precision.

Individual document-based SimHash

The SimHash class provides a way to calculate the SimHash of any given document, without using any information coming from other documents.

In this case, the document hash is computed by looking at md5 hashes of individual tokens. We use:

  • The implementation of the md5 hashing algorithm available in the hashlib module in the Python standard library.

  • Bitwise arithmetic for fast computations of the document hash from the individual hashed tokens.

The program below, for example, will print the following hexadecimal string: cf48b038108e698418650807001800c5.

from floc_simhash import SimHash

document = "Lorem ipsum dolor sit amet consectetur adipiscing elit"
hashed_document = SimHash(n_bits=128).hash(document)

print(hashed_document)

An example more related to computing cohort ids: the following program computes the cohort id of a user by applying SimHash to the document formed by the pipe-separated list of domains in the user browsing history.

from floc_simhash import SimHash

document = "google.com|hybridtheory.com|youtube.com|reddit.com"
hasher = SimHash(n_bits=128, tokenizer=lambda x: x.split("|"))
hashed_document = hasher.hash(document)

print(hashed_document)

The code above will print the hexadecimal string: 14dd1064800880b40025764cd0014715.

Providing your own tokenizer

The SimHash constructor will split the given document according to white space by default. However, it is possible to pass any callable that parses a string into a list of strings in the tokenizer parameter. We have provided an example above where we pass tokenizer=lambda x: x.split("|").

A good example of a more complex tokenization could be passing the word tokenizer in NLTK. This would be a nice choice if we wished to compute hashes of text documents.

Using the SimHashTransformer in scikit-learn pipelines

The approach to SimHash outlined in the FLoC Whitepaper consists of choosing random unit vectors and working on already vectorized data.

The choice of a random unit vector is equivalent to choosing a random hyperplane in feature space. Choosing p random hyperplanes partitions the feature space into 2^p regions. Then, a p-bit SimHash of a vector encodes the region to which it belongs.

It is reasonable to expect similar documents to have the same hash, provided the vectorization respects the given notion of similarity.

Two vectorizations are discussed in the aforementioned whitepaper: one-hot and tf-idf; they are available in scikit-learn.

The SimHashTransformer supplies a transformer (implementing the fit and transform methods) that can be used directly on the output of any of these two vectorizers in order to obtain hashes.

For example, given a 1d-array X containing strings, each of them corresponding to a concatenation of the domains visited by a given user and separated by "|", the following code will store in y the cohort id of each user, using one-hot encoding and a 32-bit SimHash.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

from floc_simhash import SimHashTransformer


X = [
    "google.com|hybridtheory.com|youtube.com|reddit.com",
    "google.com|youtube.com|reddit.com",
    "github.com",
    "google.com|github.com",
]

one_hot_simhash = Pipeline(
    [
        ("vect", CountVectorizer(tokenizer=lambda x: x.split("|"), binary=True)),
        ("simhash", SimHashTransformer(n_bits=32)),
    ]
)

y = one_hot_simhash.fit_transform(X)

After running this code, the value of y would look similar to the following (expect same lengths; actual hash values depend on the choice of random vectors during fit):

['0xd98c7e93' '0xd10b79b3' '0x1085154d' '0x59cd150d']

Caveats

  • The implementation works on the sparse matrices output by CountVectorizer and TfidfTransformer, in order to manage memory efficiently.

  • At the moment, the choice of precision in the numpy arrays results in overflow errors for p >= 64. While we are waiting for implementation details of the FLoC POCs, the first indications hint at choices around p = 50.

Development

This project uses poetry for managing dependencies.

In order to clone the repository and run the unit tests, execute the following steps on an environment with python>=3.7.

git clone https://github.com/hybridtheory/floc-simhash.git
cd floc-simhash
poetry install
pytest

The unit tests are property-based, using the hypothesis library. This allows for algorithm veritication against hundreds or thousands of random generated inputs.

Since running many examples may lengthen the test suite runtime, we also use pytest-xdist in order to parallelize the tests. For example, the following call will run up to 1000 examples for each test with parallelism 4.

pytest -n 4 --hypothesis-profile=ci
Owner
Hybrid Theory
(formerly Affectv)
Hybrid Theory
Leveraging Unique CPS Properties to Design Better Privacy-Enhancing Algorithms

Differential_Privacy_CPS Python implementation of the research paper Leveraging Unique CPS Properties to Design Better Privacy-Enhancing Algorithms Re

Shubhesh Anand 2 Dec 14, 2022
Rover. Finding the shortest pass by Dijkstra’s shortest path algorithm

rover Rover. Finding the shortest path by Dijkstra’s shortest path algorithm Задача Вы — инженер, проектирующий роверы-беспилотники. Вам надо спроекти

1 Nov 11, 2021
Cormen-Lib - An academic tool for data structures and algorithms courses

The Cormen-lib module is an insular data structures and algorithms library based on the Thomas H. Cormen's Introduction to Algorithms Third Edition. This library was made specifically for administeri

Cormen Lib 12 Aug 18, 2022
A simple python application to visualize sorting algorithms.

Visualize sorting algorithms A simple python application to visualize sorting algorithms. Sort Algorithms Name Function Name O( ) Bubble Sort bubble_s

Duc Tran 3 Apr 01, 2022
Genetic algorithms are heuristic search algorithms inspired by the process that supports the evolution of life.

Genetic algorithms are heuristic search algorithms inspired by the process that supports the evolution of life. The algorithm is designed to replicate the natural selection process to carry generatio

Mahdi Hassanzadeh 4 Dec 24, 2022
Python sample codes for robotics algorithms.

PythonRobotics Python codes for robotics algorithm. Table of Contents What is this? Requirements Documentation How to use Localization Extended Kalman

Atsushi Sakai 17.2k Jan 01, 2023
A fast, pure python implementation of the MuyGPs Gaussian process realization and training algorithm.

Fast implementation of the MuyGPs Gaussian process hyperparameter estimation algorithm MuyGPs is a GP estimation method that affords fast hyperparamet

Lawrence Livermore National Laboratory 13 Dec 02, 2022
QDax is a tool to accelerate Quality-Diveristy (QD) algorithms through hardware accelerators and massive parallelism

QDax: Accelerated Quality-Diversity QDax is a tool to accelerate Quality-Diveristy (QD) algorithms through hardware accelerators and massive paralleli

Adaptive and Intelligent Robotics Lab 183 Dec 30, 2022
🧬 Performant Evolutionary Algorithms For Python with Ray support

🧬 Performant Evolutionary Algorithms For Python with Ray support

Nathan 49 Oct 20, 2022
marching rectangles algorithm in python with clean code.

Marching Rectangles marching rectangles algorithm in python with clean code. Tools Python 3 EasyDraw Creators Mohammad Dori Run the Code Installation

Mohammad Dori 3 Jul 15, 2022
FingerPy is a algorithm to measure, analyse and monitor heart-beat using only a video of the user's finger on a mobile cellphone camera.

FingerPy is a algorithm using python, scipy and fft to measure, analyse and monitor heart-beat using only a video of the user's finger on a m

Thiago S. Brasil 37 Oct 21, 2022
A simple python implementation of A* and bfs algorithm solving Eight-Puzzle

A simple python implementation of A* and bfs algorithm solving Eight-Puzzle

2 May 22, 2022
N Queen Problem using Genetic Algorithm

The N Queen is the problem of placing N chess queens on an N×N chessboard so that no two queens attack each other.

Mahdi Hassanzadeh 2 Nov 11, 2022
frePPLe - open source supply chain planning

frePPLe Open source supply chain planning FrePPLe is an easy-to-use and easy-to-implement open source advanced planning and scheduling tool for manufa

frePPLe 385 Jan 06, 2023
Tic-tac-toe with minmax algorithm.

Tic-tac-toe Tic-tac-toe game with minmax algorithm which is a research algorithm his objective is to find the best move to play by going through all t

5 Jan 27, 2022
Resilient Adaptive Parallel sImulator for griD (rapid)

Rapid is an open-source software library that implements a novel “parallel-in-time” (Parareal) algorithm and semi-analytical solutions for co-simulation of integrated transmission and distribution sy

Richard Lincoln 7 Sep 07, 2022
PICO is an algorithm for exploiting Reinforcement Learning (RL) on Multi-agent Path Finding tasks.

PICO is an algorithm for exploiting Reinforcement Learning (RL) on Multi-agent Path Finding tasks. It is developed by the Multi-Agent Artificial Intel

21 Dec 20, 2022
Esse repositório tem como finalidade expor os trabalhos feitos para disciplina de Algoritmos computacionais e estruturais do CEFET-RJ no ano letivo de 2021.

Exercícios de Python 🐍 Esse repositório tem como finalidade expor os trabalhos feitos para disciplina de Algoritmos computacionais e estruturais do C

Rafaela Bezerra de Figueiredo 1 Nov 20, 2021
All algorithms implemented in Python for education

The Algorithms - Python All algorithms implemented in Python - for education Implementations are for learning purposes only. As they may be less effic

1 Oct 20, 2021
FPE - Format Preserving Encryption with FF3 in Python

ff3 - Format Preserving Encryption in Python An implementation of the NIST approved FF3 and FF3-1 Format Preserving Encryption (FPE) algorithms in Pyt

Privacy Logistics 42 Dec 16, 2022