An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

Gathering data of likes on Tinder within the past 7 days

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

Pipetools enables function composition similar to using Unix pipes.

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Learn machine learning the fun way, with Oracle and RedBull Racing

A notebook to analyze Amazon Recommendation Review Dataset.

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

A tax calculator for stocks and dividends activities.

High Dimensional Portfolio Selection with Cardinality Constraints

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CSV database for chihuahua (HUAHUA) blockchain transactions

Titanic data analysis for python

A library to create multi-page Streamlit applications with ease.

This python script allows you to manipulate the audience data from Sl.ido surveys

LynxKite: a complete graph data science platform for very large graphs and other datasets.

NumPy aware dynamic Python compiler using LLVM

A Python module for clustering creators of social media content into networks

statDistros is a Python library for dealing with various statistical distributions