Inverted index creation and query search mechanism on Wikipedia pages.

Overview

WikiPedia Search Engine

Step 1 : Installing Requirements

Install "stemming" module for python using pip.

Step 2 : Parsing the Data

To parse the data, run the file "wikiIndexer.py"

To run the file, the syntax is: "python WikipediaIndexer.py "

It will parse the whole dump and file the index files in the 'indexFiles' directory. It also creates the document to title mapping file in the current directory named 'docTitleMap.txt' which will be used by the search module later. (NOTE : As of current code, it pushes the index to disk for every 5000 documents encountered. This can be changed by changing the value of ' WRITE_PAGES_TO_FILE' macro in config.py)

Step 3 : Merging the Indexes and Creating Secondary Indexes

This task is done by WikipediaIndexer.py itself. There isn't any need for separate command. It takes the index files from 'pathOfFolder' directory and populates the 'finalIndex' directory with indexes of given chunk size and creates a secondary index named 'secondaryIndex.txt' in the same folder. (NOTE : As of the current code, the chunk size is kept 20000)

Step 4 : Running the Search Engine

To search for queries, run the file 'search.py'. It loads the index from 'finalIndex' (both primary and secondary). It also uses the file 'titleoffset.txt' (which must be in the working directory) to display titles corresponding to the docIDs. After it loads up, it gives the user a prompt to enter the query. After that the result of query is displayed. (top K results).

The format of result is:

	
   
     : 
    
      (Here the DocID is from the XML Dump)
	...
	
     
    
   

To specify normal queries, type them normally. For Field Queries, follow the format:

	f1:
   
     f2:
    
      ...
	where f1,f2 are fields : t - title, b - body, r - references, i - infobox, c - categories, e - external links

    
   
Owner
Piyush Atri
Piyush Atri
Jina allows you to build deep learning-powered search-as-a-service in just minutes

Cloud-native neural search framework for any kind of data

Jina AI 17k Dec 31, 2022
Pythonic search engine based on PyLucene.

Lupyne is a search engine based on PyLucene, the Python extension for accessing Java Lucene. Lucene is a relatively low-level toolkit, and PyLucene wr

A. Coady 83 Jan 02, 2023
MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch.

MeiliSearch FastAPI MeiliSearch FastAPI provides FastAPI routes for interacting with MeiliSearch. Installation Using a virtual environmnet is recommen

Paul Sanders 29 Nov 18, 2022
ForFinder is a search tool for folder and files

ForFinder is a search tool for folder and files. You can use that when you Source Code Analysis at your project's local files or other projects that you are download. Enter a root path and keyword to

Çağrı Aliş 7 Oct 25, 2022
A play store search application programming interface ( API )

Play-Store-API A play store search application programming interface ( API ) Made with Python3

Fayas Noushad 8 Oct 21, 2022
Google Drive file searcher

Google Drive file searcher

Hafitz Setya 25 Dec 09, 2022
Pythonic Lucene - A simplified python impelementaiton of Apache Lucene

A simplified python impelementaiton of Apache Lucene, mabye helps to understand how an enterprise search engine really works.

Mahdi Sadeghzadeh Ghamsary 2 Sep 12, 2022
A simple search engine that allow searching for chess games

A simple search engine that allow searching for chess games based on queries about opening names & opening moves. Built with Python 3.10 and python-chess.

Tyler Hoang 1 Jun 17, 2022
Deep Image Search - AI-Based Image Search Engine

Deep Image Search is an AI-based image search engine that includes deep transfer learning features Extraction and tree-based vectorized search technique.

144 Jan 05, 2023
Full-text multi-table search application for Django. Easy to install and use, with good performance.

django-watson django-watson is a fast multi-model full-text search plugin for Django. It is easy to install and use, and provides high quality search

Dave Hall 1.1k Jan 03, 2023
User-friendly, tiny source code searcher written by pure Python.

User-friendly, tiny source code searcher written in pure Python. Example Usages Cat is equivalent in the regular expression as '^Cat$' bor class Cat

Furkan Onder 106 Nov 02, 2022
a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot

Drive Search Bot This is a Telegram bot writen in Python for searching files in Drive. Based on SearchX-bot How to deploy? Clone this repo: git clone

Hafitz Setya 25 Dec 09, 2022
A library for fast parse & import of Windows Prefetch into Elasticsearch.

prefetch2es Fast import of Windows Prefetch(.pf) into Elasticsearch. prefetch2es uses C library libscca. Usage When using from the commandline interfa

S.Nakano 5 Nov 24, 2022
A search engine to query social media insights with political theme

social-insights Social insights is an open source big data project that generates insights about various interesting topics happening every day. Curre

UMass GDSC 10 Feb 28, 2022
Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

honmaple 197 Dec 29, 2022
🔍 Messages Searcher is make for search custom message in all channels in guild and dm.

🔍 Messages Searcher is make for search custom message in all channels in guild and dm.

Kaneki 33 Dec 31, 2022
ElasticSearch ODM (Object Document Mapper) for Python - pip install esengine

esengine - The Elasticsearch Object Document Mapper esengine is an ODM (Object Document Mapper) it maps Python classes in to Elasticsearch index/doc_t

SEEK International AI 109 Nov 22, 2022
Modular search for Django

Haystack Author: Daniel Lindsley Date: 2013/07/28 Haystack provides modular search for Django. It features a unified, familiar API that allows you to

Haystack Search 3.4k Jan 04, 2023
Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

Local SERP Google Search Engine Results Pages (SERP) in locally, no API key, no signup required Make sure the chromedriver and required package are in

theblackcat102 4 Jun 29, 2021
Inverted index creation and query search mechanism on Wikipedia pages.

WikiPedia Search Engine Step 1 : Installing Requirements Install "stemming" module for python using pip. Step 2 : Parsing the Data To parse the data,

Piyush Atri 1 Nov 27, 2021