Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Overview

Product Reviews Summarizer

Version 1.0.0

A quick guide on installation of important libraries and running the code.

The project has three .ipynb files - Data Scraper.ipynb, cosine-similarity-wo-tf-idf.ipynb, and cosine-similarity-w-tf-idf.ipynb.


Data Scraper

For the Data Scraper python script, we need to import the following three libraries - requests, BeautifulSoup, and pandas. The installation process can be viewed by clicking on the respective library names.

Splash

In this project, instead of using the default web browser to scrape data, we have created a splash container using docker. Splash is a light-weight javascript rendering service with an HTTP API. For easy installation, you can watch this amazing video by John Watson Rooney on YouTube.

https://www.youtube.com/watch?v=8q2K41QC2nQ&t=361s

Note: You need to make sure that you give the Splash Localhost URL to the requests.get().

Running the code

After you have installed and configured everything, you can run the code by providing the URL of your choice. Suppose, you are taking a product from Amazon, make sure to go to All Reviews page and go to page #2. Copy this URL upto the last '=' and paste it as an f-string in the code. Add a '{x}' after the '='. The code is ready to run. It will scrape the product name, review title, star rating, and the review body from each page, until the last page is encountered, and save it in .xlsx format.

Note: Specify the required output name and destination.


cosine-similarity-wo-tf-idf

For the cosine similarity model, first we need to download the pretrained GloVe Word Embeddings. Run the Load GloVe Word Embeddings section in the script once. It is only required if the kernel is restarted.

For this script, we need to import the following libraries - numpy, pandas, nltk, nltk.tokenize, nltk.corpus, re, sklearn.metrics.pairwise, networkx, transformers, and time. Also run the nltk.download('punkt') and nltk.download('stopwords') lines to download them.

Next step is to load the data as a dataframe. Make sure to give the correct address. Pre-processing of the reviews is done for efficient results. The pre-processing steps include converting to string datatype, converting alphabetical characters to lowercase, removing stopwords, replacing non-alphabetical characters with blank character and tokenizing the sentences.

The pre-processed data is then grouped based on star ratings and sent to the cosine similarity and pagerank algorithm. The top 10 ranked sentences after the applying the pagerank algorithm are sent to huggingface transformers to create an extractive summary (min_lenght = 75, max_length = 300). The summary, along with the product name, star rating, no of reviews, % of total reviews, and the top 5 frequent words along with the count are saved in .xlsx format.

Note: Specify the required output name and destination.


cosine-similarity-w-tf-idf

For this model, along with the above libraries, we need to import the following additional libraries - spacy, and heapq. The cosine similarity algorithm has a time complexity of O(n^2). In order to have a fast execution, in this method, we are using tf-idf measure to score the frequent words, and hence the corresponding sentences. Only the top 1000 sentences are then sent to the cosine similarity algorithm. Usage of the tf-idf measure, ensures that each product, irrespective of the number of sentences in the reviews, gives an output within 120 seconds. This method makes sure no important feature is lost, giving similar results as the previous method but in considerately less time.


Contributors

© Parv Bhatt © Namratha Sri Mateti © Dominic Thomas


Owner
Parv Bhatt
Masters in Data Analytics Student at Penn State University
Parv Bhatt
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Maarten Grootendorst 120 Dec 27, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Manage your exceptions in Python like a PRO Currently in BETA. Inspired by this blog post. I shared the building process of this tool here. “For those

Guilherme Latrova 353 Dec 31, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 04, 2021
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022
Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

Trần Cao Minh Bách 32 Jul 29, 2022
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022
A CSRankings-like index for speech researchers

Speech Rankings This project mimics CSRankings to generate an ordered list of researchers in speech/spoken language processing along with their possib

Mutian He 19 Nov 26, 2022
Translates basic English sentences into the Huna language (hoo-NAH)

huna-translator The Huna Language Translates basic English sentences into the Huna language (hoo-NAH). The Huna constructed language was developed in

Miles Smith 0 Jan 20, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
A website which allows you to play with the GPT-2 transformer

transformers A website which allows you to play with the GPT-2 model Built with ❤️ by raphtlw Table of contents Model Setup About Contributors Model T

raphtlw 2 Jan 27, 2022