Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

Overview

PizzaOrders_DataPipeline

There is a Tony who is owning a New Pizza shop.
He knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire
so he had one more genius idea to combine with it - he was going to Uberize it - and so Pizza Runner was launched!

Tony started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Tony’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.

Now he wants to know how is his business going on he needs some answers to his questions from the data. but the data which is stored is not in an appropriate format. He Approaches a Data Engineer to process and store the data for him and get the answers to his question

The data are stored in the different CSV files

  • customer_orders.csv
    Columns=>order_id,customer_id,pizza_id,exclusions,extras,order_time
  • pizza_names.csv
    Columns=> pizza_id,pizza_name
  • pizza_recipes.csv
    Columns=>pizza_id,toppings
  • pizza_toppings.csv
    Columns=>topping_id,topping_name
  • runner_orders.csv
    Columns=>order_id,runner_id,pickup_time,distance,duration,cancellation
  • runners.csv
    Columns=> runner_id,registration_date

The Answers the Tony wanted for

  1. How many pizzas were ordered?
  2. How many unique customer orders were made?
  3. How many successful orders were delivered by each runner?
  4. How many of each type of pizza was delivered?
  5. How many Vegetarian and Meatlovers were ordered by each customer?
  6. What was the maximum number of pizzas delivered in a single order?
  7. For each customer, how many delivered pizzas had at least 1 change and how many had no changes?
  8. How many pizzas were delivered that had both exclusions and extras?
  9. What was the total volume of pizzas ordered for each hour of the day?
  10. Wh/at was the volume of orders for each day of the week?

Requirements

  1. Store the data In MY SQL table
  2. Using Sqoop Store the Data in Hive
  3. Using the PySpark the get the Results for the question
  4. Store the Results in Seperate Table
  5. Automate entire process in the Airflow

AirFlow Output

Owner
Melwin Varghese P
Melwin Varghese P
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

Wes McKinney 18.6k Jan 08, 2023
Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022
MoRecon - A tool for reconstructing missing frames in motion capture data.

MoRecon - A tool for reconstructing missing frames in motion capture data.

Yuki Nishidate 38 Dec 03, 2022
Working Time Statistics of working hours and working conditions by industry and company

Working Time Statistics of working hours and working conditions by industry and company

Feng Ruohang 88 Nov 04, 2022
An Aspiring Drop-In Replacement for NumPy at Scale

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the f

Legate 502 Jan 03, 2023
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 01, 2022
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 03, 2023
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022