A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
☄️ High performance, easy to use and feature-rich Solana SDK for Python.

Solathon is an high performance, easy to use and feature-rich Solana SDK for Python. Easy for beginners, powerful for real world applications.

Bolt 28 Oct 10, 2022
Stock market bot that will be used to learn about API calls and database connections.

Stock market bot that will be used to learn about API calls and database connections.

1 Dec 24, 2021
A really easy way to display your spotify listening status on spotify.

Spotify playing README A really easy way to display your spotify listening status on READMEs and Websites too! Demo Here's the embed from the site. Cu

Sunrit Jana 21 Nov 06, 2022
AWS SQS event redrive Lambda

This repository contains the Lambda function to redrive sqs events from source to destination queue while controlling maxRetry per event.

1 Oct 19, 2021
wyscoutapi is an extremely basic API client for the Wyscout API (v2 & v3) for Python

wyscoutapi wyscoutapi is an extremely basic API client for the Wyscout API (v2 & v3). Usage Install with pip install wyscoutapi. To connect to the Wys

Ben Torvaney 11 Nov 22, 2022
Client to allow skytrack to be used with GSPro Golf simulator application

Skytrack Interface for GSPro A Basic Interface connection from Skytrack Launch Monitors to be able to play simulator golf via GSPro About The Project

James Peruggia 2 Oct 24, 2021
Instant messaging client in tkinter

Concord_client_tk Instant messaging client in tkinter Contributors : Ilade-s [https://github.com/Ilade-s] Doku [https://github.com/D0kuhebi] Descripti

Raphaël Merlet 2 Jun 15, 2022
Baota-docker - Deploying baota panel via docker

baota-docker Deploying baota panel via docker. 通过docker一键部署宝塔面板。 一、前言 好像很多人对这个感兴

Mr. Cat 15 Dec 12, 2022
An unofficial Python wrapper for the 'Binance exchange REST API'

Welcome to binex_f v0.1.0 many interfaces are heavily used by myself in product environment, the websocket is reliable (re)connected. Latest version:

DeepLn 2 Jan 05, 2022
WhatsApp API Python ChatBot

WhatsApp Api Python - Esta documentação tem como objetivo exemplificar o uso do Moorse Whatsapp API na linguagem Python.

Douglas Alves 2 Jan 06, 2022
Discord-disnake - This package allows to use disnake without changing the discord namespace

This package is a shim This module allows to use disnake using discord namespace. This is not an independent library. Installing Python 3.8 or higher

5 Dec 13, 2022
Der Dischkort Bot für Andiismus

AndreOS Der Dischkort Bot für Andiismus Wichtigger Bot für den hauseigenen Discord-Server Indoktrinationsmechanismusleitungsprogramm der andiistischen

Leon Bartle 3 Jan 13, 2022
ARKHAM X GOD MULTISPAM BOT

ARKHAM-X-GOD-MULTISPAM-BOT 𝗗𝗘𝗣𝗟𝗢𝗬 𝗨𝗣𝗧𝗢 30 𝗕𝗢𝗧𝗦 𝗜𝗡 𝗔 𝗦𝗜𝗡𝗚𝗟?

ArkhamXGod 2 Jan 08, 2022
GUI Pancakeswap2 and Uniswap3 trading client (and bot)

GUI Pancakeswap2 and Uniswap3 trading client (and bot) (MOST ADVANCE TRADING BOT SUPPORT WINDOWS LINUX MAC) (AUTO BUY TOKEN ON LUNCH AFTER ADD LIQUIDI

16 Dec 23, 2021
A tool for extracting plain text from Wikipedia dumps

WikiExtractor WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requ

Giuseppe Attardi 3.2k Dec 31, 2022
Модуль для создания скриптов для ВКонтакте | vk.com API wrapper

vk_api vk_api – Python модуль для создания скриптов для ВКонтакте (vk.com API wrapper) Документация Примеры Чат в Telegram Документация по методам API

Kirill 1.2k Jan 04, 2023
A simple telegram Bot, Upload Media File| video To telegram using the direct download link. (youtube, Mediafire, google drive, mega drive, etc)

URL-Uploader (Bot) A Bot Upload file|video To Telegram using given Links. Features: 👉 Only Auth Users (AUTH_USERS) Can Use The Bot 👉 Upload YTDL Sup

Hash Minner 18 Dec 17, 2022
1 Feb 18, 2022
um simples script para localizar IP

um simples script para localizar IP pkg install git (apt-get install git) pkg install python (apt-get install python) git clone https://github.com/byd

bydeathlxncer 4 Nov 29, 2021
This is a tutorial on how to make a Discord Bot using the discord.py library

HowToMakeADiscordBot This Github repository is here to help you code a Discord Bot using the discord.py library! 1 - Setup: Download the code inside t

Baz 1 Oct 31, 2021