Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Last update: Jan 23, 2022

Related tags

Data Analysis igti-desafio-4-cde

Overview

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI

Objetivos

Criar infraestrutura como código
Utuilizando um cluster Kubernetes na Azure
- Ingestão dos dados do Enade 2017 com python para o datalake na Azure
- Transformar os dados da camada bronze para camada silver usando delta format
- Enrriquecer os dados da camada silver para camada gold usando delta format
Utilizar Azure Synapse Serveless SQL Poll para servir os dados

Arquitetura

Passos

Criar infra

source infra/00-variables

bash infra/01-create-rg.sh

bash infra/02-create-cluster-k8s.sh

bash infra/03-create-lake.sh

bash infra/04-create-synapse.sh

bash infra/05-access-assignments.sh

Preparar k8s

Baixar kubeconfig file

bash infra/02-get-kubeconfig.sh

Para facilitar os comandos usar um alias

alias k=kubectl

Criar namespace

k create namespace processing
k create namespace ingestion

Criar Service Account e Role Bing

k apply -f k8s/crb-spark.yaml

Criar secrets

k create secret generic azure-service-account --from-env-file=.env --namespace processing
k create secret generic azure-service-account --from-env-file=.env --namespace ingestion

Intalar Spark Operator

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

helm repo update

helm install spark spark-operator/spark-operator --set image.tag=v1beta2-1.2.3-3.1.1 --namespace processing

Ingestion app

Ingestion Image

docker build ingestion -f ingestion/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4-ingestion --network=host

docker push otaciliopsf/cde-bootcamp:desafio-mod4-ingestion

Apply ingestion job

k8s/ingestion-job.yaml k apply -f k8s/ingestion-job.yaml ">

# primeiro mudar o nome unico do pod
cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/ingestion-job.yaml

k apply -f k8s/ingestion-job.yaml

Logs

ING_POD_NAME=$(cat k8s/ingestion-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")

k logs $ING_POD_NAME -n ingestion --follow

Spark

Criar Job Image

docker build spark -f spark/Dockerfile -t otaciliopsf/cde-bootcamp:desafio-mod4

docker push otaciliopsf/cde-bootcamp:desafio-mod4

Apply job

k8s/spark-job.yaml k apply -f k8s/spark-job.yaml ">

# primeiro muda o nome unico da Spark Application
cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);y['metadata']['name']=y['metadata']['name'][:-8]+str(uuid.uuid4())[:8];print(yaml.dump(y))"\
> k8s/spark-job.yaml

k apply -f k8s/spark-job.yaml

logs

SPARK_APP_NAME=$(cat k8s/spark-job.yaml |\
python3 -c "import sys,yaml,uuid;y=yaml.safe_load(sys.stdin);print(y['metadata']['name'])")'-driver'

k logs $SPARK_APP_NAME -n processing --follow

Azure Synapse Serveless SQL Poll

Acessar o Synapse workspace através do link gerado

bash infra/04-get-workspace-url.sh

Para começar a usar siga os passos

Rodar o conteudo do script create-synapse-view.sql no Synapse workspace para criar a view da tabela no lake

Pronto, o Synapse esta pronto para receber as querys.

Limpando os recursos

bash infra/99-delete-service-principal.sh

bash infra/99-delete-rg.sh

Conclusão

Seguindo os passos citados é possivel realizar querys direto na camada gold do delta lake utilizando o Synapse

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Related tags

Overview

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI

Objetivos

Arquitetura

Passos

Criar infra

Preparar k8s

Baixar kubeconfig file

Para facilitar os comandos usar um alias

Criar namespace

Criar Service Account e Role Bing

Criar secrets

Intalar Spark Operator

Ingestion app

Ingestion Image

Apply ingestion job

Logs

Spark

Criar Job Image

Apply job

logs

Azure Synapse Serveless SQL Poll

Limpando os recursos

Conclusão

Owner

Otacilio Filho

PyChemia, Python Framework for Materials Discovery and Design

small package with utility functions for analyzing (fly) calcium imaging data

BAyesian Model-Building Interface (Bambi) in Python.

This mini project showcase how to build and debug Apache Spark application using Python

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

An extension to pandas dataframes describe function.

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Dbt-core - dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

Minimal working example of data acquisition with nidaqmx python API

A stock analysis app with streamlit

This creates a ohlc timeseries from downloaded CSV files from NSE India website and makes a SQLite database for your research.

Multiple Pairwise Comparisons (Post Hoc) Tests in Python

Flenser is a simple, minimal, automated exploratory data analysis tool.

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

A columnar data container that can be compressed.

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

EOD Historical Data Python Library (Unofficial)

A tool to compare differences between dataframes and create a differences report in Excel

In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.