Introduction

This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition from development stage to production stage as smooth as possible.
Zeppelin is good at data pipeline development (Spark, Flink, Hive, Python, Shell and etc), while Airflow is the de-facto standard of Job orchestration.

How to run it

Step 1. Initialize enviromenment.

Run this following commands to initialize environment.

Download spark which is used by Zeppelin
Download zeppelin airflow plugins

git clone https://github.com/zjffdu/zeppelin_airflow.git
cd zeppelin_airflow
./init.sh

Step 2 Start Zeppelin + Airflow via docker-compose

docker-compose -f docker-compose-LocalExecutor.yml up -d

Step 3. Use Zeppelin + Airflow

Open http://localhost:8085 for Zeppelin http://localhost:8083 for Airflow

There's one dag zeppelin_example in Airflow. This dag just run 3 Zeppelin notes:

Python Tutorial/01. IPython Basics
Spark Tutorial/02. Spark Basics Features
Spark Tutorial/03. Spark SQL (PySpark)

You can enable it, then Airflow would run these Zeppelin notes.

Actually Zeppelin would not run these notes directly, instead it would clone note and run the cloned note.

More features would come soon, stay tuned.

Show you how to integrate Zeppelin with Airflow

Related tags

Overview

Introduction

How to run it

Step 1. Initialize enviromenment.

Step 2 Start Zeppelin + Airflow via docker-compose

Step 3. Use Zeppelin + Airflow

More features would come soon, stay tuned.

Owner

Jeff Zhang

Toolchest provides APIs for scientific and bioinformatic data analysis.

Learn machine learning the fun way, with Oracle and RedBull Racing

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Performance analysis of predictive (alpha) stock factors

An Integrated Experimental Platform for time series data anomaly detection.

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

My first Python project is a simple Mad Libs program.

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Data processing with Pandas.

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Aggregating gridded data (xarray) to polygons

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

Working Time Statistics of working hours and working conditions by industry and company

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Pipetools enables function composition similar to using Unix pipes.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

Automated Exploration Data Analysis on a financial dataset

My solution to the book A Collection of Data Science Take-Home Challenges