DAT4 - General Assembly's Data Science course in Washington, DC

Related tags

Deep LearningDAT4
Overview

DAT4 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (12/15/14 - 3/16/15).

Instructors: Sinan Ozdemir and Kevin Markham (Data School blog, email newsletter, YouTube channel)

Teaching Assistant: Brandon Burroughs

Office hours: 1-3pm on Saturday and Sunday (Starbucks at 15th & K), 5:15-6:30pm on Monday (GA)

Course Project information

Monday Wednesday
12/15: Introduction 12/17: Python
12/22: Getting Data 12/24: No Class
12/29: No Class 12/31: No Class
1/5: Git and GitHub 1/7: Pandas
Milestone: Question and Data Set
1/12: Numpy, Machine Learning, KNN 1/14: scikit-learn, Model Evaluation Procedures
1/19: No Class 1/21: Linear Regression
1/26: Logistic Regression,
Preview of Other Models
1/28: Model Evaluation Metrics
Milestone: Data Exploration and Analysis Plan
2/2: Working a Data Problem 2/4: Clustering and Visualization
Milestone: Deadline for Topic Changes
2/9: Naive Bayes 2/11: Natural Language Processing
2/16: No Class 2/18: Decision Trees
Milestone: First Draft
2/23: Ensembling 2/25: Databases and MapReduce
3/2: Recommenders 3/4: Advanced scikit-learn
Milestone: Second Draft (Optional)
3/9: Course Review 3/11: Project Presentations
3/16: Project Presentations

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "DAT4 team" and add your photo!

Class 1: Introduction

  • Introduction to General Assembly
  • Course overview: our philosophy and expectations (slides)
  • Data science overview (slides)
  • Tools: check for proper setup of Anaconda, overview of Slack

Homework:

  • Resolve any installation issues before next class.

Optional:

Class 2: Python

Homework:

Optional:

Resources:

Class 3: Getting Data

Homework:

  • Think about your project question, and start looking for data that will help you to answer your question.
  • Prepare for our next class on Git and GitHub:
    • You'll need to know some command line basics, so please work through GA's excellent command line tutorial and then take this brief quiz.
    • Check for proper setup of Git by running git clone https://github.com/justmarkham/DAT-project-examples.git. If that doesn't work, you probably need to install Git.
    • Create a GitHub account. (You don't need to download anything from GitHub.)

Optional:

  • If you aren't feeling comfortable with the Python we've done so far, keep practicing using the resources above!

Resources:

Class 4: Git and GitHub

  • Special guest: Nick DePrey presenting his class project from DAT2
  • Git and GitHub (slides)

Homework:

  • Project milestone: Submit your question and data set to your folder in DAT4-students before class on Wednesday! (This is a great opportunity to practice writing Markdown and creating a pull request.)

Optional:

  • Clone this repo (DAT4) for easy access to the course files.

Resources:

Class 5: Pandas

Homework:

Optional:

Resources:

  • For more on Pandas plotting, read the visualization page from the official Pandas documentation.
  • To learn how to customize your plots further, browse through this notebook on matplotlib.
  • To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and Columbia's Data Mining class has an excellent slide deck.

Class 6: Numpy, Machine Learning, KNN

  • Numpy (code)
  • "Human learning" with iris data (code, solution)
  • Machine Learning and K-Nearest Neighbors (slides)

Homework:

  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:

Class 7: scikit-learn, Model Evaluation Procedures

Homework:

Optional:

  • Practice what we learned in class today!
    • If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
    • If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
    • Either way, you can submit your commented code to DAT4-students, and we'll give you feedback.

Resources:

Class 8: Linear Regression

Homework:

Optional:

  • Similar to last class, your optional exercise is to practice what we have been learning in class, either on your project data or on another dataset.

Resources:

Class 9: Logistic Regression, Preview of Other Models

Resources:

Class 10: Model Evaluation Metrics

  • Finishing model evaluation procedures (slides, code)
    • Review of test set approach
    • Cross-validation
  • Model evaluation metrics (slides)
    • Regression:
      • Root Mean Squared Error (code)
    • Classification:

Homework:

Optional:

Resources:

Class 11: Working a Data Problem

  • Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.

  • Project overview (slides)

    • Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...

Class 12: Clustering and Visualization

  • The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
  • The code for today focuses on two main examples:
    • We will investigate simple clustering using the iris data set.
    • We will take a look at a harder example, using Pandora songs as data. See data.

Homework:

  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Monday. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
    • Confusion matrix: Kevin's guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
  • You should definitely be working on your project! Your rough draft is due in two weeks!

Resources:

Class 13: Naive Bayes

Resources:

Homework:

  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.

Class 14: Natural Language Processing

  • Overview of Natural Language Processing (slides)
  • Real World Examples
  • Natural Language Processing (code)
  • NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
  • Alternative: TextBlob

Resources:

Class 15: Decision Trees

Homework:

  • By next Wednesday (before class), review the project drafts of your two assigned peers according to these guidelines. You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT4-students. If your last name is Smith and you are reviewing Jones, you should name your file smith_reviews_jones.md.

Resources:

Installing Graphviz (optional):

  • Mac:
  • Windows:
    • Download and install MSI file
    • Add it to your Path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin

Class 16: Ensembling

Resources:

Class 17: Databases and MapReduce

Resources:

Class 18: Recommenders

  • Recommendation Engines slides
  • Recommendation Engine Example code

Resources:

Class 19: Advanced scikit-learn

Homework:

Resources:

Class 20: Course Review

Resources:

Class 21: Project Presentations

Class 22: Project Presentations

Owner
Kevin Markham
Founder of Data School
Kevin Markham
A basic neural network for image segmentation.

Unet_erythema_detection A basic neural network for image segmentation. 前期准备 1.在logs文件夹中下载h5权重文件,百度网盘链接在logs文件夹中 2.将所有原图 放置在“/dataset_1/JPEGImages/”文件夹

1 Jan 16, 2022
Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimization"

Riggable 3D Face Reconstruction via In-Network Optimization Source code for CVPR 2021 paper "Riggable 3D Face Reconstruction via In-Network Optimizati

130 Jan 02, 2023
一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目

定时面板上的签到盒 一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 特别声明 本仓库发布的脚本及其中涉及的任何解锁和解密分析脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合

Leon 1.1k Dec 30, 2022
ROS-UGV-Control-Interface - Control interface which can be used in any UGV

ROS-UGV-Control-Interface Cam Closed: Cam Opened:

Ahmet Fatih Akcan 1 Nov 04, 2022
MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

MarcoPolo is a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering Overview MarcoPolo

Chanwoo Kim 13 Dec 18, 2022
LIMEcraft: Handcrafted superpixel selectionand inspection for Visual eXplanations

LIMEcraft LIMEcraft: Handcrafted superpixel selectionand inspection for Visual eXplanations The LIMEcraft algorithm is an explanatory method based on

MI^2 DataLab 4 Aug 01, 2022
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano https:

9.6k Jan 06, 2023
Algorithmic Trading using RNN

Deep-Trading This an implementation adapted from Rachnog Neural networks for algorithmic trading. Part One — Simple time series forecasting and this c

Hazem Nomer 29 Sep 04, 2022
This repository is an unoffical PyTorch implementation of Medical segmentation in 3D and 2D.

Pytorch Medical Segmentation Read Chinese Introduction:Here! Recent Updates 2021.1.8 The train and test codes are released. 2021.2.6 A bug in dice was

EasyCV-Ellis 618 Dec 27, 2022
SimplEx - Explaining Latent Representations with a Corpus of Examples

SimplEx - Explaining Latent Representations with a Corpus of Examples Code Author: Jonathan Crabbé ( Jonathan Crabbé 14 Dec 15, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

Udit Arora 19 Oct 28, 2022
ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representa

Bats Research 94 Nov 21, 2022
A library that allows for inference on probabilistic models

Bean Machine Overview Bean Machine is a probabilistic programming language for inference over statistical models written in the Python language using

Meta Research 234 Dec 29, 2022
Python parser for DTED data.

DTED Parser This is a package written in pure python (with help from numpy) to parse and investigate Digital Terrain Elevation Data (DTED) files. This

Ben Bonenfant 12 Dec 18, 2022
A synthetic texture-invariant dataset for object detection of UAVs

A synthetic dataset for object detection of UAVs This repository contains a synthetic datasets accompanying the paper Sim2Air - Synthetic aerial datas

LARICS Lab 10 Aug 13, 2022
Lightwood is Legos for Machine Learning.

Lightwood is like Legos for Machine Learning. A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glu

MindsDB Inc 312 Jan 08, 2023
Free-duolingo-plus - Duolingo account creator that uses your invite code to get you free duolingo plus

free-duolingo-plus duolingo account creator that uses your invite code to get yo

1 Jan 06, 2022
Chainer Implementation of Semantic Segmentation using Adversarial Networks

Semantic Segmentation using Adversarial Networks Requirements Chainer (1.23.0) Differences Use of FCN-VGG16 instead of Dilated8 as Segmentor. Caution

Taiki Oyama 99 Jun 28, 2022