t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

Related tags

Data Analysistreesne
Overview

tree-SNE

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in speeding up t-SNE and obtaining finer-grained structure, we combine the two to create tree-SNE, a hierarchical clustering and visualization algorithm based on stacked one-dimensional t-SNE embeddings. We also introduce alpha-clustering, which recommends the optimal cluster assignment, without foreknowledge of the number of clusters, based off of the cluster stability across multiple scales. We demonstrate the effectiveness of tree-SNE and alpha-clustering on images of handwritten digits, mass cytometry (CyTOF) data from blood cells, and single-cell RNA-sequencing (scRNA-seq) data from retinal cells. Furthermore, to demonstrate the validity of the visualization, we use alpha-clustering to obtain unsupervised clustering results competitive with the state of the art on several image data sets.

ArXiv preprint: https://arxiv.org/abs/2002.05687

Prerequisites

Install Fit-SNE from https://github.com/KlugerLab/FIt-SNE and add the FIt-SNE directory that you cloned to your PYTHONPATH environmental variable. This lets tree-SNE access the Python file used to interface with FIt-SNE. This can be done one of several ways:

  • run export PYTHONPATH="$PYTHONPATH":/path/to/FIt-SNE in your terminal before running your Python script using tree-SNE
  • add export PYTHONPATH="$PYTHONPATH":/path/to/FIt-SNE to your .bash_profile
  • add the line import sys; sys.path.append('/path/to/FIt-SNE/') to your Python script before calling import tree_sne

Also make sure to have Numpy, Scipy, Sklearn, and Matplotlib installed.

We've tested with Python 3.6+.

Test/Example

Run example.py to make sure everything is set up right. This will run tree-SNE on the USPS handwritten digit dataset, run alpha-clustering, calculate the NMI, and display the tree. You can refer to this file for calling conventions. Note the top line adding FIt-SNE to the Python path.

Sample Usage

Assuming you have a 2D Numpy array containing your data in a variable X. To build a tree-SNE plot with 30 layers, cluster on each layer, and determine the optimal clustering via alpha-clustering (note does not require preknowledge of the number of clusters):

from tree_sne import TreeSNE

tree = TreeSNE()
embeddings, layer_clusters, best_clusters = tree.fit(X, n_layers = 30)

The embeddings variable will contain each data point's embedding in each layer, with embeddings.shape of (n_points, n_layers, n_features). For now, n_features will always be 1, as we haven't yet implemented stacked 2D t-SNE embeddings. The variable layer_clusters will contain cluster assignments for each point in each layer of the embedding, and best_clusters will contain optimal cluster assignments for the data.

To display the tree using our code with cluster labels, run:

from display_tree import display_tree_mnist
import numpy as np

display_tree_mnist(embeddings, true_labels = best_clusters, legend_labels = list(np.unique(best_clusters)), distinct = True)

Alternatively, some labels you provide can be used instead of best_clusters. We realize this is messy but until we refactor this is what we have. We're sorry. You don't have to use our display code if you don't want to, and we'll improve it soon.

If your data has more clusters, reduce the conservativeness parameter to TreeSNE. Typical values range from 1 to 2. It should never drop below 1 according to our theory motivation for its implementation, and we've only had to decrease it when trying to find 100 clusters, in which case we set it to 1.3. n_layers and conservativeness are the only two parameters that we think users may want to adjust, at least for the time being. Once we've refactored we'll write more documentation. Note that conservativeness only effects alpha-clustering and does not actually change the tree-SNE embedding itself.

MNIST tree-SNE example plot

Authors

Acknowledgments

The authors thank Stefan Steinerberger for inspiration, support, and advice; George Linderman for enabling one-dimensional t-SNE with degrees of freedom < 1 in the FIt-SNE package; Scott Gigante for data pre-processing and helpful discussions of visualizations and alpha-clustering; Smita Krishnaswamy for encouragement and feedback; and Ariel Jaffe for discussing the Nyström method and its relationship to subsampled spectral clustering.

Owner
Isaac Robinson
Yale computer science and math major interested in entrepreneurship
Isaac Robinson
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Gravitational-Wave-Analysis This project showcases how to analyze the Gravitational wave data stored at LIGO/VIRGO observatories, using Python program

1 Jan 23, 2022
Basis Set Format Converter

Basis Set Format Converter Repository for the online tool that allows you to enter a basis set in the form of text input for a variety of Quantum Chem

Manas Sharma 3 Jun 27, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Top 50 best selling books on amazon

It's a dashboard that shows the detailed information about each book in the top 50 best selling books on amazon over the last ten years

Nahla Tarek 1 Nov 18, 2021
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
ICLR 2022 Paper submission trend analysis

Visualize ICLR 2022 OpenReview Data

Jintang Li 75 Dec 06, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.3k Dec 30, 2022
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022
📊 Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock

Paulina Khew 21 Aug 30, 2022