Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Overview

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Here, we introduce a new algorithm, which we name Buckshot++. Buckshot++ improves upon the k-means by dealing with the main shortcoming thereof, namely, the need to predetermine the number of clusters, K. Typically, K is found in the following manner:

  1. settle on some metric,
  2. evaluate that metric at multiple values of K,
  3. use a greedy stopping rule to determine when to stop (typically the bend in an elbow curve).

There must be a better way. We detail the following 3 improvements that the Buckshot++ algorithm makes to k-means.

  1. Not all metrics are create equal. And since K-means doesn't prescribe which metric to use for finding K, we analyzed that some of the commonly implemented metrics are too inconsistent from one iteration to the next. Buckshot++ prescribes the silhouette score for finding K.
  2. In k-means, every single point is clustered -- even the noise and outliers. But what we really care about is the pattern and not the noise. We show here an elegant way to overcome this problem -- even simpler than k-medoids or k-medians.
  3. Finally, the computational complexity of running k-means multiple times on the whole dataset to find the best K can be prohibitive. We show below a surprisingly simple alternative with better asymptotics.

Details of the Buckshot++ algorithm

ALGORITHM: Buckshot++
INPUTS: population of N vectors
B := number of bootstrap samples
F := max number of clusters to try
M := cluster quality metric
OUTPUT: the optimal K for kmeans

Take B bootstrap samples where each sample is of size 1/B.
for each counter k from 2 to F do
  Compute kmeans with k centers.
  Compute the metric M on the clusters.
Compute the centroid of all metrics vectors.
Get argmax of the centroid vector.

Explanation of Buckshot++

The Buckshot++ algorithm was motivated by the Buckshot algorithm, which essentially finds cluster centers by performing hierarchical clustering on a sample and then performing k-means by taking those cluster centers as inputs. Hierarchical has relatively high time complexity, which is why Buckshot performs hierarchical only on a sample. The key difference between hierarchical and kmeans is that the former is more deterministic/stable but less scalable than the latter, as the next table elucidates.

%matplotlib inline
import pandas as pd
pd.set_option('display.max_rows', 500)
tbl = pd.DataFrame({'k-means': ['O(N * k * d * i)', 'random initial means; local minimum; outlier'],
                    'hierarchical': ['O(N^2 * logN)', 'outlier']}
                   , index=['Computational Complexity', 'Sources of Instability'])
tbl
k-means hierarchical
Computational Complexity O(N * k * d * i) O(N^2 * logN)
Sources of Instability random initial means; local minimum; outlier outlier

Hierarchical's higher time complexity means that, for large inputs, running k-means multiple times is still faster than running hierarchical just once. The Buckshot algorithm runs hierarchical just once on a small sample in order to initialize cluster centers for k-means. Since O(N^2 * logN) grows really fast, the sample must be really small to make it work computationally. But a key critique of Buckshot is failure to find the right structure with a small sample.

Buckshot++'s key innovation lies in the step "Take B bootstrap samples where each sample is of size 1/B." While Buckshot is doing hierarchical on a sample, Buckshot++ is doing multiple kmeans on bootstrap samples. Doing kmeans many times can still finish sooner than doing hierarchical just once, as the time complexities above show. An added bonus is that bootstrapping is a great way to smooth out noise and improve stability. In fact, that is exactly why Bagging (a.k.a. Bootstrap Aggregating) and Random Forests work so well.

Python implementation of Buckshot++

The core algorithm implementation is in the buckshotpp module. We use it below to cluster a news headlines dataset.

from buckshotpp import Clusterings, plot_mult_samples
from numpy.random import choice
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
import nltk; nltk.download('punkt', quiet=True)
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 120
import warnings; warnings.filterwarnings('ignore')

vecSpaceMod = Clusterings({'file_loc': 'data/news_headlines.csv',
                           'tf_dampen': True,
                           'common_word_pct': 1,
                           'rare_word_pct': 1,
                           'dim_redu': False}
                         )  # Instantiate a Clusterings object using parameters.
news_df = vecSpaceMod.get_file() # Read news_headlines.csv into a df.
metrics_byK = vecSpaceMod.buckshot(news_df)
plot_mult_samples(metrics_byK, 'silhouette')

png

An insight from this chart

Each green curve is generated from a bootstrap sample, and the red curve is their average. Remember the sources of instability for k-means listed in the table above? Outlier is one. The concept of outlier has somewhat different meaning in the context of clustering. In supervised learning, an outlier is a rare observation that's far from other observations distance-wise. In clustering, a far away observation is its own well-separated cluster. Here, our interpretation is that "rare" is the operative word here and that outliers are singleton clusters that exert undue influence on the formation of other clusters. Look at how bagging led to a more stable estimate of the optimal number of clusters in the graph above.

Not all metrics are create equal

The two internal clustering metrics implemented in scikit-learn are: the Silhouette Coefficient and the Calinski-Harabasz criterion. Comparing the Silhouette plotted above with the Calinski plotted below, it's clear that Calinski is far more extreme, perhaps implausibly extreme.

plot_mult_samples(metrics_byK, 'calinski')

png

Internal or External Clustering Metrics?

This data contains a field named "STORY" that indicates which story a headline belongs to. With this field as the ground truth, we compute Mutual Information (the most common external metric) using the code below. Mutual Information's possible range is 0-1. Using the K resulting from Buckshot++, we obtained a Mutual Information of about 0.6, an indicator that the model performance is reasonable.

X = vecSpaceMod.term_weight_matr(news_df.TITLE)
kmeans_fit = KMeans(20).fit(X)  # the argument comes from inflectin point of silhouette plot
mutual_info = adjusted_mutual_info_score(labels_true=news_df.STORY, labels_pred=kmeans_fit.labels_) 
mutual_info
0.6435601965984835

Practically, does Buckshot++ produce well-separated clusters?

Taking a look at the documents and their corresponding "predictedCluster", the results certainly do seem reasonable.

cluster_results = pd.DataFrame({'predictedCluster': kmeans_fit.labels_,
                                'document': news_df.TITLE})
cluster_results.sort_values(by='predictedCluster', inplace=True)

cluster_results
predictedCluster document
25 0 SAC Capital Starts Anew as Point72
50 0 Zebra Technologies to Acquire Enterprise Busin...
23 0 Fine Tuning: Good Wife just gets better
21 0 Boulder's Wealth May Be A Factor For Lowest Ob...
6 0 Power restored to nuclear plant in Waterford, ...
73 0 Electricity out as Millstone shifts to diesel
59 1 Twitter's head of media Chloe Sladden steps do...
28 1 Twitter's revolving door: media head Chloe Sla...
12 1 Twitter Exec Exodus Continues with Media Chief...
67 2 Sony Xperia C3 arrives with 5MP selfie camera,...
30 2 Leaked: Images Of Sony's Xperia C3 'Selfie Phone'
45 2 Sony Xperia Z2 Encased In A Block Of Ice, Cont...
90 2 Sony Xperia Z4 Concept Emerges as Fan Imagines...
78 2 If you hate the word 'selfie' look away now, t...
71 3 Twitter Executive Quits Amid Stalling Growth
47 3 Twitter COO quits, signalling management shake-up
52 3 Twitter Loses a Powerful Executive
31 3 Second Twitter executive quits hours after Row...
20 3 Twitter COO resigns as growth lags
61 3 Twitter COO Rowghani resigns amid lacklustre g...
57 4 'Goodbye Twitter' COO Ali Rowghani, says bye t...
69 4 Twitter chief operating officer resigns as use...
66 4 UPDATE 3-Twitter chief operating officer resig...
86 4 Twitter chief operating officer Ali Rowghani h...
76 4 Ali Rowghani, Twitter's COO, resigns after mon...
49 4 Twitter COO Ali Rowghani Just Announced Via Tw...
13 4 Twitter COO Ali Rowghani Exits
35 4 Second Twitter exec resigns with goodbye tweet...
39 5 Why almost everything you've been told about u...
77 5 Why Fargo Works So Well as a TV Show
0 6 'Mad Men' Preview: Buckle Up For 7 'Dense' Epi...
4 6 'Mad Men' end in sight for Weiner
36 6 Weiner reflects on the beginning of the end of...
42 7 Giant mystery crater in Siberia has scientists...
85 7 Mysterious giant crater in the earth discovere...
60 7 Massive Crater Discovered in Siberia
92 7 Massive mystery crater at 'end of the world'
16 7 Mysterious crater in Siberia spawns wild Inter...
43 8 Inflation rise stalls wage hopes in the UK
82 8 The Least Obese City in the Country
19 8 Real wages could resume fall as "Easter effect...
55 8 UK Inflation Rise To 1.8% Delays Real Wage Ris...
26 8 Virginia's Governor Challenges Abortion Clinic...
51 8 BREAKING NEWS: Transport costs lead to hike in...
8 8 Cable prices climb 4 times faster than inflati...
79 9 Despite Safety Issues, GM's Sales Still Increa...
17 9 Chrysler Group LLC reports June 2014 US sales ...
40 9 GM June Sales Up 9 Percent, Best June Since 2007
87 9 Ford sales fall, GM barely even; Jeep powers C...
18 10 Gov. McAuliffe Makes Health Announcements
48 10 Microsoft wants Windows XP dead and has announ...
74 10 McAuliffe puts focus on women's health
7 11 Sony makes duckfacing official with Xperia C3,...
54 11 Sony to announce 'Selfie' phone on July 8th wi...
27 11 Sony prepares to launch a smartphone that has ...
91 11 Sony Xperia C3 launches as "world's best selfi...
88 11 Sony unveils Xperia C3 smartphone with LED fla...
11 11 Sony Xperia C3 Boasts 5MP "PROselfie" Front-fa...
44 12 UK CPI rises to 1.8% in April, core CPI hits 2%
75 12 Rising CO2 Levels Will Lower Nutritional Value...
1 12 Here's How Climate Change Will Make Food Less ...
81 12 Rising CO2 levels also make our food less nutr...
80 13 Nutrition in Crops Are Cut down Drastically by...
2 13 Rising carbon dioxide levels reduce nutrients ...
68 13 With carbon dioxide levels up, nutrients in cr...
64 14 Inflation back up: Modest rise to 1.8% in Apri...
83 14 US plants prepare for long-term nuclear waste ...
22 14 Nuclear Plant Operators Deal With Radioactive ...
32 14 US plants prepare long-term nuclear waste stor...
84 15 'Mad Men' takes off on its final flight
3 15 'Mad Men' mixology
5 15 'Mad Men': 7 things to know for Season 7
9 15 Mad Men - the (Blaxploitation) Movie
37 15 TV Review: Mad Men Season 7
46 15 'Mad Men': Season 7 Premiere Guide (Video)
70 15 10 Things You Never Knew About 'Mad Men'!
53 15 'Mad Men' Season 7 Spoilers: Everything We Kno...
72 15 Rich Sommer from AMC's 'Mad Men' Season Premiere
63 16 Fargo (FX) Season Finale 2014 �Morton's Fork�
56 16 Before 'Fargo's' season finale, a sequel (or p...
65 16 'Fargo' Season 1 Spoilers: Episode 10 Synopsis...
62 17 Google Glass headsets get new designs in colla...
41 17 Google's first fashionable Glass frames are de...
89 17 Google Glass Still Trying To Look Cool
34 17 Net-a-Porter Embraces Google Glass
15 18 Routine pelvic exams not recommended under new...
14 18 Doctors group nixes routine pelvic exams
38 18 Metro Detroit doctors wary of recommendation a...
10 18 Doctors against having frequent pelvic exams
58 19 Technology stocks falling for 2nd day in a row
24 19 UPDATE 5-JPMorgan profit weaker than expected ...
29 19 JPMorgan profit weaker than expected
33 19 Marks and Spencer's profits fall for third year

Summary of the key advantages of Buckshot++

  • Accurate method of estimating the number of clusters (a clearly best Silhouette emerged every time, while typical elbow heuristic searches can hit or miss).
  • Scalable (faster search for K achieved by using k-means rather than hierarchical; running k-means on subsample rather than everything).
  • Noise resistant when used in conjunction with k-means++ (sampling with replacement lessens the chance of selecting an outlier in the bootstrap sample).
Owner
John Jung
Senior Machine Learning Engineer
John Jung
Custom Django field for using enumerations of named constants

django-enumfield Provides an enumeration Django model field (using IntegerField) with reusable enums and transition validation. Installation Currently

5 Monkeys 195 Dec 20, 2022
GeoDjango provides geospatial extensions to the Django web dev framework

Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. All documentation is in the "docs" directo

Paul Smith 20 Sep 20, 2022
Awesome Django Markdown Editor, supported for Bootstrap & Semantic-UI

martor Martor is a Markdown Editor plugin for Django, supported for Bootstrap & Semantic-UI. Features Live Preview Integrated with Ace Editor Supporte

659 Jan 04, 2023
Developer-friendly asynchrony for Django

Django Channels Channels augments Django to bring WebSocket, long-poll HTTP, task offloading and other async support to your code, using familiar Djan

Django 5.5k Jan 06, 2023
Auth module for Django and GarpixCMS

Garpix Auth Auth module for Django/DRF projects. Part of GarpixCMS. Used packages: django rest framework social-auth-app-django django-rest-framework-

GARPIX CMS 18 Mar 14, 2022
Django Course Project - TextCorrector

Django-TextUtils Django Course Project A tool for analyzing text data in Django backend. It is a project where you can do some of the things with you

1 Oct 29, 2021
xsendfile etc wrapper

Django Sendfile This is a wrapper around web-server specific methods for sending files to web clients. This is useful when Django needs to check permi

John Montgomery 476 Dec 01, 2022
Faker is a Python package that generates fake data for you.

Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in yo

Daniele Faraglia 15.2k Jan 01, 2023
webfest Django project @innovaccer

inno-doctor webfest Django project @innovaccer setup guide create new directory for project clone the repo with url into the directory make sure pytho

Rohit sahu 6 Oct 28, 2022
A simple Blog Using Django Framework and Used IBM Cloud Services for Text Analysis and Text to Speech

ElhamBlog Cloud Computing Course first assignment. A simple Blog Using Django Framework and Used IBM Cloud Services for Text Analysis and Text to Spee

Elham Razi 5 Dec 06, 2022
Twitter Bootstrap for Django Form

Django bootstrap form Twitter Bootstrap for Django Form. A simple Django template tag to work with Bootstrap Installation Install django-bootstrap-for

tzangms 557 Oct 19, 2022
Django Livre Bank

Django Livre Bank Projeto final da academia Construdelas. API de um banco fictício com clientes, contas e transações. Integrantes da equipe Bárbara Sa

Cecília Costa 3 Dec 22, 2021
Notes-Django: an advanced project to save notes in Django. where users are able to Create, Read, Update and Delete their notes.

An advanced software to keep you notes. It allows users to perform CRUD operations on theirs Notes. Was implemented Authorization and Authentication

Edilson Pateguana 1 Feb 05, 2022
A clone of https://virgool.io written in django

Virgool clone A clone of virgool blog written in django Installation first rename the .env.sample to .env and fill it. with docker docker-compose up -

Danial Selmipoor 7 Dec 23, 2022
Send logs to RabbitMQ from Python/Django.

python-logging-rabbitmq Logging handler to ships logs to RabbitMQ. Compatible with Django. Installation Install using pip. pip install python_logging_

Alberto Menendez Romero 38 Nov 17, 2022
Highlight the keywords of a page if a visitor is coming from a search engine.

Django-SEKH Django Search Engine Keywords Highlighter, is a middleware for Django providing the capacities to highlight the user's search keywords if

Julien Fache 24 Oct 08, 2021
Simpliest django(uvicorn)+postgresql+nginx docker-compose (ready for production and dev)

simpliest django(uvicorn)+postgresql+nginx docker-compose (ready for production and dev) To run in production: docker-compose up -d Site available on

Artyom Lisovskii 1 Dec 16, 2021
A simple plugin to attach a debugger in Django on runserver command.

django-debugger A simple plugin to attach a debugger in Django during runserver Installation pip install django-debugger Usage Prepend django_debugger

Sajal Shrestha 11 Nov 15, 2021
Source files for a free pyRevit toolbar.

pyRoovit (WIP) What is this? PyRoovit is/will be a toolbar for the use with pyRevit built by Gavin Crump (aka Aussie BIM Guru). Having used and taught

Gavin Crump 11 Nov 10, 2022
The friendly PIL fork (Python Imaging Library)

Pillow Python Imaging Library (Fork) Pillow is the friendly PIL fork by Alex Clark and Contributors. PIL is the Python Imaging Library by Fredrik Lund

Pillow 10.4k Jan 03, 2023