Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Overview

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python πŸ“Š

Last updated on January 30, 2022 by Thomas J. Nicoletti

I would like to preface this document by stating this is my second major project using Python. From my first project to now, I certainly improved upon my understanding of and proficiency with Python, though I still have a long journey ahead of me. I aim to keep learning more and more everyday, and hope this project provides some benefit to the greater applied social science community.

The purpose of this data mining script is to use random forest classification, in conjunction with factor analysis and other analytic techniques, to automatically yield feature importance metrics and related output for a driver analysis. Driver analysis quantifies the importance of independent variables (i.e., drivers) in predicting some outcome variable. Within this repository is a basic, simulated dataset created by me, containing five independent variables and one outcome variable. I am by no means an expert in simulating datasets, so I encourage everyone to use real-world data as a stress test for this statistical tool.

This tool will communicate with users using simple inputs via the Command Prompt. Once all mandatory and optional inputs are received, the analysis will run and send relevant information to the source folder; this potentially includes text files, images, and data files useful for model comprehension and validation, as well as statistically- and conceptually-informed decision-making. The most useful outputs will include the automatically generated feature importance plot and feature quadrant chart.

πŸ’» Installation and Preparation

Please note that excerpts of code provided below are examples based on the driver.py script. As a self-taught programmer, I suggest reading through my insights, mixing them with a quick Google search and your own experiences, and then delving into the script itself.

For this project, I used Python 3.9, the Microsoft Windows operating system, and Microsoft Excel. As such, these act as the prerequisites for utilizing this repository successfully without any additional troubleshooting. Going forward, please ensure everything you download or install for this project ends up in the correct location (e.g., the same source folder).

Use pip to install relevant packages to the proper source folder using the Command Prompt and correct PATH. For example:

pip install numpy
pip install pandas

Please be sure to install each of the following packages: easygui, matplotlib, numpy, pandas, seaborn, string, factor_analyzer, scipy, sklearn, and statsmodels. If required, use the first section of the script to determine lacking dependencies, and proceed accordingly.

πŸ“‘ Script Breakdown

The script begins by calling relevant libraries in Python, as well as defining Mahalanobis distance, which is used to identify multivariate outliers in a later step of this project. Additionally, the Command Prompt will read a simple set of instructions for the user, including important information regarding categorical features, the location of the outcome variable within the dataset, and a required revision for missing data. Furthermore, the script will allow the user to specify a random seed for easy replication of this driver analysis at a later date:

import easygui
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
...
def mahalanobis(x = None, data = None, cov = None):
	mu = x - np.mean(data)
    ...
	return mah.diagonal()
...
seed = int(input('Please enter your numerical random seed for replication purposes: '))
np.random.seed(seed)
text = open('random_seed.txt', 'w')

The script has an entire section dedicated to understanding your dataset, including a quick process for uploading your data file, removing missing data, adding an outlier status variable, determining the final sample size, classifying variables, and so on:

df = pd.read_csv(easygui.fileopenbox())
df.dropna(inplace = True)
df['Mahalanobis'] = mahalanobis(x = df, data = df.iloc[:, :len(df.columns)], cov = None)
df['PValue'] = 1 - chi2.cdf(df['Mahalanobis'], len(df.columns) - 1)
...
n = df.shape[0]
text = open('sample_size.txt', 'w')
...
x = df.iloc[:, :-1]
y = np.ravel(df.iloc[:, -1])
feat = df.columns[:-1]
mean = np.array(df.describe().loc['mean'][:-1])

The script then checks for relevant statistical assumptions needed before determining if your dataset is appropriate for factor analysis. This includes Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin Test. Additionally, a scree plot is produced using principal components analysis to assist in factor analysis decision-making. Once all of this is reviewed, the user will provide relevant inputs regarding their driver analysis model:

bart = calculate_bartlett_sphericity(x)
bart = (str(round(bart[1], 2)))
text = open('sphericity.txt', 'w')
...
kmo = calculate_kmo(x)
kmo = (str(round(kmo[1], 2)))
text = open('factorability.txt', 'w')
...
pca = PCA()
pca.fit(x)
comp = np.arange(pca.n_components_)
plt.figure()

When it comes to choosing whether to run random forest classification on the original variables or transformed factors, the above information is critical. The user will be able to decide both A) whether or not to use factor analysis, and B) how many factors should be used in extraction if applicable. Additionally, if the user opts for the factor analysis route, they will also be able to determine whether all the factors or just the highest loading variable per factor should be used (please see lines 139-150 in the script). The following optional factor analysis and mandatory core analyses will run based on user specifications from the previous step:

fa = FactorAnalysis(n_components = factor, max_iter = 3000, rotation = 'varimax')
...
x = fa.transform(x)
...
load = pd.DataFrame(fa.components_.T.round(2), columns = cols, index = feat)
load.to_csv('factor_loadings.csv')
...
vif = pd.Series(variance_inflation_factor(x.values, i) for i in range(x.shape[1]))
vif = pd.DataFrame(np.array(vif.round(2)), columns = ['Variable Inflation Factor'], index = feat)
vif.T.to_csv('variable_inflation_factors.csv')
clf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_features = 'auto', bootstrap = True, oob_score = True, class_weight = 'balanced').fit(x, y)
oob = str(round(clf.oob_score_, 2)).ljust(4, '0')
pred = clf.predict_proba(x)
loss = str(round(log_loss(y, pred), 2)).ljust(4, '0')
perf = pd.DataFrame({'Out-of-Bag Score': oob, 'Log Loss': loss}, index = ['Estimate'])
perf.to_csv('model_performance.csv')

Please note, the only current rotation method available in Python for factor analysis is varimax, as far as I know. If another rotation method is preferred, I would opt out of the factor analysis route, or try implementing your own solution from scratch. From these results, the feature importance plot and its respective feature quadrant chart can be graphed and saved automatically to the source folder. This is an especially useful and efficient data visualization tool to help express which variable(s) are most important in predicting your outcome. It also saves you quite a bit of time compared to graphing it yourself!

imp = clf.feature_importances_
sort = np.argsort(imp)
plt.figure()
plt.barh(range(len(sort)), imp[sort], color = 'mediumaquamarine', align = 'center')
plt.title('Feature Importance Plot')
plt.xlabel('Derived Importance β†’')
...
imps = []
score = []
for i, feat in enumerate(imp[sort]):
  imps.append(round(feat / imp[sort].mean() * 100, 0))
for i, feat in enumerate(mean[sort]):
  score.append(round(feat / mean[sort].mean() * 100, 0))
quad = pd.DataFrame({'Rescaled Observed Score β†’': score, 'Rescaled Derived Importance β†’': imps,
  'Feature': x.columns[sort]})

To run the script, I suggest using a batch file located in the source folder as follows:

python driver.py
PAUSE

Although the entire script is not reflected in the above breakdown, this information should prove helpful in getting the user accustomed to what this script aims to achieve. If any additional information and/or explanations are desired, please do not hesitate in reaching out!

πŸ“‹ Next Steps

Although I feel this project is solid in its current state, I think one area of improvement would fall in the realm of optimizing the script and making it more pythonic. I am also quite interested in hearing feedback from users, including their field of practice, which variables they used for their analyses, and how satisfied they were with this statistical tool overall.

πŸ’‘ Community Contribution

I am always happy to receive feedback, recommendations, and/or requests from anyone, especially new learners. Please click here for information about the license for this project.

❔ Project Support

Please let me know if you plan to make changes to this project, or adapt the script to a project of your own interest. We can certainly collaborate to make this process as painless as possible!

πŸ“š Additional Resources

  • My current work in market research introduced me to the idea of driver analysis and its usefulness; this statistical tool was created with that space in mind, though it is certainly applicable to all applied areas of business and social science
  • To learn more about calculating random forest classification in Python, click here to access scikit-learn
  • To learn more about calculating factor analysis in Python, click here to access scikit-learn
  • For easy-to-use text editing software, check out Sublime Text for Python and Atom for Markdown
Owner
Thomas
With a passion for research, I am eager to build upon my knowledge of statistical programming. My current areas of focus include data mining and psychometrics.
Thomas
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

17 Jul 09, 2022
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
Average time per match by division

HW_02 Unzip matches.rar to access .json files for matches. Get an API key to access their data at: https://developer.riotgames.com/ Average time per m

11 Jan 07, 2022
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. πŸ˜ƒ Motiv

Souvik Pratiher 31 Dec 16, 2022
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
Spectacular AI SDK fuses data from cameras and IMU sensors and outputs an accurate 6-degree-of-freedom pose of a device.

Spectacular AI SDK examples Spectacular AI SDK fuses data from cameras and IMU sensors (accelerometer and gyroscope) and outputs an accurate 6-degree-

Spectacular AI 94 Jan 04, 2023
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 05, 2023
NumPy aware dynamic Python compiler using LLVM

Numba A Just-In-Time Compiler for Numerical Functions in Python Numba is an open source, NumPy-aware optimizing compiler for Python sponsored by Anaco

Numba 8.2k Jan 07, 2023
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
A meta plugin for processing timelapse data timepoint by timepoint in napari

napari-time-slicer A meta plugin for processing timelapse data timepoint by timepoint. It enables a list of napari plugins to process 2D+t or 3D+t dat

Robert Haase 2 Oct 13, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
CS50 pset9: Using flask API to create a web application to exchange stocks' shares.

C$50 Finance In this guide we want to implement a website via which users can β€œregister”, β€œlogin” β€œbuy” and β€œsell” stocks, like below: Background If y

1 Jan 24, 2022
Data processing with Pandas.

Processing-data-with-python This is a simple example showing how to use Pandas to create a dataframe and the processing data with python. The jupyter

1 Jan 23, 2022