I'm currently a research associate with a M.S. in Computer Science, with a focus (and minor) in Statistics and Statistical Modeling. My graduate research focused on Machine Learning (ML) and Artificial Intelligence (AI); my interests are in exploring the intersections between unsupervised learning, statistical learning theory, and empirical analysis. I also enjoy building and contributing software in the realm of scientific computing, and for reproducible research purposes. Topic areas that interest me include e.g. clustering, dimensionality reduction, topology theory, manifold learning, density estimation, etc. I have supplemental research interests, background knowledge, or experience in random graph modeling, bayesian statistics, computational geometry, reinforcement learning (such as adversarial learning!) and high performance computing. A few selected research projects are listed below.
I first started doing research part time at the Air Force Institute of Technology with a heavily multidisciplinary team called the Low Orbitals Radar and Electromagnetism (LORE) group in 2013 doing either 1) research for an independent project under supervision of Dr. Andrew Terzuoli or 2) supporting graduate students' research efforts in the group. I worked actively with the group until 2016, after which I maintained an advisory-role until early 2017.
In 2015, I started working for the Machine Learning and Complex Systems Lab as part of a research-based independent study shortly after taking an introductory ML/data analysis course taught by Derek Doran. I received a graduate research assistant position in the same lab shortly after, working towards an M.S. in Computer Science.
Since late 2016, I've been involved in a small group associated with the Air Force Research Laboratory that does applied topological data analysis (TDA) with the Mapper framework. I originally worked for the group under a very part-time status, but since Fall 2018 began doing research there full time.
My computational experience is diverse. Since 2015, I started using the R Project for Statistical Computing for statistical modeling, and I continue to prefer R for research. In my undergraduate years, I used both C++ (primarily C++11) and ANSI-C89/C90 extensively for a myriad of projects (see below). Part of my undergraduate research delved into project involving computational geometry which required a final implementation written in Compute Unified Device Architecture (CUDA) and OpenCL. Of course, I'm proficient in both Python and Java.
The primary result of the Mapper framework is the geometric realization of a simplicial complex, depicting topological relationships and structures suitable for visualizing, analyzing, and comparing high dimensional data. As an unsupervised tool that may be used for exploring or modeling heterogeneous types of data, Mapper naturally relies on a number of parameters which explicitly control the quality of the resulting construction; one such critical parameter controls the entire relational component of the output complex. In practice, there is little guidance on what values may provide "better" or more "stable" sets of simplices. In this effort, we provide a new algorithm that enables efficient computation of successive mapper realizations with respect to this crucial parameter. Our results not only enhances the exploratory/confirmatory aspect of Mapper, but also give tractability to recent theoretical extensions to Mapper related to persistence and stability.
With the rapid development and widespread deployment of sensors dedicated to location-acquisition, new types of models have emerged to predict macroscopic patterns that manifest in large data sets representing "significant" group behavior. Partially due to the immense scale of geospatial data, current approaches to discover these macroscopic patterns are primarily driven by inherently heuristic detection methods. Although useful in practice, the inductive bias adopted by such mainstream detection schemes is often unstated or simply unknown. Inspired by recent theoretical advances in efficient non-parametric density level set estimation techniques, in this research effort we describe a semi-supervised framework for automating point of interest discovery in geospatial contexts. We outline the flexibility and utility of our approach through numerous examples, and give a systematic framework for incorporating semisupervised information while retaining finite-sample estimation guarantees.
Density-based clustering techniques have become extremely popular in the past decade. It's often conjectured that the reason for the success of these methods is due to their ability of identify "natural groups" in data. These groups are often non-convex (in terms of shape), deviating the typical premise of 'minimal variance' that underlies parametric, model-based approaches, and often appear in very large data sets. As the era of "Big Data" continues to rise in popularity, it seems that typical notions having access to scalable, easy-to-use, and scalable implementations of these density-based methods is paramount. In this research effort, we provide fast, state-of-the-art density-based algorithms in the form of an open-source package in R. We also provide several related density-based clustering tools to help bring make state of the art density-based clustering accessible to people with large, computationally difficult problems.
The Iterative Closest Point (ICP) problem is now a well-studied problem that seeks to align a given query point cloud to a fixed reference point cloud. The ICP problem computationally is dominated by the first phase, a pairwise distance minimization. The "brute-force" approach, an embarrassingly parallel problem amenable to GPU-acceleration, involves calculating the pairwise distance from every point in the query set to every point in the reference set. This however still requires linear runtime complexity per thread, rendering the trivial solution unsuitable for e.g. real-time applications. Alternative spatial indexing data structures utilizing branch-and-bound (B&B) properties have been proposed as a means of reducing the algorithmic complexity of the ICP problem, however they were originally developed for serial applications: it is well known that direct conversion to their parallel equivalents often results in slower runtime performance than GPU-employed brute-force approaches due to frequent suboptimal memory access patterns and conditional computations. In this application-motivated effort, we propose a novel two-step method which exposes the intrinsic parallelism of the ICP problem, yet retains a number of the B&B properties. Our solution involves an O(log n) approximate search, followed by fast vectorized search we call the Delaunay Traversal, which we show empirically finishes in O(k) time on average, where k << n, and is demonstrated to generally exhibit extremely small growth factors on average. We demonstrate the superiority of our method compared to the traditional B&B and brute-force implementations using a variety of benchmark data sets, and demonstrate its usefulness in the context of Autonomous Aerial Refueling.
School | Degree | Graduation Year |
---|---|---|
Wright State University | Masters of Science in Computer Science | 2018 |
Wright State University | Bachelor of Science in Computer Science Minor in Statistics | 2015 |
CEG 7900: Network Science | CS 7830: Machine Learning | CS 3250: Computational Tools and Techniques for Data Analysis |
STT 7020: Applied Stochastic Processes | CS 7230: Information Theory | CS 4850: Foundations of Artificial Intelligence |
STT 3600/3610: Applied Statistics I & II | STT 4610: Theoretical Statistics I | CS 7200: Algorithm Design and Analysis |
As I read more into theoretical foundations of density-based clustering, my research began to intersect Topology Theory and Manifold Learning. In 2017, I began to research these connections in a minor capacity with a local research group studying the intersection of TDA and machine learning. The research primarily involved understanding the basic foundations of Topology towards extending the Mapper framework, a popular and very general method which has been used successfully for data analysis.
Starting Fall 2018 I was hired full-time to begin enhancing/extending Mapper, and to assist the team in using Mapper on real-world applications. My primary research towards this end has been two-fold: (1) to enable efficient construction of mappers in a multiscale setting, (2) understand the full range of use-cases for the Mapper framework. For more details, see below.
NOTE: This is paper is still developing, and is made available in spirit of transparent research. Some equations may be incorrect and there may be notational errors.
I was hired by Dr. Steven Arnold under NASAs 10-week LERCIP program to apply Machine Learning to a specific Material Science problem. The first phase of the research project involving training a fairly trivial feed-forward Artificial Neural Network (ANN) to act a surrogate model for the Generalized Method of Cells (GMC) technique. The second (non-trivial) phase of the project involved creating a systematic procedure for interpreting various aspects of the data produced by the surrogate model using a non-parameteric Optimal Experimental Design (OED)-motivated optimization procedure, recently made possible by the Approximate Coordinate Exchange algorithm.
NOTE: A technical report and associated journal article is currently being developed to fully capture and report the results of the research project. Preliminary draft versions of both the code and the report(s) are available upon request to U.S. citizens only.
My graduate research involved a large, multifaceted project aimed at modeling real-world traffic network networks at a macroscopic scale. The goal of the project is to turn raw positioning/track information into a dynamic network representation, and then model that representation. The project involved researching density-based clustering algorithms, cluster validation measures, non-parametric density estimation techniques, markov chain monte carlo optimization techniques, and random graph modeling (stochastic block models).
I submitted a successful funding proposal under the Google Summer of Code (GSOC) Initiative to the R Project for Statistical Computing to explore, develop, and unify recent developments related the theory of density-based clustering. This involved a mixture of code development which culminated in the form of an R package, as well as deep research to further understand the theory and utility of the cluster tree. There was also a WSU newsroom piece that describes the proposal in a non-technical way.
Being a heavily multi-disciplinary team, I worked on several exploratory or educational projects involving computational, statistical, and physics-based problems. Much of the work involved assisting the Air Force graduate students with their research work. In that time, I studied topics including branch-and-bound spatial indexing data structures (kd-trees, cover trees, locality sensitive hashing), the ICP problem, finite mixture modeling, markov chain modeling, and general parameter estimation techniques (EM/MAP estimation).
Various random graph models such as Erdős–Rényi models and Exponential Random Graph Models (ERGMs), entropy measures over networks, density-based clustering techniques (DBSCAN and OPTICS), non-parametric time-varying regression models (ARMA + ARIMA models)
Gauss–Newton Method, approximation algorithms for unsplittable flow problems + basic graph theory (by extension), Natural language processing techniques for SEO (PageRank), asynchronous vs. synchronous client-server communication strategies with AJAX and NodeJS/PHP servers, XML Schema and XML Technologies [Xlink, XPath, etc.]