dc.description.abstract | Data streams are defined as a sequence of observations arriving continuously at a fast pace. They
pose unique computational challenges viz: Single-pass at incoming observations, huge storage requirements, and accounting for concept drift. Concept drift is a phenomenon where characteristics
of data evolve over time. Concept drift renders the models built in conventional setup outdated
for predictions on current data. Predictive machine learning methods are supposed to account for
these challenges while processing data streams that have become ubiquitous due to the pervasive
presence of sensors in the Internet of Things era.
The prevalence of information and communication technologies for pervasive sensor data collection, a rapid decrease in data storage cost, and pervasive availability of computing power enables
the analysis of “big data” for monitoring, planning, and operational purposes. The domain of ‘Intelligent Systems’ involves the use of advancements in communication and computation technologies to address challenges in data-driven systems. This leads to the production of high-velocity,
information-rich data streams. These streams operate in dynamic environments and do not meet
the requirements of a (time) stationary distribution which is often an important requirement for
analysis of temporal data.
In this dissertation, we aim to develop new prediction methodologies for data streams from
sensors with applications in the transportation and human activity recognition domains. We focus
on the following problems:
1. Dynamic Concept Drift Detection in Data Streams with Limited Labeling
2. Dynamic Demand Forecasting in Bikeshare Networks
For the first problem in a classification context, we use optimal transport theory to develop a novel
algorithm for detecting concept drift in partially labeled data streams. We develop a summariza-tion measure to reduce the storage requirements of a data stream. We demonstrate the performance
of the algorithm on synthetic benchmark datasets and real datasets containing sensor observations
from the transportation domain. This approach can help transportation researchers develop adaptive systems for safer driving with minimal user feedback. It can also aid transportation planners in
assessing changes in mobility preferences of a population using sensor data. The key contributions
of this approach are that in addition to developing a novel algorithm for drift detection, we also
propose of a data-driven approach for estimation of threshold that critically determines the performance of a drift detection algorithm. As accuracy alone is an unsuitable metric for comparing drift
detection algorithms in limited labeling setups, we propose a novel measure that accounts for the
predictive performance and the labeling requirements of a method.
For the first problem in general predictive contexts, we develop a novel algorithm for detecting concept drift in partially labeled data streams using theory from symbolic data. We devise a
novel drift detection metric using theory from symbolic data analysis and statistical learning. We
demonstrate the performance of the proposed algorithm on synthetic and real-life human activity
recognition dataset. It can be applied to aid assisted living for the elderly where a drift detected
in real-time could help update the predictive system to detect falls and injuries more accurately.
The key contribution of this method is the development of a novel drift detection metric that is
more sensitive to drifts in features with more predictive power, thus improving upon existing drift
tracking metrics that are equally receptive to drifts in all features. This method is applicable for
both regression and classification problems.
For the second problem, we focus on demand forecasting in a bike-share system. We devise an
algorithm that uses spatial clustering to reduce the high-dimensionality of the problem, followed
by building time series models in streaming setup. An accurate forecast helps can help the bikeshare authorities to achieve timely rebalancing across stations to meet demand effectively. The
key contributions of this method is the development of light-weight models for bike demand forecasting that are more suitable for edge computing environments with limited computing power as
compared to deep learning models with high computational overheads.
The key contributions of this dissertation are development of new algorithms with a demonstrated applicability in real world problems. We also offer insights into the choosing the right
algorithm based on the application context. The findings of this research contribute to domains of
streaming data, transportation, human activity recognition and sensor data analysis. | en_US |