TY - GEN
T1 - Parallel Continuous Outlier Mining in Streaming Data
AU - Toliopoulos, Theodoros
AU - Gounaris, Anastasios
AU - Tsichlas, Kostas
AU - Papadopoulos, Apostolos
AU - Sampaio, Sandra
PY - 2019
Y1 - 2019
N2 - In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source
AB - In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source
KW - Anomaly detection
KW - Flink
KW - Streams
UR - http://www.scopus.com/inward/record.url?scp=85062839224&partnerID=8YFLogxK
U2 - 10.1109/dsaa.2018.00033
DO - 10.1109/dsaa.2018.00033
M3 - Conference contribution
SN - 9781538650905
T3 - 2018 IEEE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA)
SP - 227
EP - 236
BT - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
A2 - Eliassi-Rad, Tina
A2 - Wang, Wei
A2 - Cattuto, Ciro
A2 - Provost, Foster
A2 - Ghani, Rayid
A2 - Bonchi, Francesco
PB - IEEE Computer Society
ER -