Abstract
To cope with today's large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.
Original language | English |
---|---|
Title of host publication | SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing |
Place of Publication | New York, NY |
Publisher | Association for Computing Machinery |
Pages | 288-300 |
Number of pages | 13 |
ISBN (Electronic) | 9781450360111 |
DOIs | |
Publication status | Published - 11 Oct 2018 |
Event | ACM Symposium on Cloud Computing 2018 - Cape Rey Beach Resort, Carlsbad, California, Carlsbad, United States Duration: 11 Oct 2018 → 13 Oct 2018 https://acmsocc.github.io/2018/ |
Publication series
Name | SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing |
---|
Conference
Conference | ACM Symposium on Cloud Computing 2018 |
---|---|
Country/Territory | United States |
City | Carlsbad |
Period | 11/10/18 → 13/10/18 |
Internet address |
Keywords
- Dataflow Engines
- Language Integration
- Data Exchange