A key feature of the STREAMLINE project is to unify batch and stream processing. This is achieved by extending Apache Flink.
Today’s big data analytics systems cater for either “data at rest” or “data in motion.” As a result, enterprises are left to devise costly strategies to support and integrate disparate systems. To alleviate this burden, STREAMLINE aims to reduce complexity, enable faster results, and reduce cost by supporting analysis on “big data at rest” and “fast data in motion” in a single system. STREAMLINE research and innovation actions include carrying out groundbreaking research in the areas of distributed systems, data management, and machine learning, with the key goal to arrive at sustainable innovation by technology transfer to an established and growing open source project. STREAMLINE will achieve this sustainability by building upon and feeding back into Apache Flink. Furthermore, STREAMLINE will demonstrate innovation impact in three reactive and proactive analytics applications: one focused on customer retention, , the second on targeted advertisement and the third on multilingual Web processing.
Figure 1: Magic Triangle
STREAMLINE research and innovation objectives address the lack of established technologies for high accuracy and reactive predictive methods that are easy to develop, maintain and operate. Our vision manifests in the magic triangle (Figure 1) with the dimensions of skills shortage, delayed information processing, and lack of appropriate analytics. The vision of STREAMLINE is to improve on all three dimensions, by streamlining data analysis, by reducing the software complexity by introducing a novel architecture and interaction paradigm, and by reducing the time it takes to arrive at actionable intelligence via the introduction of innovative approximate methods and interactivity. We will reduce the skill requirements through a declarative language that enables data scientists to express what they want as analysis result without describing how the result should be efficiently computed. This alleviates the major burden data scientists face today, i.e., the need to know and operate two different systems and their respective management and tuning parameters. Achieving this will enable more people to easily create data analysis programs for ever-faster and increasingly bigger datasets. This effectively means that STREAMLINE reduces cost, improves performance, and creates new functionality for data analysis and related business applications.