Work Package 1
WP1 is a research and development work package, where each task follows the methodology from requirements analysis, research, design to prototyping and development, functional testing, integration, system testing, and evaluation. In this work package the combined data at rest and data in motion-processing platform will be defined, designed and implemented as extension to the current version of Apache Flink.
In summary, the achievement of the following objectives is envisaged:
• To create novel distributed query optimization techniques work best for the combined analysis of streaming data and data-at-rest. (Task 1.1)
• To implement novel fault-tolerance schemes that allow stateful streaming. (Task 1.2)
• To exploit Flink’s design principles that naturally lend themselves to build a unified batch-stream processing architecture. (Task 1.3)
• To provide novel methods for incremental computation as a feature for quicker computations of streaming algorithms. (Task 1.4)
• To support Streaming SQL Joins through View Maintenance with Changelogs. (Task 1.5).
A measurable outcome of this work will be improved performance (latency, throughput) for data analysis problems that combine the analysis of data in motion with data at rest. This outcome will be demonstrated by benchmarking the performance of the streamline extensions to Apache Flink to a state-of-the-art lambda architecture in WP3 after integration of the research. Other measurable outcomes are the performance improvements of the individual research tasks on stateful streaming, unified processing and incremental computations, again wrt throughput and latency.
Work Package 2
In the chain of the technical WPs 1-3, in WP2 we build on top of the platform specific developments in WP1 to define application specific components, which will in turn be integrated in the use case interactive environment in WP3. This work package is a research and development work package, where each task follows the methodology from requirements analysis, research, design to prototyping and development, functional testing, integration, system testing, and evaluation. In this work package the combined data at rest and data in motion-processing platform will be defined, designed and implemented as extension to the current version of Apache Flink. In this WP we build upon the next generation stream processing engine Apache Flink to provide a machine learning library serving the direct needs of the application partner business cases. To achieve this goal, we will exploit recent approaches for approximate data structures for data streams to aid longer term storage, delayed messages in sliding windows and stream joins. Moreover, we will use the same techniques in decentralized collaborative filtering to adapt the underlying distributed topology to the data locality present in event streams. Our aim in this WP is to go beyond of the state-of the-art platforms, e.g., Esper and Spark, by building a system with higher scalability and throughput. Apache Flink turns out a potential candidate as it is primarily designed for multi-server distributed operation and promises better lower level data management compared to Spark.
In summary, the achievement of the following objectives is envisaged:
• To provide a data analysis framework for data in motion and data at rest ready to be used by data scientists. (Task 2.1)
• To define and build different window operation semantics. (Task 2.2)
• To develop a machine learning library for mining streaming data. (Task 2.3)
The measurable outcomes produced by this WP will be:
• Accuracy of the prediction (classification, regression or recommendation) measured by batch (AUC: Area under the Receiver Operating Curve; RMSE: Root Mean Square Error; NDCG: Normalized Discounted Cumulative Gain; …) and temporal online measures.
• Latency and throughput of the analytics applications.
• Gain over competitors measured as speedup over out-of-box and optimized test implementations in distributed streaming systems, primarily Spark.
Work Package 3
WP3 focuses on lowering the skills barrier for big data analytics. In analogy to WP1 and WP2, this work package is a research and development work package, where each task follows the methodology from requirements analysis, research, design to prototyping and development, functional testing, integration, system testing, and evaluation. In this work package the combined data at rest and data in motion-processing platform will be defined, designed and implemented as extension to the current version of Apache Flink.
In WP3 we will implement methods and techniques to increase the usefulness and usability of Apache Flink with respect to our use-cases in WP5, and beyond STREAMLINE. Data intensive platforms have been widely adopted for large-scale data analytics. These systems let users write parallel computations using a set of data analysis operators, without having to worry about work distribution and fault tolerance. Although current frameworks provide numerous abstractions for accessing a cluster’s computational resources, they often do not abstract from the physical characteristics of the underlying system (e.g., having to decide when to cache data in Spark) or do not provide clear abstractions and easy to use interactive interfaces for end users. Also, current frameworks do not often enable secure access to data, which is of paramount importance in multitenant environments. As a result, it would be difficult for users who are not expert in systems programming to work with these platforms. Therefore, it is crucial to provide a high level abstraction for non-expert users to work with data intensive platforms interactively, in declarative way and efficiently. Additionally, installing and configuring data intensive platforms is a time consuming work and requires systems expertise. Hence, we need to provide deployment software to help users to set up the environment quickly on demand.
The major objective of WP3 are:
• To provide a deployment software to help users to set up the environment quickly (T3.1)
• To provide a high level declarative language to increase the usability of the platform (T3.2)
• To provide an interactive environment for Flink (T3.3)
A measurable outcome of this WP is an improvement in the effort and time to deploy and develop applications with Flink, additionally get results of interactive (SQL-like) queries in few seconds on large data sets.
Work Package 4
During the first year of the project, two use case partners have expressed a strong interest in recommendation methods on streaming data. The use cases are briefly introduced in the following. Although originated from quite distinct business area, share some distinctive features that make them a quite promising test bed for the design, implementation and validation of a recommendation module in STREAMLINE/Flink.
• Both approaches concern item-to-item recommendation, which thus complements nicely the existing recommendation efforts in STREAMLINE that were mostly focused on collaborative filtering / user similarities.• Both approaches attempt to identify groups of similar items based on a textual description, rather than on a welldefined set of quantitative or qualitative features.
• Finally, external sources play an important role in the two use cases. External sources, essentially extracted from the Web, supply information likely to reinforce the accuracy of the recommendation results.
Text-centric methodologies have raised a lot of attention in the recommendation community, due to the growing importance of items datasets collected from unstructured sources. Social networks, mobile devices, IoT are all contributors of data streams that often consist of short, textual, sometimes multilingual descriptions, raising the challenge of building relevant matching techniques. We therefore propose to devote our efforts in WP4 to elaborate innovative text-based, items-to-items recommendation algorithms, with the hope that it will contribute to make Flink even more attractive as a ML execution environment.
Altice Labs use case
As part of the AlticeLabs TV recommendation task, the need for using other information sources besides the activity logs has emerged. The reason is, most of the items are new and the item cold start problem can be best handled by using other internal (e.g. program catalogue) and external sources, including even IMDB or other Internet resources. Currently, the TV recommendation engine used in the PT IPTV solution (called MEO) is very simple and restricted to the viewing of the most seen channels and basic related shows. To motivate clients it is important to enhance the engine to recommend not just what other people see with similar consumption profiles but also new channels and programs as soon as they become available. A catalog with the description of the shows (including synopsis, actors, director, etc.) as well as rating (e.g. from IMDB) and other complementary information (e.g. social networks) will help to understand better the user preferences and therefore enhance the recommendation engine. The challenge comes from getting useful information out of these external sources, reason it and weighting conveniently all the factors. Another challenge is to match these informations with the different profiles of the client, depending on the time of the day and even the day of the week. TV Recommendations should be about what the client would like to see and that should depend on:
– Programs already seen by a certain profile (note that a client may have different profiles during the day since it may represent different people); besides the profile the time factor is important (e.g. programs seen a long ago may be important than the ones seen yesterday but if the one yesterday has been seen lots of times then it may be more important)
– Related programs (based on several criteria: what others seen, common info (genre, actors, etc.)
– New channels and programs of similar types
– New seasons and episodes of a program.
IMR use case
As part of IMR development, a new B2C electronic commerce use-case emerged. Bomerce (beta version launched in January 2017) is a web and mobile application that aims at addressing the main shortcomings of online shopping. Many people, when trying to explore the list of on-line offers for a product or a service, are confronted to the silo curse of proprietary eCommerce sites. A customer who wishes to compare several offers for a same product must open several tabs in the browser, and independently access in each tab to a specific site. This makes the comparison of offers difficult, and often requires using additional tools such as a spreadsheet that summarizes the offers. Moreover, some useful and natural inspections cannot be achieved with this clumsy mechanism: asking for an advice to a friend or third-party; finding similar and competitive offers in other eCommerce sites; being notified of promotions for the chased product or similar ones; etc.
Bomerce essentially acts as an «agnostic» shopping cart, independent from eCommerce platforms, offering to customers a neutral place where product offers can be stored, compared and discovered. A challenge in this context is to offer the recommendation services that are proposed by eCommerce sites, which rely on internal information related to their consumers’ purchases. Two kinds of recommendation services are envisaged in Bomerce:
• Recommendation of similar product (possibly sponsored)
• Recommendation of products that typically are purchased together (e.g., an electronic device and the appropriate batteries) Similarity features extracted from the textual product description can support the first type of recommendation. The second one is trickier in the context of Bomerce. Our main option is to bootstrap the recommender model with data crawled from a short list of representative ecommerce sites (this model is published and can be extracted) and to update it using the inputs of the Bomerce customers. This involves for instance a continuous refinement on the model based on the flows of users interactions (click and other inputs), a requirement addressed by Task 3.1.
– Scale text processing, classification and recommendation to enable critical mass of text metadata to be created and utilized
– Calculate a method to measure business KPIs to determine where they are useful to augment predictive power of current models
– Text categorization by training over 10M+ objects classified into 1000+ categories
– TV program recommendation quality increase over collaborative filtering methods of WP2
– Improve business KPIs for Altice Labs and Bomerce clients
Work Package 5
This work package aims at ensuring seamless integration between the technological advances of the project and important real world applications of the data economy, by deploying business-driven use cases following agiledevelopment cycles and evaluating them through appropriate research methodologies.
In order to demonstrate the innovation and business value of our research, we apply Flink STREAMLINE to challenging business-driven big data industrial problems, which require the advanced analytics and real-time processing techniques beyond the state of the art as developed in WP1 through WP4. Each industrial partner will deploy one real-world industrial application based on its business needs. This work package will focus on the design, integration, implementation and evaluation of those applications. The applications consist of Quadruple-play (ALB), Media Content (First year NMusic), Gaming (Rovio) and Retail (IMR) and are described in detail in section 1.3.1 The task description below is abstract with respect to the use cases but the work described willbe instantiated for each of them during project execution. Development will follow the three stage cyclic development and corresponding milestones described in section 3.1.1 – Prototype, Pilot and Production Platform.
The major objectives of WP5 are:
1. To elicit very well-defined business-driven technical requirements of the four industrial applications, which willdrive the technological work packages WP1 trough WP4, and produce software development specifications detailing the design of each use case (Task 5.1).
2. To integrate Flink STREAMLINE into production platforms of the industrial partners (Task 5.2).
3. To successfully implement and demonstrate the four industrial applications (Task 5.3).
4. To evaluate the developed applications through trials involving real data and user groups (Task 5.4).
The major measurable outcome of WP5 is an improvement of the industrial partners business performance (c.f. Quantifiable Targets in Figure 5). In particular, the four final applications, should translate into New Services, Performance Improvement, Cost Reduction and Business Growth as measured by the KPIs defined in section 2.1.2 in real-world field trials on production platforms. Design and implementation will be documented in deliverables D5.1, D5.3, D5.5 and the detailed evaluation through KPI measurement in deliverables D5.2, D5.4 and D5.6.
Work Package 6
The key objective of this work package is to develop an effective strategy for disseminating and exploiting results obtained throughout the project amongst research and industrial communities.
In summary, the achievement of the following objectives is envisaged:
• To raise public awareness on the project, its results and progress within defined target groups using effective communication means and strategies (e.g., web site, social media channels);
• To effectively share STREAMLINE results and related information among interested parties and external stakeholders, mainly organizations with the capacity for adding a multiplier effect to the project dissemination;
• To present the project progress, technologies and results in scientific and research publications (in journals and magazines) and participations to relative events (conferences, workshops, symposiums and exhibitions);
• To ensure high level outreach within the community of potential users, mainly SMEs, and maintain a strong community around STREAMLINE (e.g., meetups, hackathons);
• To setup an exploitation strategy to support market adoption of the solutions developed in the project;
• To exchange experience with projects and other relevant actions working in similar or complementary domains in order to join efforts, minimize duplications and maximize the final impact.
Moreover, WP6 will explore potential paths to exploitation of the STREAMLINE platform including:
• Application of the STREAMLINE platform to the pilots, yielding improvements in respective partners’ core business regarding profiling, recommendations, customer retention and general business insights;
• Engage the developer community and SMEs presenting potential applications of STREAMLINE solutions;
• Preparation of a commercial roadmap, helping and encouraging stakeholders to use STREAMLINE and Flink tools, detailing potential commercial applications, requirements and resources for STREAMLINE deployment and integration;
• Ensure that IPR issues are handled properly throughout the project.
A measurable outcome will consist of roadmaps for dissemination and commercial usage. Additionally, there will be status reports on dissemination activities produced at the end of each year in the project.
Work Package 7
The purpose of WP7 is to ensure the orderly execution of the project, consisting of financial and administrative coordination as well as the technical coordination, risk management and quality assurance.
WP7 aims at ensuring an efficient management of the project, including technical overall coordination, maintaining contacts with the contracting authority (European Commission) and providing an appropriate level of reporting either at the end of the reporting periods or whenever the needs arise. WP7 will also provide strategic impulse to other activities within the project, such as Dissemination and Exploitation activities (WP6) where an open source communities described as service extension strategy, which will clearly link to the overall strategic guidance of the project.
WP7 is devoted to the project management. In particular, the specific objectives of this WP are:
• Coordinating the work packages leaders and structuring and restructuring the Work Packages where necessary ensuring that all work meets functional requirements.
• Managing efficient interaction with EC, handling contractual matters, delivery of results and other any possible project amendment.
• Providing efficient preparation and coordination of technical and managerial STREAMLINE documentation.
• Organizing and managing STREAMLINE meetings.
• Providing precise quality control to all the project deliverables and other public documents.
• Motivating, fostering, and coordinating partners’ work and cooperation within the consortium.
• Providing risk management and implementing recovering plans in case any risk is detected during the project.
• Monitoring resource expenditures.
• Overseeing the promotion of gender equality and other ethical issues within STREAMLINE.