VISION

To address the ever-increasing deluge of data collected and processed by computing systems, there is a trend toward processing data as close as possible to their source (edge computing). Both the EU and Gartner predict that by 2025 around 80% of enterprise data will be generated and processed outside the traditional cloud. In fact, edge computing is becoming even more attractive with the advent of energy-efficient micro-servers and powerful embedded devices with significant storage and processing capabilities. Another driver toward a cloud-edge computing continuum is 5G, both as a client of continuum resources and as a communication service enabler.

Moreover, the advent of the cloud-edge computing paradigm aggravates the challenging task of managing the loose federation of heterogeneous and distributed resources, this time at an extreme scale, making the traditional human-in-the-loop management unrealistic. To achieve dynamic and flexible system and application management with minimal user involvement, the concept of autonomic computing systems was proposed a long time ago as “computing systems that can manage themselves given high-level objectives from administrators”. However, the scale, heterogeneity, high dynamicity, and intrinsic variability of the continuum yield such rule-based approaches insufficient.

The vision of the MLSysOps project is to address these challenges to enable autonomic, efficient, and adaptive end-to-end system management on the heterogeneous and dynamic edge-cloud continuum by using AI models. To this end, MLSysOps will introduce an AI-driven control and management framework that interfaces with off-the-shelf management mechanisms, and employ a hierarchical, distributed, explainable, and evolving AI architecture for autonomic system operation. Specifically, the MLSysOps framework will employ a multi-level hierarchical ML agent approach, in which agents at the lower levels will be responsible for the training and evolution of ML models that control the local compute and storage nodes, whereas agents at the higher levels will coordinate higher-level AI-driven control and ML training purposes within a layer or between layers. MLSysOps platform will also develop software-based mechanisms that provide security and privacy by integrating and extending Zero Trust Network Access (ZTNA), the technology that operates on an ML-based, adaptive trust model.

MLSysOps will demonstrate its efficacy through two well-defined use cases in precision agriculture and smart cities, utilizing cloud, smart, and deep-edge infrastructures. The use cases correspond to dynamic, greatly impactful applications with widely heterogeneous demands.