IT decision-makers are using artificial intelligence (AI) and machine learning (ML) in big data projects and state-of-the-art data science models to obtain business objectives and efficiencies in use cases for financial services, health care, government and other sectors.
The health care industry, for one example, is expected to spend roughly $23 billion globally on big data analytics in the next three years, according to P&S Intelligence. Medical and life sciences organizations are embarking on AI and ML initiatives to unlock complex data sets with the goal of preventing diseases, speeding recovery and improving patient outcomes. Financial services institutions are using these systems to bolster fraud-detection efficacy, and federal governments are applying it for public data sharing to support R&D and improved public services. The list goes on and on.
Go here to see a listing of eWEEK’s Top Predictive Analytics Companies.
The sensitive nature of the data used in deep learning projects–including data ownership issues and regulatory requirements such as the General Data Protection Regulation (GDPR), HIPAA, financial data privacy rules–require organizations to go to great lengths to keep information private and secure. As a result, data sets that could be tremendously valuable in concert with other initiatives (or organizations) are often locked away and guarded, creating data silos.
However, as a variety of industries begin to spread their wings with AI and ML technology, we’re seeing a groundswell of overwhelming demand for innovative, trusted and inclusive solutions to the data collaboration problem. Organizations are asking for a way to execute deep-learning algorithms on data sets from multiple parties while ensuring that the data source is not shared or compromised, and that only the results are shared with approved parties.
Here are some key industry insights into this trend, including some history. Information for this eWEEK Data Points article is provided by Nikhil M. Deshpande, Director of AI and Security Solutions Engineering at Intel.
Data Point No. 1: Previous Approaches
A few years back, attempts were made to address this challenge by moving data to the compute mechanism. This approach involved moving data sets from various parties’ edge nodes to a centralized aggregation engine. The data was then run though the aggregation engine at a central location in a Trusted Execution Environment (TEE)–an isolated, and private, execution environment within a processor such as Intel SGX –so only the output or results of the query could be shared, while the data themselves were kept private.
Data Point No. 2: Persistent Challenges
This centralized data aggregation model led to a new set of challenges. Moving data from one site to another can be a significant burden on an organization due to the sheer size of a data set, or certain data privacy and storage regulations that simply make it impossible. Additionally, there were many data normalization challenges that came with this approach. For example, data sets from various health care institutions often come in different file formats, with different fields of information that don’t match up with other parties. Without a common schema across all participating data sets, aggregation could be incredibly arduous and even impossible. Lastly, “moving data to the compute” required a tremendous amount of upfront commitment and cooperation from IT personnel at each organization involved.
The overall goal of this early approach was to address the privacy and security problems that were so prevalent in big data collaboration projects. While it provided some benefits, it turned out to be a less-than-optimal method. However, it led to a new approach called Federated Machine Learning.
Data Point No. 3: Federated Machine Learning Emerges
Federated Machine Learning is a distributed machine-learning approach that enables model training on large bodies of decentralized data, ensuring secure, multi-party collaboration on big data projects without compromising the data of any parties involved. Google first coined the term in a paper published back in 2016, and since then the model has been the subject of active research and investment by Google, Intel and other industry leaders, as well as academia.
Data Point No. 4: How Federated Machine Learning Works
In this approach, the data isn’t moved at all. Contrary to previous techniques, compute actually moves to the data. So, Federated Machine Learning brings processing mechanisms to the data source for ongoing training and inferencing at the source, instead of requiring that participating organizations migrate data to one centralized location. As a result, processing is done by organizations onsite (or in-network), and the results are sent to a centralized location, where the model is simply updated through aggregation.
Federated Machine Learning addresses some major data collaboration privacy issues, but we’re still left with some questions to answer. For instance, the data might remain private, but is the aggregation model secure from theft or tampering that could lead to data leakage? Or if the model itself is secure, are the communication links between federated nodes and the aggregator secure from interference?
Data Point No. 5: The Role of Hardware in Federated Machine Learning
To answer these questions, we have to look at the role hardware technologies play in the Federated Machine Learning process. As big data projects leveraging AI and ML continue to take off, participants must be protected through security layers down to the silicon. These federated learning systems deploy hardware-based TEEs at the participants’ edge nodes as well as the central aggregation engine (where the aggregation model resides). This would ensure that the model training at the edge and aggregated model itself are computed inside a trusted environment to protect the confidentiality and integrity of code and data. The communications between edge and aggregation engine would also be protected from tampering. This would remove many of the issues that arise from moving data to compute.
In a Federated Machine Learning model, compute and data would be both protected at the hardware level across the entire system, within TEEs. This helps the parties involved find confidence in the privacy and security of both the dataset and the machine learning model, supporting against confidentiality leaks and data integrity attacks.
Data Point No. 6: The Road to Widespread Adoption
As we look ahead, we can expect hardware TEE-enabled Federating Machine Learning to produce major breakthroughs in big data collaboration. Imagine a future where this technology facilitates a trusted, global data-sharing playing field that enables organizations to unlock previously untapped data sets for collaborative analysis with other organizations. Access to large, reliable datasets is essential to the development and deployment of robust and trusted AI/ML solutions across every industry.
There’s no doubt that creating a trusted, decentralized data collaboration model will generate far-reaching benefits, but there’s still a significant amount of work to be done in order to reach widespread commercial adoption. The industry needs technology leaders in computing hardware and blockchain, world governments, regulatory bodies around the globe, standards organizations, public and private participants, and more to collaborate with one another.
As with any machine learning application, access to data will be a key to success. Organizations within many data science disciplines must work together to develop a common schema across various data sets, ensuring the availability of quality data without any bias or issues.
If you have a suggestion for an eWEEK Data Points article, email cpreimesberger@eweek.com.