The power of remote engine execution for ETL/ELT data pipelines


Business leaders risk compromising their competitive edge if they do not proactively implement generative AI (gen AI). However, businesses scaling AI face entry barriers. Organizations require reliable data for robust AI models and accurate insights, yet the current technology landscape presents unparalleled data quality challenges.

According to International Data Corporation (IDC), stored data is set to increase by 250% by 2025, with data rapidly propagating on-premises and across clouds, applications and locations with compromised quality. This situation will exacerbate data silos, increase costs and complicate the governance of AI and data workloads. 

The explosion of data volume in different formats and locations and the pressure to scale AI looms as a daunting task for those responsible for deploying AI. Data must be combined and harmonized from multiple sources into a unified, coherent format before being used with AI models. Unified, governed data can also be put to use for various analytical, operational and decision-making purposes. This process is known as data integration, one of the key components to a strong data fabric. End users cannot trust their AI output without a proficient data integration strategy to integrate and govern the organization’s data. 

The next level of data integration

Data integration is vital to modern data fabric architectures, especially since an organization’s data is in a hybrid, multi-cloud environment and multiple formats. With data residing in various disparate locations, data integration tools have evolved to support multiple deployment models. With the increasing adoption of cloud and AI, fully managed deployments for integrating data from diverse, disparate sources have become popular. For example, fully managed deployments on IBM Cloud enable users to take a hands-off approach with a serverless service and benefit from application efficiencies like automatic maintenance, updates and installation.

Another deployment option is the self-managed approach, such as a software application deployed on-premises, which offers users full control over their business-critical data, thus lowering data privacy, security and sovereignty risks.

The remote execution engine is a fantastic technical development which takes data integration to the next level. It combines the strengths of fully managed and self-managed deployment models to provide end users the utmost flexibility.

There are several styles of data integration. Two of the more popular methods, extract, transform, load (ETL) and extract, load, transform (ELT), are both highly performant and scalable. Data engineers build data pipelines, which are called data integration tasks or jobs, as incremental steps to perform data operations and orchestrate these data pipelines in an overall workflow. ETL/ELT tools typically have two components: a design time (to design data integration jobs) and a runtime (to execute data integration jobs).

From a deployment perspective, they have been packaged together, until now. The remote engine execution is revolutionary in the sense that it decouples design time and runtime, creating a separation between the control plane and data plane where data integration jobs are run. The remote engine manifests as a container that can be run on any container management platform or natively on any cloud container services. The remote execution engine can run data integration jobs for cloud to cloud, cloud to on-premises, and on-premises to cloud workloads. This enables you to keep the design timefully managed, as you deploy the engine (runtime) in a customer-managed environment, on any cloud such as in your VPC, any data center and any geography.

This innovative flexibility keeps data integration jobs closest to the business data with the customer-managed runtime. It prevents the fully managed design time from touching that data, improving security and performance while retaining the application efficiency benefits of a fully managed model.

The remote engine allows ETL/ELT jobs to be designed once and run anywhere. To reiterate, the remote engines’ ability to provide ultimate deployment flexibility has compounding benefits:

  • Users reduce data movement by executing pipelines where data lives.
  • Users lower egress costs.
  • Users minimize network latency.
  • As a result, users boost pipeline performance while ensuring data security and controls.

While there are several business use cases where this technology is advantageous, let’s examine these three: 

1. Hybrid cloud data integration

Traditional data integration solutions often face latency and scalability challenges when integrating data across hybrid cloud environments. With a remote engine, users can run data pipelines anywhere, pulling from on-premises and cloud-based data sources, while still maintaining high performance. This enables organizations to use the scalability and cost-effectiveness of cloud resources while keeping sensitive data on-premises for compliance or security reasons.

Use case scenario: Consider a financial institution that needs to aggregate customer transaction data from both on-premises databases and cloud-based SaaS applications. With a remote runtime, they can deploy ETL/ELT pipelines within their virtual private cloud (VPC) to process sensitive data from on-premises sources while still accessing and integrating data from cloud-based sources. This hybrid approach helps to ensure compliance with regulatory requirements while taking advantage of the scalability and agility of cloud resources.

2. Multicloud data orchestration and cost savings

Organizations are increasingly adopting multicloud strategies to avoid vendor lock-in and to use best-in-class services from different cloud providers. However, orchestrating data pipelines across multiple clouds can be complex and expensive due to ingress and egress operating expenses (OpEx). Because the remote runtime engine supports any flavor of containers or Kubernetes, it simplifies multicloud data orchestration by allowing users to deploy on any cloud platform and with ideal cost flexibility.

Transformation styles like TETL (transform, extract, transform, load) and SQL Pushdown also synergies well with a remote engine runtime to capitalize on source/target resources and limit data movement, thus further reducing costs. With a multicloud data strategy, organizations need to optimize for data gravity and data locality. In TETL, transformations are initially executed within the source database to process as much data locally before following the traditional ETL process. Similarly, SQL Pushdown for ELT pushes transformations to the target database, allowing data to be extracted, loaded, and then transformed within or near the target database. These approaches minimize data movement, latencies, and egress fees by leveraging integration patterns alongside a remote runtime engine, enhancing pipeline performance and optimization, while simultaneously offering users flexibility in designing their pipelines for their use case.

Use case scenarioSuppose that a retail company uses a combination of Amazon Web Services (AWS) for hosting their e-commerce platform and Google Cloud Platform (GCP) for running AI/ML workloads. With a remote runtime, they can deploy ETL/ELT pipelines on both AWS and GCP, enabling seamless data integration and orchestration across multiple clouds. This ensures flexibility and interoperability while using the unique capabilities of each cloud provider.

3. Edge computing data processing

Edge computing is becoming increasingly prevalent, especially in industries such as manufacturing, healthcare and IoT. However, traditional ETL deployments are often centralized, making it challenging to process data at the edge where it is generated. The remote execution concept unlocks the potential for edge data processing by allowing users to deploy lightweight, containerized ETL/ELT engines directly on edge devices or within edge computing environments.

Use case scenarioA manufacturing company needs to perform near real-time analysis of sensor data collected from machines on the factory floor. With a remote engine, they can deploy runtimes on edge computing devices within the factory premises. This enables them to preprocess and analyze data locally, reducing latency and bandwidth requirements, while still maintaining centralized control and management of data pipelines from the cloud.

Unlock the power of the remote engine with DataStage-aaS Anywhere

The remote engine helps take an enterprise’s data integration strategy to the next level by providing ultimate deployment flexibility, enabling users to run data pipelines wherever their data resides. Organizations can harness the full potential of their data while reducing risk and lowering costs. Embracing this deployment model empowers developers to design data pipelines once and run them anywhere, building resilient and agile data architectures that drive business growth.  Users can benefit from a single design canvas, but then toggle between different integration patterns (ETL, ELT with SQL Pushdown, or TETL), without any manual pipeline reconfiguration, to best suit their use case.

IBM® DataStage®-aaS Anywhere benefits customers by using a remote engine, which enables data engineers of any skill level to run their data pipelines within any cloud or on-premises environment. In an era of increasingly siloed data and the rapid growth of AI technologies, it’s important to prioritize secure and accessible data foundations. Get a head start on building a trusted data architecture with DataStage-aaS Anywhere, the NextGen solution built by the trusted IBM DataStage team.

Learn more about DataStage-aas Anywhere Try IBM DataStage as a Service for free

The post The power of remote engine execution for ETL/ELT data pipelines appeared first on IBM Blog.