Introducing a cloud environment into an organization can be an overwhelming process. There are many aspects to consider, including how to incorporate the new cloud environment into your network infrastructure, user authentication across environments, and file sharing from on-premise to cloud access. But one of the most common and foundational questions is: How will we move our data to the cloud? And relatedly: what might our future-state cloud data environment look like? This article addresses those core questions, looking at:
- Cloud migration
- Data management when moving to the cloud
- Cloud data access.
1. Cloud Migration
Key considerations when planning a cloud data migration are:
- Where will your data be stored in the cloud, and what database options are there?
- What cloud architecture will you need, in terms of data lakes, data warehouses etc?
- How will you move your data from on-premise to the cloud, and how will you change it in the cloud environment?
- What security considerations do you need to take into account?
1.a. Cloud Data Storage. There are two primary types of cloud data storage:
- Cloud Storage
- Database Storage
Cloud Storage can be thought of as simply a folder structure, not unlike a typical network file share structure (\\share\folder\file). Cloud storage is often referred to as a ‘bucket’ (e.g. an Amazon S3 bucket).
Database Storage has multiple options: Relational, Columnar or Key/Value Pair.
- Relational is the most typical database architecture that most people will be familiar with. Data is stored in tables, tables are connected through primary/foreign key relationships, and an SQL compliant language is used to access data and manage database objects (e.g. tables, views, indexes).
- Columnar, an approach that allows high performance when working with large datasets, can look and feel like Relational (e.g. tables, views), but data is not indexed, keys (primary/foreign) are not enforced through SQL, and data is stored and accessed at a columnar level.
- Key/Value Pair has a completely different structure than Relational or Columnar, with data saved in key (the field name e.g. FirstName, LastName), value (the data e.g. John, Smith) pairs. Given that a structure is not imposed, it allows large amounts of incoming data (e.g. streaming data) to be efficiently stored.
1. b. Cloud Data Structure. Data in the cloud is typically saved in the following structure:
- Landing (staging area for data). The Landing area provides a central location to store data from on-premise and external sources, with the design needing to take into account the structure (e.g. folder location), details of the data source (e.g. name, purpose, source), and timing (e.g. update cycle). Cloud Storage is typically used for the Landing model and data.
- Data Lake (data segregated by subject). The Data Lake is a subject-based repository of resolved data. From Landing to Data Lake, the pipeline (data processing) might incorporate data resolution (e.g. data types, data relationships, data completeness, other data quality), data matching across dimensions (e.g. customer, product), and data deduplication. The model and data for the Data Lake can exist either in Cloud or Database Storage (most likely Columnar).
- Data Warehouse (dimensionally modeled data for reporting and analytics). The Data Warehouse is also a subject-based repository but, unlike a Data Lake, data is joined for downstream reporting and analytics in the form of facts (e.g. values, amounts) and dimensions (e.g. by customer, by entity, by business line, by product). The model and data for the Data Warehouse usually exists in Database Storage (most likely Columnar for large datasets).
- Data Mart (aggregated data for reporting). Data Marts contain aggregated data, selected from the Data Warehouse to be used for reporting purposes. The model and data for the Data Mart usually exists in Database Storage (most likely Columnar for large datasets).
1.c. On-Premise and Cloud Data Movement. Data movement (‘pipelines’) may occur through multiple processes across and within environments:
- On-Premise to Landing data movement. From the on-premise environment, data is usually replicated as-is to the Landing area (see 1.b above). Examples of this are data from: trading systems, accounting or sub-ledger systems, or risk systems.
- External to Landing data movement. External data would be sourced into the Landing area via FTP/SFTP or APIs. Examples of this data are: market data, instrument data, party data and, increasingly, alternative data.
- Intra-Cloud data movement. From Landing > Data Lake, Data Lake > Data Warehouse, Data Warehouse > Data Mart, ETL/ELT applications and/or custom pipeline development are applied to move and transform data in the cloud environment.
1.d. Cloud Data Security. Security of the data is a paramount consideration for the overall strategy and architecture of any cloud migration. The following states of data should be considered when moving to the cloud:
- At Rest. When data is landed in Cloud Storage, encryption should be applied to that data to change it from its original state.
- In Transit. During the transit process, a secured method should be used for data access (e.g. HTTPS, SSL, TLS, FTPS).
- In Use. Data encryption (and decryption) and secured access methods should be used when accessing data from on-premise, external, or cloud environments.
Combining these methods provides the optimal data security.
2. Data Management when Moving to the Cloud
Data Catalog and Metadata. What data is available in your cloud environment, and who owns it? For finding data, a data catalog is key to understanding what data is available, what its source is, where it is, and who owns it. The catalog should include technical information such as formats, as well as information such as classification, ownership, and any jurisdictional aspects, especially those that might drive regulatory compliance. As data assets do not all have an equal value, a data catalog strategy should be developed for your cloud migration that determines catalog granularity based on business value. Business definitions would be an additional key piece of metadata (data dictionary).
The design should include automatic capture of metadata where possible, including aspects such as ownership (e.g. data stewards), provenance (e.g. source system), rights-of-use (e.g. data use information, contractual or other obligations), and timeliness. Metadata should also include metrics and SLAs to support automated data quality efforts, including attribute-based data quality statistics collected during the data movement process. It should also include lifecycle information to support automated adherence to retention and archiving policies.
Any related data can also be associated in the metadata to aid discovery and avoid duplication. Any information about taxonomies (hierarchical classifications) or associated ontologies (e.g. geographic, sector, departmental, product, lifecycle) should also be added.
3. Cloud Data Access
How is data accessed in the cloud? Is the data secure? From a control perspective, who has accessed it?
Querying Data. Across the environment, access to data should be managed through an interface structure to provide for consistent access, security, and logging. This might be an API that provides a layer between the user and data that incorporates both in-transit and in-use encryption, also taking into account PII (personally identifiable information) considerations and any rights-of-use limitations.
Logging Data Access. Centralized logging is key for determining who has accessed data, when, where it occurred and, hopefully, the purpose of usage. From an audit and control perspective, this can be difficult when the accessed data is in many locations or is incomplete.
Prior to moving to the cloud, developing a robust strategy and plan is critical to ensuring a fit-for-purpose solution. Defining and prioritizing requirements for your cloud migration based on the above guidelines can assist in determining a strategic direction and the overall architecture. And let’s not forget fundamentals such as testing: the overall plan needs to include a test plan to validate the implementation of each requirement — be it technology or data — and ensure a successful solution.