[SEO Subhead]
This Guidance demonstrates how to build a scientific data management system that integrates both laboratory instrument data and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way. It starts once an experiment or project is initiated and electronic lab notebooks (ELNs) or lab information management systems (LIMS) notify a metadata catalog on AWS. After data is collected from instruments, the data moves to a data store that is associated with the metadata catalog. Bioinformatics results are captured in the data store, with all new files linked to the ELN or LIMS through the metadata store. All data is governed and discoverable by searching the metadata to find data assets, or by configuring a natural language search with a chat interface.
Please note: [Disclaimer]
Architecture Diagram
-
Overview
-
Main Architecture
-
Overview
-
This architecture diagram shows an overview about how you can accelerate the launch of a scientific data management system that integrates both your laboratory instruments and software with cloud data governance, data discovery, and bioinformatics pipelines, capturing key metadata events along the way. For more details on each component, open the other tab.
Goal:
This Guidance provides details on the implementation best practices for metadata enrichment, search, and bioinformatics processing as part of Building Digitally Connected Labs with AWS.
Conceptual overview:
When a scientist or technician sets up an experiment in the electronic lab notebooks (ELN) or lab information management systems (LIMS), this notifies the metadata catalog of that new experiment and configures the data store to receive that instrument data.Later, when lab instrument data is collected, the data moves to the data store that was pre-associated with the metadata store.
Bioinformatics processing steps and output files are captured within the data store, and all new files are linked to the ELN or LIMS through the metadata Store.
Data is governed and discoverable by metadata search, or by natural language search through a chat interface.
-
Main Architecture
-
This architecture diagram shows the main architecture and provides more details about each component. For more details and architectural considerations, visit the Implementation Guide.
Step 1
A scientist or technician sets up a project or experiment metadata within the ELN, LIMS, or other experiment or testing database.
Step 2
Upon the creation of an experiment in the ELN, the ELN creates an event that calls an AWS API to send an Experiment ID. With the Experiment ID received, an AWS Lambda function calls an ELN’s API to retrieve all of the experiment metadata that will be relevant to contextualize the data for discovery.Step 3
The Lambda function sets up Amazon Simple Storage Service (Amazon S3) buckets as a scientific Data Store. The setup includes the naming of related folders, based on the unique Experiment ID. At this step, the Data Stores are empty. In addition to this, next-generation sequencing (NGS) data can be stored in AWS HealthOmics.Step 4
The Lambda function writes the metadata that it has collected into an Amazon DataZone Metadata Catalog. This is done by creating Amazon DataZone data assets, adding metadata forms to those assets, and assigning the metadata to those fields.Step 5
Lambda calls the ELN’s API to add the location of the new data asset to the experiment entity within the ELN.
Step 6
Scientists run instruments to collect data and save data to a network-accessible folder being monitored by AWS DataSync.Step 7
DataSync ingests the data into the landing zone bucket within the data store.Step 8
The writing of the file to the data store invokes an event, initiating AWS Step Functions. Step Functions will import the instrument data from the landing zone into the relevant raw bucket within Amazon S3, and optionally into the relevant read set within the HealthOmics sequence store.Step 9
Step Functions adds the file names, creation date, and other metadata that is extracted from the files to the metadata store for the data assets of the relevant experiment.
Step 10
An Amazon S3 bucket event initiates a bioinformatics pipeline run using the raw bucket as the data source. Bioinformatics output files are written to processed bucket.
Step 11
Step Functions adds the file names, creation date, and other metadata that is extracted from the files to the metadata store for the data assets of the relevant experiment.Step 12
Optionally, AWS Storage Gateway mounts onto the local network for users to access the processed bucket for local analysis or report generation.Step 13
Locally-generated files are written to the final bucket. Step Functions posts the name and date to the processed data asset in the metadata store.
Step 14
Amazon DataZone is a data management portal to discover, analyze, and report on data. In Amazon DataZone, research scientists and business users can search for lab datasets using keywords that are found in the metadata store, which originated from the ELN.
For example, these may be searches for sample id, experiment id, group, platform, file names, dates, or keywords within the experimental description. These searches will return a list of data assets that have an association with those keywords, which are collections of Amazon S3 objects.
Step 15
To index the contents of the data files, Amazon Kendra is a managed service for indexing data and conducting semantic search. Amazon Kendra APIs may be used within a custom discovery portal application that your organization creates, to enable this search.
In combination with Amazon Kendra, large language models (LLMs), a subset of foundation models (FMs), can be used to generate summaries of search results and create conversational experiences.
Step 16
Data and metadata can be semantically searched by a user to discover datasets, gain access to, and analyze datasets.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance was configured with Amazon API Gateway and Step Functions, two AWS services that are purpose-built to help you run and monitor your research systems effectively, gain insights into operations, and continually improve your processes. Specifically, API Gateway creates RESTful APIs to enable two-way communication between AWS and your lab software. It acts as a front door for lab software and AWS to share logic and metadata, which enables up-to-date contextualization of datasets in ELN and AWS. Step Functions is a visual workflow service to automate microservice processes between the data store, metadata store, and lab software, creating an orchestration event that removes the need for manual updates of metadata, and keeps the ELN, the data store, and the metadata store in sync with one another.
-
Security
Amazon DataZone and Storage Gateway work in concert to improve your security posture, protecting your data, systems, and assets. Amazon DataZone lets users access data in accordance with their organization’s security and compliance regulations, providing unified access controls to scientific data across multiple data domains and third-party data stores. Storage Gateway supports data integrity efforts with encryption, audit logging, and write-once, read-many (WORM) storage from on-premises applications to the data mesh. It provides lab users access to cloud-backed files for use in report generation or local analysis, while making it easy to maintain metadata tagging in the data mesh.
-
Reliability
Amazon S3 and DataSync are built to ensure your workloads perform their intended functions correctly and consistently while allowing you to recover quickly from failure. Amazon S3 is a highly available and durable object store with cross-Region options for global organizations. DataSync provides managed data transfer with advanced features, including bandwidth throttling, migration scheduling, task filtering, and task reporting. By liberating data from on-premises file stores, DataSync and Amazon S3 provide a reusable transfer and storage architecture that can scale from small to large.
-
Performance Efficiency
AWS Batch and HealthOmics both help you monitor performance and maintain efficiency for your workloads as business needs evolve. AWS Batch offers a flexible, high-performance computing configuration and virtually unlimited scale, allowing bioinformatics groups to tune and scale infrastructure as life science workloads dictate. It brings instant access to virtually unlimited computing resources to accelerate genomics, proteomics, cell imaging, electron microscopy, and high throughput simulation.
HealthOmics allows for Ready2Run workflows or bring-your-own private bioinformatics workflows to simplify the deployment of high-performance compute workflows. It includes pre-built workflows designed by industry-leading third-party software companies along with common, open-source pipelines to help you get started quickly.
-
Cost Optimization
Amazon S3 Intelligent-Tiering storage class delivers automatic storage cost savings when data access patterns change through the lifecycle of instrument data, allowing for automatic cost savings that align with the way that scientific data is used. For example, you can move raw instrument data to lower access frequency storage classes once that data has been processed. Another way cost is optimized with this Guidance is with HealthOmics sequence stores. These are genomics-aware data stores that support large-scale analysis and collaborative research across entire populations, reducing long-term storage costs by automatically moving data objects that have not been accessed within 30 days to an archive storage class. HealthOmics also supports petabytes of omics data to be stored efficiently and cost effectively, allowing scientific discovery at population scale.
-
Sustainability
DataSync and HealthOmics work in tandem to minimize the environmental impacts of running cloud workloads. For example, DataSync rapidly migrates instrument files to the cloud for data storage and archival, relieving the need for an expanding on-premise data center. And, HealthOmics automatically provisions and scales your compute infrastructure, removing the need to manage servers and giving unused compute services back to the service, reducing the amount of wasted resources.
Implementation Resources
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
Related Content
Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS
Guidance for Digital Connected Lab on AWS
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.