How AWS Lake Formation works

AWS Lake Formation is an awesome data and analytics related (fully managed) service that enables you to create a secure Data Lake fast – in days.

But, what is a Data Lake?

In simple terms, a Data Lake is a repository where you can store all type of (related) data, including data that is structured, semi-structured, and structured. You can include raw data as well as transformed data in that Data Lake.
An easy way to visualize it is to compare it with a natural lake that intakes many different streams of water – from different sources. Some of that water may be clean, some dirty, some with various types of minerals, or even dump from polluted sources.

The purpose of Data Lake is to create a central place where all of your (related) data exist. That way, your applications (typically analytics applications) would not have to integrate with multiple sources. The Data Lake, if done right, can increase the speed of analytics, and result in better insights.

Even though it sounds like that a Data Lake is simply a data-dump (which it is), and would be easy to setup, in practice it takes lot of effort, resources and time to create the data-flows from various sources to this Data Lake. Tasks such as integration, load / unload of data, security (access, encryption, etc.), data cleansing, at times data-format handling, cataloging, etc., can cause lot of headaches and can turn out to be costly (in terms of $$ and time). This is where AWS Lake Formation service can help. AWS Lake Formation can expedite the process of setting up Data Lake for your applications and analytics.

Conceptual view of how AWS Lake Formation works:

Image courtesy of AWS

Key features and benefits that AWS Lake Formation brings to you

Easily and quickly setup Data Lakes
- Import data with ease – whether from AWS data sources or from on-prem sources
- Use Blue Prints to load data – that handle complexity under the hood for you
Automating much of Data Catalogs help enable self-service for your project / organization
- Lake Formation automatically discovers all AWS data sources to which it is provided access to
- Lake Formation makes use of AWS Glue workflows, and uses similar Catalogs
Automation of data-cleansing and deduplication
- Lake Formation deploys advanced mechanisms to analyze large volume of data for redundant data in a fairly short period of time
Simplification of security management like Access Controls, Encryption, and Logging
Uses S3 as the base
- Lake Formation keeps the data in standard format, stored on S3. You are not locked into AWS since no proprietary formatting is used.
- Lake Formation provides APIs and CLI, thus enabling integration with your custom applications.

You can seamlessly integrate following services with Lake Formation, including applying permission management:

Amazon Athena
Amazon EMR
AWS Glue
Amazon Redshift Spectrum
Amazon QuickSight

Pricing

There is no additional charge for Lake Formation. You pay for the underlying resources used, such as:

AWS Glue
Amazon S3
Amazon EMR
Amazon Athena
Amazon Redshift

External Links

AWS Lake Formation site

September 23, 2020

Every Bit Cloud

How AWS Lake Formation works

But, what is a Data Lake?

Conceptual view of how AWS Lake Formation works:

Key features and benefits that AWS Lake Formation brings to you

Pricing

External Links

Related Posts

Amazon Managed Streaming for Apache Kafka (MSK)

Amazon Kinesis Data Analytics

Amazon Kinesis Data Firehose