Getting Started with Data Lakes (Part 4)

Nidhi Vichare
4 minute read
September 24, 2020
AWS
Cloud
Big Data
Analytics
Data Lakes
Data Pipelines
Data Engineering
Architecture

Data Lake

Final Considerations

Getting started with a Data Lake is a choice that must be made by carefully considering the following:

  1. Confirm if data lake is the best choice
  • Evaluate and be sure that your data and use cases truly are well-suited to a data lake.
  • Continue using your relational database technologies for what they are good at, while introducing a data lake to your multi-platform architecture when it becomes appropriate.
  • The existence of many types of data sources, in varying formats, is usually a great indicator of a data lake use case. Conversely, if all of your data sources are relational, extracting that data into a data lake may not be the best option unless you are looking to house point-in-time history.
  1. Start with a small, practical project A few of the easier ways to get started with a data lake include:
  • Begin storing new “unusual” datasets. This affords time to do longer-term planning and learning while beginning to accumulate data.
  • Data warehouse staging area. By relocating the data warehouse staging area to the data lake, you can reduce storage needs in the relational platform and/or take advantage of big data processing tools (such as Spark, for instance). This works particularly well for low-latency datasets.
  • Data warehouse archival area. Data which ages past a date threshold can be offloaded to the data lake for historical retention. It can still be available for querying, typically via data virtualization techniques.
  1. Address ‘readiness’ considerations
  • Be prepared to handle the trade-offs of schema on read vs. schema on write solutions. Remember, structure needs to be applied at some point so all decisions become a “pay me now or pay me later” proposition - which requires the right balance for your use cases.
  • Additionally, you can expect to introduce new technologies and new development patterns.
  1. Do a POC whenever possible
  • Conducting a technical proof of concept (POC) is something you should become accustomed to doing routinely when introducing new features, new technologies, or new processes. POCs can also be referred to as sandbox solutions in a data lake.
  • Features and functionality change rapidly in cloud services, so a POC can help you be agile and learn what needs to be done for the full-fledged solution. There is always a risk that a POC is pushed into being production-ready too quickly; therefore, factor in the effort for operationalizing POC/prototype/sandbox efforts.
  1. Don’t shortchange planning
  • Although agility and less up-front work are touted as advantages of a data lake, that does not mean there is no up-front planning. You absolutely must manage, plan, organize, secure, and catalog a data lake so that it doesn’t turn into the dreaded “data swamp.”
  • Techniques for how to ingest raw data quickly, and how to retrieve curated data effectively, are iterative in nature.
  1. Implement the right level discipline
  • Finding the right balance of agility (to get things done today) versus process and rigor (to implement a sound, maintainable, extensible architecture) is a challenge. The answer isn’t the same for every company.
  • To the extent that more schema on read approaches become prevalent, the need for governance and data cataloging are just as important as they always been, perhaps even more so.

Link to Data Lakes Part 1

Reference: Thanks to AWS

Further Reading

🔗 Read more about Snowflake here

🔗 Read more about Cassandra here

🔗 Read more about Elasticsearch here

🔗 Read more about Kafka here

🔗 Read more about Spark here

🔗 Read more about Data Lakes Part 1here

🔗 Read more about Data Lakes Part 2here

🔗 Read more about Data Lakes Part 3here

🔗 Read more about Redshift vs Snowflake here

🔗 Read more about Best Practices on Database Design here