Picture credit: Unsplash
As humans we always remember the important events of our lives - the "Good" and the "Not so good". Our life journey is a tapestry woven together with colorful threads, each representing unique personal experiences. Have you ever compared the front and back of a tapestry? The front is colorful and pretty, and the back lacks clarity and detail. As we reflect back on our lives, we connect the dots of the life experiences that helped make that beautiful tapestry. In this paper, I would like to share a set of data engineering problems that my team experienced, and solutions that we devised as colorful threads in coherent and elegant whole.
As a leader of the Data Engineering team, I managed a group of data engineers, data analysts and architects who were responsible for supplying business stakeholders with the analytic insights that creates competitive advantages. The customer requirements, market opportunities and product evolution drove a series of questions.
The processes that worked like a charm a few years ago had to be re-engineered and modernized to drive faster business outcomes. I started noticing the data processes to drive the features moved slower, constrained by development cycles, limited resources and cross-functional dependencies. The gap between what the customer needed and the data acquisition from IT was often a source of conflict. Inevitably this mismatch between expectations and capabilities caused dissatisfaction preventing from fully realizing the strategic benefit of our data.
I will take you through my journey of overcoming obstacles by embracing hybrid Cloud environments, modern tools and technologies for digital transformation so we could reap the benefits of a solid, long-term solution. As we pivoted our focus on modern applications, modern infrastructure, and clouds to deliver better customer experience, reliability was the #1 goal and key focus areas were Availability, Performance, and Security. I drove initiatives for Big Data projects and Hybrid-Cloud transformation with focus on Distributed Data Sytems, Agile, DevOps, DataOps, Analytics, SecOps and Business Intelligence.
Figure: DataOps Components
Requirements kept changing
Many Silos
A Variety of Data Formats
Data Errors
Data Pipeline Management
Manual Processes
As we pivoted our focus on modern applications, modern infrastructure, and clouds to deliver better customer experience, reliability was #1 goal and key focus areas were Availability, Performance, and Security.
Agile, DevOps and DataOps methodologies impact productivity, quality and reliability in data analytics.
Let's take a moment to define DataOps, DevOps and Continuuos Delivery before we get into the specific details.
DataOps(The new normal) is a combination of tools and methods, which streamline the development of new analytics while ensuring data quality. DataOps helps shorten the cycle time for producing analytic value and innovation. My DataOps team is a collaborative and cross-functional analytics team that values analytics that work; we measure the performance of data analytics by the insights we deliver. DataOps team embraces change, and seeks to constantly understand evolving customer needs.
DevOps(The new normal) improves collaboration between employees from the planning through the deployment phases of software. It seeks to reduce time to deployment, decrease time to market, minimize defects, and shorten the time required to fix problems. The DevOps approach brings together software development and operations to more closely align development with business objectives and to shorten development cycles and increase deployment frequency.
Continuous delivery(The new normal) requires automation from planning all the way through to delivery/deployment. Previously, software development, testing, quality assurance, and customers could each be running in different environments. This hampered their ability to communicate and led to misunderstandings. I also included Continuous Tuning and Continuous Cost Optimization(Tuning our Infrastructure-as-a-code script, reviewing TCO)
Figure: Grow your leadership flame
The software development pipeline has these steps: Planning, resourcing, development, testing, quality assurance, and deployment.
Adoption of Dataops helped reduce time to produce insights and improved quality of the data analytics processes. The opportunities to ask the next business question were easier that lowered the marginal cost. The biggest benefit was it boosted team morale by becoming more data driven in decision making. Finally, we achieved greater team efficiency by following agile processes, reuse and refactoring. We were handling change successfully and went beyond data engineering to preserve the value of our data.
We wanted to accelerate the Cloud adoption because we were trying to respond to new customer demands spikes and offer SaaS services to our customers in an agile way.
There is a renewed focus on reliability, scalability, and performance of systems that are generating revenue. The CI/CD paired with "Observability" is the new evolutionary step in delivering new functionality while keeping up with reliability. Observability is a property of the system and application that enables an external observer to monitor and troubleshoot and be able to understand what's happening to that application in real time even if the conditions that caused it were previously unknown. I would recommend Datadog, Elastic, Sumo Logic, and Weavescope.
We re architected backend distrubed systems and built a flexible stack that will keep working for the business as it evolves. We made a shift from on-premise private data center to AWS Cloud. We are dealing with a whole different operating model, and the shift is in managing these cloud and on-premise workloads in an orchestrated way. Our organization has adopted new methodologies, new processes, new tools, new skill sets, and teams have to be recombined into different types of teams—cross-functional teams—because cloud environments are ever-changing. The cloud enables scale with hardware and agile development delivered rapid evolution and change of software i.e our applications.
The platform capabilities i.e. observability and reliability are integrated end to end. Discipline started with a software delivery process i.e. development and CI/CD [continuous integration/continuous development] pipeline. The development and system observability kept track of all the KPIs related to deployments frequency, history, production changes(successful/failed/rolled back) and how quickly we are able to recover from those failures.
We had adopted Kubernetes 5 years ago along with microservices based applications. The modernization of our development stack allowed us to pivot to AWS fairly quickly. I expect to see continuous acceleration in cloud adoption journey of multiple clouds continued with Kubernetes, microservices-based API and service meshes. With modern methodologies like Agile, DevOps, DataOps and Observability to deliver reliable, secure, and performant applications, we are in great shape to scale in production.
As we manage reliability we have strengthened SecOps (security across). The logs/metrics/traces data that helps with reliable performance and availability is also used to build secure applications. This cross-functional and streamlined approach to tooling, as well as to processes, has helped us deliver and continuously tune our applications to deliver more actionable and valuable insights.
The network data comes in structured, unstructured and semi-structured formats. Depending on the use case the data was stored in relational databases (Oracle, MySQL, Aurora Serverless), data warehouse(Teradata, Snowflake), data lake(Snowflake, S3, MapR, distributed NoSql clusters(Cassandra, MongoDB, search/anaytics(Elasticsearch and S3 object store.
I have a written a detailed Blog - File Formats that discusses the data format options.
Data formats: csv, json, avro, parquet, orc, protobuf, arrow
In addition, asynchronous data ingestion of large datasets such as syslogs and case attachments add business context to the network data captured in a continuous stream. Different data consumers have different goals. The product teams have different analytics needs than Support, Operations or DevOps, not to mention Marketing and Business Intelligence. This usually leads to hybrid data management systems consisting of different technologies as no single solution can meet the needs of the business objectives. The need to support such diverse and disparate data sources and access mechanisms to support a wide variety of use cases presents a variety of challenges for an effective and robust data ingestion strategy. To a large degree the mechanism to use for data ingestion will be determined by the use case and the velocity of the data.
Schema-on-write, helps to execute the query faster because the data is already loaded in a strict format and you can easily find the desired data. However, this requires much more preliminary preparation and continuous transformation of incoming data, and the cost of making changes to the schema is high and should be avoided.
Schema-on-read, it has fast data ingestion because data shouldn't follow any internal schema — it's just copying/moving files. This type of data handling is more flexible in case of big data, unstructured data, or frequent schema changes. Schema-on-read, on the other hand, can provide flexibility, scalability, and prevent some human mistakes. It is generally recommended that data is stored in the original raw format (just in case) and optimized in another format suitable for further data processing. The ETL may have invalid transformations, and the original raw data will allow you to return to the original data and process it correctly. In addition, since the schema must be defined when querying data, SQL queries tend to be very complex. They take time to write, and even more to complete.
We used Oracle primarily for all workloads. The performance requirements and SLA drove the re-architecture. We switched to horizontally-scalable multi-region Cassandra implementation for all of our core services. Oracle can achieve what you want, but in a roundabout way. It allows for no-downtime expansion of a cluster, but uses complex coordination mechanisms and special processes like Dynamic Database Services to achieve this. In Cassandra, this is simply a matter of adding a node, which is a foundational operation that doesn’t require special coordination or a service like Dynamic Database Services. The ability to scale linearly is built into the core of the technology
The design decisions behind Cassandra enable the advantages from Oracle above. Instead of trying to outfit an existing technology that was designed for a different purpose. Cassandra is designed specifically to achieve the functionality that modern applications require. Also consider the Disk architecture. RAC uses a shared disk architecture, which requires mechanisms for management and orchestration to avoid loss of data and maintain service availability. In Cassandra, by sharing no resources (a fundamental component of the architecture, not a feature), you completely bypass this problem and never even attempt to use shared disks, which are an anti pattern for truly distributed workloads.
Furthermore, at a certain scale, Oracle RAC deployments start to look more and more like Cassandra. Scaling an Oracle system requires simplification of models which takes away benefits of relational abstraction and de-normalization which takes away consistency. It also requires async replication across regions to avoid latency. It’s almost as if you are trying to make the Oracle system look more like the Cassandra system under the hood. If you start adding in the complexity that allows Oracle to do much more than Cassandra (at least functionally) you start introducing elements that hinder scale and availability, so you have to drop that RDBMS functionality in favor of scale and ability, negating the Oracle benefits entirely. For example, you can’t randomly access any data using a join if you want to have predictable performance. So in Oracle you denormalize to avoid joins. Why not just denormalize and use Cassandra, which has better built-in mechanisms for distribution?
In Oracle availability is a process of recovering from catastrophic failures. This is because of the coordination involved and the consistency guarantees that an ACID database provides. But because of these guarantees, users are looking at a failover process. And however fast RAC can recover, it’s still a fundamentally flawed architecture. Cassandra’s shared nothing architecture allows for complete disaster avoidance, not just disaster recover.
In a nutshell, Oracle is better for RDBMS use cases, But trying to use it for big data systems is only a solution until you can get a better NoSQL strategy in place with a true distributed system like Cassandra. Otherwise you will have to deal with rising costs and complexity. While Oracle RAC does it’s best to hide complexity required to backend modern applications, Cassandra doesn’t need to hide it’s complexity, as the functionality directly maps to goals of your system.
Monitoring system performance, understanding query access patterns and continuous tuning enabled us to be ready for production in 3 months.
The Elasticseach-Logstash-Kibana log analytics opens up insights to view and correlate events that can lead to system slowness, corruption or even a system failure. Just to name a few top events from our list:
- Large batch warnings
- Tombstone warnings
- Excessive GC and/or long pauses
- Memtable Flush Writers warnings
- Compaction warnings and errors
- Connection failures
- Unwanted data access
Switching to Cassandra could be a daunting task if you are migrating your production workloads to multiple regions in Cloud and On-Premises. Start with a small cluster in single region, test your application to ensure the model is working correctly, setup our first Observability cluster i.e ELK or Graphana/Prometheus and webhooks to get notifications. Finally, I found this mindmap to be a great tool to demystify Cassandra.
Redundancy has a cost and a reliable system has to pay that to achieve such resilience for services by eliminating every single point of failure. Reliability is availability over time considering the full range of possible real-world conditions that can occur. At a high level, we chose to have the following policies in place:
Have an observability plan for Kubernetes metrics to report on and track in real-time:
Number of Kubernetes clusters
Number of nodes
Number of nodes by cluster
Number of pods
Number of pods by node
Nodes by uptime (min to max)
Pods by uptime (min to max)
Number of applications/services
Number of applications/services by cluster
Resource utilization by node (e.g. CPU)
Disk writes per node
Disk reads per node
Network errors per node
When we embarked on our transformation journey, there were numerous challenges. There were time-consuming and tedious challenges of mapping database schemas and application interdependencies to have a successful migration to Cloud. A hybrid framework enabled smooth testing and migration process, allowing us to work with disparate data sources and applications that co-existed, no matter where they were hosted. We were able to increase developer velocity, enable security at every step, automate every workflow, and redefined collaboration by using the right set of tools. We are able to reap the benefits of a Hybrid cloud environment (Openstack Private Cloud & AWS Cloud). The data and insights become more critical with growing business needs, transitioning allowed us realtime access to the entire data eco-system and derive real-time insights. We were able to frequently adjust to data strategies to ensure we are aligned with regulatory and compliance requirements. The new hybrid environment allowed us to unify and effectively utilize data that is most useful to real-time insights, decsion making, trendspotting and analytics needs.
With a solid plan for CI/CD and modernized tools like CircleCI, Flux, Jira Project Management Tool,Git Hub Enterprise to increase developer velocity, Confluence, Webex Teams and Observability Tools(mentioned above), we were able to accelerate and successfully complete the Cloud migration.
AWS Cloud gave us wings to complete the data migration project using DMS and powerful purpose built databases for our use cases. I will be writing another paper to share my experience with CDC from Cassandra to AWS(S3 and Aurora-MySQL).
The problems with frequent changes, silos, errors, pipeline management and velocity went away with automation, observability, decentralization of databases, S3 based data lake, managed services like ECS, serverless Lambda, Aurora, Elastic, SQS and Kinesis. So the journey continues and as I work through solving more complex challenges, I will continue to share my solutions and experiences with you.
🔗 Read more about Snowflake here
🔗 Read more about Cassandra here
🔗 Read more about Elasticsearch here
🔗 Read more about Kafka here
🔗 Read more about Spark here
🔗 Read more about Data Lakes Part 1here
🔗 Read more about Data Lakes Part 2here
🔗 Read more about Data Lakes Part 3here
🔗 Read more about Redshift vs Snowflake here
🔗 Read more about Best Practices on Database Design here