Data is the focus nowadays, the era of hardware, software, mobile, cloud is fade away. We become hungry to get insight and innovate around the data. That's why data-driven is so important for the organization and also advancement in Machine Learning, Data Science, Artificial Intelligence (AI) and Deep Learning (DL).
However, sometimes people still misunderstood the basic concept and the law of physics when doing Data Lake and Microservices. You still need to get back to the basic of distributed system, computer science and software engineering. Tools can only as good as the people used it. That empathy and creativity is human power that won't never be replaced by AI.
Basically all your dream will be limited by the law of physics. We can't escape the network, band-with, hardware limitation, data transport and organizational structure. We can have a big dream but if the technology is not there or used incorrectly you will still end up in a mess.
So let's get to the basic.
Network is Scary so Don't distributed your object.
If you are into computer science You will know about 8 Fallacy of Distributed Computing. Lots of things could go wrong when you do anything with the network, sending data, communicating, remote call, REST API, messaging etc. Martin Fowler state this perfectly in his classical PoEAA books.
First Law of Distributed Object Design: "don't distribute your objects"
The 8 fallacies are:
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn’t change
There is one administrator
Transport cost is zero
The network is homogeneous
This is also happen in any distributed system. A lot of pattern, consensus, standard and also techniques around this one. You can see a lot of innovation and great ideas from Leslie Lamport about this like consensus, vector clock, and other basic of distributed algorithm that enable the world of our internet nowadays. That's also powered blockchain technologies.
Cloud also a networked application and the network is really unreliable, microservices also related with distributed architecture, messaging also a networked concept and it all comes from the root of Operating System from Computer Science. Message passing, Locking, Caching, Data Storage etc. So you need to know the basic before jumping into doing distributed system. Every decision you made is always comes with trade-off. That's why architect and consultant is always famous with the answer "It's depends"
A lot of jargon and hype that driven by vendor that trying to sell things that actually just a new branding with the same approach. It's like Web Services, SOAP, etc, but now they called it Microservices. NoSQL etc. I don't say that it's the bad things. But please don't ever forget the history of software. You will end up doing the same mistake again.
Big Data and Data Lake
Given an advancement in hardware and technology the hardware become cheaper, the storage is cheaper and people can easily add more node for horizontal scaling. Hadoop bring the breakthrough 10 years ago like Java do the same on software industry 25 years ago. Basically big data is just scalable storage + scalable processing. That's it. What you will do with that it's up to you.
You are responsible to make values or insight out of it. As i said there's no silver bullet and quick win there. It's all depends on you. Your creativity, intuition, vision and execution. Idea without execution is just a dream. Think big and start small. Deliver MVP, get rapid feedback from customer, learn, iterate. Do it over and over again and you will be what they called Lean Startup.
The huge concept that has been applied in this big data platform is large storage required a large compute power. You can't bring all the data into a single node then processed approach do in the past. You need to do ELT and not ETL.
Code always smaller than the huge data you have in the cluster. So why not sending the code/computation closer the data and processed that with high performance. You will avoid band-with issues and more faster result. Remember fallacy of distributed computing.
Still with great big data technology comes a great responsibility. Fall into the wrong hand you will create a gigantic mesh with big data. Just dumping all the data into single data lake will completely waste your company resources, time and money. You will generate a lot of headache and less productivity. This giant big data monolith approach bring the large organization down into his knees. You still need data governance and also smart way of thinking on how to create better data ecosystem.
Domain-driven design and Data Mesh
Given the problem above then let's take a look at different view of doing things. Basically even you are doing Big Data and Analytics, you should follow software engineering rules and principles.
Good Modularity and Decomposition
Domain-driven design(DDD) by Eric Evans clearly embrace this approach. You should read that blue books. It's a beautiful piece of thinking. It's basically quite related with Service Oriented Architecture.
Don Box define and explain he following 4 tenets:
Boundaries are explicit
Services are autonomous
Services share schema and contract, not class
Service compatibility is determined based on policy
But once again vendor drive the industry with term of Microservices. The issues is people tend to think logical decomposition (services) is similar with physical separation (deployment, container, node or server). Then it introduce network between the Microservices through synchronous or blocking REST API and also asynchronous using Messaging. It making accidental complexity. And in cloud everything comes with costs. You need the data align with the source.
Data Mesh approach is truly embracing DDD. But again this is logical architecture, more about people, team organization and data product approach. The physical things which is data platform can be shared between domain. That will make easier and totally loose couple architecture and allow agility.
Cloud Native Data Platform
The old school approach of doing big data is thinking it's should be deployed in the physical server. Apache Hadoop is invented in pre cloud era, so that's a valid thinking at that time. But it will need to evolve into new stuff. Scalable storage + Scalable processing won't work anymore on the Cloud. We need to embrace the elasticity and agility on the cloud to get the most value from it. We also can save a lot of cost and truly get the real benefit from it.
If you still thinking the old way of doing things and expect it would work on the cloud then you need to unlearn what you learn. Cloud not just a technology transformation but also digital organizational transformation.
So right now we don't want scalable storage anymore, we also don't want scalable compute or processing anymore. But what we want is
Unlimited storage + Elastic Compute
You should not limit your thinking into physical architecture anymore. More logical and bring the benefit for your organization.
Storage in the cloud like AWS S3 is cheap. You can theoretically get unlimited storage. S3 wire protocol is also the standard right now. You can use it also on premise with minio object storage. Then you can enable hybrid and multi-cloud architecture.
But be careful about the cost of data migration. Even in the cloud you need to keep your options open and don't lock yourself into single platform. You can always rebuild your compute engine as it's disposable because it's stateless. Through containerization, orchestration and resource management and sort of tools from cloud native
The strategy for your data agility is to use
Open data format
Open data transport
Agile Analytics (ML and AI) Pipeline
Let's deep dive into each points one by one.
Open Data Format
You can use open data format like Apache Avro, json, text or Apache Parquet for columnar storage. Usually the standard for doing analytics is Apache Parquet for columnar file format and Apache Arrow for in memory columnar format. That way you can move your data and the format make it compatible for other analytics tools. You can bring your data on premise or even move your data to another cloud storage that suite your business strategy. But still the egress still costs you a money depends on how large your data is.
So we should avoid lock-in right? Well as always the answer is It's depends.
No one would told you exactly how you should avoid and embrace it. All we can do is to give you suggestion depends on your cost benefit analysis and return on investment. I love how Gregor Hohpe explain on his write article the rationale between this in very comprehensive manner. If you are an executive i should really recommend you read his cloud strategy books. It's really give you a big preparation for your cloud strategy based on your business and digital strategy.
Open Data Transport
People sometimes treat this as byproducts.
Self-driving Data Governance
Agile Analytics Pipeline
ML, AI and DL. Data Scientist....