For years, as the processing power of application servers exploded, databases lagged due to their limited capacity and speed. However, thanks to the various Big Data frameworks, the database world has seen an overhaul.
These frameworks include: Apache Hadoop. Today, applications running on Hadoop clusters are increasing day by day. This is because the latter is not only simple and efficient, but also because it works very well in the distributed environment. It is designed to run efficiently on thousands of machines and massive datasets.
What is Apache Hadoop?
What is meant by “Hadoop”? Or more importantly, what does “Hadoop” mean? That means High Availability Distributed Object Oriented Platform. And that’s exactly what Hadoop technology offers developers: high availability through parallel task distribution object oriented.
Hadoop is an open source Java framework developed in 2006 by Doug Cutting and Michael J. Cafarella, and maintained by the Apache Software Foundation under the Apache 2.0 license. It is a software ecosystem that allows companies to process huge amounts of data in a short time. This is accomplished by facilitating the use of parallel computer processing on a massive scale.
Hadoop is very popular because it is both accessible and easy to learn. Economical and practical, it offers modules with many options. It can easily scale and be deployed across multiple machines to accommodate nearly any size data set. Moreover, its method of storing and processing data presents an attractive enterprise solution in the context of growing storage.
Today, Hadoop has been implemented in several industries to meet their specific needs. Yahoo was among the first companies to adopt Hadoop. Since then, several large companies, including Facebook, Twitter, and Adobe, have implemented this architecture to benefit their organization.
What are the components of Hadoop?
Hadoop is not just an application, it’s mostly a platform whose various integrated elements allow the storage and processing of distributed data. The Hadoop ecosystem includes basic elements essential to the framework, others are complementary elements that add functionality.
Hadoop works in the form of clusters: it uses not one but several servers (called nodes) in order to support very large workloads. Within a cluster, there are:
- THE main node (main node) which will administer all the resources of the servers, manage the success of each job and manage the working nodes.
- THE working nodes (worker nodes) which will perform the requested calculations on portions of data.
For critical workloads, one can have multiple master nodes to ensure high availability.
The basic elements of Hadoop are:
HDFS
HDFS is the acronym of Hadoop Distributed File System and this is Hadoop’s storage unit. It is a file system specially designed to store huge data sets of different formats on various machines.
There are two components in HDFS:
- NameNode: NameNode is the master daemon (main program). There is only one active NameNode. It manages DataNodes and stores all metadata.
- DataNode: DataNode is the slave daemon (secondary program). There can be multiple DataNodes. It stores the actual data.
HDFS divides data into multiple blocks, by default, at a maximum of 128MB. The default block size can be changed depending on processing speed and data distribution.
Let’s take an example: if we have 300 MB of data for example, it is divided into 3 blocks: 128 MB, 128 MB and 44 MB. The final block manages the rest of the storage space needed, so it does not should not be sized to 128MB. This is how data is stored in a distributed manner in HDFS.
This is how the distributed nature of HDFS comes into its own!
YARN
YARN is the acronym of Yet Another Resource Negotiator. It manages the node group and acts as the resource management unit of Hadoop. YARN allocates RAM, memory, and other resources to different applications. It can be seen as the program that administers resource management in the cluster.
YARN has two components:
- Resource Manager (Master): This is the master daemon (main program). It manages the allocation of resources such as CPU, memory, and network bandwidth.
- NodeManager (Slave): It is the slave daemon (secondary program), and it reports resource usage to the resource manager.
MapReduce
MapReduce makes it possible to process large volumes of data in a distributed way in parallel. Since it is a programming architecture, MapReduce can be used in many many programming languages : Java, Scala, R, Python, C++, etc.
It is mainly used with the Hadoop framework, because MapReduce will use Hadoop’s cluster mode operation to take advantage of a large amount of data available.
This is indeed very important, since the particularity of MapReduce is to be able to process data in parallel. When the processing is done, the data coming from the multiple servers of the cluster is retrieved and aggregated to obtain a final result.
Also taking advantage of the Hadoop ecosystem, MapReduce uses the HDFS distributed file system, which allows each server in the cluster to read and write disk-based data. Each file is therefore accessible by any node of the cluster, which makes it possible to execute a job independently of the others.
Thus, through its operation, MapReduce will be able to process a large volume, both because it has several servers in a cluster, and because each of these servers uses the same file system.
Common
Hadoop Common refers to the collection of common utilities and libraries that support other Hadoop modules. It is considered as the base/core of the framework, as it provides essential services and basic processes such as the abstraction of the underlying operating system and its file system.
It also contains the Java Archive (JAR) files and scripts needed to get started with Hadoop and provides source code and documentation, as well as a contribution section that includes various Hadoop community projects.
Hadoop Common is also known as Hadoop Core.
Other elements of the Hadoop ecosystem
Hadoop is also a ecosystemi.e. popular applications that are not strictly part of the core Hadoop modules, but are frequently used in conjunction with them include:
- winter is data warehouse software that runs on Hadoop and allows users to work with data in HDFS using an SQL-like query language called HiveQL.
- Impala is the open source native analytical database for Hadoop.
- Pig is a tool typically used with Hadoop as an abstraction over MapReduce to analyze large sets of data represented as data streams. Pig allows you to perform operations such as joining, filtering, sorting and loading data.
- HBase and Cassandra are two particularly popular non-relational (NoSQL) databases on Hadoop.
- ZooKeeper is a centralized service enabling highly reliable distributed processing. In particular, it is used in conjunction with Apache Kafka for streaming data in real time.
- sqoop is a tool designed to efficiently transfer data between Hadoop and structured datastores such as relational databases.
- Oozie is a workflow scheduling system for managing Hadoop tasks. Oozie Workflow tasks are directed acyclic graphs (DAGs) of actions.
What are the benefits of Apache Hadoop?
Hadoop makes it possible to meet the challenge of the enormity of Big Data thanks to its many qualities, among these, we can mention:
- Resilience : Data stored in one cluster node is replicated to other nodes, ensuring Hadoop’s fault tolerance. If a node goes down, the other servers in the cluster still have a backup copy of the data.
- Scalability: Unlike traditional systems that have limited storage capacity, Hadoop is scalable because it operates in a distributed environment. If needed, the configuration can be easily extended by installing other servers, and the storage capacity can thus reach several petabytes.
- Moderate cost: Since Hadoop is an open source framework requiring no license, the costs of this solution are significantly lower than those of traditional relational databases. Furthermore, the use of inexpensive standard hardware explains the moderate cost of this solution.
- Speed : The distributed file system, concurrent processing and the MapReduce model make it possible to execute the most complex queries in seconds.
- Data diversity: HDFS can store different data formats: structured, unstructured (e.g. videos) or semi-structured (e.g. XML files). When storing data, it is not necessary to validate it against a predefined schema: the data can be downloaded in any format. Upon retrieval, the data is analyzed and used by applying the required schema(s). This flexibility makes it possible to derive different knowledge from the same data.
It is used for storing and processing big data and analyzing structured and unstructured data, across a collection of components and scaling from a single server to thousands of machines.
What are the limitations of Apache Hadoop?
Although Hadoop is an extremely powerful Big Data tool, it still has some limitations.
- It is limited in its ability to process data in real timebecause it is designed for large-scale data processing tasks that can take hours or days to run.
- He has difficulty in manage variable structure or unstructured datasuch as images, videos and geolocation data.
- He may have scalability issues for memory-intensive processing tasks.
- He can be difficult to configure and manageespecially for users who are not experienced in cluster management.
Despite these limitations where there are alternatives and specific tools, Apache Hadoop still remains a essential brick of Data architectures in companies. Whether to perform large data processing tasks or for durable data storage needs, the Hadoop ecosystem has not yet said its last word.