Even though Hadoop is still a relatively new technology compared to established data warehouse products, it has found a way into many companies in recent years, where it plays an irreplaceable role in the processing and analysis of massive volumes of data. Apache Hadoop is an open platform with an active community of developers, who in addition to maintaining and optimizing existing components continually deliver new functions. It is gradually finding its way into well-known distributions, be it Cloudera, now-related Hortonworks, or MapR. In order for us to be able to deliver modern solutions to our customers, we have to follow these trends and know their advantages and disadvantages. So, what are the two big innovations in HDFS and Hive components?
HDFS Erasure Coding
HDFS, also known as the Hadoop Distributed File System, is a key component of most Hadoop installations. It is a highly robust storage system where the disk capacity is distributed to many interconnected machines forming a computing cluster. Saved files are divided into blocks of the same size, which by default are replicated three times and stored on different servers. This guarantees data retention in case of failure of up to two servers and contributes to the actualization of one of Hadoop’s central ideas, that computations should be moved instead of data. With three replicas, the system can choose the optimal node to perform the computation on and can process one file block-by-block in parallel.
The downside of this concept is that only 33% of the actual cluster capacity is efficiently used. Especially when it comes to data that we do not need for computations, that we are just archiving, we are wasting space. Hadoop 3, therefore, has introduced Erasure Coding. This is efficient coding where each block is further divided into smaller cells. These cells are then split into groups of a certain number and parity cells are added to each group. If we lose some cells from a group, the parity information is sufficient to reconstruct the lost data. Instead of simple duplication, we get a tool that still guarantees secure data storage while allowing the same HDFS disk capacity to store more data. Thus, real use can increase to up to 70% of actual capacity.
HDFS Erasure Coding can be turned on for only selected data. So, one system can have both data that we are actively transforming and analysing as well as data that we want to archive.
Transactions in Hive
Hive 3 comes with more interesting features. Apache Hive is a system for managing large volumes of data in HDFS, which it allows you to run SQL queries on. Support for transactions at the row level is new. In other words, in the new version of Hive, we can use UPDATE and DELETE in addition to SELECT and INSERT. This is a big change, mainly because one of the basic principles of Hadoop was “write once, read many”, meaning write the data once and then run queries on them or transform them into a new data set.
If we look back at the history, the previous version of Hive supported transactions at the level of partitions, which are constructs that allow data to be organized by specific keys. For example, data from individual days can be stored in separate folders. But in praxis, partition-level transactions led to the need to overwrite the entire partition to change a single value.
Why did Hive developers decide to introduce this change? They did it to support several important cases. The first is to correct existing records. If we care about data quality, sooner or later a situation will arise where it is necessary to correct some records in a table. Similarly, for example, in connection with GDPR, situations arise where we need to delete data about a user from the database. Last but not least, it better supports streaming applications, which often require updates to existing records due to the nature of the data. If it is possible to edit individual rows, there is no need to rewrite unnecessarily large amounts of data.
In praxis, when using this function, it is necessary to take into account the way Hive executes transactions. The key component of the entire system is the transaction manager. It assigns sequential numbers to individual transactions. When writing, the data is stored in two types of files. Base files contain the most recently consolidated version of the data. When writing new rows, they are updated or deleted by writing a delta file, which holds information about how the data in the base file has been modified since the last consolidation. When the user reads the data, the transaction manager provides information on which delta files need to be read to return the most up-to-date value to the user.
The problem with this architecture is that reading performance declines as the number of delta files increases. Thus, Hive triggers compaction, according to adjustable parameters, where delta files are rewritten into a single base file. This is a relatively expensive operation. Additionally, the size of the table increases with each data update, even if the number of rows remains the same. It should also be noted that transaction processing in Hive contains only a subset of the properties that we understand as a normal part of transactions in traditional relational databases. For example, COMMIT and ROLLBACK statements are not supported, nor can we set the isolation level. Furthermore, transaction support is limited to tables that store data in the ORC file format.
In this article, we have described possible efficient data storage in HDFS and transaction support in Hive. However, this is not all that is new in the latest version of Hadoop. Technology in the world of big data continues to undergo rapid development. But remember that every new function has its undisputed use cases as well as its pitfalls to keep in mind. Each new use for a project must be preceded by analysis and verification as to whether the application will bring about the desired result.
Author: Tomáš Duda
Big Data Engineer