Saturday, April 23, 2016

Working with Compressed Data in HDFS

I needed to process a large amount of data stored on my machine drive. The data is stored in a single folder, which contains numbers of files. Each file is about few gigabytes. The total size of the data is 4.5TB uncompressed, and my machine hard disk drive is only 2TB.  Intutively, I can copy the data from the external hard disk to the HDFS, which is feasible. Such approach will consume at least 9TB if the number of replication in the HFDS is 2x.

The one might argue, "why should I care if the HDFS has a huge capacity?". The simplest answer to this consider the two following situation:
(1) The Hadoop cluster is shared between you and others, which makes everyone using the cluster compete for the HDFS resource. " This is my case."
(2) Suppose you have a small cluster and the size of  HDFS is smaller than the size of the data you need to process. Apparently, it is impossible to fit your data in a small cluster without compression.

 For the above-mentioned reasons, running Hadoop job on compressed data can save your resources, and keep your colleagues happy.

Now let's assume a couple of things first. You have massive data that do not fit in your single machine storage. The data are sliced into numbers of files. The following illustration shows the hieratical directory of the data.

----Folder
       | File
       | File
       | Nested Folder
                    | File 1
                    | File 1
       | File 1
       | File 1


First, I compressed the data files using .bz2 format. The size of the data after the compression is around 400GB.  Certainly now,  the data can easily fit in the machine hard disk, and most importantly copying the data to the HDFS will take about 2x of its original size (e.g., 800GB). I used the following script to compress and copied the data to the HDFS simultaneously. 

for i in $(`pwd/largeData/*`);
do
             bzip2 -c  "$i"
             hadoop fs -copyFromLocal "${i%}.bz2" /hdfsDir
done


Unfortunately, due to a power cut in my working area, all machines, and the Hadoop cluster was shutdown. Thus, I wrote another script to copy and compress the data if files does not exist on the HDFS. 

  1 #  Louai Alarabi - 02/15/2016
  2 #  check files recursively inside the current folder if they exist in the HDFS otherwise do the following
  3 # 1) Compress to the bz2  
  4 # 2) copy the compressed file to the HDFS 
  6 # 4) Finally after copy all the files it start the hadoop job
  7 #
  8 hadoop fs -mkdir /myHDFSdata
  9 echo " Check files in the target directory... "
 10 files=$(hadoop fs -ls /myHDFSdata/)
 11 for i in $(find `pwd` -name "*.txt");
 12 do
 13         filename=$(basename "$i" | sed 's/.txt//g')
 14         echo "$filename"
 15         if echo "$files" | grep "$filename";then
 16                 echo " $filename exist"
 17         else
 18                echo "$filename miss"
 19                echo "compress file $i"
 20                bzip2 "$i"
 21                echo "copy file to the hdfs $i"
 22                hadoop fs -copyFromLocal "${i%.bz2}/myHDFSdata/
 23         fi
 24 
 25 done
 26 echo " Done compress and copy the data to the HDFS"