Saturday, April 23, 2016

Working with Compressed Data in HDFS

I needed to process a large amount of data stored on my machine drive. The data is stored in a single folder, which contains numbers of files. Each file is about few gigabytes. The total size of the data is 4.5TB uncompressed, and my machine hard disk drive is only 2TB.  Intutively, I can copy the data from the external hard disk to the HDFS, which is feasible. Such approach will consume at least 9TB if the number of replication in the HFDS is 2x.

The one might argue, "why should I care if the HDFS has a huge capacity?". The simplest answer to this consider the two following situation:
(1) The Hadoop cluster is shared between you and others, which makes everyone using the cluster compete for the HDFS resource. " This is my case."
(2) Suppose you have a small cluster and the size of  HDFS is smaller than the size of the data you need to process. Apparently, it is impossible to fit your data in a small cluster without compression.

 For the above-mentioned reasons, running Hadoop job on compressed data can save your resources, and keep your colleagues happy.

Now let's assume a couple of things first. You have massive data that do not fit in your single machine storage. The data are sliced into numbers of files. The following illustration shows the hieratical directory of the data.

----Folder
       | File
       | File
       | Nested Folder
                    | File 1
                    | File 1
       | File 1
       | File 1


First, I compressed the data files using .bz2 format. The size of the data after the compression is around 400GB.  Certainly now,  the data can easily fit in the machine hard disk, and most importantly copying the data to the HDFS will take about 2x of its original size (e.g., 800GB). I used the following script to compress and copied the data to the HDFS simultaneously. 

for i in $(`pwd/largeData/*`);
do
             bzip2 -c  "$i"
             hadoop fs -copyFromLocal "${i%}.bz2" /hdfsDir
done


Unfortunately, due to a power cut in my working area, all machines, and the Hadoop cluster was shutdown. Thus, I wrote another script to copy and compress the data if files does not exist on the HDFS. 

  1 #  Louai Alarabi - 02/15/2016
  2 #  check files recursively inside the current folder if they exist in the HDFS otherwise do the following
  3 # 1) Compress to the bz2  
  4 # 2) copy the compressed file to the HDFS 
  6 # 4) Finally after copy all the files it start the hadoop job
  7 #
  8 hadoop fs -mkdir /myHDFSdata
  9 echo " Check files in the target directory... "
 10 files=$(hadoop fs -ls /myHDFSdata/)
 11 for i in $(find `pwd` -name "*.txt");
 12 do
 13         filename=$(basename "$i" | sed 's/.txt//g')
 14         echo "$filename"
 15         if echo "$files" | grep "$filename";then
 16                 echo " $filename exist"
 17         else
 18                echo "$filename miss"
 19                echo "compress file $i"
 20                bzip2 "$i"
 21                echo "copy file to the hdfs $i"
 22                hadoop fs -copyFromLocal "${i%.bz2}/myHDFSdata/
 23         fi
 24 
 25 done
 26 echo " Done compress and copy the data to the HDFS"

Friday, March 11, 2016

Git working with Branch

There are numers reasons why somebody wants to create a new branch; for example, someone might need to add a new feature to the master branch without reflecting the changes to the master branch. Another reason could be fixing a bug and then merge the modifications to the master branch. Despite the reasons, here I will list main command lines you need to know to work with git branching as following

1- Show remote git URL:
$ git remote show origin 

2- Show git working branch:
$ git branch 

3- Create a new branch from existing one.
git branch <branchName>

4- Switch between branches.
$ git checkout <branchName>

5- To submit the newly created branch to the remote server, only for the first time. 
$ git push -u origin <branchName>

6- Later to submit any changes will be similar to operating with only one branch. 
$ git add --all 
$ git commit -m ' your message.' 
$ git pull 
$ git push 

7- To clone a specific branch
$ git clone -b <branch> <remote> 

8- To merge changes from one branch, e.g, <Fix> to the master branch. 

$ git checkout master
$ git pull origin master 
$ git merge fix
$ git add --all 
$ git commit -m "message "
$ git push origin master 
Example: 
$ git clone -b fixingBug louai@server/project/repository/projectName


Reference : https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell