Friday, November 6, 2015

Sort or Uniq Very Large Dataset

Sort or Uniq Very Large Dataset

Script for?
The purpose of this sorting implementation is to have a script that can run on a typical linux machine with minimal 3rd party software dependency and handle large amount of data. The script can scale up, and use system resources to max. The script can be used for finding unique elements too, just update 'OPERATION' variable to 'UNIQ' (uncomment a line).

Even on a not so large file the script could be faster than directly using linux’s ‘sort’ command. Linux sort command implements ‘merge sort’ which has time complexity of O (nlogn). Here as we we processes smaller chunks of file at a time thus reducing ‘n’.


Download script here

How to use?
Copy the script to the folder with input data and run following command:
$ ./gol_sort.sh <input_filename> Note: Make sure you have bash 4.0 or later version (echo $BASH_VERSION to find your version).

Script:
### Objective: Program to sort the entries in a multi TB file (pseudo map-reduce). 
### Author: Dr. Suresh Golconda (suresh_golconda@yahoo.com)
### Version: 2.0

### Usage: $mysort.sh <inputfile>

## Requirement: 
##     Make sure you are running in BASH 4.0 OR later version. Did not spend enough time to adapt to older versions.

## Assumptions:
# 1. There is no other file in the folder with filename starting with "_tmp"
# 2. There is enough free disk space to create a copy of input file and to store the output results file. In the worst the case I would suggest to have available disk space equal to twice the size of the input file. 
# 3. Two instance of the script wont be run simultaneously in a same folder. As temporarily (_tmp) files generated may conflict. Will fix this in next version

## Performance tuning
# To best utilize the script, try to set 'temp_filesize' to be smaller than the number of lines in your input file.
# 'temp_filesize' represents number of lines of input data that are processed as a unit.
# Based on available RAM, # of cores, and number of lines in the input file, set optimal value for 'temp_filesize'


## Notes:
# If due to any reason the script is terminated (eg: control-C) do not forget to manually delete all _tmp* files before restarting the script.
# If input data has numeric, uncomment "extra_sort_flags='-n'" line below.

# Future work: 
# Use random-number plus '_tmp' prefix to allow simultaneous running 2 instances of the script in the same folder
# Estimate RAM, disk requirement.
# Write down symptoms: (temp files are created),
# Script should take an argument to handle 'numeric' or 'string' input.

# Troubleshooting: 
# ----------------
# Q: Failing on sorting operation with message intending not finding mysubuniq2() function. 
# A: Make sure you are running in bash 4.0 or later version

# Q: Error about 'expr'. 
# A: You can ignore this for now as this is just trying to print some time taken parameters.


###---- Default value: You may want to change
parallel=`getconf _NPROCESSORS_ONLN`   # If you want to use fixed cores, over write this variable with no. of cores to use
temp_filesize=500000   #size in terms of number of lines in each temp file
OPERATION="SORT"       # for sorting input file
#OPERATION="UNIQ"      # for finding unique elements in input file
#extra_sort_flags='-n' # If input data is numeric value, uncomment below line#


##---- Sanity-check-1: Minimum bash version 
min_bash_major_version_num=4
major_version_num=${BASH_VERSION:0:1}
if (( major_version_num < min_bash_major_version_num)); then
    echo "Please run on bash version $min_bash_major_version_num or later. Your bash version $BASH_VERSION"
    echo "To check your present version of bash> echo \$BASH_VERSION"
   exit
fi
  
###---- Sanity-check-2: correct number of arguments
if [ "$#" -ne 1 ]; then
    pres_script_name=`basename "$0"`
    echo "Usage: $pres_script_name <input_file>"
    exit
fi
filename=$1

if [ "$OPERATION" == 'SORT' ]; then
    echo "Attempting to sort..."
    output_filename=${filename}.sort
fi

if [ "$OPERATION" == 'UNIQ' ]; then
    echo "Attempting to find uniq..."
    output_filename=${filename}.uniq
fi


##### Functions definition ########## 
## sort elements using unix 'sort' command
function mysubsort1 {
    subfilename=$1
    subfilename_tmp=$1".u"

    sort $extra_sort_flags ${subfilename} > ${subfilename_tmp}
    mv ${subfilename_tmp} ${subfilename}
}
export -f mysubsort1

## find unique elements using linux's sort| uniq commands
mysubuniq1()
{
    subfilename=$1
    subfilename_tmp=$1".u"

    sort  $extra_sort_flags ${subfilename} | uniq > ${subfilename_tmp}
    mv ${subfilename_tmp} ${subfilename}
}
export -f mysubuniq1




### Actual logic ####
echo "1/4: splitting inputfile, cores available ${parallel}"
tm1=`date +%s000`  ## Spliting input file
    split -a 7 -l${temp_filesize} ${filename} '_tmp';
tm2=`date +%s000`



echo "2/4: processing individual files" ## processing individual files
if [ "$OPERATION" == 'SORT' ]; then # if sorting input
    ls _tmp* | xargs -P ${parallel} -n 1 -I % bash -c 'mysubsort1 %' _ {}
fi
if [ "$OPERATION" == 'UNIQ' ]; then # if finding uniq elements
    ls _tmp* | xargs -P ${parallel} -n 1 -I % bash -c 'mysubuniq1 %' _ {}
fi
tm3=`date +%s000`



echo "3/4: merging results"  ## merging results
if [ "$OPERATION" == 'SORT' ]; then # if sorting
    sort  $extra_sort_flags -m _tmp* -o ${output_filename}
fi
if [ "$OPERATION" == 'UNIQ' ]; then # if finding uniq elements
    sort  $extra_sort_flags -m _tmp* | uniq > ${output_filename}
fi
tm4=`date +%s000`


echo "4/4: cleaning"  ## cleaning
    rm _tmp*
tm5=`date +%s000`



printf "Time taken in ms: total (%'.d), splitting (%'.d), sorting/uniq (%'.d), merging (%'.d), cleaning (%'.d)\n" `expr $tm5 - $tm1` `expr $tm2 - $tm1` `expr $tm3 - $tm2` `expr $tm4 - $tm3` `expr $tm5 - $tm4`
                                                
---------------- 

Description:
  • Implements pseudo map reduce approach using basic bash script.
  • The script will automatically check number of core you have on the machine and tries to use all the cores.
  • Can scale up pretty easily, is limited by buffer hard drive space available.
  • Has few parameters to improve the performance. Will describe these when I come back to this blog.
  • Can utilize multiple cores of the system without using 3rd party libraries such as parallel.
  • Is not limited by RAM size, but have to set the size of split files to match best performance.
  • Tested on test file of 1TB with 15 billion records on single machine (8 cores, 4 GB free)
  • A line in input file can be strings (sentences) or numeric. By default script assumes elements to be string, can modify for sorting numerics
  • Prints the summary of time took in each of the 4 steps

Limitations:
  • Did not test with wide variety of strings with special characters
  • Need to evaluate if there is any limitation on the length of each line.
  • Cannot run two simultaneous instances of these script in same folder as they might be conflict with temporary files created.
  • Input data is assumed to be in single file, can modify the script to be able to take a regular expression with list of files.


Notes:
  • On a typical multi-core machine the performance is limited by disk speed. Presently the system has to read the whole data at (max) twice. Probably can avoid the step of splitting the files and writing back to disk by somehow streaming the contents of the source file to each core. Tried few techniques but with decreased performance.
  • If getting out of memory errors when sorting individual files, then try to decrease number in lines put in each split file.
  • If getting out of memory error when merging, then increase the number of lines put in each split file.

Troubleshooting:
Q: Failing with message stating that not able to find mysubuniq() function.
A: Make sure you are running in bash 4.0 or later version. Use echo $BASH_VERSION to get bash version

Q: Error about 'expr'.

A: You can ignore this for now as this is just trying to print some time taken parameters. Again I think it is caused



Future work: 
  • Use random-number plus '_tmp' prefix to allow simultaneous running 2 instances of the script in the same folder.
  • Estimate RAM, disk requirement.
  • Write down symptoms: (temp files are created),
  • Script should take an argument to handle 'numeric' or 'string' input.
  • Early evaluation of using hash-map to sort individual files did not give good performance. Need to re-evaluate the system.
  • Early attempts to sort contents of the file while splitting failed, need to try again. This will reduce time spent in writing a copy of input file back to disk.

Other projects page: https://sites.google.com/site/golcondaprofile/projects

6 comments:

  1. Thanks for sharing... Very helpful and interesting idea..

    ReplyDelete
  2. Great job.Looking forward for updated versions.

    ReplyDelete
  3. There is definately a great deal to know about this subject. I like all of the points you've made. uniqson analytics

    ReplyDelete
  4. MLB Money Line Picks for Wednesday, December 17th - Shoot matchpoint matchpoint gioco digitale gioco digitale jeetwin jeetwin 177el cortez 3d, black and red - Shootercasino

    ReplyDelete
  5. I'm testing sorting 100GB of data files.
    Can I access the drive to download your script file. I have sent an access request to your share script with email: dang.n.***@gmail

    ReplyDelete