Choose color scheme

Tag Archives: hdfs

  • Install Single Node Hadoop on Mac

    Install Single Node Hadoop on Mac

    Operating System: Mac OSX Yosemite
    Hadoop Version 2.7.2

    Pre-requisites
    We need to enable SSH to localhost without a passphrase.

    Go to System Preferences, then check “Remote Login” to ON.

    Now in a terminal window, ensure that the following succeeds with no passphrase.
    $>ssh localhost

    Download Hadoop Distribution
    Download the latest hadoop distribution from http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.7.2/

    Hadoop Configuration Files

    Go to the directory where your hadoop distribution is installed.

    Then change the following files
    hadoop_distro/etc/hadoop/hdfs-site.xml

    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    

    hadoop_distro/etc/hadoop/core-site.xml

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>
    

    hadoop_distro/etc/hadoop/yarn-site.xml

    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    </configuration>
    

    hadoop_distro/etc/hadoop/mapred-site.xml

    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
    </configuration>
    

    Format HDFS
    $bin/hdfs namenode -format

    Start HDFS
    $sbin/start-dfs.sh

    Start YARN
    $sbin/start-yarn.sh

  • Advanced Map Reduce Program: Patent Citation

    Advanced Map Reduce Program: Patent Citation

    Problem

    Derive Reverse Patent Citation.

     

    Dataset

    Visit http://nber.org/patents/

     

    Download: http://nber.org/patents/acite75_99.zip   zip file

    Unzipping should yield  cite75_99.txt

    This file lists patents and all the patents that it cites.

    This post will explore the problem of reversing this. We want to determine the patents and all its references (that is patents that cite it)

     

    Number of Records

    $wc cite75_99.txt

    16522439 16522439 264075431 cite75_99.txt

     

    16.5 million records

     

    Map Reduce Program

     

    package mapreduce; 
    import java.io.IOException; 
    import org.apache.hadoop.conf.Configuration; 
    import org.apache.hadoop.fs.Path; 
    import org.apache.hadoop.io.Text; 
    import org.apache.hadoop.mapreduce.Job; 
    import org.apache.hadoop.mapreduce.Mapper; 
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    
    public class PatentCitation {
       public static class  PatentCitationMapper  extends Mapper<Text, Text, Text, Text> { 
          public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
           String[] citation = key.toString().split( "," );
           Text cited = new Text(citation[1]); 
           Text citing = new Text(citation[0]);
           
           context.write(cited, citing);
          }
        }
    
        public static class PatentCitationReducer extends Reducer<Text, Text, Text, Text> {
           @Override
           protected void reduce(Text key, Iterable<Text> values, Context context)  throws  IOException, InterruptedException {
           String csv = "";
           for (Text value : values) {
             if (csv.length() >  0) { 
               csv +=  ","; 
           }
           csv += value.toString();
          }
          context.write(key,  new  Text(csv));
       }
      }
    
      public static void  main(String[] args)  throws  Exception {
        Configuration conf =  new  Configuration();
        Job job = Job.getInstance(conf, "Hadoop Patent Citation Example" );
        job.setJarByClass(PatentCitation.class );
        job.setMapperClass(PatentCitationMapper.class);
        job.setReducerClass(PatentCitationReducer.class);
        job.setInputFormatClass(KeyValueTextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class); 
    
       FileInputFormat.addInputPath(job, new Path(args[1]));
       FileOutputFormat.setOutputPath(job, new Path(args[2]));
       System.exit(job.waitForCompletion(true) ? 0 :  1 ); 
       }
    }
    
    

    Package this class as patentcitation.jar

     

    Setup the environment

    Either download the Hadoop distribution from Apache web site or use a sandbox such as Hortonworks Sandbox.

     

    Create HDFS Directories

    [root@sandbox]# hadoop fs -mkdir /user/captain

    [root@sandbox]# hadoop fs -mkdir /user/captain/input

     

    We have created an input directory in hdfs.  Let us copy the patent citation file into hdfs.

    [root@sandbox]# hdfs dfs -copyFromLocal input/cite75_99.txt /user/captain/input

     

    Run the Map Reduce Program

    yarn jar patentcitation.jar PatentCitation /user/captain/input /user/captain/output

     

    15/09/09 19:45:22 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox:8188/ws/v1/timeline/

    15/09/09 19:45:22 INFO client.RMProxy: Connecting to ResourceManager at sandbox/192.168.112.132:8050

    15/09/09 19:45:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

    15/09/09 19:45:23 INFO input.FileInputFormat: Total input paths to process : 1

    15/09/09 19:45:23 INFO mapreduce.JobSubmitter: number of splits:2

    15/09/09 19:45:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441826432501_0004

    15/09/09 19:45:23 INFO impl.YarnClientImpl: Submitted application application_1441826432501_0004

    15/09/09 19:45:23 INFO mapreduce.Job: The url to track the job: http://sandbox:8088/proxy/application_1441826432501_0004/

    15/09/09 19:45:23 INFO mapreduce.Job: Running job: job_1441826432501_0004

    15/09/09 19:45:29 INFO mapreduce.Job: Job job_1441826432501_0004 running in uber mode : false

    15/09/09 19:45:29 INFO mapreduce.Job:  map 0% reduce 0%

    15/09/09 19:45:40 INFO mapreduce.Job:  map 17% reduce 0%

    15/09/09 19:45:43 INFO mapreduce.Job:  map 24% reduce 0%

    15/09/09 19:45:47 INFO mapreduce.Job:  map 29% reduce 0%

    15/09/09 19:45:50 INFO mapreduce.Job:  map 33% reduce 0%

    15/09/09 19:45:53 INFO mapreduce.Job:  map 41% reduce 0%

    15/09/09 19:45:56 INFO mapreduce.Job:  map 50% reduce 0%

    15/09/09 19:45:59 INFO mapreduce.Job:  map 58% reduce 0%

    15/09/09 19:46:02 INFO mapreduce.Job:  map 65% reduce 0%

    15/09/09 19:46:05 INFO mapreduce.Job:  map 69% reduce 0%

    15/09/09 19:46:08 INFO mapreduce.Job:  map 77% reduce 0%

    15/09/09 19:46:11 INFO mapreduce.Job:  map 89% reduce 0%

    15/09/09 19:46:13 INFO mapreduce.Job:  map 92% reduce 0%

    15/09/09 19:46:14 INFO mapreduce.Job:  map 98% reduce 0%

    15/09/09 19:46:15 INFO mapreduce.Job:  map 100% reduce 0%

    15/09/09 19:46:24 INFO mapreduce.Job:  map 100% reduce 68%

    15/09/09 19:46:27 INFO mapreduce.Job:  map 100% reduce 69%

    15/09/09 19:46:30 INFO mapreduce.Job:  map 100% reduce 71%

    15/09/09 19:46:33 INFO mapreduce.Job:  map 100% reduce 73%

    15/09/09 19:46:36 INFO mapreduce.Job:  map 100% reduce 75%

    15/09/09 19:46:39 INFO mapreduce.Job:  map 100% reduce 77%

    15/09/09 19:46:42 INFO mapreduce.Job:  map 100% reduce 79%

    15/09/09 19:46:45 INFO mapreduce.Job:  map 100% reduce 81%

    15/09/09 19:46:48 INFO mapreduce.Job:  map 100% reduce 83%

    15/09/09 19:46:51 INFO mapreduce.Job:  map 100% reduce 85%

    15/09/09 19:46:54 INFO mapreduce.Job:  map 100% reduce 88%

    15/09/09 19:46:57 INFO mapreduce.Job:  map 100% reduce 91%

    15/09/09 19:47:00 INFO mapreduce.Job:  map 100% reduce 93%

    15/09/09 19:47:03 INFO mapreduce.Job:  map 100% reduce 96%

    15/09/09 19:47:06 INFO mapreduce.Job:  map 100% reduce 98%

    15/09/09 19:47:09 INFO mapreduce.Job:  map 100% reduce 100%

    15/09/09 19:47:11 INFO mapreduce.Job: Job job_1441826432501_0004 completed successfully

    15/09/09 19:47:11 INFO mapreduce.Job: Counters: 49

    File System Counters

    FILE: Number of bytes read=594240702

    FILE: Number of bytes written=891741626

    FILE: Number of read operations=0

    FILE: Number of large read operations=0

    FILE: Number of write operations=0

    HDFS: Number of bytes read=264206769

    HDFS: Number of bytes written=158078539

    HDFS: Number of read operations=9

    HDFS: Number of large read operations=0

    HDFS: Number of write operations=2

    Job Counters

    Launched map tasks=2

    Launched reduce tasks=1

    Data-local map tasks=2

    Total time spent by all maps in occupied slots (ms)=85024

    Total time spent by all reduces in occupied slots (ms)=55354

    Total time spent by all map tasks (ms)=85024

    Total time spent by all reduce tasks (ms)=55354

    Total vcore-seconds taken by all map tasks=85024

    Total vcore-seconds taken by all reduce tasks=55354

    Total megabyte-seconds taken by all map tasks=21256000

    Total megabyte-seconds taken by all reduce tasks=13838500

    Map-Reduce Framework

    Map input records=16522439

    Map output records=16522439

    Map output bytes=264075431

    Map output materialized bytes=297120321

    Input split bytes=266

    Combine input records=0

    Combine output records=0

    Reduce input groups=3258984

    Reduce shuffle bytes=297120321

    Reduce input records=16522439

    Reduce output records=3258984

    Spilled Records=49567317

    Shuffled Maps =2

    Failed Shuffles=0

    Merged Map outputs=2

    GC time elapsed (ms)=3356

    CPU time spent (ms)=136790

    Physical memory (bytes) snapshot=552968192

    Virtual memory (bytes) snapshot=2486214656

    Total committed heap usage (bytes)=412090368

    Shuffle Errors

    BAD_ID=0

    CONNECTION=0

    IO_ERROR=0

    WRONG_LENGTH=0

    WRONG_MAP=0

    WRONG_REDUCE=0

    File Input Format Counters

    Bytes Read=264206503

    File Output Format Counters

    Bytes Written=158078539

     

     

    Copy the result file to local directory

    # hdfs dfs -ls /user/captain/output

    Found 2 items

    -rw-r–r–   1 root hdfs          0 2015-09-09 19:47 /user/captain/output/_SUCCESS

    -rw-r–r–   1 root hdfs  158078539 2015-09-09 19:47 /user/captain/output/part-r-00000

    # hdfs dfs -copyToLocal /user/captain/output/part-r-00000 output/

    15/09/09 19:50:45 WARN hdfs.DFSClient: DFSInputStream has been closed already

     

    View the results

    #vi part-r-0000

     

    “CITED” “CITING”

    1       3964859,4647229

    10000   4539112

    100000  5031388

    1000006 4714284

    1000007 4766693

    1000011 5033339

    1000017 3908629

    1000026 4043055

    1000033 4975983,4190903

    1000043 4091523

    1000044 4082383,4055371

    1000045 4290571

    1000046 5525001,5918892

    1000049 5996916

    1000051 4541310

    1000054 4946631

    1000065 4748968

    1000067 4944640,5071294,5312208

    1000070 5009029,4928425

    1000073 4107819,5474494

    1000076 5845593,4867716

    1000083 5322091,5566726

    1000084 4182197,4683770

    1000086 4686189,4839046,4217220,4178246

    1000089 5505607,5540869,5505610,5544405,5571464,5505611,5277853,5807591,5395228,5503546

    1000094 5713167,4897975,4920718