Advanced Map Reduce Program: Patent Citation

Problem

Derive Reverse Patent Citation.

 

Dataset

Visit http://nber.org/patents/

 

Download: http://nber.org/patents/acite75_99.zip   zip file

Unzipping should yield  cite75_99.txt

This file lists patents and all the patents that it cites.

This post will explore the problem of reversing this. We want to determine the patents and all its references (that is patents that cite it)

 

Number of Records

$wc cite75_99.txt

16522439 16522439 264075431 cite75_99.txt

 

16.5 million records

 

Map Reduce Program

 

package mapreduce; 
import java.io.IOException; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class PatentCitation {
   public static class  PatentCitationMapper  extends Mapper<Text, Text, Text, Text> { 
      public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
       String[] citation = key.toString().split( "," );
       Text cited = new Text(citation[1]); 
       Text citing = new Text(citation[0]);
       
       context.write(cited, citing);
      }
    }

    public static class PatentCitationReducer extends Reducer<Text, Text, Text, Text> {
       @Override
       protected void reduce(Text key, Iterable<Text> values, Context context)  throws  IOException, InterruptedException {
       String csv = "";
       for (Text value : values) {
         if (csv.length() >  0) { 
           csv +=  ","; 
       }
       csv += value.toString();
      }
      context.write(key,  new  Text(csv));
   }
  }

  public static void  main(String[] args)  throws  Exception {
    Configuration conf =  new  Configuration();
    Job job = Job.getInstance(conf, "Hadoop Patent Citation Example" );
    job.setJarByClass(PatentCitation.class );
    job.setMapperClass(PatentCitationMapper.class);
    job.setReducerClass(PatentCitationReducer.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class); 

   FileInputFormat.addInputPath(job, new Path(args[1]));
   FileOutputFormat.setOutputPath(job, new Path(args[2]));
   System.exit(job.waitForCompletion(true) ? 0 :  1 ); 
   }
}

Package this class as patentcitation.jar

 

Setup the environment

Either download the Hadoop distribution from Apache web site or use a sandbox such as Hortonworks Sandbox.

 

Create HDFS Directories

[root@sandbox]# hadoop fs -mkdir /user/captain

[root@sandbox]# hadoop fs -mkdir /user/captain/input

 

We have created an input directory in hdfs.  Let us copy the patent citation file into hdfs.

[root@sandbox]# hdfs dfs -copyFromLocal input/cite75_99.txt /user/captain/input

 

Run the Map Reduce Program

yarn jar patentcitation.jar PatentCitation /user/captain/input /user/captain/output

 

15/09/09 19:45:22 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox:8188/ws/v1/timeline/

15/09/09 19:45:22 INFO client.RMProxy: Connecting to ResourceManager at sandbox/192.168.112.132:8050

15/09/09 19:45:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

15/09/09 19:45:23 INFO input.FileInputFormat: Total input paths to process : 1

15/09/09 19:45:23 INFO mapreduce.JobSubmitter: number of splits:2

15/09/09 19:45:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441826432501_0004

15/09/09 19:45:23 INFO impl.YarnClientImpl: Submitted application application_1441826432501_0004

15/09/09 19:45:23 INFO mapreduce.Job: The url to track the job: http://sandbox:8088/proxy/application_1441826432501_0004/

15/09/09 19:45:23 INFO mapreduce.Job: Running job: job_1441826432501_0004

15/09/09 19:45:29 INFO mapreduce.Job: Job job_1441826432501_0004 running in uber mode : false

15/09/09 19:45:29 INFO mapreduce.Job:  map 0% reduce 0%

15/09/09 19:45:40 INFO mapreduce.Job:  map 17% reduce 0%

15/09/09 19:45:43 INFO mapreduce.Job:  map 24% reduce 0%

15/09/09 19:45:47 INFO mapreduce.Job:  map 29% reduce 0%

15/09/09 19:45:50 INFO mapreduce.Job:  map 33% reduce 0%

15/09/09 19:45:53 INFO mapreduce.Job:  map 41% reduce 0%

15/09/09 19:45:56 INFO mapreduce.Job:  map 50% reduce 0%

15/09/09 19:45:59 INFO mapreduce.Job:  map 58% reduce 0%

15/09/09 19:46:02 INFO mapreduce.Job:  map 65% reduce 0%

15/09/09 19:46:05 INFO mapreduce.Job:  map 69% reduce 0%

15/09/09 19:46:08 INFO mapreduce.Job:  map 77% reduce 0%

15/09/09 19:46:11 INFO mapreduce.Job:  map 89% reduce 0%

15/09/09 19:46:13 INFO mapreduce.Job:  map 92% reduce 0%

15/09/09 19:46:14 INFO mapreduce.Job:  map 98% reduce 0%

15/09/09 19:46:15 INFO mapreduce.Job:  map 100% reduce 0%

15/09/09 19:46:24 INFO mapreduce.Job:  map 100% reduce 68%

15/09/09 19:46:27 INFO mapreduce.Job:  map 100% reduce 69%

15/09/09 19:46:30 INFO mapreduce.Job:  map 100% reduce 71%

15/09/09 19:46:33 INFO mapreduce.Job:  map 100% reduce 73%

15/09/09 19:46:36 INFO mapreduce.Job:  map 100% reduce 75%

15/09/09 19:46:39 INFO mapreduce.Job:  map 100% reduce 77%

15/09/09 19:46:42 INFO mapreduce.Job:  map 100% reduce 79%

15/09/09 19:46:45 INFO mapreduce.Job:  map 100% reduce 81%

15/09/09 19:46:48 INFO mapreduce.Job:  map 100% reduce 83%

15/09/09 19:46:51 INFO mapreduce.Job:  map 100% reduce 85%

15/09/09 19:46:54 INFO mapreduce.Job:  map 100% reduce 88%

15/09/09 19:46:57 INFO mapreduce.Job:  map 100% reduce 91%

15/09/09 19:47:00 INFO mapreduce.Job:  map 100% reduce 93%

15/09/09 19:47:03 INFO mapreduce.Job:  map 100% reduce 96%

15/09/09 19:47:06 INFO mapreduce.Job:  map 100% reduce 98%

15/09/09 19:47:09 INFO mapreduce.Job:  map 100% reduce 100%

15/09/09 19:47:11 INFO mapreduce.Job: Job job_1441826432501_0004 completed successfully

15/09/09 19:47:11 INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=594240702

FILE: Number of bytes written=891741626

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=264206769

HDFS: Number of bytes written=158078539

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=85024

Total time spent by all reduces in occupied slots (ms)=55354

Total time spent by all map tasks (ms)=85024

Total time spent by all reduce tasks (ms)=55354

Total vcore-seconds taken by all map tasks=85024

Total vcore-seconds taken by all reduce tasks=55354

Total megabyte-seconds taken by all map tasks=21256000

Total megabyte-seconds taken by all reduce tasks=13838500

Map-Reduce Framework

Map input records=16522439

Map output records=16522439

Map output bytes=264075431

Map output materialized bytes=297120321

Input split bytes=266

Combine input records=0

Combine output records=0

Reduce input groups=3258984

Reduce shuffle bytes=297120321

Reduce input records=16522439

Reduce output records=3258984

Spilled Records=49567317

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=3356

CPU time spent (ms)=136790

Physical memory (bytes) snapshot=552968192

Virtual memory (bytes) snapshot=2486214656

Total committed heap usage (bytes)=412090368

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=264206503

File Output Format Counters

Bytes Written=158078539

 

 

Copy the result file to local directory

# hdfs dfs -ls /user/captain/output

Found 2 items

-rw-r–r–   1 root hdfs          0 2015-09-09 19:47 /user/captain/output/_SUCCESS

-rw-r–r–   1 root hdfs  158078539 2015-09-09 19:47 /user/captain/output/part-r-00000

# hdfs dfs -copyToLocal /user/captain/output/part-r-00000 output/

15/09/09 19:50:45 WARN hdfs.DFSClient: DFSInputStream has been closed already

 

View the results

#vi part-r-0000

 

“CITED” “CITING”

1       3964859,4647229

10000   4539112

100000  5031388

1000006 4714284

1000007 4766693

1000011 5033339

1000017 3908629

1000026 4043055

1000033 4975983,4190903

1000043 4091523

1000044 4082383,4055371

1000045 4290571

1000046 5525001,5918892

1000049 5996916

1000051 4541310

1000054 4946631

1000065 4748968

1000067 4944640,5071294,5312208

1000070 5009029,4928425

1000073 4107819,5474494

1000076 5845593,4867716

1000083 5322091,5566726

1000084 4182197,4683770

1000086 4686189,4839046,4217220,4178246

1000089 5505607,5540869,5505610,5544405,5571464,5505611,5277853,5807591,5395228,5503546

1000094 5713167,4897975,4920718

See Also

Advanced Spark Example: Patent Citation