Choose color scheme

Tag Archives: big data

  • Strata Hadoop New York City Best Presentations

    Strata Hadoop New York City Best Presentations

    Strata Hadoop was held in New York City from September 29 to October 1, 2015. This conference catered to the business side of the big data world.

    There were some great speakers at the conference. The top presentations are listed here.

    Please visit http://strataconf.com/big-data-conference-ny-2015/public/schedule/proceedings for the slides of majority of the speakers.

    Top Presentations

    Advanced Data Science with Spark Streaming

    Download (PDF, 650KB)

    Big Data at Netflix

    GDE Error: Error retrieving file - if necessary turn off error checking (404:Not Found)

    Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud, a real-world case study

    Download (PDF, 5.76MB)

    The business case for Spark, Kafka, and friends

    Download (PDF, 2.97MB)

  • Is Apache Spark ready for Petabyte Scale?

    Apache Spark Logo

    Is Apache Spark ready for Petabyte Scale?

    Download (PDF, 6.76MB)


    Slides courtesy of the Linux Foundation

    Ashwin Shankar and Cheolsoo Park from Netflix Inc gave an excellent presentation at the Apache Big Data Conference in Budapest this week on how Netflix is using Apache Spark at Petabyte scale.

    There are very few companies that are operating at petabyte scale. Netflix is one of them.

    Validation from Netflix about the scale of Apache Spark is a great boon for the open source framework, that is gaining immense popularity and adoption in the big data community.

    Netflix encountered many issues when using Spark on AWS. They opened many bug reports and provided good solutions for problems.

    Netflix has explored Spark on Mesos and Yarn. Spark on Yarn involved 1000+ nodes and memory was 100TB+.

    Spark on YARN exposed the following significant problems:

    This is good validation by Netflix on Apache Spark for mainstream big data processing.

  • Strata Hadoop NYC – Day 3 Highlights

    Strata Hadoop NYC – Day 3 Highlights

    Kudu – Transactional and Analytic tradeoffs in Hadoop

    https://www.oreilly.com/ideas/kudu-resolving-transactional-and-analytic-trade-offs-in-hadoop

    Informatica $1M Big Data Ready Challenge

    https://www.informatica.com/ready/big-data-ready.html#fbid=e2_LGhL7Mis
    Informatica Big Data Challenge

    Forbes : Why Expanding Signal Hunting Skills Is Crucial To Big Data Success

    http://www.forbes.com/sites/danwoods/2015/10/01/why-expanding-signal-hunting-skills-is-crucial-to-big-data-success/

    IBM: Day One Highlights

    http://www.ibmbigdatahub.com/blog/highlights-day-one-strata-hadoop-world-2015

    One Click Installs : Simplicity with Hadoop

    http://siliconangle.com/blog/2015/10/01/one-click-installs-moving-to-simplicity-with-hadoop-bigdatanyc/

    TIBCO Sofware Extends Cloud BI Reach to Apache Spark

    http://talkincloud.com/saas-software-service/tibco-sofware-extends-cloud-bi-reach-apache-spark?utm_campaign=PostBeyond&utm_medium=%234948&utm_source=Twitter&utm_term=TIBCO+Sofware+Extends+Cloud+BI+Reach+to+Apache+Spark

    Big Data and the Creative Destruction of Today’s Business Models

    Big data—enormous data sets virtually impossible to process with conventional technology—offers a big advantage for companies that learn how to harness it. Renowned 20th-century economist Joseph Schumpeter said, “Innovations imply, by virtue of their nature, a big step and a big change … and hardly any ‘ways of doing things’ which have been optimal before remain so afterward.”
    See more at https://www.atkearney.com/analytics/ideas-insights/article/-/asset_publisher/hZFiG2E3WrIP/content/big-data-and-the-creative-destruction-of-today-s-business-models/10192?_101_INSTANCE_hZFiG2E3WrIP_redirect=%2Fanalytics%2Fideas-insights

    Deloitte: Analytics Trends 2015

    http://www2.deloitte.com/us/en/pages/deloitte-analytics/articles/analytics-trends-2015.html

    Cap Gemini: Insight-driven Operations – The missing link Between Big Data Architecture and Operations

    Traditional business intelligence architectures, which rely only on well-known relational systems, excel at processing standard business-generated data. See more at https://www.capgemini.com/global-technology-partners/cloudera/insight-driven-operations?utm_source=twitter&utm_medium=pdf&utm_campaign=cloudera_nyc

    But, today’s environment increasingly demands insights derived from larger and more complex datasets. As a result, traditional architectures need to adapt to meet the modern needs.

    Interesting Tweets

    Booth Pictures

    Zaloni

    Zaloni

    Snowflake Computing

    Snowflake

    NetApp

    NetApp

    Press Releases

    SAS: Data Management transforms companies from data hoarders to data masters

    http://www.sas.com/en_us/news/press-releases/2015/september/sas-data-management-update.html

  • Advanced Example: Hadoop Map Reduce and Apache Spark

    Advanced Example: Hadoop Map Reduce and Apache Spark

    Objective

    Solve an advanced problem using Hadoop Map Reduce and demonstrate the same problem using Apache Spark.

     

    Hadoop Map Reduce Program

    http://blog.hampisoftware.com/?p=20

     

    Apache Spark based solution

    http://blog.hampisoftware.com/?p=41

     

    Differences between the two solutions

    Boilerplate code

    In the map reduce code, you have to write a mapper class and a reducer class.

    In Spark, you do not have to write a lot of boilerplate code.  If you use Java 8, then you can use Lambda expressions to cut the boilerplate code further.

    In Spark, there is no need for a mapper class or a reducer class. You can perform transformations such as map, flatMap, mapToPair, reduceByKey etc on a RDD.

     

    Performance

    The Spark based solution is very fast compared to the map reduce solution. Around 45secs for 16 million records in a virtual machine sandbox.

     

    Scala

    If you write Scala code, then the Apache Spark code can be further simplified into few lines of code. In this post, we have solved the patent citation problem using Java.

  • Advanced Spark Java Example: Patent Citation

    Background

    If you have not reviewed the earlier post on Hadoop Map Reduce example with Patent Citation, now is the time. That post will provide the problem context.

    In this post, we will use Apache Spark to solve the reverse patent citation problem.

    Environment Setup

    Just like the map reduce exercise we had, use the latest sandbox from one of the Hadoop vendors such as Hortonworks or Cloudera or MapR. They should have Spark integrated.

    Add the input file into HDFS as described in the map reduce post. The file should be placed in an hdfs directory called as /user/captain/input.

    Program in Apache Spark using Java

    package spark;
    import org.apache.spark.SparkConf; 
    import  org.apache.spark.api.java.JavaPairRDD; 
    import  org.apache.spark.api.java.JavaRDD;
    import org.apache.spark.api.java.JavaSparkContext; 
    import org.apache.spark.api.java.function.Function; 
    import org.apache.spark.api.java.function.Function2; 
    import  org.apache.spark.api.java.function.PairFunction; 
    import scala.Tuple2; 
    
    public class  PatentCitation { 
    
      public static void  main(String[] args){ 
        String appName =  "Patent Citation" ;      
        SparkConf sparkConf =  new  SparkConf().setAppName(appName);  
        JavaSparkContext javaSparkContext =  new   JavaSparkContext((sparkConf));    
    
        JavaRDD<String> patentCitationFile = javaSparkContext.textFile(args[ 0 ]);     
    
        //Create a mapper to do reverse key value pair
        JavaPairRDD<String,String> lines = patentCitationFile.mapToPair( new  PairFunction<String, String, String>() { 
       @Override 
       public Tuple2<String, String> call(String s) throws Exception {           
         String[] pair = s.split( "," ); 
         return new  Tuple2<String, String>(pair[ 1 ], pair[ 0 ]); 
    } }); 
    
        //Let us reduce this 
        JavaPairRDD<String,String> cited = lines.reduceByKey( new Function2<String, String, String>() {           
        @Override 
        public  String call(String s1, String s2)  throws  Exception { 
         return s1 +  ","  + s2; 
       } }).sortByKey(); 
       
       //Format for storing 
       JavaRDD<String> formatted = cited.map( new  Function<Tuple2<String,String>, String>() 
      {
    
        @Override 
        public String call(Tuple2<String, String> theTuple) throws  Exception { 
           return  theTuple._1() +  " "  + theTuple._2();
            } 
          });
         //Store the formatted RDD in the text file</span></pre>
         formatted.saveAsTextFile(args[1]);
     }
    }
    
    

    Description of the program

    In the first step, we create a JavaPairRDD to read the input file – line by line. We split at a comma. So we are provided with 2 string variables. The first variable represents the Citing Patent and the second variable represents the Cited Patent.

    Since our problem is to find the reverse. All the citing patents for a particular patent, we create a tuple that is the reverse.

    In the second step, we reduce the RDD by key. Now the RDD gets two string values for a key. We just separate these values with a comma.

    Now our RDD is ready for storage. The challenge is that storing the tuple in a file will yield parentheses at the beginning and end of the line.  So we run another transformation on the RDD – format.

    Finally we store the RDD in a file.

     

    Program Run

    [sandbox spark]# spark/bin/spark-submit –class spark.PatentCitation –master local –executor-memory 1G –num-executors 10 patentcitation_spark.jar /user/captain/input/cite75_99.txt /user/captain/output

     

    15/09/15 19:18:45 INFO SparkContext: Running Spark version 1.3.1

    15/09/15 19:18:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

    15/09/15 19:18:45 INFO SecurityManager: Changing view acls to: root

    15/09/15 19:18:45 INFO SecurityManager: Changing modify acls to: root

    15/09/15 19:18:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)

    15/09/15 19:18:45 INFO Slf4jLogger: Slf4jLogger started

    15/09/15 19:18:45 INFO Remoting: Starting remoting

    15/09/15 19:18:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@sandbox:36543]

    15/09/15 19:18:46 INFO Utils: Successfully started service ‘sparkDriver’ on port 36543.

    15/09/15 19:18:46 INFO SparkEnv: Registering MapOutputTracker

    15/09/15 19:18:46 INFO SparkEnv: Registering BlockManagerMaster

    15/09/15 19:18:46 INFO DiskBlockManager: Created local directory at /tmp/spark-c1f7c70b-349d-4984-b6bb-544080fc4d07/blockmgr-bc783dc8-c929-422c-bbae-1759e45dc59e

    15/09/15 19:18:46 INFO MemoryStore: MemoryStore started with capacity 265.4 MB

    15/09/15 19:18:46 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e0895997-d0e1-4dfe-9fe6-01df51e9a65c/httpd-6a121c63-3f76-47cd-b083-77d722fd95bf

    15/09/15 19:18:46 INFO HttpServer: Starting HTTP Server

    15/09/15 19:18:46 INFO Server: jetty-8.y.z-SNAPSHOT

    15/09/15 19:18:46 INFO AbstractConnector: Started SocketConnector@0.0.0.0:33490

    15/09/15 19:18:46 INFO Utils: Successfully started service ‘HTTP file server’ on port 33490.

    15/09/15 19:18:46 INFO SparkEnv: Registering OutputCommitCoordinator

    15/09/15 19:18:46 INFO Server: jetty-8.y.z-SNAPSHOT

    15/09/15 19:18:46 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040

    15/09/15 19:18:46 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.

    15/09/15 19:18:46 INFO SparkUI: Started SparkUI at http://sandbox:4040

    15/09/15 19:18:46 INFO SparkContext: Added JAR file:/root/anil/spark/patentcitation_spark.jar at http://192.168.112.132:33490/jars/patentcitation_spark.jar with timestamp 1442344726418

    15/09/15 19:18:46 INFO Executor: Starting executor ID <driver> on host localhost

    15/09/15 19:18:46 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@sandbox:36543/user/HeartbeatReceiver

    15/09/15 19:18:46 INFO NettyBlockTransferService: Server created on 56529

    15/09/15 19:18:46 INFO BlockManagerMaster: Trying to register BlockManager

    15/09/15 19:18:46 INFO BlockManagerMasterActor: Registering block manager localhost:56529 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 56529)

    15/09/15 19:18:46 INFO BlockManagerMaster: Registered BlockManager

    15/09/15 19:18:46 INFO MemoryStore: ensureFreeSpace(169818) called with curMem=0, maxMem=278302556

    15/09/15 19:18:46 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 165.8 KB, free 265.2 MB)

    15/09/15 19:18:47 INFO MemoryStore: ensureFreeSpace(34017) called with curMem=169818, maxMem=278302556

    15/09/15 19:18:47 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 33.2 KB, free 265.2 MB)

    15/09/15 19:18:47 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:56529 (size: 33.2 KB, free: 265.4 MB)

    15/09/15 19:18:47 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0

    15/09/15 19:18:47 INFO SparkContext: Created broadcast 0 from textFile at PatentCitation.java:18

    15/09/15 19:18:47 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

    15/09/15 19:18:47 INFO FileInputFormat: Total input paths to process : 1

    15/09/15 19:18:47 INFO SparkContext: Starting job: sortByKey at PatentCitation.java:28

    15/09/15 19:18:47 INFO DAGScheduler: Registering RDD 2 (mapToPair at PatentCitation.java:19)

    15/09/15 19:18:47 INFO DAGScheduler: Got job 0 (sortByKey at PatentCitation.java:28) with 2 output partitions (allowLocal=false)

    15/09/15 19:18:47 INFO DAGScheduler: Final stage: Stage 1(sortByKey at PatentCitation.java:28)

    15/09/15 19:18:47 INFO DAGScheduler: Parents of final stage: List(Stage 0)

    15/09/15 19:18:47 INFO DAGScheduler: Missing parents: List(Stage 0)

    15/09/15 19:18:47 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at mapToPair at PatentCitation.java:19), which has no missing parents

    15/09/15 19:18:47 INFO MemoryStore: ensureFreeSpace(3736) called with curMem=203835, maxMem=278302556

    15/09/15 19:18:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.6 KB, free 265.2 MB)

    15/09/15 19:18:47 INFO MemoryStore: ensureFreeSpace(2723) called with curMem=207571, maxMem=278302556

    15/09/15 19:18:47 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.7 KB, free 265.2 MB)

    15/09/15 19:18:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:56529 (size: 2.7 KB, free: 265.4 MB)

    15/09/15 19:18:47 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0

    15/09/15 19:18:47 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839

    15/09/15 19:18:47 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at mapToPair at PatentCitation.java:19)

    15/09/15 19:18:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks

    15/09/15 19:18:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 1389 bytes)

    15/09/15 19:18:47 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

    15/09/15 19:18:47 INFO Executor: Fetching http://192.168.112.132:33490/jars/patentcitation_spark.jar with timestamp 1442344726418

    15/09/15 19:18:47 INFO Utils: Fetching http://192.168.112.132:33490/jars/patentcitation_spark.jar to /tmp/spark-d5e8a049-e4b6-45d9-b533-6e7c1295ed19/userFiles-2ba84816-9443-4d34-b8f4-360e6a69aeaf/fetchFileTemp4125688246233447293.tmp

    15/09/15 19:18:47 INFO Executor: Adding file:/tmp/spark-d5e8a049-e4b6-45d9-b533-6e7c1295ed19/userFiles-2ba84816-9443-4d34-b8f4-360e6a69aeaf/patentcitation_spark.jar to class loader

    15/09/15 19:18:47 INFO HadoopRDD: Input split: hdfs://sandbox:8020/user/captain/input/cite75_99.txt:0+134217728

    15/09/15 19:18:47 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

    15/09/15 19:18:47 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

    15/09/15 19:18:47 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

    15/09/15 19:18:47 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

    15/09/15 19:18:47 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

    15/09/15 19:18:49 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:18:51 INFO ExternalSorter: Thread 65 spilling in-memory map of 86.6 MB to disk (2 times so far)

    15/09/15 19:18:53 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (3 times so far)

    15/09/15 19:18:55 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (4 times so far)

    15/09/15 19:18:57 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (5 times so far)

    15/09/15 19:19:00 INFO ExternalSorter: Thread 65 spilling in-memory map of 85.6 MB to disk (6 times so far)

    15/09/15 19:19:02 INFO ExternalSorter: Thread 65 spilling in-memory map of 85.1 MB to disk (7 times so far)

    15/09/15 19:19:04 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (8 times so far)

    15/09/15 19:19:06 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (9 times so far)

    15/09/15 19:19:22 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2004 bytes result sent to driver

    15/09/15 19:19:22 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, ANY, 1389 bytes)

    15/09/15 19:19:22 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)

    15/09/15 19:19:22 INFO HadoopRDD: Input split: hdfs://sandbox:8020/user/captain/input/cite75_99.txt:134217728+129857703

    15/09/15 19:19:22 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 34712 ms on localhost (1/2)

    15/09/15 19:19:23 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:19:25 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (2 times so far)

    15/09/15 19:19:28 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (3 times so far)

    15/09/15 19:19:31 INFO ExternalSorter: Thread 65 spilling in-memory map of 85.1 MB to disk (4 times so far)

    15/09/15 19:19:33 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (5 times so far)

    15/09/15 19:19:35 INFO ExternalSorter: Thread 65 spilling in-memory map of 85.1 MB to disk (6 times so far)

    15/09/15 19:19:37 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (7 times so far)

    15/09/15 19:19:40 INFO ExternalSorter: Thread 65 spilling in-memory map of 87.4 MB to disk (8 times so far)

    15/09/15 19:19:42 INFO ExternalSorter: Thread 65 spilling in-memory map of 84.2 MB to disk (9 times so far)

    15/09/15 19:19:53 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2004 bytes result sent to driver

    15/09/15 19:19:53 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 30861 ms on localhost (2/2)

    15/09/15 19:19:53 INFO DAGScheduler: Stage 0 (mapToPair at PatentCitation.java:19) finished in 65.576 s

    15/09/15 19:19:53 INFO DAGScheduler: looking for newly runnable stages

    15/09/15 19:19:53 INFO DAGScheduler: running: Set()

    15/09/15 19:19:53 INFO DAGScheduler: waiting: Set(Stage 1)

    15/09/15 19:19:53 INFO DAGScheduler: failed: Set()

    15/09/15 19:19:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

    15/09/15 19:19:53 INFO DAGScheduler: Missing parents for Stage 1: List()

    15/09/15 19:19:53 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[5] at sortByKey at PatentCitation.java:28), which is now runnable

    15/09/15 19:19:53 INFO MemoryStore: ensureFreeSpace(3312) called with curMem=210294, maxMem=278302556

    15/09/15 19:19:53 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.2 KB, free 265.2 MB)

    15/09/15 19:19:53 INFO MemoryStore: ensureFreeSpace(2378) called with curMem=213606, maxMem=278302556

    15/09/15 19:19:53 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.3 KB, free 265.2 MB)

    15/09/15 19:19:53 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:56529 (size: 2.3 KB, free: 265.4 MB)

    15/09/15 19:19:53 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0

    15/09/15 19:19:53 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839

    15/09/15 19:19:53 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[5] at sortByKey at PatentCitation.java:28)

    15/09/15 19:19:53 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks

    15/09/15 19:19:53 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1124 bytes)

    15/09/15 19:19:53 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)

    15/09/15 19:19:53 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks

    15/09/15 19:19:53 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms

    15/09/15 19:19:54 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:19:56 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.7 MB to disk (2 times so far)

    15/09/15 19:19:59 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (3 times so far)

    15/09/15 19:20:01 INFO BlockManager: Removing broadcast 1

    15/09/15 19:20:01 INFO BlockManager: Removing block broadcast_1_piece0

    15/09/15 19:20:01 INFO MemoryStore: Block broadcast_1_piece0 of size 2723 dropped from memory (free 278089295)

    15/09/15 19:20:01 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:56529 in memory (size: 2.7 KB, free: 265.4 MB)

    15/09/15 19:20:01 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0

    15/09/15 19:20:01 INFO BlockManager: Removing block broadcast_1

    15/09/15 19:20:01 INFO MemoryStore: Block broadcast_1 of size 3736 dropped from memory (free 278093031)

    15/09/15 19:20:01 INFO ContextCleaner: Cleaned broadcast 1

    15/09/15 19:20:01 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (4 times so far)

    15/09/15 19:20:06 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1682 bytes result sent to driver

    15/09/15 19:20:06 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1124 bytes)

    15/09/15 19:20:06 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)

    15/09/15 19:20:06 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks

    15/09/15 19:20:06 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

    15/09/15 19:20:06 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 13527 ms on localhost (1/2)

    15/09/15 19:20:07 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:20:10 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (2 times so far)

    15/09/15 19:20:12 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (3 times so far)

    15/09/15 19:20:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1685 bytes result sent to driver

    15/09/15 19:20:19 INFO DAGScheduler: Stage 1 (sortByKey at PatentCitation.java:28) finished in 25.877 s

    15/09/15 19:20:19 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 12352 ms on localhost (2/2)

    15/09/15 19:20:19 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool

    15/09/15 19:20:19 INFO DAGScheduler: Job 0 finished: sortByKey at PatentCitation.java:28, took 91.526647 s

    15/09/15 19:20:19 INFO FileOutputCommitter: File Output Committer Algorithm version is 1

    15/09/15 19:20:19 INFO SparkContext: Starting job: saveAsTextFile at PatentCitation.java:43

    15/09/15 19:20:19 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 157 bytes

    15/09/15 19:20:19 INFO DAGScheduler: Registering RDD 3 (reduceByKey at PatentCitation.java:28)

    15/09/15 19:20:19 INFO DAGScheduler: Got job 1 (saveAsTextFile at PatentCitation.java:43) with 2 output partitions (allowLocal=false)

    15/09/15 19:20:19 INFO DAGScheduler: Final stage: Stage 4(saveAsTextFile at PatentCitation.java:43)

    15/09/15 19:20:19 INFO DAGScheduler: Parents of final stage: List(Stage 3)

    15/09/15 19:20:19 INFO DAGScheduler: Missing parents: List(Stage 3)

    15/09/15 19:20:19 INFO DAGScheduler: Submitting Stage 3 (ShuffledRDD[3] at reduceByKey at PatentCitation.java:28), which has no missing parents

    15/09/15 19:20:19 INFO MemoryStore: ensureFreeSpace(3128) called with curMem=209525, maxMem=278302556

    15/09/15 19:20:19 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.1 KB, free 265.2 MB)

    15/09/15 19:20:19 INFO MemoryStore: ensureFreeSpace(2252) called with curMem=212653, maxMem=278302556

    15/09/15 19:20:19 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB, free 265.2 MB)

    15/09/15 19:20:19 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:56529 (size: 2.2 KB, free: 265.4 MB)

    15/09/15 19:20:19 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0

    15/09/15 19:20:19 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:839

    15/09/15 19:20:19 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (ShuffledRDD[3] at reduceByKey at PatentCitation.java:28)

    15/09/15 19:20:19 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks

    15/09/15 19:20:19 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 4, localhost, PROCESS_LOCAL, 1113 bytes)

    15/09/15 19:20:19 INFO Executor: Running task 0.0 in stage 3.0 (TID 4)

    15/09/15 19:20:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks

    15/09/15 19:20:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms

    15/09/15 19:20:21 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:20:23 INFO BlockManager: Removing broadcast 2

    15/09/15 19:20:23 INFO BlockManager: Removing block broadcast_2

    15/09/15 19:20:23 INFO MemoryStore: Block broadcast_2 of size 3312 dropped from memory (free 278090963)

    15/09/15 19:20:23 INFO BlockManager: Removing block broadcast_2_piece0

    15/09/15 19:20:23 INFO MemoryStore: Block broadcast_2_piece0 of size 2378 dropped from memory (free 278093341)

    15/09/15 19:20:23 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:56529 in memory (size: 2.3 KB, free: 265.4 MB)

    15/09/15 19:20:23 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0

    15/09/15 19:20:23 INFO ContextCleaner: Cleaned broadcast 2

    15/09/15 19:20:23 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.7 MB to disk (2 times so far)

    15/09/15 19:20:25 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (3 times so far)

    15/09/15 19:20:27 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (4 times so far)

    15/09/15 19:20:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 4). 1098 bytes result sent to driver

    15/09/15 19:20:35 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 5, localhost, PROCESS_LOCAL, 1113 bytes)

    15/09/15 19:20:35 INFO Executor: Running task 1.0 in stage 3.0 (TID 5)

    15/09/15 19:20:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 4) in 16394 ms on localhost (1/2)

    15/09/15 19:20:35 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks

    15/09/15 19:20:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms

    15/09/15 19:20:37 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:20:39 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (2 times so far)

    15/09/15 19:20:41 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory map of 83.6 MB to disk (3 times so far)

    15/09/15 19:20:50 INFO Executor: Finished task 1.0 in stage 3.0 (TID 5). 1098 bytes result sent to driver

    15/09/15 19:20:50 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 5) in 14507 ms on localhost (2/2)

    15/09/15 19:20:50 INFO DAGScheduler: Stage 3 (reduceByKey at PatentCitation.java:28) finished in 30.901 s

    15/09/15 19:20:50 INFO DAGScheduler: looking for newly runnable stages

    15/09/15 19:20:50 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

    15/09/15 19:20:50 INFO DAGScheduler: running: Set()

    15/09/15 19:20:50 INFO DAGScheduler: waiting: Set(Stage 4)

    15/09/15 19:20:50 INFO DAGScheduler: failed: Set()

    15/09/15 19:20:50 INFO DAGScheduler: Missing parents for Stage 4: List()

    15/09/15 19:20:50 INFO DAGScheduler: Submitting Stage 4 (MapPartitionsRDD[8] at saveAsTextFile at PatentCitation.java:43), which is now runnable

    15/09/15 19:20:50 INFO MemoryStore: ensureFreeSpace(152496) called with curMem=209215, maxMem=278302556

    15/09/15 19:20:50 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 148.9 KB, free 265.1 MB)

    15/09/15 19:20:50 INFO MemoryStore: ensureFreeSpace(97600) called with curMem=361711, maxMem=278302556

    15/09/15 19:20:50 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 95.3 KB, free 265.0 MB)

    15/09/15 19:20:50 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:56529 (size: 95.3 KB, free: 265.3 MB)

    15/09/15 19:20:50 INFO BlockManagerMaster: Updated info of block broadcast_4_piece0

    15/09/15 19:20:50 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:839

    15/09/15 19:20:50 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (MapPartitionsRDD[8] at saveAsTextFile at PatentCitation.java:43)

    15/09/15 19:20:50 INFO TaskSchedulerImpl: Adding task set 4.0 with 2 tasks

    15/09/15 19:20:50 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 6, localhost, PROCESS_LOCAL, 1124 bytes)

    15/09/15 19:20:50 INFO Executor: Running task 0.0 in stage 4.0 (TID 6)

    15/09/15 19:20:50 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks

    15/09/15 19:20:50 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms

    15/09/15 19:20:51 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (1 time so far)

    15/09/15 19:20:52 INFO ExternalSorter: Thread 65 spilling in-memory map of 85.4 MB to disk (2 times so far)

    15/09/15 19:20:53 INFO ExternalSorter: Thread 65 spilling in-memory map of 84.2 MB to disk (3 times so far)

    15/09/15 19:20:54 INFO FileOutputCommitter: File Output Committer Algorithm version is 1

    15/09/15 19:20:55 INFO BlockManager: Removing broadcast 3

    15/09/15 19:20:55 INFO BlockManager: Removing block broadcast_3_piece0

    15/09/15 19:20:55 INFO MemoryStore: Block broadcast_3_piece0 of size 2252 dropped from memory (free 277845497)

    15/09/15 19:20:55 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:56529 in memory (size: 2.2 KB, free: 265.3 MB)

    15/09/15 19:20:55 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0

    15/09/15 19:20:55 INFO BlockManager: Removing block broadcast_3

    15/09/15 19:20:55 INFO MemoryStore: Block broadcast_3 of size 3128 dropped from memory (free 277848625)

    15/09/15 19:20:55 INFO ContextCleaner: Cleaned broadcast 3

    15/09/15 19:20:57 INFO FileOutputCommitter: Saved output of task ‘attempt_201509151920_0004_m_000000_6’ to hdfs://sandbox:8020/user/captain/output/_temporary/0/task_201509151920_0004_m_000000

    15/09/15 19:20:57 INFO SparkHadoopMapRedUtil: attempt_201509151920_0004_m_000000_6: Committed

    15/09/15 19:20:57 INFO Executor: Finished task 0.0 in stage 4.0 (TID 6). 1828 bytes result sent to driver

    15/09/15 19:20:57 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 7, localhost, PROCESS_LOCAL, 1124 bytes)

    15/09/15 19:20:57 INFO Executor: Running task 1.0 in stage 4.0 (TID 7)

    15/09/15 19:20:57 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 6) in 7135 ms on localhost (1/2)

    15/09/15 19:20:57 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks

    15/09/15 19:20:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms

    15/09/15 19:20:58 INFO ExternalSorter: Thread 65 spilling in-memory map of 84.8 MB to disk (1 time so far)

    15/09/15 19:20:58 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (2 times so far)

    15/09/15 19:20:59 INFO ExternalSorter: Thread 65 spilling in-memory map of 87.6 MB to disk (3 times so far)

    15/09/15 19:21:01 INFO ExternalSorter: Thread 65 spilling in-memory map of 83.6 MB to disk (4 times so far)

    15/09/15 19:21:02 INFO FileOutputCommitter: File Output Committer Algorithm version is 1

    15/09/15 19:21:05 INFO FileOutputCommitter: Saved output of task ‘attempt_201509151920_0004_m_000001_7’ to hdfs://sandbox:8020/user/captain/output/_temporary/0/task_201509151920_0004_m_000001

    15/09/15 19:21:05 INFO SparkHadoopMapRedUtil: attempt_201509151920_0004_m_000001_7: Committed

    15/09/15 19:21:05 INFO Executor: Finished task 1.0 in stage 4.0 (TID 7). 1828 bytes result sent to driver

    15/09/15 19:21:05 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 7) in 8171 ms on localhost (2/2)

    15/09/15 19:21:05 INFO DAGScheduler: Stage 4 (saveAsTextFile at PatentCitation.java:43) finished in 15.306 s

    15/09/15 19:21:05 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool

    15/09/15 19:21:05 INFO DAGScheduler: Job 1 finished: saveAsTextFile at PatentCitation.java:43, took 46.343546 s

     

     

    Results

    [sandbox spark]# vi part-00000

    “CITED” “CITING”

    1 3964859,4647229

    10000 4539112

    100000 5031388

    1000006 4714284

    1000007 4766693

    1000011 5033339

    1000017 3908629

    1000026 4043055

    1000033 4975983,4190903

    1000043 4091523

    1000044 4055371,4082383

    1000045 4290571

    1000046 5918892,5525001

    1000049 5996916

    1000051 4541310

    1000054 4946631

    1000065 4748968

    1000067 5312208,4944640,5071294

    1000070 4928425,5009029

    1000073 5474494,4107819

    1000076 5845593,4867716

    1000083 5322091,5566726

    1000084 4683770,4182197

    1000086 4217220,4686189,4178246,4839046

    1000089 5277853,5571464,5807591,5395228,5503546,5505607,5505610,5505611,5540869,5544405

    1000094 4897975,4920718,5713167

    10001 4228735

    1000102 5120183,5791855

    1000108 4399627

    1000118 5029926,5058767,4919466

    100012 4641675,4838290

    1000120 4546821

    1000122 4048703,4297867

    1000125 4573823

    1000129 4748545

    1000133 5809659,4922621

    1000134 4977842,4175499,4373460

    100014 5179741

    1000145 3961423

    1000149 5876147,5193932

    1000150 4217940,4271874

    1000152 3962981,4032016,4269599

    1000168 4479408

    1000172 4865014,4974551,4653469

    1000173 5303740,5303742,4312704,4346730

    1000178 4830222

    1000179 5182923

    1000181 5185948

    1000185 4082391,3871143,5793021

    1000187 4470573

    1000188 4134380,4023548

    1000193 5092745

    1000195 4922643

    1000199 4166638,4846484,4664399

    1000201 4697655

    1000203 4541584

  • Introduction to Apache Zookeeper

    Apache Zookeeper

     

    zookeeper

     

    Apache Zookeeper is a :

    • centralized
    • high performance
    • coordination system

    for distributed applications.

    Apache Zookeeper enables distributed systems.

     

    Applications using Apache Zookeeper

    • Apache Hadoop
    • Apache HBase
    • Apache Kafka
    • Apache Accumulo
    • Apache Mesos
    • Apache Solr
    • Neo4j

     

    Zookeeper Primitives and Recipes

    Zookeeper provides primitives for distributed coordination. Rather than exposing the primitives directly to client applications, it exposes a file system like API.

    Recipes are the implementations of primitives in Zookeeper. Recipes provide the operations on Zookeeper data nodes (called ZNodes).

    The ZNodes are organized in a hierarchical tree model similar to a file system.

    ZNodes

    zookeeper_tree

    In this diagram,

    the /employees znode is the parent znode for all znodes representing employees. An example is Matt which is a znode employee-1

    the /dept znode is the parent znode for all znodes representing departments. An example is HR which is a znode dept-1

    the /offices znode is the parent znode for all znodes representing offices. An example is Boston which is a znode office-1

    ZNodes can contain data or no data. If there is data in a znode, it is stored as a byte array.

    The leaf nodes in the tree represent the data. Every time data is added, a znode is added. A znode is removed when data is deleted.

    There are 4 modes for Zookeeper ZNodes:

    1. Persistent
    2. Ephemeral
    3. Persistent_Sequential
    4. Ephemeral_Sequential

    Persistent Nodes are znodes that can be deleted only by request. They survive service restarts and are backed up in disk.

    Ephemeral Nodes are znodes that exist as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior, ephemeral znodes are not allowed to have children.

    Sequence: When creating a znode you can also request that ZooKeeper append a monotonically increasing counter to the end of path. This counter is unique to the parent znode. The counter has a format of %010d — that is 10 digits with 0 (zero) padding.

    The Curator framework also defines the following recipe: a persistent ephemeral node is an ephemeral node that attempts to stay present in ZooKeeper, even through connection and session interruptions.

    Zookeeper API

    There are 6 primary operations exposed by the API:

    • create /path data    –  Creates a znode named with /path and containing data
    • delete /path     –  Deletes the znode /path
    • exists /path     – Checks whether /path exists
    • setData /path data    –  Sets the data of znode /path to data
    • getData /path    –  Returns the data in /path
    • getChildren /path    – Returns the list of children under /path

    Installing Zookeeper

    Download stable version of Zookeeper from https://zookeeper.apache.org/releases.html

    $> tar xvz zookeeper-3.4.6.tar.gz
    $> cd zookeeper-3.4.6/conf

     

    Create zoo.cfg file with the following info:

    tickTime=2000
    dataDir=/home/xyz/zookeeper/data
    clientPort=2181

     

    Remember to change the data dir value to something that is writable by the zookeeper process.

     

    $> cd zookeeper-3.4.6/bin
    $>./zkServer.sh start
     JMX enabled by default
     Using config: /home/zyx/zookeeper-3.4.6/bin/../conf/zoo.cfg
     Starting zookeeper ... STARTED

     

    Now that the zookeeper server has started, time to interact with it.

     

    In another terminal/command window, go to the bin directory of your zookeeper installation.

    bin$ ./zkCli.sh -server 127.0.0.1:2181
    Connecting to 127.0.0.1:2181
    2015-09-09 21:22:29,700 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
    2015-09-09 21:22:29,704 [myid:] - INFO [main:Environment@100] - Client environment:host.name=xxx..xxxx.xxx
    2015-09-09 21:22:29,704 [myid:] - INFO [main:Environment@100] - Client environment:java.version=1.7.0_51
    2015-09-09 21:22:29,707 [myid:] - INFO [main:Environment@100] - Client environment:java.vendor=Oracle Corporation
    2015-09-09 21:22:29,707 [myid:] - INFO [main:Environment@100] - Client environment:java.home=/Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre
    2015-09-09 21:22:29,707 [myid:] - INFO [main:Environment@100] - Client environment:java.class.path=/Users/......
    2015-09-09 21:22:29,727 [myid:] - INFO [main:Environment@100] - Client environment:java.library.path=/Users/xyz/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
    2015-09-09 21:22:29,727 [myid:] - INFO [main:Environment@100] - Client environment:java.io.tmpdir=/var/folders/dt/p17rgljd56v_jd0hy9s73l3w0000gn/T/
    2015-09-09 21:22:29,728 [myid:] - INFO [main:Environment@100] - Client environment:java.compiler=<NA>
    2015-09-09 21:22:29,728 [myid:] - INFO [main:Environment@100] - Client environment:os.name=Mac OS X
    2015-09-09 21:22:29,728 [myid:] - INFO [main:Environment@100] - Client environment:os.arch=x86_64
    2015-09-09 21:22:29,728 [myid:] - INFO [main:Environment@100] - Client environment:os.version=10.10.5 
    2015-09-09 21:22:29,729 [myid:] - INFO [main:Environment@100] - Client environment:user.home=/Users/xyz
    2015-09-09 21:22:29,729 [myid:] - INFO [main:Environment@100] - Client environment:user.dir=/Users/xyz/zookeeper/zookeeper-3.4.6/bin
    2015-09-09 21:22:29,731 [myid:] - INFO [main:ZooKeeper@438] - Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@1d88a478
    Welcome to ZooKeeper!
    2015-09-09 21:22:29,766 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@975] - Opening socket connection to server 127.0.0.1/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
    JLine support is enabled
    2015-09-09 21:22:29,775 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@852] - Socket connection established to 127.0.0.1/127.0.0.1:2181, initiating session
    2015-09-09 21:22:29,806 [myid:] - INFO [main-SendThread(127.0.0.1:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server 127.0.0.1/127.0.0.1:2181, sessionid = 0x14fb50eae000000, negotiated timeout = 30000
    WATCHER::
    WatchedEvent state:SyncConnected type:None path:null

    Type help to get all the available commands. After that, we are going to use the “ls”, “get” and “set” commands.

    [zk: 127.0.0.1:2181(CONNECTED) 0] help
    ZooKeeper -server host:port cmd args
    connect host:port
    get path [watch]
    ls path [watch]
    set path data [version]
    rmr path
    delquota [-n|-b] path
    quit
    printwatches on|off
    create [-s] [-e] path data acl
    stat path [watch]
    close
    ls2 path [watch]
    history
    listquota path
    setAcl path acl
    getAcl path
    sync path
    redo cmdno
    addauth scheme auth
    delete path [version]
    setquota -n|-b val path
    [zk: 127.0.0.1:2181(CONNECTED) 1]
    
    
    [zk: 127.0.0.1:2181(CONNECTED) 1] ls /
    [zookeeper]
    [zk: 127.0.0.1:2181(CONNECTED) 2] create /blog_testing test_data
    Created /blog_testing
    [zk: 127.0.0.1:2181(CONNECTED) 3] ls /
    [blog_testing, zookeeper]
    [zk: 127.0.0.1:2181(CONNECTED) 4] get /blog_testing
    test_data
    cZxid = 0x2
    ctime = Wed Sep 09 21:48:02 CDT 2015
    mZxid = 0x2
    mtime = Wed Sep 09 21:48:02 CDT 2015
    pZxid = 0x2
    cversion = 0
    dataVersion = 0
    aclVersion = 0
    ephemeralOwner = 0x0
    dataLength = 9
    numChildren = 0
    [zk: 127.0.0.1:2181(CONNECTED) 5] set /blog_testing updated_text
    cZxid = 0x2
    ctime = Wed Sep 09 21:48:02 CDT 2015
    mZxid = 0x3
    mtime = Wed Sep 09 21:48:42 CDT 2015
    pZxid = 0x2
    cversion = 0
    dataVersion = 1
    aclVersion = 0
    ephemeralOwner = 0x0
    dataLength = 12
    numChildren = 0
    [zk: 127.0.0.1:2181(CONNECTED) 6] get /blog_testing
    updated_text
    cZxid = 0x2
    ctime = Wed Sep 09 21:48:02 CDT 2015
    mZxid = 0x3
    mtime = Wed Sep 09 21:48:42 CDT 2015
    pZxid = 0x2
    cversion = 0
    dataVersion = 1
    aclVersion = 0
    ephemeralOwner = 0x0
    dataLength = 12
    numChildren = 0
    [zk: 127.0.0.1:2181(CONNECTED) 7] delete /blog_testing
    [zk: 127.0.0.1:2181(CONNECTED) 8] ls /
    [zookeeper]
    [zk: 127.0.0.1:2181(CONNECTED) 9]
    

     

    To shut down the zookeeper server, in the bin directory

    $>./zkServer.sh stop
    JMX enabled by default
    Using config: /Users/xyz/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
    Stopping zookeeper ... STOPPED
    
    

    Zookeeper Programming

    If you want to write programs interacting with Zookeeper, you should definitely use the Apache Curator framework.

    curator-logo

     

    Unit Testing with Zookeeper

    The Apache Curator project provides an embedded zookeeper instance that can be used for unit testing.

    import org.apache.curator.test.TestingServer;
    TestingServer testingServer = new TestingServer();
    testingServer.start();
    String zookeeperConnectionStr = testingServer.getConnectString();

     

    Stay Tuned!

  • Zookeeper driven big data infrastructure

    Background

    Any big data infrastructure operating at scale, requires the following technologies:

    • Hadoop
    • Enterprise Search
    • Enterprise Messaging

    Managing these three verticals is a mammoth task.

    zookeeper

    Coordination

    When you have your big data infrastructure scaling according to business needs, you need to choose management technologies that are common/applicable across multiple areas.  This way you minimize the number of complementary technologies in operation at your big data infrastructure.

    One such technology used in management and coordination is Apache Zookeeper.

    When you use the following technologies in your big data infrastructure, you can use Apache Zookeeper for coordination:

    • Hadoop
    • Apache Solr for Enterprise Search
    • Apache Kafka for Enterprise Messaging

    zookeeper_bigdata

    As depicted in the diagram, Zookeeper can be the central pivot for managing your big data infrastructure.

     

    Please do not hesitate to review my Introduction to Zookeeper post.

    Stay Tuned!

     

    References

    Introduction to Zookeeper