Monday, 22 April 2013

MySQL Applier For Hadoop: Real time data export from MySQL to HDFS


MySQL replication enables data to be replicated from one MySQL database server (the master) to one or more MySQL database servers (the slaves). However, imagine the number of use cases being served if the slave (to which data is replicated) isn't restricted to be a MySQL server; but it can be any other database server or platform with replication events applied in real-time! 

This is what the new Hadoop Applier empowers you to do.

An example of such a slave could be a data warehouse system such as Apache Hive, which uses HDFS as a data store. If you have a Hive metastore associated with HDFS(Hadoop Distributed File System), the Hadoop Applier can populate Hive tables in real time. Data is exported from MySQL to text files in HDFS, and therefore, into Hive tables. It is as simple as running a 'CREATE TABLE' HiveQL on Hive, to define the table structure similar to that on MySQL (and yes, you can use any row and column delimiters you want); and then run Hadoop Applier to start real time data replication.

The motivation to develop the Hadoop Applier is that currently, there is no tool available to perform this real time transfer. Existing solutions to import data into HDFS include Apache Sqoop which is well proven and enables batch transfers , but as a result requires re-import from time to time, to keep the data updated. It reads the source MySQL database via a JDBC connector or a fastpath connector, and performs a bulk data transfer, which can create an overhead on your operational systems, making other queries slow. Consider a case where there are only a few changes of the database compared to the size of the data, Sqoop might take too long to load the data. 

On the other hand, Hadoop Applier reads from a binary log and inserts data in real time, applying the events as they happen on the MySQL server; therefore other queries can continue to execute without effect on their speed. No bulk transfers required! Hadoop Applier takes only the changes and insert them, which is a lot faster. 

Hadoop Applier can thus be a solution when you need to rapidly acquire new data from MySQL for real-time processing within Hadoop.

Introducing The Applier: 

It is a method which replicates events from the MySQL binary log to provide real time integration of MySQL with Hadoop and related frameworks which work on top of HDFS. There are many use cases for the integration of unstructured data stored in Apache Hadoop and structured data from relational databases such as MySQL. 

Hadoop Applier provides real time connectivity between MySQL and Hadoop/HDFS(Hadoop Distributed File System); which can be used for big data analytics: for purposes like sentiment analysis, marketing campaign analysis, customer churn modeling, fraud detection, risk modelling and many more. You can read more about the role of Hadoop Applier in Big data in the blog by Mat Keep. Many widely used systems, such as Apache Hive, use HDFS as a data store.


The diagram below represents the integration:
Happlier Application Replication via Hadoop Applier happens by reading binary log events , and writing them into a file in HDFS(Hadoop Distributed File System) as soon as they happen on MySQL master. “Events” describe database changes such as table creation operations or changes to table data.

As soon as an Insert query is fired on MySQL master, it is passed to the Hadoop Applier. This data is then written into a text file in HDFS. Once data is in HDFS files; other Hadoop ecosystem platforms and databases can consume this data for their own application. 

Hadoop Applier can be downloaded from http://labs.mysql.com/
 
Prerequisites:

These are the packages you require in order to run Hadoop Applier on your machine:

- Hadoop Applier package from http://labs.mysql.com
- Hadoop 1.0.4 ( that is what I used for the demo in the next post)
- Java version 6 or later (since hadoop is written in Java)
- libhdfs (it comes precompiled with Hadoop distros,
 ${HADOOP_HOME}/libhdfs/libhdfs.so)
- cmake 2.6 or greater 
- libmysqlclient 5.6
- gcc 4.6.3
- MySQL Server 5.6
-FindHDFS.cmake (cmake file to find libhdfs library while compiling. You can get a copy online)
-FindJNI.cmake (optional, check if you already have one:
$locate FindJNI.cmake
)

To use the Hadoop Applier with Hive, you will also need to install Hive , which you can download here.

Please use the comments section of this blog to share your opinion on Hadoop Applier, and let us know more about your requirements.

R
ead the
next post in order to learn how to install and configure Hadoop Applier, and the implementation details.


53 comments:

  1. Can Hadoop Applier synchronize updated records or does it track only insertion of new records ?

    ReplyDelete
    Replies
    1. Hi,

      For this release, the applier only tracks insertion of new records.

      We have considered adding support for deletes, updates and DDL's as well, but they are more complicated to handle and we are not sure of how much of an interest is this currently.

      Can you please elaborate your use case?

      Delete
    2. Thanks for quick response.

      In general, there is 'customers' table and our users can register on website (new records appear in table) and they also can change their statuses, like confirm their accounts or deactivate them (existent records are updated).
      And we need to run some reports, which gather analytics about users with different statuses.
      That's why it's important for us to be able to have latest updates of users' statuses in Hadoop.

      Currently we are trying to use Sqoop tool (http://sqoop.apache.org/) in order to setup incremental import workflow based at lastmodified_date field and then merge latest updates with initial data in Hadoop.
      But as for Sqoop 1.4.2 there is a lot of bugs, which currently do not allow us to build such workflow in an automated way (without human interference).

      So currently we are very interested in 'updates' feature of Hadoop Applier.

      It also will be great to have opportunity to migrate inital mysql schema into Hive table (I mean, without creating Hive tables by hand in advance).
      Because it's a bit frustrating to recreate (mirror) the whole database schema in Hive by hand in case if we are going to migrate couple of databases with hundreds of tables.
      For instance, Sqoop has such ability to fetch metada from db and map column types to appropriate Hive types and generate Hive tables by itself.

      Delete
    3. Hi,
      Thank you for giving the details.

      The use case is very valid and interesting, and this will help us
      shape the future direction for Hadoop Applier.
      Great to see your interest in it, stay tuned to get the latest updates.

      Delete
  2. Getting the following error

    # make -j4

    /opt/mysql/server-5.6/include/sql_common.h:26:18: fatal error: hash.h: No such file or directory

    ReplyDelete
    Replies
    1. Hi Luis,

      Please make sure that the MySQL server code is built. MySQL Server generates some header files (which are required by the Hadoop Applier) during compilation of the server code, hash.h being one of them. Hence the error.

      Please reply on this thread in case you face any other issues.

      Delete
  3. Hey there,
    I have the same problem, can what do you mean by "make sure that the MySQL server code is built". I downloaded the binary version of mysql server but couldn't find hash.h in the include dir.
    Can you assist me with this please.

    ReplyDelete
    Replies
    1. Hi Amr!

      Thank you for trying out the Applier!

      Unfortunately, downloading the binary version of MySQL server code will not help. You need any one of the following:
      - MySQL Server source code (http://dev.mysql.com/downloads/mysql/#downloads , slect 'source code' from the drop down menu)

      - MySQL connector C (Download from http://dev.mysql.com/downloads/connector/c/#downloads)

      Now, what I meant by "make sure that the MySQL server code is built" is: If you are using the MySQL server code for the headers and libraries, run the following commands on the source code you downloaded:
      - cmake .
      - make
      This will make sure that all the required header files and the library is in place when you include/link to the Hadoop applier.

      Hope this helps.
      Please reply on the thread in case you still face issues.

      Delete
  4. Thanks. Will try it out and let you guys know how it goes.

    ReplyDelete
  5. Ok so the complication and all went successfully. But the replication isn't happening.

    my mysql database looks like this:
    database:
    mysql_replicate_test

    desc rep_test;
    +--------+--------------+------+-----+---------+----------------+
    | Field | Type | Null | Key | Default | Extra |
    +--------+--------------+------+-----+---------+----------------+
    | row_id | int(11) | NO | PRI | NULL | auto_increment |
    | msg | varchar(200) | YES | | NULL | |
    +--------+--------------+------+-----+---------+----------------+


    my hive db looks like this:
    database: mysql_replicate_test

    create table rep_test
    > ( time_stamp INT, row_id INT, msg STRING )
    > row format delimited
    > fields terminated by ','
    > stored as textfile;

    my.cnf file:
    #
    # The MySQL database server configuration file.
    #
    # You can copy this to one of:
    # - "/etc/mysql/my.cnf" to set global options,
    # - "~/.my.cnf" to set user-specific options.
    #
    # One can use all long options that the program supports.
    # Run program with --help to get a list of available options and with
    # --print-defaults to see which it would actually understand and use.
    #
    # For explanations see
    # http://dev.mysql.com/doc/mysql/en/server-system-variables.html

    # This will be passed to all mysql clients
    # It has been reported that passwords should be enclosed with ticks/quotes
    # escpecially if they contain "#" chars...
    # Remember to edit /etc/mysql/debian.cnf when changing the socket location.
    [client]
    port = 3306

    # Here is entries for some specific programs
    # The following values assume you have at least 32M ram

    # This was formally known as [safe_mysqld]. Both versions are currently parsed.
    [mysqld_safe]
    nice = 0

    [mysqld]
    #
    # * Basic Settings
    #
    server-id = 1
    log_bin = /var/log/mysql/master-bin.log
    expire_logs_days = 10
    max_binlog_size = 100M
    binlog-format = row #Very important if you want to receive write, update and delete row events
    user = mysql
    port = 3306
    basedir = /usr/local/mysql
    datadir = /usr/local/mysql/data
    tmpdir = /tmp
    skip-external-locking

    #
    # Instead of skip-networking the default is now to listen only on
    # localhost which is more compatible and is not less secure.
    bind-address = 0.0.0.0
    #
    # * Fine Tuning
    #
    # * IMPORTANT: Additional settings that can override those from this file!
    # The files must end with '.cnf', otherwise they'll be ignored.
    #
    federated



    my hadoop config:







    fs.default.name
    hdfs://amr-Lenovo-G580:54310
    The name of the default file system. A URI whose
    scheme and authority determine the FileSystem implementation. The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class. The uri's authority is used to
    determine the host, port, etc. for a filesystem.



    hadoop.tmp.dir
    /home/hduser/tmp










    dfs.replication
    2
    Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.




    dfs.support.broken.append
    true










    mapred.job.tracker
    amr-Lenovo-G580:54311
    The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.





    hadoop version: Hadoop 1.2.0


    mysql test run command:
    ./mtr --start --suite=rpl --mysqld=--binlog_checksum=NONE

    happlier command:
    ./happlier mysql://root@localhost:13000 hdfs://amr-Lenovo-G580:54310

    Any idea how to debug this?

    ReplyDelete
    Replies
    1. Hi Amr!

      Great that the compilation works fine now! Thanks for trying this out.

      For this issue, I request you to check the following:

      1. I understand that you cannot see data into Hive, but is the data replicated into HDFS? i.e. check the HDFS file system. (you can do it from the command line or the Hadoop web-gui).

      If yes, then check the base dir into which the db structure is created. This should match the data warehouse directory in Hive.

      By default the base dir into which Hadoop Applier writes, is set as /user/hive/warehouse.

      The hive configuration file (hive-default.xml.template) should have the property
      \hive.metastore.warehouse.dir <\property>
      \/user/hive/warehouse\<\value>
      \location of default database for the warehouse\<\description>

      The value here, "/user/hive/warehouse" should match the value you set while running the happlier executable as the base dir.

      2.Please note that replication would start from the first insert you make into the MySQL table. Note, the executable 'happlier' should be running when the insert is made into the table on MySQL server.


      So when you execute ./happlier mysql://root@localhost:13000 hdfs://amr-Lenovo-G580:54310
      The output should be this:

      "The default data warehouse directory in HDFS will be set to /user/hive/warehouse
      Change the default warehouse directory (Y or N)?"

      Please give a Y or N option, Suppose you give N, output will be
      "Connected to HDFS File System
      The data warehouse directory is set as /user/hive/warehouse."

      Now,whenever an insert is done, the success message from happlier will be:
      "Written N bytes to datafile in the following directory: 13000 hdfs://amr-Lenovo-G580:54310/user/hive/warehouse/mysql_replicate_test.db/rep_test."

      Please have a look at the demo here, it might be of some help:
      http://www.youtube.com/watch?v=mZRAtCu3M1g

      In case this does not help, I request you to please paste the output of happlier when a row insert is done on MySQL.

      Hope this helps.

      - Shubhangi

      Delete
  6. Thanks,
    So i still can't get the data to be replicated into hdfs.
    hadoop fs -lsr / :
    drwxr-xr-x - hduser supergroup 0 2013-07-27 16:10 /user/hive/warehouse/mysql_replicate_test.db
    drwxr-xr-x - hduser supergroup 0 2013-07-27 16:10 /user/hive/warehouse/mysql_replicate_test.db/rep_test

    The database i created is there and all but the connector isn't working

    Here is the output of the executable:

    The default data warehouse directory in HDFS will be set to /usr/hive/warehouse
    Change the default data warehouse directory? (Y or N) Y
    Enter the absolute path to the data warehouse directory :/user/hive/warehouse
    Connected to HDFS file system
    The data warehouse directory is set as /user/hive/warehouse
    .
    .
    .

    Nothing gets printed after this, even when i insert to the tables.


    I added some debugging code to the connector, basically i print statements after each step of the connection is done, here is what i got:

    user: root, password: , host: localhost, binlog: , port: 13000
    version: 5.6.12-log
    CHECKSUMS ...
    ALL CHECKSUMS PASSED

    It looks like the connection part is going ok, i'm suspecting that mysql isn't sending the binlog event to the connector. Is there any special mysql config i need to do in order to make this happen ?

    ReplyDelete
  7. Hi Amr,
    Thanks for trying hadoop_applier.
    I think you need to set the variable binlog_format(in MySQL) to ROW before inserting into MySQL. You can use this command to do that

    Set session binlog_format='ROW';

    Please let me know if this does not work.

    ReplyDelete
  8. Sorry, still not working :(

    Is there a way to make sure that mysql is actually sending the proper events via telnet or something similar ?

    ReplyDelete
    Replies
    1. Hi Amr,

      You can check via the examples which are a part of the repo:

      home$ make binlog-browser
      home/examples$ ./binlog-browser mysql://root@127.0.0.1:13000

      see if you could get events listed here.

      Some comments:
      Please note that the server you start using mtr does not use the conf settings you might have specified in INSTALLDIR/my.cnf; and hence your binlog format settings there are not used here.

      You may please use the command Neha specified, or else give it as an option during the time server starts:
      ./mtr --start --suite=rpl --mysqld=--binlog_format=ROW --mysqld=--binlog_checksum=NONE


      Also, the debugging code gives the correct output, i.e.

      user: root, password: , host: localhost, binlog: , port: 13000

      The binlog name is initially empty, and that is expected. This is because Hadoop Applier does not get the name of the binlog file when it connects to the server and registers as a slave. It is only sent when it is registered as a slave and it requests for the binlog dump; i.e. requests a binlog stream from the server (via the COM_BINLOG_DUMP command).

      Delete
  9. Thank you all, it's working now :D

    The problem was that when i was trying to insert to mysql, i was connecting to mysql on port 3306 rather than port 13000 which caused the bin log event not to be triggered.

    That said, i couldn't compile the binlog-browser. Don't really need it now but here is the error i got:

    make binlog-browser
    g++ binlog-browser.cpp -o binlog-browser
    binlog-browser.cpp: In function ‘bool check_event_db(mysql::Binary_log_event**)’:
    binlog-browser.cpp:267:8: error: ‘WRITE_ROWS_EVENT_V1’ was not declared in this scope
    binlog-browser.cpp:269:8: error: ‘UPDATE_ROWS_EVENT_V1’ was not declared in this scope
    binlog-browser.cpp:271:8: error: ‘DELETE_ROWS_EVENT_V1’ was not declared in this scope
    binlog-browser.cpp:280:33: error: ‘STMT_END_F’ is not a member of ‘mysql::Row_event’
    binlog-browser.cpp: In function ‘int main(int, char**)’:
    binlog-browser.cpp:636:48: error: ‘str_error’ was not declared in this scope
    binlog-browser.cpp:688:51: error: ‘str_error’ was not declared in this scope
    binlog-browser.cpp:728:38: error: ‘WRITE_ROWS_EVENT_V1’ is not a member of ‘mysql’
    binlog-browser.cpp:730:38: error: ‘UPDATE_ROWS_EVENT_V1’ is not a member of ‘mysql’
    binlog-browser.cpp:732:38: error: ‘DELETE_ROWS_EVENT_V1’ is not a member of ‘mysql’
    binlog-browser.cpp:752:37: error: ‘STMT_END_F’ is not a member of ‘mysql::Row_event’
    binlog-browser.cpp:796:21: error: ‘class mysql::Binary_log_event’ has no member named ‘print_long_info’
    binlog-browser.cpp:796:52: error: ‘class mysql::Binary_log_event’ has no member named ‘print_event_info’
    binlog-browser.cpp:814:18: error: ‘class mysql::Binary_log_event’ has no member named ‘print_long_info’
    binlog-browser.cpp:814:49: error: ‘class mysql::Binary_log_event’ has no member named ‘print_event_info’
    make: *** [binlog-browser] Error 1

    I'm guessing it's not seeing the mysql include files. Will check it once i got some time.

    Again, thanks for the help.

    ReplyDelete
    Replies
    1. Hi Amr,

      Great you could find the issue!

      W.r.t the binlog-browser, I think the problem isn't because of not being able to find the header files, but because of using the outdated header files (from the previous release) of the applier, namely 'binlog_event.h'

      Please note, Hadoop Applier is the second release for mysql-replication-listener, (you can have a quick look on launchpad here).

      I notice that the error "no member found" is for the additions done in the current release.

      But, in that scenario, I wonder why is 'make happlier' doesn't report the same errors. Did you do something different to compile it?

      Delete
  10. Nope. Just browsed to the directory and ran the make command. Thanks for the info about the release though. I was getting kinda confused about the two names.

    I will make sure i don't have mixed versions and try again.

    ReplyDelete
  11. I'm running into more interesting problem now. When i start the happlier along with a script that will parse a large XML file and insert it to mysql, the applier will exist randomly.
    Also when i check the exist status "echo $?" it prints 0 which means that it was a normal exist.

    Is there any case where the happlier will exist with no external interaction. Or exist at all for that matter?

    ReplyDelete
    Replies
    1. Hi Amr,

      Sorry for the late reply.
      No, the happlier is a continuous process, and should not ideally exit on starting another process which would insert data into MySQL. If it does, there would be an error code or message.

      Could you share some more information on what commands you issue?

      Delete
  12. More interesting finding, happlier will exit exactly after writing 740243. any idea what this magic number means ?

    ReplyDelete
    Replies
    1. Hi Amr!

      Sorry for the late reply.
      I am however, unsure of this behavior. 740243 are no magic numbers relevant to happlier, I wonder it from the script though.


      Delete
  13. Very Interesting. Is there any work around to make this compatible with versions prior to 5.6?

    ReplyDelete
    Replies
    1. Hi,

      Thank you for the interest in the product.
      Yes, there are workarounds possible.
      The only change from MySQL 5.5 affecting the applier is the new field types supporting fractional timestamps, added in version 5.6.

      1. A quick and short way, if you require compatibility with MySQL 5.5 only (this would work for 5.6 too, but these fields will not be supported), is to apply this patch:

      === modified file 'src/value.cpp'
      --- src/value.cpp
      +++ src/value.cpp 2013-10-30 04:51:41 +0000
      @@ -148,6 +148,8 @@
      }
      break;
      }
      +//excluding the new field types introduced in 5.6 version
      +#if 0
      case MYSQL_TYPE_TIME2:
      if (metadata > MAX_TIME_WIDTH)
      length= 3 + (metadata - MAX_TIME_WIDTH)/2;
      @@ -166,6 +168,7 @@
      else
      length= 5;
      break;
      +#endif
      default:
      length= UINT_MAX;
      }

      2.A better way, however, is to have support for both, where we require detecting the library version (libmysqlclient) during build time, and use it as flags to support conditional compilation for the above field types.

      Thank you,
      Shubhangi

      Delete
  14. For Binary_log_event objects, is the object being pointed to by m_header leaked?
    An object is allocated as "m_waiting_event" in Binlog_tcp_driver::wait_for_next_vent, and is then passed to the event, but I don't see where it might be deleted.

    ReplyDelete
    Replies
    1. Hi!

      Thank you for pointing out the issue, and thank you for trying out the applier!

      In case of m_header, it is an object in Binary_log_event, so the memory is freed when the destructor is called. The caller has the responsibility to free memory used to set m_header. In the current code, the tcp_driver sets it using m_waiting_event (memory assigned from heap), and file_driver sets it using m_event_log_header (memory assigned in stack).

      There is a memory leak when the tcp_driver is used to connect to the MySQL server ( which is m_waiting_event). It should be assigned memory in the constructor(s) for Binlog_tcp_driver instead of wait_for_next_event, and freed in the disconnect method.

      A patch to address this has been committed, and will be published in the next release.


      Thanks once again,
      Shubhangi

      Delete
  15. Hi Shubhangi,

    Thank you for your post.

    I'm stepping through the prerequisites you stated above but having troubles determining which exact file (specific file name) to download and install.

    I currently have MySQL 5.6.2 installed and it runs great.

    But when I get to Hadoop 1.0.4 , there are mainly 3 variation to download:

    1.2.X - current stable version, 1.2 release
    2.2.X - current stable 2.x version
    0.23.X - similar to 2.X.X but missing NN HA.

    Does only 1 work with the Apache Hive and Hadoop Applier? And Which file should I download if we are to only use 1.2.X ??

    http://www.motorlogy.com/apache/hadoop/common/stable1/


    There are many like this in your prerequisites. Can you kindly pinpoint me to the right files and get this set up? Anywhere I can get tutorials on how to use MySQL, Hadoop, and Apache Hive?

    I really appreciate what you have done here and thank you in advance for your help. I'm really interested in Big Data and would like to learn.

    Thanks,
    Jason

    ReplyDelete
    Replies
    1. Hi Jason!

      Thank you for trying out the Applier.
      - MySQL 5.6.2: Great! Make sure when you run this along with the Applier,
      set binlog_checksum=NONE before starting the server.

      - Hadoop Version: You can use any. I have tested it with 1.0.4, but you can
      try with the latest stable versions. in the link you provided, you
      can download any one, which suits your platform. It might be
      that you need to edit the file 'FindHDFS.cmake', if necessary,
      to have HDFS_LIB_PATHS set as a path to libhdfs.so, and
      HDFS_INCLUDE_DIRS have the path pointing to the location
      of hdfs.h.

      For 1.x versions, library path is $ENV{HADOOP_HOME}/c++/Linux-i386-32/lib ,
      and header files are contained in $ENV{HADOOP_HOME}/src/c++/libhdfs.
      For 2.x releases, header files and libraries can be found in
      $ENV{HADOOP_HOME}/lib/native, and $ENV{HADOOP_HOME}/include
      respectively.
      (Details in part 2 of this blog http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html )

      - Other per-requisites and tutorials:
      For MySQL:
      " http://dev.mysql.com/doc/refman/5.6/en/index.html "
      Using apache hive and hadoop :
      " https://cwiki.apache.org/confluence/display/Hive/GettingStarted "
      " http://hadoop.apache.org/docs/current/ "
      is very well documented here.

      Usage of these three together is documented in the second part of the blog:
      " http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-2.html "

      Hope this helps.
      Please reply to the thread for any other clarifications, happy to see you trying it out!

      Thanks,
      Shubhangi

      Delete
  16. Thanks for the insight.

    We run a large production environment with OLTP in MySQL, using application-level sharding. We also use statement based replication. Currently, we use MySQL backups to bulk load HDFS evey day, but this takes a long time, and we're looking for real-time options.
    Because we use OLTP SQL, we need update and delete statement support, and because we're using statement based replication (row-based has a bug which makes it not work for our particular use case,) Hadoop Applier doesn't currently do much for us.
    What is the status of the project? Is it being worked on actively? is there a roadmap?

    ReplyDelete
    Replies
    1. Hi Jon!

      Sorry for the delay in the reply.
      Thank you for sharing your use case with us.

      Making the Applier work with statement based replication is difficult for us, because the rows are not guaranteed to be the same while they are replicated to the binary logs, especially for unsafe queries.
      Also, we would like to know the RBR specific bug which is blocking your work, that shall help us to get a better insight.

      We have considered adding update and delete, but there are no concrete plans yet.

      Thank you once again,
      Shubhangi

      Delete
  17. Replies
    1. Hi Abhijeet!

      Thank you for your interest in the Applier!

      The steps are mentioned in the next blog:
      http://innovating-technology.blogspot.in/2013/04/mysql-hadoop-applier-part-2.html

      Thanks,
      Shubhangi

      Delete
  18. Hi shubhangi

    i have got following error when i enter make happlier

    [ 77%] Built target replication_static
    examples/mysql2hdfs/CMakeFiles/happlier.dir/build.make:44: CMakeFiles/happlier.dir/depend.make: No such file or directory
    examples/mysql2hdfs/CMakeFiles/happlier.dir/build.make:47: CMakeFiles/happlier.dir/progress.make: No such file or directory
    examples/mysql2hdfs/CMakeFiles/happlier.dir/build.make:50: CMakeFiles/happlier.dir/flags.make: No such file or directory
    make[3]: *** No rule to make target `CMakeFiles/happlier.dir/flags.make'. Stop.
    make[2]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/all] Error 2
    make[1]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/rule] Error 2
    make: *** [happlier] Error 2

    note:am using cloudera



    ReplyDelete
  19. Hi Mahesh,

    Thank you trying out the Applier!

    Sorry, I tried, but I am not able to reproduce to the problem.
    Can you please give the exact steps you followed to build the code? Did you use an IDE ? Request you to please mention the directory path in which you executed the build command.

    Thank you,
    Shubhangi

    P.S. Please note this might be a bug, and you may report it on bugs.mysql.com, under the category 'MySQLServer: Binlog.

    ReplyDelete
  20. Hi Shubhangi,

    Thank you for the reply

    i just followed your steps and this is my directory path

    cloudera@dn66:~/mysql-hadoop-applier-0.1.0$ pwd
    /home/cloudera/mysql-hadoop-applier-0.1.0
    cloudera@dn66:~/mysql-hadoop-applier-0.1.0$ make happlier
    [ 77%] Built target replication_static
    examples/mysql2hdfs/CMakeFiles/happlier.dir/build.make:44: CMakeFiles/happlier.dir/depend.make: No such file or directory
    examples/mysql2hdfs/CMakeFiles/happlier.dir/build.make:47: CMakeFiles/happlier.dir/progress.make: No such file or directory
    examples/mysql2hdfs/CMakeFiles/happlier.dir/build.make:50: CMakeFiles/happlier.dir/flags.make: No such file or directory
    make[3]: *** No rule to make target `CMakeFiles/happlier.dir/flags.make'. Stop.
    make[2]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/all] Error 2
    make[1]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/rule] Error 2
    make: *** [happlier] Error 2
    cloudera@dn66:~/mysql-hadoop-applier-0.1.0$ ls
    CHANGELOG.txt cmake_install.cmake CPackConfig.cmake examples lib README src
    CMakeCache.txt CMakeLists.txt CPackSourceConfig.cmake FindHDFS.cmake Makefile release tests
    CMakeFiles COPYING CTestTestfile.cmake include MyCmake source_downloads
    cloudera@dn66:~/mysql-hadoop-applier-0.1.0$


    ReplyDelete
    Replies
    1. Hi Mahesh,

      Sorry for the delay in the reply. I have not been able to reproduce the issue yet.
      I shall look into it once again.

      Thanks,
      Shubhangi

      Delete
    2. Hi Shubha,
      Thanks for your reply.....am also trying to solve this.


      Thanks,
      Mahesh

      Delete
  21. Hi Shubhangi,

    i am using cloudera CDH 5 and mysql 5.6.17.Hadoop and hive are running on VM.
    How can i resolve following error?

    happlier mysql://root@localhost:3306 hdfs://localhost:8020
    The default data warehouse directory in HDFS will be set to /usr/hive/warehouse
    Change the default data warehouse directory? (Y or N) N
    loadFileSystems error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    hdfsBuilderConnect(forceNewInstance=0, nn=localhost, port=8020, kerbTicketCachePath=(NULL), userName=(NULL)) error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    Couldnot connect to HDFS file system

    thanks
    Karan

    ReplyDelete
    Replies
    1. Hi Karan,

      Thank you for trying out the applier.

      The exception says 'NoClassFound', and I suspect the classpath is not set correctly.
      For hadoop versions 2.0.0 and above, the classpath doesn't support wild characters. If you add the jars explicitly to the CLASSPATH, your app will work.

      You could use a simple shell loop such as at one here:

      source /usr/lib/bigtop-utils/bigtop-detect-javahome
      export CLASSPATH=/etc/hadoop/conf
      for file in `ls /usr/lib/hadoop/client/*.jar`
      do
      export CLASSPATH=$CLASSPATH:$file
      done
      export LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server/"


      Hope that helps.

      Thank you,
      Shubhangi

      Delete
  22. Thans for share this tutorial, thats very helping to my shool task

    Asrizal Wahdan Wilsa Sharing Media

    ReplyDelete
  23. Hi Shubhangi,

    the applier is running for few min and i get following error :
    Written 215 bytes to datafile in the following directory: hdfs://localhost:8020/user/hive/warehouse/test_db.db/test_insert
    Written 319 bytes to datafile in the following directory: hdfs://localhost:8020/user/hive/warehouse/test_db.db/test_insert
    14/05/15 14:55:39 INFO hdfs.DFSClient: Exception in createBlockOutputStream
    java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1344)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1193)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)
    14/05/15 14:55:39 WARN hdfs.DFSClient: DataStreamer Exception
    java.lang.NullPointerException
    at org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:279)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:579)
    FSDataOutputStream#close error:
    java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting...
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)
    hdfsOpenFile(datafile1.txt): FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:
    org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2458)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2340)
    ...
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:316)
    at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161)
    Failed to open datafile1.txt for writing!
    hdfsOpenFile(datafile1.txt): FileSystem#append((Lorg/apache/hadoop/fs/Path;)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:
    org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.

    at the same time i see errors in namenode log file :
    2014-05-15 14:55:39,128 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.append: failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt f
    or DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.
    2014-05-15 14:55:39,128 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:mysql (auth:SIMPLE) cause:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hive/warehouse/test_db.db/test_insert/datafile1.txt for DFSClient_NONMAPREDUCE_1866428327_1 on client 127.0.0.1 because current leaseholder is trying to recreate file.



    Any idea what can be the root cause?
    thanks
    --karan

    ReplyDelete
    Replies
    1. Hi Karan,
      Sorry for the delay in the reply.
      It seems the datanode did not allow the transfer. I am not sure of the root cause: can you try using " hdfs dfs -put " for the same, to verify first? If it works fine, that implies the Applier is facing some issue.

      Thanks,
      Shubhangi

      Delete
  24. Hi Shubhangi,

    I got the following error,

    [root@dn66 mysql-hadoop-applier-0.1.0]# make happlier
    [ 77%] Built target replication_static
    Linking CXX executable happlier
    /usr/bin/ld: warning: libmawt.so, needed by /usr/lib/java/jre/lib/amd64/libjawt.so, not found (try using -rpath or -rpath-link)
    CMakeFiles/happlier.dir/hdfs_schema.cpp.o: In function `HDFSSchema::HDFSSchema(std::basic_string, std::allocator > const&, int, std::basic_string, std::allocator > const&, std::basic_string, std::allocator > const&)':
    hdfs_schema.cpp:(.text+0xa0): undefined reference to `hdfsConnect'
    hdfs_schema.cpp:(.text+0xd8): undefined reference to `hdfsConnectAsUser'
    CMakeFiles/happlier.dir/hdfs_schema.cpp.o: In function `HDFSSchema::~HDFSSchema()':
    hdfs_schema.cpp:(.text+0x334): undefined reference to `hdfsDisconnect'
    CMakeFiles/happlier.dir/hdfs_schema.cpp.o: In function `HDFSSchema::HDFS_data_insert(std::basic_string, std::allocator > const&, char const*)':
    hdfs_schema.cpp:(.text+0x537): undefined reference to `hdfsSetWorkingDirectory'
    hdfs_schema.cpp:(.text+0x5f4): undefined reference to `hdfsExists'
    hdfs_schema.cpp:(.text+0x62d): undefined reference to `hdfsOpenFile'
    hdfs_schema.cpp:(.text+0x663): undefined reference to `hdfsOpenFile'
    hdfs_schema.cpp:(.text+0x6d5): undefined reference to `hdfsWrite'
    hdfs_schema.cpp:(.text+0x777): undefined reference to `hdfsFlush'
    hdfs_schema.cpp:(.text+0x7cc): undefined reference to `hdfsCloseFile'
    /usr/lib/java/jre/lib/amd64/libjawt.so: undefined reference to `awt_Unlock@SUNWprivate_1.1'
    /usr/lib/java/jre/lib/amd64/libjawt.so: undefined reference to `awt_GetComponent@SUNWprivate_1.1'
    /usr/lib/java/jre/lib/amd64/libjawt.so: undefined reference to `awt_Lock@SUNWprivate_1.1'
    /usr/lib/java/jre/lib/amd64/libjawt.so: undefined reference to `awt_GetDrawingSurface@SUNWprivate_1.1'
    /usr/lib/java/jre/lib/amd64/libjawt.so: undefined reference to `awt_FreeDrawingSurface@SUNWprivate_1.1'
    collect2: ld returned 1 exit status
    make[3]: *** [examples/mysql2hdfs/happlier] Error 1
    make[2]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/all] Error 2
    make[1]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/rule] Error 2
    make: *** [happlier] Error 2

    help me

    ReplyDelete
    Replies
    1. Hi Mahesh,

      Thank you for tyring out the applier.

      Other users have reported the same issue.
      One of them resolved it on linux VM by manually linking $JAVA_HOME/jre/lib//xawt/libmawt.so to $JAVA_HOME/jre/lib/libmawt.so.

      Also, can you please check for the following:
      1. Do you have the JAVA_HOME set ?
      2. Do you have CLASS_PATH set to point to jars required to run Hadoop itself?
      (command ~: export CLASSPATH= $(hadoop classpath) )
      3. Can you please try running Hadoop and check if it runs fine?

      May be installing Oracle JDK ( I use 1.7.0_03) instead of openJDK would help.

      Hope this helps!

      Thank you,
      Shubhangi

      Delete
  25. Hi shubangi,

    Thank you for your reply
    I solved that error now i got new error

    [root@dn66 mysql-hadoop-applier-0.1.0]# make happlier
    [ 77%] Built target replication_static
    Linking CXX executable happlier
    /opt/hadoop-2.3.0/lib/native/libhdfs.so: could not read symbols: File in wrong format
    collect2: ld returned 1 exit status
    make[3]: *** [examples/mysql2hdfs/happlier] Error 1
    make[2]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/all] Error 2
    make[1]: *** [examples/mysql2hdfs/CMakeFiles/happlier.dir/rule] Error 2
    make: *** [happlier] Error 2

    Thank you,
    Mahesh

    ReplyDelete
    Replies
    1. can you please tell me how to solve this error

      thank you
      Mahesh,

      Delete
    2. Hi Mahesh,

      Good to know the previous error is resolved!

      This error may be occurring because of mismatch in the library versions used by the Applier. Can you please make sure that libhdfs is the 32 bit version ?

      Hope that helps,
      Shubhangi

      Delete
  26. Hi Shubha,

    Am using 64 bit version

    Thank you
    Mahesh

    ReplyDelete
  27. Hi Shubangi,

    i could not put a file with hdfs command at the same time. .i see also following Insufficient space error in the datanode :

    2014-05-28 14:21:16,826 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver constructor. Cause is
    2014-05-28 14:21:16,826 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-2086983221-10.6.49.105-1398121375219:blk_1073741981_81479 received exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Insufficient space for appending to FinalizedReplica, blk_1073741981_81479, FINALIZED
    getNumBytes() = 3220492
    getBytesOnDisk() = 3220492
    getVisibleLength()= 3220492
    getVolume() = /var/lib/hadoop-hdfs/cache/hdfs/dfs/data/current
    getBlockFile() = /var/lib/hadoop-hdfs/cache/hdfs/dfs/data/current/BP-2086983221-10.6.49.105-1398121375219/current/finalized/subdir16/subdir56/blk_1073741981
    unlinked =true
    2014-05-28 14:21:16,826 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: ops-dbops1002.idanalytics.com:50010:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:57952 dest: /127.0.0.1:50010
    org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Insufficient space for appending to FinalizedReplica, blk_1073741981_81479, FINALIZED

    monitoring the hdfs shows 1.9G avaiable space before i start the happlier :

    -bash-4.1$ hadoop fs -df -h /user
    Filesystem Size Used Available Use%
    hdfs://localhost:8020 5.9 G 54.3 M 1.9 G 1%

    few minutes later 1.9G has been used, but the datafile1.txt is only 2MB
    When i terminate the happlier the space will be available.
    do you know why hadoop or happlier holding the space?
    Thanks
    --Karan

    ReplyDelete
  28. Hi,
    Does Hadoop Applier support PARTITION on Hive? Currently we use Sqoop+shell script to replicate data from Mysql to HDFS(Hive) and use add partition based on date and hour. It's necessary for us to use Impala otherwise Impala may crash because of the lack of memory.

    I am not sure whether Hadoop Applier meet our needs or provide a way to apply shell script along with it.

    ReplyDelete
  29. Hi shubhangi,

    Thank you so much.Finally i configured hadoop applier and working fine.
    An issue am facing is when i re-execute following commands,

    ./mtr --start --suite=rpl --mysqld=--binlog_format='ROW' --mysqld=--binlog_checksum=NONE
    mysql -u root -p --socket=/usr/local/mysql/mysql-test/var/tmp/mysqld.1.sock

    the database which i created gets deleted.

    Thanks

    Mahesh

    ReplyDelete
  30. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. and it only can synchronize mysql5.6? how about mysql5.5, or mariadb5.x and mariadb10.x?
      thanks very much

      Delete