Tuesday, August 19, 2014

Flume implementation using SpoolDirectory Source, HDFS Sink, File Channel


       Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting,   aggregating and moving large amounts of log data from many different sources to a centralized data store. 

Steps:
1.  create Directory to copy the log files from mount location. 
For Ex: /home/user/flume/source

2. create HDFS Directory to copy the files from flume spool directory.
For Ex: /user/hdpuser/flumelogs

3. create the flume-conf.conf file in /home/user/flume/ folder.
#Single agent flume configuration

#Agent components
a1.sources = src1
a1.channels = chan1
a1.sinks = sink1

#configuring the souce
a1.sources.src1.type = spooldir
a1.sources.src1.spoolDir = /home/user/flume/source

#configuring the channel
a1.channels.chan1.type = file
#Checkpoint directory
a1.channels.chan1.checkpointDir = /home/user/flume /.flume/file-channel/checkpoint             
#log files created in Data directory while running flume-agent
a1.channels.chan1.dataDirs = /home/user/flume/flume/.flume/file-channel/data

#configuring the Sink
a1.sinks.sink1.type = hdfs
a1.sinks.sink1.hdfs.path = /user/hdpuser/flumelogs/year=20%y/month=%m
#a1.sinks.sink1.hdfs.round = true
#a1.sinks.sink1.hdfs.roundValue = 10
#a1.sinks.sink1.hdfs.roundUnit = minute
a1.sinks.sink1.hdfs.useLocalTimeStamp = true
a1.sinks.sink1.hdfs.batchSize = 10000
a1.sinks.sink1.hdfs.fileType = DataStream
#a1.sinks.sink1.hdfs.filePrefix = %d-START-
#creating log file in hdfs based on the roll size
a1.sinks.sink1.hdfs.rollCount = 0
a1.sinks.sink1.hdfs.rollSize = 10737418240
a1.sinks.sink1.hdfs.rollInterval = 216000
#a1.sinks.sink1.sink.serializer = text
#a1.sinks.sink1.sink.serializer.appendNewline = false

a1.sources.src1.channels = chan1
a1.sinks.sink1.channel = chan1

4. run the flume-agent
in command promt run the below command
$cd  /home/user/flume
$flume-ng agent --conf conf --conf-file flume-conf.conf --name a1 
--- this creates .flumespool in /home/user/flume/source contains the flume metadata information when flume-agnet stream the information.
Note: "Spooling Directory Source runner has shutdown" means there is no file to stream.

5. copy the log file in to the /home/user/flume/source.
Flume-agent take that file and stream into the HDFS (/user/hdpuser/flumelogs/year=20%y/month=%m) .
6. Open your HDFS File Browser and see the file.
7.  Checkpoint directory and Data Directory files are created by flume file channel with log data and check points.

 Note:
a) The channel transaction capacity will need to be smaller than equal to the channel capacity. 
b) Batch Size  <=  channel capacity.   
In Same way, you can use memory channel also 

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I'm getting an error in prod flume with similar set up

    2015-09-06 17:34:46,554 (pool-6-thread-1) [INFO - org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:238)] Last read was never committed - resetting mark position.
    2015-09-06 17:34:49,563 (pool-6-thread-1) [WARN - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:239)] The channel is full, and cannot write data now. The source will try again after 4000

    curl localhost:5653/metrics
    {"SINK.s3-sink":{"BatchCompleteCount":"0","ConnectionFailedCount":"1","EventDrainAttemptCount":"0","ConnectionCreatedCount":"0","Type":"SINK","BatchEmptyCount":"0","ConnectionClosedCount":"0","EventDrainSuccessCount":"0","StopTime":"0","StartTime":"1441560792619","BatchUnderflowCount":"0"},"SOURCE.spooling-directory":{"OpenConnectionCount":"0","Type":"SOURCE","AppendBatchAcceptedCount":"0","AppendBatchReceivedCount":"20","EventAcceptedCount":"0","AppendReceivedCount":"0","StopTime":"0","StartTime":"1441560792648","EventReceivedCount":"2000","AppendAcceptedCount":"0"},"CHANNEL.fileChannel":{"EventPutSuccessCount":"0","ChannelFillPercentage":"99.9988","Type":"CHANNEL","StopTime":"0","EventPutAttemptCount":"260","ChannelSize":"999988","StartTime":"1441560792613","EventTakeSuccessCount":"0","ChannelCapacity":"1000000","EventTakeAttemptCount":"1"}}

    Any idea what I can do?

    ReplyDelete
    Replies
    1. I think your disk space is full. Could you please check the driver space where exactly your writing your .log data files.
      Which channel you are using in flume-conf?

      Delete
  3. If I want to make the hdfs output file to certain size, like 10 MB, what should I change?

    ReplyDelete