Technical Tweets: Flume implementation using SpoolDirectory Source, HDFS Sink, File Channel

Tuesday, August 19, 2014

Flume implementation using SpoolDirectory Source, HDFS Sink, File Channel

Flume: Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Steps:

1. create Directory to copy the log files from mount location.

For Ex: /home/user/flume/source

2. create HDFS Directory to copy the files from flume spool directory.

For Ex: /user/hdpuser/flumelogs

3. create the flume-conf.conf file in /home/user/flume/ folder.

#Single agent flume configuration

#Agent components

a1.sources = src1

a1.channels = chan1

a1.sinks = sink1

#configuring the souce

a1.sources.src1.type = spooldir

a1.sources.src1.spoolDir = /home/user/flume/source

#configuring the channel

a1.channels.chan1.type = file

#Checkpoint directory

a1.channels.chan1.checkpointDir = /home/user/flume /.flume/file-channel/checkpoint

#log files created in Data directory while running flume-agent

a1.channels.chan1.dataDirs = /home/user/flume/flume/.flume/file-channel/data

#configuring the Sink

a1.sinks.sink1.type = hdfs

a1.sinks.sink1.hdfs.path = /user/hdpuser/flumelogs/year=20%y/month=%m

#a1.sinks.sink1.hdfs.round = true

#a1.sinks.sink1.hdfs.roundValue = 10

#a1.sinks.sink1.hdfs.roundUnit = minute

a1.sinks.sink1.hdfs.useLocalTimeStamp = true

a1.sinks.sink1.hdfs.batchSize = 10000

a1.sinks.sink1.hdfs.fileType = DataStream

#a1.sinks.sink1.hdfs.filePrefix = %d-START-

#creating log file in hdfs based on the roll size

a1.sinks.sink1.hdfs.rollCount = 0

a1.sinks.sink1.hdfs.rollSize = 10737418240

a1.sinks.sink1.hdfs.rollInterval = 216000

#a1.sinks.sink1.sink.serializer = text

#a1.sinks.sink1.sink.serializer.appendNewline = false

a1.sources.src1.channels = chan1

a1.sinks.sink1.channel = chan1

4. run the flume-agent

in command promt run the below command

$cd /home/user/flume

$flume-ng agent --conf conf --conf-file flume-conf.conf --name a1

--- this creates .flumespool in /home/user/flume/source contains the flume metadata information when flume-agnet stream the information.

Note: "Spooling Directory Source runner has shutdown" means there is no file to stream.

5. copy the log file in to the /home/user/flume/source.

Flume-agent take that file and stream into the HDFS (/user/hdpuser/flumelogs/year=20%y/month=%m) .

6. Open your HDFS File Browser and see the file.

7. Checkpoint directory and Data Directory files are created by flume file channel with log data and check points.

Note:

a) The channel transaction capacity will need to be smaller than equal to the channel capacity.

b) Batch Size <= channel capacity.
In Same way, you can use memory channel also

4 comments:

UnknownSeptember 6, 2015 at 10:22 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownSeptember 6, 2015 at 10:35 AM
I'm getting an error in prod flume with similar set up

2015-09-06 17:34:46,554 (pool-6-thread-1) [INFO - org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:238)] Last read was never committed - resetting mark position.
2015-09-06 17:34:49,563 (pool-6-thread-1) [WARN - org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:239)] The channel is full, and cannot write data now. The source will try again after 4000

curl localhost:5653/metrics
{"SINK.s3-sink":{"BatchCompleteCount":"0","ConnectionFailedCount":"1","EventDrainAttemptCount":"0","ConnectionCreatedCount":"0","Type":"SINK","BatchEmptyCount":"0","ConnectionClosedCount":"0","EventDrainSuccessCount":"0","StopTime":"0","StartTime":"1441560792619","BatchUnderflowCount":"0"},"SOURCE.spooling-directory":{"OpenConnectionCount":"0","Type":"SOURCE","AppendBatchAcceptedCount":"0","AppendBatchReceivedCount":"20","EventAcceptedCount":"0","AppendReceivedCount":"0","StopTime":"0","StartTime":"1441560792648","EventReceivedCount":"2000","AppendAcceptedCount":"0"},"CHANNEL.fileChannel":{"EventPutSuccessCount":"0","ChannelFillPercentage":"99.9988","Type":"CHANNEL","StopTime":"0","EventPutAttemptCount":"260","ChannelSize":"999988","StartTime":"1441560792613","EventTakeSuccessCount":"0","ChannelCapacity":"1000000","EventTakeAttemptCount":"1"}}

Any idea what I can do?
ReplyDelete
Replies
অলস আমিDecember 9, 2015 at 2:37 AM
If I want to make the hdfs output file to certain size, like 10 MB, what should I change?
ReplyDelete
Replies

Add comment