Building Low-Cost Mass Data Applications with Flume and EMR

ashwin_kumar · October 31, 2025, 11:11am

Building Low-Cost Mass Data Applications with Flume and EMR

Flume log collection is a distributed system that enables the efficient gathering and processing of log data from various sources, including application servers, messaging middleware, and other data sources. The collected data is then stored in a specified location for analysis. EMR (Elastic MapReduce), on the other hand, is a hosted Hadoop cloud service provided by Tencent that offers a complete cluster management, service monitoring, security management, and storage solution.

Overview of the Solution

In this article, we will describe how to use Flume, EMR, and Object Storage (COS) to build low-cost data warehouse applications. The overall application architecture is as follows:

Flume Data Stream

The data to be analyzed can come from various sources, including:

Application service logs generated by message brokers, such as Kafka.
Other data sources, such as HTTP requests or network server logs.

Flume can select the output destination based on the user’s needs. This article will focus on three types of storage:

• HDFS (Conventional Distributed File System): objects are stored.
• COS (Object Storage): a cost-effective solution for storing large amounts of data.
• CFS (Cloud File Storage): a scalable and secure storage solution.

Setting Up Flume with COS Sink

If cost is a concern, we recommend setting the output destination to COS. This article will focus on how to set up Flume with a COS Sink.

Installing Flume

To install Flume, follow these steps:

2.1 Download Flume

You can use the following command to download Flume version 1.7.0:

git clone -b release-1.7.0 https://github.com/apache/flume.git

2.2 Compile and Install Flume

To compile and install Flume, use the following command:

mvn package -Dmaven.javadoc.skip=true -DskipTests -Dtar -Dhadoop2.version=2.7.3 -Phadoop-2

After successful compilation, extract the installation directory:

flume-ng-dist/target/apache-flume-1.7.0-bin.tar.gz

2.3 Processing Compatibility

Enter the Flume directory and execute the following command to remove conflicting jar files:

rm -rf ./lib/httpclient-4.2.1.jar
rm -rf ./lib/httpcore-4.1.3.jar

Configuring Flume

To configure Flume, follow these steps:

3.1 Copy Hadoop Log

Copy the Hadoop log to the EMR cluster node:

cp /usr/local/service/hadoop /data/emr/hdfs/log

3.2 Confirm COS Hadoop Configuration

Confirm the presence of the COS Hadoop configuration profile:

ls -al /usr/local/service/hadoop/etc/hadoop/core-site.xml

The configuration profile should contain the following properties:

<Property>
  <Name>fs.cos.userinfo.secretId</Name>
  <Value>keys</Value>
</Property>
<Property>
  <Name>fs.cos.userinfo.appid</Name>
  <Value>Account ID</Value>
</Property>
<Property>
  <Name>fs.cos.userinfo.secretKey</Name>
  <Value>keys key</Value>
</Property>
<Property>
  <Name>fs.cosn.block.size</Name>
  <Value>1048576</Value>
</Property>
<Property>
  <Name>fs.cosn.impl</Name>
  <Value>org.apache.hadoop.fs.cosnative.NativeCosFileSystem</Value>
</Property>
<Property>
  <Name>fs.cosn.userinfo.region</Name>
  <Value>ap-guangzhou</Value>
</Property>
<Property>
  <Name>fs.cosn.local_block_size</Name>
  <Value>2097152</Value>
</Property>

Note: If you selected COS when creating the EMR cluster, the configuration will be generated automatically. If you are unsure about the configuration values, you can submit a work order or contact customer service.

3.3 Flume Configuration

Create a new configuration file in the Flume conf directory:

mkdir -p /data/emr/hdfs/logs
mkdir -p /data/emr/hdfs/tmp

Configure the Flume agent:

search.sources = so
search.sinks = si
search.channels = sc

# Configure a channel that buffers events in memory:
search.channels.sc.type = memory
search.channels.sc.capacity = 20000
search.channels.sc.transactionCapacity = 100

# Configure the source:
search.sources.so.channels = sc
search.sources.so.type = exec
search.sources.so.command = tail -F /data/logs/demo.log

# Describe the sink:
search.sinks.si.channel = sc
search.sinks.si.type = hdfs
search.sinks.si.hdfs.path = cosn://bucketname/demo
search.sinks.si.hdfs.writeFormat = Text
search.sinks.si.hdfs.fileType = DataStream
search.sinks.si.hdfs.rollSize = 0
search.sinks.si.hdfs.rollCount = 10000
search.sinks.si.hdfs.batchSize = 1000
search.sinks.si.rollInterval = 1

Starting the Service and Verifying

4.1 Start the Agent Service

Enter the Flume installation directory and execute the following command to start the agent service:

bin/flume-ng agent -c conf -f ./conf/demo.conf --name demo

4.2 Verify Log Generation

Use the following command to verify whether the log is generated successfully:

hadoop fs -ls cosn://bucket/demo

Starting Analysis Tasks

After a successful push log, you can analyze your data in the following manner:

• ETL carried out under the log table pushed to the Hive storage location.
• Direct production data in ORC format and then query by Presto.
• Write program analysis by Spark or MR.