Building Low-Cost Mass Data Applications with Flume and EMR
Flume log collection is a distributed system that enables the efficient gathering and processing of log data from various sources, including application servers, messaging middleware, and other data sources. The collected data is then stored in a specified location for analysis. EMR (Elastic MapReduce), on the other hand, is a hosted Hadoop cloud service provided by Tencent that offers a complete cluster management, service monitoring, security management, and storage solution.
Overview of the Solution
In this article, we will describe how to use Flume, EMR, and Object Storage (COS) to build low-cost data warehouse applications. The overall application architecture is as follows:
Flume Data Stream
The data to be analyzed can come from various sources, including:
- Application service logs generated by message brokers, such as Kafka.
- Other data sources, such as HTTP requests or network server logs.
Flume can select the output destination based on the user’s needs. This article will focus on three types of storage:
• HDFS (Conventional Distributed File System): objects are stored.
• COS (Object Storage): a cost-effective solution for storing large amounts of data.
• CFS (Cloud File Storage): a scalable and secure storage solution.
Setting Up Flume with COS Sink
If cost is a concern, we recommend setting the output destination to COS. This article will focus on how to set up Flume with a COS Sink.
Installing Flume
To install Flume, follow these steps:
2.1 Download Flume
You can use the following command to download Flume version 1.7.0:
git clone -b release-1.7.0 https://github.com/apache/flume.git
2.2 Compile and Install Flume
To compile and install Flume, use the following command:
mvn package -Dmaven.javadoc.skip=true -DskipTests -Dtar -Dhadoop2.version=2.7.3 -Phadoop-2
After successful compilation, extract the installation directory:
flume-ng-dist/target/apache-flume-1.7.0-bin.tar.gz
2.3 Processing Compatibility
Enter the Flume directory and execute the following command to remove conflicting jar files:
rm -rf ./lib/httpclient-4.2.1.jar
rm -rf ./lib/httpcore-4.1.3.jar
Configuring Flume
To configure Flume, follow these steps:
3.1 Copy Hadoop Log
Copy the Hadoop log to the EMR cluster node:
cp /usr/local/service/hadoop /data/emr/hdfs/log
3.2 Confirm COS Hadoop Configuration
Confirm the presence of the COS Hadoop configuration profile:
ls -al /usr/local/service/hadoop/etc/hadoop/core-site.xml
The configuration profile should contain the following properties:
<Property>
<Name>fs.cos.userinfo.secretId</Name>
<Value>keys</Value>
</Property>
<Property>
<Name>fs.cos.userinfo.appid</Name>
<Value>Account ID</Value>
</Property>
<Property>
<Name>fs.cos.userinfo.secretKey</Name>
<Value>keys key</Value>
</Property>
<Property>
<Name>fs.cosn.block.size</Name>
<Value>1048576</Value>
</Property>
<Property>
<Name>fs.cosn.impl</Name>
<Value>org.apache.hadoop.fs.cosnative.NativeCosFileSystem</Value>
</Property>
<Property>
<Name>fs.cosn.userinfo.region</Name>
<Value>ap-guangzhou</Value>
</Property>
<Property>
<Name>fs.cosn.local_block_size</Name>
<Value>2097152</Value>
</Property>
Note: If you selected COS when creating the EMR cluster, the configuration will be generated automatically. If you are unsure about the configuration values, you can submit a work order or contact customer service.
3.3 Flume Configuration
Create a new configuration file in the Flume conf directory:
mkdir -p /data/emr/hdfs/logs
mkdir -p /data/emr/hdfs/tmp
Configure the Flume agent:
search.sources = so
search.sinks = si
search.channels = sc
# Configure a channel that buffers events in memory:
search.channels.sc.type = memory
search.channels.sc.capacity = 20000
search.channels.sc.transactionCapacity = 100
# Configure the source:
search.sources.so.channels = sc
search.sources.so.type = exec
search.sources.so.command = tail -F /data/logs/demo.log
# Describe the sink:
search.sinks.si.channel = sc
search.sinks.si.type = hdfs
search.sinks.si.hdfs.path = cosn://bucketname/demo
search.sinks.si.hdfs.writeFormat = Text
search.sinks.si.hdfs.fileType = DataStream
search.sinks.si.hdfs.rollSize = 0
search.sinks.si.hdfs.rollCount = 10000
search.sinks.si.hdfs.batchSize = 1000
search.sinks.si.rollInterval = 1
Starting the Service and Verifying
4.1 Start the Agent Service
Enter the Flume installation directory and execute the following command to start the agent service:
bin/flume-ng agent -c conf -f ./conf/demo.conf --name demo
4.2 Verify Log Generation
Use the following command to verify whether the log is generated successfully:
hadoop fs -ls cosn://bucket/demo
Starting Analysis Tasks
After a successful push log, you can analyze your data in the following manner:
• ETL carried out under the log table pushed to the Hive storage location.
• Direct production data in ORC format and then query by Presto.
• Write program analysis by Spark or MR.