<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
This page describes in detail how to run end to end tests on a hudi dataset that helps in improving our confidence
in a release as well as perform large scale performance benchmarks.
# Objectives
1. Test with different versions of core libraries and components such as `hdfs`, `parquet`, `spark`,
`hive` and `avro`.
2. Generate different types of workloads across different dimensions such as `payload size`, `number of updates`,
`number of inserts`, `number of partitions`
3. Perform multiple types of operations such as `insert`, `bulk_insert`, `upsert`, `compact`, `query`
4. Support custom post process actions and validations
# High Level Design
The Hudi test suite runs as a long running spark job. The suite is divided into the following high level components :
## Workload Generation
This component does the work of generating the workload; `inserts`, `upserts` etc.
## Workload Scheduling
Depending on the type of workload generated, data is either ingested into the target hudi
dataset or the corresponding workload operation is executed. For example compaction does not necessarily need a workload
to be generated/ingested but can require an execution.
## Other actions/operations
The test suite supports different types of operations besides ingestion such as Hive Query execution, Clean action etc.
# Usage instructions
## Entry class to the test suite
```
org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.java - Entry Point of the hudi test suite job. This
class wraps all the functionalities required to run a configurable integration suite.
```
## Configurations required to run the job
```
org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.HoodieTestSuiteConfig - Config class that drives the behavior of the
integration test suite. This class extends from com.uber.hoodie.utilities.DeltaStreamerConfig. Look at
link#HudiDeltaStreamer page to learn about all the available configs applicable to your test suite.
```
## Generating a custom Workload Pattern
There are 2 ways to generate a workload pattern
1.Programmatically
You can create a DAG of operations programmatically - take a look at `WorkflowDagGenerator` class.
Once you're ready with the DAG you want to execute, simply pass the class name as follows:
```
spark-submit
...
...
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob
--workload-generator-classname org.apache.hudi.integ.testsuite.dag.scheduler.<your_workflowdaggenerator>
...
```
2.YAML file
Choose to write up the entire DAG of operations in YAML, take a look at `simple-deltastreamer.yaml` or
`simple-deltastreamer.yaml`.
Once you're ready with the DAG you want to execute, simply pass the yaml file path as follows:
```
spark-submit
...
...
--class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob
--workload-yaml-path /path/to/your-workflow-dag.yaml
...
```
## Building the test suite
The test suite can be found in the `hudi-integ-test` module. Use the `prepare_integration_suite.sh` script to
build
the test suite, you can provide different parameters to the script.
```
shell$ ./prepare_integration_suite.sh --help
Usage: prepare_integration_suite.sh
--spark-command, prints the spark command
-h, hdfs-version
-s, spark version
-p, parquet version
-a, avro version
-s, hive version
```
```
shell$ ./prepare_integration_suite.sh
....
....
Final command : mvn clean install -DskipTests
```
## Running on the cluster or in your local machine
Copy over the necessary files and jars that are required to your cluster and then run the following spark-submit
command after replacing the correct values for the parameters.
NOTE : The properties-file should have all the necessary information required to ingest into a Hudi dataset. For more
information on what properties need to be set, take a look at the test suite section under demo steps.
```
shell$ ./prepare_integration_suite.sh --spark-command
spark-submit --master prepare_integration_suite.sh --deploy-mode
--properties-file --class org.apache.hudi.integ.testsuite.HoodieTestSuiteJob target/hudi-integ-test-0.6
.0-SNAPSHOT.jar --source-class --source-ordering-field --input-base-path --target-base-path --target-table --props --storage-type --payload-class --workload-yaml-path --input-file-size --<deltastreamer-ingest>
```
## Running through a test-case (local)
Take a look at the `TestHoodieTestSuiteJob` to check how you can run the entire suite using JUnit.
## Running an end to end test suite in Local Docker environment
Start the Hudi Docker demo:
```
docker/setup_demo.sh
```
NOTE: We need to make a couple of environment changes for Hive 2.x support. This will be fixed once Hudi moves to Spark 3.x.
Execute below if you are using Hudi query node in your dag. If not, below section is not required.
Also, for longer running tests, go to next section.
```
docker exec -it adhoc-2 bash
cd /opt/spark/jars
rm /opt/spark/jars/hive*
rm spark-hive-thriftserver_2.11-2.4.4.jar
wget https://repo1.maven.org/maven2/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-preview2/spark-hive-thriftserver_2.12-3.0.0-preview2.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-common/2.3.1/hive-common-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.3.1/hive-exec-2.3.1-core.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/2.3.1/hive-jdbc-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-llap-common/2.3.1/hive-llap-common-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-metastore/2.3.1/hive-metastore-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-serde/2.3.1/hive-serde-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-service/2.3.1/hive-service-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-service-rpc/2.3.1/hive-service-rpc-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/shims/hive-shims-0.23/2.3.1/hive-shims-0.23-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/shims/hive-shims-common/2.3.1/hive-shims-common-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-storage-api/2.3.1/hive-storage-api-2.3.1.jar
wget https://repo1.maven.org/maven2/org/apache/hive/hive-shims/2.3.1/hive-shims-2.3.1.jar
wget https://repo1.maven.org/maven2/org/json/json/20090211/json-20090211.jar
cp /opt/hive/lib/log* /opt/spark/jars/
rm log4j-slf4j-impl-2.6.2.jar
cd /opt
```
Copy the integration tests jar into the docker container
```
docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.11.0-SNAPSHOT.jar adhoc-2:/opt
```
```
docker exec -it adhoc-2 /bin/bash
```
Clean the working directories before starting a new test:
```
hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/output/
hdfs dfs -rm -r /user/hive/warehouse/hudi-integ-test-suite/input/
```
Launch a Copy-on-Write job:
```
# COPY_ON_WRITE tables
=========================
## Run the following command to start the test suite
spark-submit \
--conf spark.task.cpus=1 \
--conf spark.executor.cores=1 \
--conf spark.task.maxFailures=100 \
--conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true \
--conf spark.kryoserialize
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
1. 通过快速、可插拔的索引支持更新插入 2. 以原子方式发布支持回滚的数据 3. 编写器和查询之间的快照隔离 4. 用于数据恢复的保存点 5. 使用统计信息管理文件大小和布局 6. 行和列数据的异步压缩 7. 用于跟踪世系的时间线元数据 8. 通过群集优化数据湖布局 Hudi supports three types of queries: 1. 快照查询 - 使用列式存储和基于行的存储(例如 Parquet + Avro)的组合,提供对实时数据的快照查询 。 增量查询 - 提供在某个时Hudi 支持三种类型的查询: 2. 间点之后插入或更新记录的更改流。 3. 读取优化查询 - 通过纯列式存储(例如 Parquet)提供出色的快照查询性能。
资源推荐
资源详情
资源评论
收起资源包目录
Apache Hudi代表 Hadoop 实现大数据的对象存储的Upserts、Deletes 和 Incrementals (2000个子文件)
TestHoodieDeltaStreamer.java 156KB
TestHoodieLogFormat.java 152KB
TestHoodieTableFileSystemView.java 124KB
ITTestHoodieDataSource.java 95KB
HoodieTableMetadataUtil.java 95KB
TestHiveSyncTool.java 82KB
HoodieTestTable.java 68KB
AbstractTableFileSystemView.java 66KB
HoodieMetadataTableValidator.java 65KB
StreamSync.java 63KB
HoodieAvroUtils.java 62KB
HoodieTableMetaClient.java 56KB
HoodieTestDataGenerator.java 50KB
TestIncrementalFSViewSync.java 49KB
AbstractHoodieLogRecordReader.java 49KB
HoodieStreamer.java 47KB
HoodieDeltaStreamerTestBase.java 44KB
BaseHoodieLogRecordReader.java 43KB
HoodieActiveTimeline.java 42KB
HoodieMetadataPayload.java 41KB
HoodieTableConfig.java 38KB
HoodieBackedTableMetadata.java 37KB
TestHoodieActiveTimeline.java 37KB
AvroSchemaCompatibility.java 37KB
AWSGlueCatalogSyncClient.java 37KB
HoodieWrapperFileSystem.java 36KB
CompactionCommand.java 36KB
AvroOrcUtils.java 36KB
FSUtils.java 34KB
TestAvroSchemaEvolutionUtils.java 33KB
TestHoodieIndexer.java 33KB
TestHoodieAvroUtils.java 32KB
SparkMain.java 31KB
HiveTestUtil.java 31KB
HFileBootstrapIndex.java 31KB
TestProtoConversionUtil.java 31KB
ITTestHoodieDemo.java 30KB
HoodieMetadataConfig.java 30KB
TestHoodieDeltaStreamerSchemaEvolutionQuick.java 30KB
TestTimelineUtils.java 30KB
TestBootstrap.java 29KB
RocksDbBasedFileSystemView.java 29KB
RemoteHoodieTableFileSystemView.java 28KB
UtilHelpers.java 28KB
ITTestSchemaEvolution.java 27KB
HoodieHBaseAvroHFileReader.java 27KB
ConfigUtils.java 27KB
TestCommitsCommand.java 27KB
TestPriorityBasedFileSystemView.java 27KB
FileCreateUtils.java 26KB
TestHoodieDeltaStreamerWithMultiWriter.java 26KB
TestFSUtils.java 25KB
HoodieTimeline.java 25KB
TableSchemaResolver.java 25KB
HoodieMultiTableStreamer.java 25KB
HoodieLogFileReader.java 25KB
TestHoodieDeltaStreamerSchemaEvolutionExtensive.java 24KB
TestS3EventsHoodieIncrSource.java 24KB
ProtoConversionUtil.java 24KB
ParquetSplitReaderUtil.java 24KB
ParquetSplitReaderUtil.java 24KB
ParquetSplitReaderUtil.java 24KB
ParquetSplitReaderUtil.java 24KB
TestOrcBootstrap.java 24KB
HoodieRepairTool.java 24KB
HiveSchemaUtil.java 23KB
TimelineUtils.java 23KB
CommitsCommand.java 23KB
AvroSchemaUtils.java 23KB
KafkaOffsetGen.java 23KB
BaseHoodieTableFileIndex.java 23KB
HiveSyncTool.java 23KB
TestExternalSpillableMap.java 22KB
HoodieDefaultTimeline.java 22KB
TestHoodieHFileReaderWriterBase.java 22KB
ParquetUtils.java 22KB
HoodieStorageConfig.java 21KB
IncrementalTimelineSyncFileSystemView.java 21KB
FileSystemViewStorageConfig.java 21KB
TimelineCommand.java 20KB
TestTimestampBasedKeyGenerator.java 20KB
TestRepairsCommand.java 20KB
TestJsonKafkaSource.java 20KB
UtilitiesTestBase.java 20KB
BaseTableMetadata.java 20KB
TableSizeStats.java 20KB
TestTableChanges.java 20KB
TestDataSkippingWithMORColstats.java 20KB
AvroInternalSchemaConverter.java 20KB
TestGcsEventsHoodieIncrSource.java 20KB
HoodieNativeAvroHFileReader.java 20KB
TableCommand.java 19KB
TestCompactionUtils.java 19KB
HoodieDropPartitionsTool.java 19KB
HoodieAdbJdbcClient.java 19KB
TestIncrSourceHelper.java 19KB
HoodieMergedLogRecordScanner.java 19KB
HoodieCommitMetadata.java 19KB
DeltaConfig.java 19KB
TestHoodieIncrSource.java 19KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
AI普惠行者
- 粉丝: 1670
- 资源: 117
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 学生成绩管理系统-C++版本
- 吉林大学离散数学2笔记.pdf
- 通道处理过程的模拟通常涉及对通道处理机制的理解与实现.txt
- Flume进阶-自定义拦截器jar包
- Dubins曲线算法讲解和在运动规划中的使用.pdf
- 上市公司-股票性质数据-工具变量(民企、国企、央企)2003-2022年.dta
- 上市公司-股票性质数据-工具变量(民企、国企、央企)2003-2022年.xlsx
- Reeds+Shepp曲线算法讲解和实现.pdf
- 毕业设计基于SpringBoot+MyBatisPlus+MySQL+Vue的外卖配送信息系统源代码+数据库
- 词向量(Word Embeddings)是自然语言处理(NLP)领域的一种重要技术.txt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功