<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Hadoop integration
**Todo:** Add a description
## Properties
It's possible to configure the [ParquetInputFormat](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java) / [ParquetOutputFormat](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java) with Hadoop config or programmatically with setters.
**Example:**
```java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
Configuration conf = new Configuration();
conf.set("parquet.page.size","128");
Job writeJob = new Job(conf);
ParquetOutputFormat.setBlockSize(writeJob, 1024);
```
## Class: ParquetOutputFormat
**Property:** `parquet.summary.metadata.level`
**Description:** Write summary files in the same directory as parquet files.
If this property is set to `all`, write both summary file with row group info to _metadata and summary file without row group info to _common_metadata.
If it is `common_only`, write only the summary file without the row group info to _common_metadata.
If it is `none`, don't write summary files.
**Default value:** `all`
---
**Property:** `parquet.enable.summary-metadata`
**Description:** This property is deprecated, use `parquet.summary.metadata.level` instead.
If it is `true`, it similar to `parquet.summary.metadata.level` with `all`. If it is `false`, it is similar to `NONE`.
**Default value:** `true`
---
**Property:** `parquet.block.size`
**Description:** The block size in bytes. This property depends on the file system:
- If the file system (FS) used supports blocks like HDFS, the block size will be the maximum between the default block size of FS and this property. And the row group size will be equal to this property.
- block_size = max(default_fs_block_size, parquet.block.size)
- row_group_size = parquet.block.size`
- If the file system used doesn't support blocks, then this property will define the row group size.
Note that larger values of row group size will improve the IO when reading but consume more memory when writing
**Default value:** `134217728` (128 MB)
---
**Property:** `parquet.page.size`
**Description:** The page size in bytes is for compression. When reading, each page can be decompressed independently.
A block is composed of pages. The page is the smallest unit that must be read fully to access a single record.
If this value is too small, the compression will deteriorate.
**Default value:** `1048576` (1 MB)
---
**Property:** `parquet.compression`
**Description:** The compression algorithm used to compress pages. This property supersedes `mapred.output.compress*`.
It can be `uncompressed`, `snappy`, `gzip`, `lzo`, `brotli`, `lz4`, `zstd` and `lz4_raw`.
If `parquet.compression` is not set, the following properties are checked:
* mapred.output.compress=true
* mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec
Note that custom codecs are explicitly disallowed. Only one of Snappy, GZip, LZO, LZ4, Brotli or ZSTD is accepted.
**Default value:** `uncompressed`
---
**Property:** `parquet.write.support.class`
**Description:** The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer.
Usually provided by a specific ParquetOutputFormat subclass and it should be the descendant class of `org.apache.parquet.hadoop.api.WriteSupport`
---
**Property:** `parquet.enable.dictionary`
**Description:** Whether to enable or disable dictionary encoding. If it is true, then the dictionary encoding is enabled for all columns.
If it is false, then the dictionary encoding is disabled for all columns.
It is possible to enable or disable the encoding for some columns by specifying the column name in the property.
Note that all configurations of this property will be combined (See the following example).
**Default value:** `true`
**Example:**
```java
// Enable dictionary encoding for all columns
conf.set("parquet.enable.dictionary", true);
// Disable dictionary encoding for 'column.path'
conf.set("parquet.enable.dictionary#column.path", false);
// The final configuration will be: Enable dictionary encoding for all columns except 'column.path'
```
---
**Property:** `parquet.dictionary.page.size`
**Description:** The dictionary page size works like the page size but for dictionary.
There is one dictionary page per column per row group when dictionary encoding is used.
**Default value:** `1048576` (1 MB)
---
**Property:** `parquet.validation`
**Description:** Whether to turn on validation using the schema.
**Default value:** `false`
---
**Property:** `parquet.writer.version`
**Description:** The writer version. It can be either `PARQUET_1_0` or `PARQUET_2_0`.
`PARQUET_1_0` and `PARQUET_2_0` refer to DataPageHeaderV1 and DataPageHeaderV2.
The v2 pages store levels uncompressed while v1 pages compress levels with the data.
For more details, see the [thrift definition](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift).
**Default value:** `PARQUET_1_0`
---
**Property:** `parquet.memory.pool.ratio`
**Description:** The memory manager balances the allocation size of each Parquet writer by resize them averagely.
If the sum of each writer's allocation size is less than the total memory pool, the memory manager keeps their original value.
If the sum exceeds, it decreases the allocation size of each writer by this ratio.
This property should be between 0 and 1.
**Default value:** `0.95`
---
**Property:** `parquet.memory.min.chunk.size`
**Description:** The minimum allocation size in byte per Parquet writer. If the allocation size is less than the minimum, the memory manager will fail with an exception.
**Default value:** `1048576` (1 MB)
---
**Property:** `parquet.writer.max-padding`
**Description:** The maximum size in bytes allowed as padding to align row groups. This is also the minimum size of a row group.
**Default value:** `8388608` (8 MB)
---
**Property:** `parquet.page.size.row.check.min`
**Description:** The frequency of checks of the page size limit will be between
`parquet.page.size.row.check.min` and `parquet.page.size.row.check.max`.
If the frequency is high, the page size will be accurate.
If the frequency is low, the performance will be better.
**Default value:** `100`
---
**Property:** `parquet.page.size.row.check.max`
**Description:** The frequency of checks of the page size limit will be between
`parquet.page.size.row.check.min` and `parquet.page.size.row.check.max`.
If the frequency is high, the page size will be accurate.
If the frequency is low, the performance will be better.
**Default value:** `10000`
---
**Property:** `parquet.page.value.count.threshold`
**Description:** The value count threshold within a Parquet page used on each page check.
**Default value:** `Integer.MAX_VALUE / 2`
---
**Property:** `parquet.page.size.check.estimate`
**Description:** If it is true, the column writer estimat
没有合适的资源?快使用搜索试试~ 我知道了~
Apache Parquet Java.zip
共994个文件
java:866个
md:22个
xml:19个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 80 浏览量
2024-11-25
07:45:57
上传
评论
收藏 2.04MB ZIP 举报
温馨提示
Parquet Java(以前称为 Parquet MR)此存储库包含Apache Parquet的 Java 实现Apache Parquet 是一种开源的列式数据文件格式,旨在实现高效的数据存储和检索。它提供高性能压缩和编码方案来批量处理复杂数据,并且受多种编程语言和分析工具的支持。parquet-format存储库 包含文件格式规范。Parquet 使用Dremel 论文中描述的记录分解和组装算法来表示嵌套结构。您可以在我们的2013 年 Hadoop 峰会演示文稿中找到有关格式和预期用例的更多详细信息建筑Parquet-Java 使用 Maven 构建并依赖于 thrift 编译器(protoc 现在由 maven 插件管理)。安装 Thrift要构建并安装 thrift 编译器,请运行wget -nv https://archive.apache.org/dist/thrift/0.21.0/thrift-0.21.0.tar.gztar xzf thrift-0.21.0.tar.gzcd thrift-0.21.0chmod +x
资源推荐
资源详情
资源评论
收起资源包目录
Apache Parquet Java.zip (994个子文件)
car.avdl 2KB
allFromParquetOldBehavior.avsc 2KB
allFromParquetNewBehavior.avsc 2KB
all.avsc 2KB
fixedToInt96.avsc 2KB
nested_array.avsc 777B
stringBehavior.avsc 772B
logicalType.avsc 765B
list_with_nulls.avsc 377B
map_with_nulls.avsc 162B
array.avsc 155B
map.avsc 152B
bundle 789B
bundle 789B
mvnw.cmd 7KB
.editorconfig 2KB
finalize-release 2KB
.gitattributes 836B
.gitignore 202B
ByteBitPacking512VectorLE.java 121KB
ParquetMetadataConverter.java 92KB
TestColumnIndexBuilder.java 90KB
ParquetFileReader.java 88KB
ParquetFileWriter.java 82KB
TestParquetMetadataConverter.java 70KB
ProtoWriteSupportTest.java 62KB
TestParquetFileWriter.java 61KB
ParquetRewriterTest.java 56KB
TestTypeBuilders.java 52KB
Types.java 52KB
ParquetRewriter.java 45KB
TestArrayCompatibility.java 45KB
TestAvroSchemaConverter.java 41KB
TestReadWrite.java 39KB
AvroRecordConverter.java 38KB
TestEncryptionOptions.java 36KB
DictionaryFilterTest.java 36KB
SchemaConverter.java 34KB
TestPropertiesDrivenEncryption.java 34KB
LogicalTypeAnnotation.java 34KB
TestReflectLogicalTypes.java 33KB
ParquetWriter.java 33KB
TestStatistics.java 32KB
ParquetInputFormat.java 32KB
ProtoMessageConverter.java 31KB
TestDataPageChecksums.java 31KB
ParquetProperties.java 30KB
PrimitiveType.java 30KB
TestDelegatingSeekableInputStream.java 30KB
ThriftRecordConverter.java 29KB
TestColumnIndexFilter.java 29KB
ProtoInputOutputFormatTest.java 28KB
ColumnIndexBuilder.java 28KB
DefaultValuesWriterFactoryTest.java 27KB
ProtoWriteSupport.java 27KB
TestSchemaConverter.java 27KB
AvroSchemaConverter.java 27KB
TestColumnIndexFiltering.java 26KB
TestDictionary.java 26KB
TestColumnIO.java 26KB
ParquetOutputFormat.java 26KB
TestStatistics.java 25KB
AvroWriteSupport.java 25KB
TestArrayCompatibility.java 25KB
TestBoundaryOrder.java 24KB
TestReadWriteOldListBehavior.java 24KB
ColumnChunkMetaData.java 24KB
FileEncodingsIT.java 24KB
RewriteOptions.java 23KB
ColumnChunkPageWriteStore.java 23KB
Binary.java 23KB
PigSchemaConverter.java 23KB
TestParquetWriter.java 23KB
TestStatisticsFilter.java 23KB
Operators.java 23KB
DirectCodecFactory.java 23KB
ParquetWriteProtocol.java 23KB
ColumnReaderBase.java 22KB
TestInputFormat.java 22KB
TestParquetWriteProtocol.java 22KB
ProtoSchemaConverterTest.java 22KB
ParquetLoader.java 22KB
DictionaryValuesWriter.java 22KB
BytesInput.java 22KB
IncrementallyUpdatedFilterPredicateGenerator.java 21KB
BufferedProtocolReadToWrite.java 21KB
ColumnIndexValidator.java 21KB
TestPrimitiveStringifier.java 20KB
TestTypeBuildersWithLogicalTypes.java 20KB
AvroJson.java 19KB
MessageColumnIO.java 19KB
DictionaryFilter.java 19KB
TestThriftSchemaConverterProjectUnion.java 19KB
Statistics.java 19KB
TestVectorIoBridge.java 18KB
TestBloomFiltering.java 18KB
ProtoSchemaConverter.java 18KB
RecordReaderImplementation.java 18KB
ThriftType.java 18KB
TestRecordLevelFilters.java 17KB
共 994 条
- 1
- 2
- 3
- 4
- 5
- 6
- 10
资源评论
徐浪老师
- 粉丝: 8105
- 资源: 8096
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功