ApacheParquetJava.zip资源-CSDN文库

共994个文件

java：866个

md：22个

xml：19个

版权申诉

83 浏览量 2024-11-25 07:45:57 上传评论收藏 2.04MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Apache Parquet Java.zip （994个子文件）

car.avdl 2KB

allFromParquetOldBehavior.avsc 2KB

allFromParquetNewBehavior.avsc 2KB

all.avsc 2KB

fixedToInt96.avsc 2KB

nested_array.avsc 777B

stringBehavior.avsc 772B

logicalType.avsc 765B

list_with_nulls.avsc 377B

map_with_nulls.avsc 162B

array.avsc 155B

map.avsc 152B

bundle 789B

mvnw.cmd 7KB

.editorconfig 2KB

finalize-release 2KB

.gitattributes 836B

.gitignore 202B

ByteBitPacking512VectorLE.java 121KB

ParquetMetadataConverter.java 92KB

TestColumnIndexBuilder.java 90KB

ParquetFileReader.java 88KB

ParquetFileWriter.java 82KB

TestParquetMetadataConverter.java 70KB

ProtoWriteSupportTest.java 62KB

TestParquetFileWriter.java 61KB

ParquetRewriterTest.java 56KB

TestTypeBuilders.java 52KB

Types.java 52KB

ParquetRewriter.java 45KB

TestArrayCompatibility.java 45KB

TestAvroSchemaConverter.java 41KB

TestReadWrite.java 39KB

AvroRecordConverter.java 38KB

TestEncryptionOptions.java 36KB

DictionaryFilterTest.java 36KB

SchemaConverter.java 34KB

TestPropertiesDrivenEncryption.java 34KB

LogicalTypeAnnotation.java 34KB

TestReflectLogicalTypes.java 33KB

ParquetWriter.java 33KB

TestStatistics.java 32KB

ParquetInputFormat.java 32KB

ProtoMessageConverter.java 31KB

TestDataPageChecksums.java 31KB

ParquetProperties.java 30KB

PrimitiveType.java 30KB

TestDelegatingSeekableInputStream.java 30KB

ThriftRecordConverter.java 29KB

TestColumnIndexFilter.java 29KB

ProtoInputOutputFormatTest.java 28KB

ColumnIndexBuilder.java 28KB

DefaultValuesWriterFactoryTest.java 27KB

ProtoWriteSupport.java 27KB

TestSchemaConverter.java 27KB

AvroSchemaConverter.java 27KB

TestColumnIndexFiltering.java 26KB

TestDictionary.java 26KB

TestColumnIO.java 26KB

ParquetOutputFormat.java 26KB

TestStatistics.java 25KB

AvroWriteSupport.java 25KB

TestArrayCompatibility.java 25KB

TestBoundaryOrder.java 24KB

TestReadWriteOldListBehavior.java 24KB

ColumnChunkMetaData.java 24KB

FileEncodingsIT.java 24KB

RewriteOptions.java 23KB

ColumnChunkPageWriteStore.java 23KB

Binary.java 23KB

PigSchemaConverter.java 23KB

TestParquetWriter.java 23KB

TestStatisticsFilter.java 23KB

Operators.java 23KB

DirectCodecFactory.java 23KB

ParquetWriteProtocol.java 23KB

ColumnReaderBase.java 22KB

TestInputFormat.java 22KB

TestParquetWriteProtocol.java 22KB

ProtoSchemaConverterTest.java 22KB

ParquetLoader.java 22KB

DictionaryValuesWriter.java 22KB

BytesInput.java 22KB

IncrementallyUpdatedFilterPredicateGenerator.java 21KB

BufferedProtocolReadToWrite.java 21KB

ColumnIndexValidator.java 21KB

TestPrimitiveStringifier.java 20KB

TestTypeBuildersWithLogicalTypes.java 20KB

AvroJson.java 19KB

MessageColumnIO.java 19KB

DictionaryFilter.java 19KB

TestThriftSchemaConverterProjectUnion.java 19KB

Statistics.java 19KB

TestVectorIoBridge.java 18KB

TestBloomFiltering.java 18KB

ProtoSchemaConverter.java 18KB

RecordReaderImplementation.java 18KB

ThriftType.java 18KB

TestRecordLevelFilters.java 17KB

共 994 条

# Hadoop integration **Todo:** Add a description ## Properties It's possible to configure the [ParquetInputFormat](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java) / [ParquetOutputFormat](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java) with Hadoop config or programmatically with setters. **Example:** ```java import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; Configuration conf = new Configuration(); conf.set("parquet.page.size","128"); Job writeJob = new Job(conf); ParquetOutputFormat.setBlockSize(writeJob, 1024); ``` ## Class: ParquetOutputFormat **Property:** `parquet.summary.metadata.level` **Description:** Write summary files in the same directory as parquet files. If this property is set to `all`, write both summary file with row group info to _metadata and summary file without row group info to _common_metadata. If it is `common_only`, write only the summary file without the row group info to _common_metadata. If it is `none`, don't write summary files. **Default value:** `all` --- **Property:** `parquet.enable.summary-metadata` **Description:** This property is deprecated, use `parquet.summary.metadata.level` instead. If it is `true`, it similar to `parquet.summary.metadata.level` with `all`. If it is `false`, it is similar to `NONE`. **Default value:** `true` --- **Property:** `parquet.block.size` **Description:** The block size in bytes. This property depends on the file system: - If the file system (FS) used supports blocks like HDFS, the block size will be the maximum between the default block size of FS and this property. And the row group size will be equal to this property. - block_size = max(default_fs_block_size, parquet.block.size) - row_group_size = parquet.block.size` - If the file system used doesn't support blocks, then this property will define the row group size. Note that larger values of row group size will improve the IO when reading but consume more memory when writing **Default value:** `134217728` (128 MB) --- **Property:** `parquet.page.size` **Description:** The page size in bytes is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. **Default value:** `1048576` (1 MB) --- **Property:** `parquet.compression` **Description:** The compression algorithm used to compress pages. This property supersedes `mapred.output.compress*`. It can be `uncompressed`, `snappy`, `gzip`, `lzo`, `brotli`, `lz4`, `zstd` and `lz4_raw`. If `parquet.compression` is not set, the following properties are checked: * mapred.output.compress=true * mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec Note that custom codecs are explicitly disallowed. Only one of Snappy, GZip, LZO, LZ4, Brotli or ZSTD is accepted. **Default value:** `uncompressed` --- **Property:** `parquet.write.support.class` **Description:** The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer. Usually provided by a specific ParquetOutputFormat subclass and it should be the descendant class of `org.apache.parquet.hadoop.api.WriteSupport` --- **Property:** `parquet.enable.dictionary` **Description:** Whether to enable or disable dictionary encoding. If it is true, then the dictionary encoding is enabled for all columns. If it is false, then the dictionary encoding is disabled for all columns. It is possible to enable or disable the encoding for some columns by specifying the column name in the property. Note that all configurations of this property will be combined (See the following example). **Default value:** `true` **Example:** ```java // Enable dictionary encoding for all columns conf.set("parquet.enable.dictionary", true); // Disable dictionary encoding for 'column.path' conf.set("parquet.enable.dictionary#column.path", false); // The final configuration will be: Enable dictionary encoding for all columns except 'column.path' ``` --- **Property:** `parquet.dictionary.page.size` **Description:** The dictionary page size works like the page size but for dictionary. There is one dictionary page per column per row group when dictionary encoding is used. **Default value:** `1048576` (1 MB) --- **Property:** `parquet.validation` **Description:** Whether to turn on validation using the schema. **Default value:** `false` --- **Property:** `parquet.writer.version` **Description:** The writer version. It can be either `PARQUET_1_0` or `PARQUET_2_0`. `PARQUET_1_0` and `PARQUET_2_0` refer to DataPageHeaderV1 and DataPageHeaderV2. The v2 pages store levels uncompressed while v1 pages compress levels with the data. For more details, see the [thrift definition](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift). **Default value:** `PARQUET_1_0` --- **Property:** `parquet.memory.pool.ratio` **Description:** The memory manager balances the allocation size of each Parquet writer by resize them averagely. If the sum of each writer's allocation size is less than the total memory pool, the memory manager keeps their original value. If the sum exceeds, it decreases the allocation size of each writer by this ratio. This property should be between 0 and 1. **Default value:** `0.95` --- **Property:** `parquet.memory.min.chunk.size` **Description:** The minimum allocation size in byte per Parquet writer. If the allocation size is less than the minimum, the memory manager will fail with an exception. **Default value:** `1048576` (1 MB) --- **Property:** `parquet.writer.max-padding` **Description:** The maximum size in bytes allowed as padding to align row groups. This is also the minimum size of a row group. **Default value:** `8388608` (8 MB) --- **Property:** `parquet.page.size.row.check.min` **Description:** The frequency of checks of the page size limit will be between `parquet.page.size.row.check.min` and `parquet.page.size.row.check.max`. If the frequency is high, the page size will be accurate. If the frequency is low, the performance will be better. **Default value:** `100` --- **Property:** `parquet.page.size.row.check.max` **Description:** The frequency of checks of the page size limit will be between `parquet.page.size.row.check.min` and `parquet.page.size.row.check.max`. If the frequency is high, the page size will be accurate. If the frequency is low, the performance will be better. **Default value:** `10000` --- **Property:** `parquet.page.value.count.threshold` **Description:** The value count threshold within a Parquet page used on each page check. **Default value:** `Integer.MAX_VALUE / 2` --- **Property:** `parquet.page.size.check.estimate` **Description:** If it is true, the column writer estimat

评论收藏

内容反馈

版权申诉