所需积分/C币:50 2018-11-19 12:03:07 3.14MB PDF
收藏 收藏

aTe M BDeME Person 剂am See the difference? Input/output 1. For an analytical query, only a small number of table columns need to be read. In a column-oriented database, you can read just the data you need. For example, if you need 5 columns out of 100, you can expect a 20-fold reduction in I/O 2. Since data is read in packets, it is easier to compress Data in columns is also easier to compress. This further reduces the yO volume 3. Due to the reduced l/o, more data fits in the system cache For example, the query"count the number of records for each advertising platform"requires reading one" advertising platform ID column, which takes up 1 byte uncompressed. If most of the traffic was not from advertising platforms, you can expect at least 10-fold compression of this column. When using a quick compression algorithm, data decompression is possible at a speed of at least several gigabytes of uncompressed data per second. In other words, this query can be processed at a speed of approximately several billion rows per second on a single server. This speed is actually achieved in practice. sc1⊥ chouse-c1ient clickHouse client version 0.0.52053 Connecting to localhost: 9000 Connected to clickHouse server version 0.0.52053 : SELECT CounterID, count()FROM hits GrOUP BY CounterID ORDER By count() DESC LIMIT 20 CounterS FroM hits GROUP BY C。 untold ORDER BY count( DESC L工MIT ounterID-T 114208|56057344 11508 3228|44658301 3823042045932 14526342042158 91244|38297270 154139|26647572 150748|24112755 242232|21302571 338158|13507087 62180|12229491 82264|12187441 232261|12148031 14627211438516 168777|11403636 4120072|11227824 1093880810519739 74088 9047015 115079|9837972 337234 8205961 20 rows in set. Elapsed: 0.153 sec. Processed 1.00 billion rows, 4.00 GB(6.53 billion rows/s, 26.10 GB/s) :) Since executing a query requires processing a large number of rows, it helps to dispatch all operations for entire vectors istead of for separate rows, or to implement the query engine so that there is almost no dispatching cost. If you dont do this, with any half-decent disk subsystem, the query interpreter inevitably stalls the CPU. It makes sense to both store data in columns and process it, when possible, by columns There are two ways to do this 1. A vector engine. All operations are written for vectors, instead of for separate values. This means you don 't need to call operations very often, and dispatching costs are negligible. Operation code contains an optimized internal cycle 2. Code generation. The code generated for the query has all the indirect calls in it This is not done in "normal"databases, because it doesn't make sense when running simple queries. However, there are exceptions. For example, MemSQL uses code generation to reduce latency when processing SQL queries. (For comparison, analytical DBMSS require optimization of throughput, not latency. Note that for CPU efficiency, the query language must be declarative sql or MDX), or at least a vector (, K). The query should only contain implicit loops, allowing for optimization Distinctive features of clickhouse True column-Oriented dbms In a true column-oriented DBMS, no extra data is stored with the values. Among other things, this means that constant length values must be supported, to avoid storing their length "number"next to the values. As an example, a billion UInt8- type values should actually consume around 1 GB uncompressed, or this will strongly affect the CPu use. It is very important to store data compactly(without any"garbage")even when uncompressed, since the speed of decompression( CPU usage depends mainly on the volume of uncompressed data This is worth noting because there are systems that can store values of different columns separately, but that can't effectively process analytical queries due to their optimization for other scenarios. EXamples are HBase, Big I able, Cassandra, and Hyper I able. In these systems, you will get throughput around a hundred thousand rows per second, but not hundreds of millions of rows per second It's also worth noting that ClickHouse is a database management system, not a single database. ClickHouse allows creating tables and databases in runtime, loading data, and running queries without reconfiguring and restarting the server Data Compression Some column-oriented DBMSs(InfiniDB CE and MonetdB)do not use data compression. However, data compression does play a key role in achieving excellent performance Disk Storage of data Mving a data physically sorted by primary key makes it possible to extract data for it's specific values or value ranges with low latency, less than few dozen milliseconds. any column-oriented DBMSs (such as SAP hANa and Google power drill) can only work in RAM. This approach encourages the allocation of a larger hardware budget than is actually necessary for real-time analysis ClickHouse is designed to work on regular hard drives, which means the cost per gB of data storage is low, but SSD and additiona Ram are also fully used if available Parallel Processing on Multiple Cores Large queries are parallelized in a natural way, taking all the necessary resources that available on the current server Distributed Processing on Multiple Servers Almost none of the columnar DBMSs mentioned above have support for distributed query processing. In ClickHouse, data can reside on different shards. Each shard can be a group of replicas that are used for fault tolerance. The query is processed on all the shards in parallel. This is transparent for the user SQL Support ClickHouse supports a declarative query language based on SQl that is identical to the sQl standard in many cases Supported queries include GROUP BY, ORDER BY, subqueries in FROM, IN, and joIN clauses, and scalar subqueries Dependent subqueries and window functions are not supported Vector Engine Data is not only stored by columns, but is processed by vectors(parts of columns). This allows us to achieve high CPU efticiency Real-time Data Updates ClickHouse supports tables with a primary key. In order to quickly perform queries on the range of the primary key, the data is sorted incrementally using the merge tree. Due to this, data can continually be added to the table. No locks are taken when new data is ingested Index Having a data physically sorted by primary key makes it pos sible to extract data for it's specific values or value ranges with low latency, less than few dozen milliseconds Suitable for Online Queries Low latency means that queries can be processed without delay and without trying to prepare answer in advance, right at the same moment while user interface page is loading. In other words, online Support for Approximated Calculations ClickHouse provides various ways to trade accuracy for performance 1. Aggregate functions for approximated calculation of the number of distinct values, medians, and quantiles 2. Running a query based on a part(sample)of data and getting an approximated result. In this case, proportionally less data is retrieved from the disk 3. Running an aggregation for a limited number of random keys, instead of for all keys. Under certain conditions for key distribution in the data, this provides a reasonably accurate result while using fewer resources Data replication and data integrity support Uses asynchronous multimaster replication. After being written to any available replica, data is distributed to all the remaining replicas in the background. The system maintains identical data on different replicas. Recovery after most failures is performed automatically, and in complex cases- semi-automatically For more information, see the section Data replication #table_engines-replication] ClickHouse Features that Can be Considered Disadvantages 1. No full-fledged transactions 2. Lack of ability to modify or delete already inserted data with high rate and low latency. There are batch deletes and updatesavailabletocleanupormodifydata,forexampletocomplywithGDPR[https://gdpr-infoeu 3. The sparse index makes Click House not really suitable for point queries retrieving sing le rows by their keys Performance According to internal testing results at Yandex, ClickHause shows the best performance (both the highest throughput for long queries and the lowest latency on short queries) for comparable operating scenarios among systems of its class that wereavailablefortestingYoucanviewthetestresultsonaseparatepage] This has also been confirmed by numerous independent benchmarks. They are not dif ficult to find using an internet search, oryoucanseeoursmalcollectionofrelatedlinks[ Throughput for a single Large Query Throughput can be measured in rows per second or in megabytes per second. If the data is placed in the page cache, a query that is not too complex is processed on modern hardware at a speed of approximately 2-10 GB/s of uncompressed data on a single server (for the simplest cases, the speed may reach 30 GB/s). If data is not placed in the page cache, the speed depends on the disk subsystem and the data compression rate. For example, if the disk subsystem allows reading data at 400 MB/, and the data compression rate is 3 the speed will be around 1.2 GB/s To get the speed in rows per second, divide the speed in bytes per second by the total size of the columns used in the query For example, if 10 bytes of columns are extracted, the speed will be around 100-200 million rows per second The processing speed increases almost linearly for distributed processing, but only if the number of rows resulting from aggregation or sorting is not too large Latency When Processing Short Queries If a query uses a primary key and does not select too many rows to process (hundreds of thousands), and does not use too many columns, we can expect less than 50 milliseconds of latency (single digits of milliseconds in the best case)if data is placed in the page cache. Otherwise, latency is calculated from the number of seeks. If you use rotating drives, for a system that is not overloaded, the latency is calculated by this formula: seek time(10 ms)*number of columns queried*number of data parts. Throughput When Processing a Large Quantity of Short Queries Under the same conditions, Click House can handle several hundred queries per second on a single server(up to several thousand in the best case). since this scenario is not typical for analytical DBMSs, we recommend expecting a maximum of 100 queries per second Performance When Inserting data We recommend inserting data in packets of at least 1000 rows, or no more than a single request per second. When inserting to a Merge Tree table from a tab-separated dump, the insertion speed will be from 50 to 200 MB/ S If the inserted rows are around 1 Kb in size, the speed will be from 50, 000 to 200, 000 rows per second. If the rows are small, the performance will be higher in rows per second (on Banner System data ->500, 000 rows per second; on Graphite data-> 1.000,000 rows per second). To improve performance, you can make multiple INSeRT queries in parallel, and performance will increase linearly Yandex Metrica Use case ClickhousewasoriginallydevelopedtopowerYandex.Metrica[,thesecondlargestweb analyticsplatformintheworld[,andcontinuestobethecore component of this system. With more than 13 trillion records in the database and more than 20 billion events daily. ClickHouse allows generating custom reports on the fly directly from non-aggregated data. This article briefly covers the goals of ClickHouse in the early stages of its development Yandex. Metrica builds customized reports on the fly based on hits and sessions, with arbitrary segments defined by the user This often requires building complex aggregates, such as the number of unique users. New data for building a report is received in real time As of April 2014, Yandex. Metrica was tracking about 12 billion events(page views and clicks) daily. All these events must be stored in order to build custom reports. A single query may require scanning millions of rows within a few hundred milliseconds, or hundreds of millions of rows in just a few seconds Usage in Yandex. Metrica and other Yandex Services ClickHouse is used for multiple purposes in Yandex. Metrica. Its main task is to build reports in online mode using non aggregated data. It uses a cluster of 374 servers, which store over 20. 3 trillion rows in the database. The volume of compressed data, without counting duplication and replication, is about 2 PB. The volume of uncompressed data(in tsv format) would be approximately 17 PB ClickHouse is a so used for Storing data for Session Replay from Yandex. Metrica Processing intermediate data Building global reports with Analytics. Running queries for debugging the Yandex. Metrica engine Analyzing logs from the aPI and the user interface ClickHouse has at least a dozen installations in other Yandex services: in search verticals, Market, Direct, business analytics mobile development, AdFox, personal services, and others Aggregated and Non-aggregated data There is a popular opinion that in order to effectively calculate statistics, you must aggregate data, since this reduces the volume of data But data aggregation is a very limited solution, for the following reasons You must have a pre-defined list of reports the user will need The user cant make custom reports When aggregating a large quantity of keys, the volume of data is not reduced, and aggregation is useless For a large number of reports, there are too many aggregation variations(combinatorial explosion When aggregating keys with high cardinality( such as URLs), the volume of data is not reduced by much(less than twofold) For this reason, the volume of data with aggregation might grow instead of shrink Users do not view all the reports we generate for them a large portion of calculations are useless The logical integrity of data may be violated for various aggregations If we do not aggregate anything and work with non-aggregated data, this might actually reduce the volume of calculations However, with aggregation, a significant part of the work is taken offline and completed relatively calmly. In contrast, online calculations require calculating as fast as possible, since the user is waiting for the result Yandex. Metrica has a specialized system for aggregating data called Metrage, which is used for the majority of reports Starting in 2009, Y andex Metrica also used a specialized olap database for non-aggregated data called oLAPServer, which was previously used for the report builder. OLAP Server worked well for non-aggregated data, but it had many restrictions that did not allow it to be used for all reports as desired. These included the lack of support for data ty pes ( only numbers), and the inability to incrementally update data in real-time (it could only be done by rewriting data daily oLAPServer is not a DBMS, but a specialized dB To remove the limitations of OLAPServer and solve the problem of working with non-aggregated data for all reports, we developed the ClickHouse DBMS Getting Started System requirements Installation from the official repository requires linux with x86_64 architecture and support for the sse 4.2 instruction set To check for sse 4.2 grep -q sse4 2 /proc/cpuinfo & echo"SSE 4.2 supported" echo"SSE 4.2 not supported We recommend using bunt or Debian The terminal must use UTF-8 encoding Forrpm-basedsystemsyoucanuse3rd-partypackages packages ClickHouse also works on FreeBSD and Mac Os X. It can be compiled for 86_64 processors without SsE 4.2 support, and for AArch64 cPus Installation For testing and development, the system can be installed on a single server or on a desktop computer stalling from Packages for Debian/Ubuntu In /etc/apt/ sources -ist (or in a separate /etc/apt/sources list. d/clickhouse list file), add the repository deb If you want to use the most recent test version, replace'stable' with'testing Then run sudo apt-key adv --keyserver keyserver ubuntu. com --recv E0C5 6BD4 optional sudo apt-get update sudo apt-get install clickhouse-client clickhouse-server Youcanalsodownloadandinstallpackagesmanuallyfromhere T ClickHouse contains access restriction settings. They are located in the 'users. xml file(next to 'config. xml ). By default, access is allowed from anywhere for the 'default'user, without a password. See user/default/networks For more information, see the section "Configuration files Installing from Sources To compile, follow the instructions: build. md You can compile packages and install them. You can also use programs without installing packages client: dbms/programs/clickhouse-client Server: dbms/programs/clickhouse-server For the server, create a catalog with data, such as /opt/clickhouse/data/default/ /opt/clickhouse/metadata/default/ (Configurable in the server config. Run 'chown' for the desired user Note the path to logs in the server config(src/dbms/programs/server/config. xml) Other Installation methods Dockerimagehttps://hub.dockercom/r/yandex/clickhouse-server/[https://hub.dockercom/r/yandex/clickhouse-server/] RpmpackagesforCentosorRhel: T] Gentoo Launch To start the server(as a daemon), run sudo service clickhouse-server start See the logs in the /var/loc/clickrouse-server/ directory If the server doesn't start, check the configurations in the file /etc/clicknouse-server/config.xml You can also launch the server from the console clickhouse-server --config-file=/etc/clickhouse-server/config. xml In this case, the log will be printed to the console, which is convenient during development. If the configuration file is in the current directory, you don' t need to specify the'--config-file' parameter. By default, it uses. / config. xml You can use the command line client to connect to the server clickhouse-client The default parameters indicate connecting with localhost: 9000 on behalf of the user 'default'without a password. The client can be used for connecting to a remote server. Example clickhouse-client --host=example. com For more information see the section " Command-ine client Checking the system milovidovehostname:-/work/metrica/src/dbms/src/clients ./clickhause-client ClickHlouse client version 0.0.18749 Connecting to localhost: 9000 Connected to clickhouse server version 0.0.18749 : SELECT I SELECT 1 1 1 rows in set. Elapsed: 0. 003 sec Congratulations, the system works To continue experimenting, you can try to download from the test data sets On Time Downloading data

试读 127P clickhouse
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
  • 分享精英

关注 私信
clickhouse 50积分/C币 立即下载

试读结束, 可继续阅读

50积分/C币 立即下载