Skip to content

Commit

Permalink
Added Docusaurus config and /docs folder
Browse files Browse the repository at this point in the history
  • Loading branch information
rfraposa committed Mar 28, 2022
1 parent 1fc1078 commit 56030b8
Show file tree
Hide file tree
Showing 200 changed files with 30,656 additions and 0 deletions.
39 changes: 39 additions & 0 deletions .github/workflows/deploy-github.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: deploy-docs-website

on:
push:
branches: [docs]


# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
publish:
runs-on: ubuntu-latest
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- name: Check out repo
uses: actions/checkout@v2
# Node is required for npm
- name: Set up Node
uses: actions/setup-node@v2
with:
node-version: '14'
# Download the reference docs from ClickHouse/ClickHouse
- name: Download Reference Doc
run: curl https://codeload.github.com/ClickHouse/ClickHouse/tar.gz/master | tar -xz -C docs/ --strip=2 “ClickHouse-master/docs/en”
# Install and build Docusaurus website
- name: Build Docusaurus website
run: |
npm install
npm run build
- name: Deploy to GitHub Pages
if: success()
uses: crazy-max/ghaction-github-pages@v2
with:
target_branch: gh-pages
build_dir: build
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
node_modules
.docusaurus
build
docs/en/
**/.DS_Store
backup
3 changes: 3 additions & 0 deletions babel.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
module.exports = {
presets: [require.resolve('@docusaurus/core/lib/babel/preset')],
};
4 changes: 4 additions & 0 deletions docs/about-us/_category_.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
position: 100
label: 'About Us'
collapsible: true
collapsed: true
199 changes: 199 additions & 0 deletions docs/about-us/adopters.md

Large diffs are not rendered by default.

96 changes: 96 additions & 0 deletions docs/about-us/distinctive-features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
sidebar_label: Distinctive Features
sidebar_position: 20
---

# Distinctive Features of ClickHouse {#distinctive-features-of-clickhouse}

## True Column-Oriented Database Management System {#true-column-oriented-dbms}

In a real column-oriented DBMS, no extra data is stored with the values. Among other things, this means that constant-length values must be supported, to avoid storing their length “number” next to the values. For example, a billion UInt8-type values should consume around 1 GB uncompressed, or this strongly affects the CPU use. It is essential to store data compactly (without any “garbage”) even when uncompressed since the speed of decompression (CPU usage) depends mainly on the volume of uncompressed data.

It is worth noting because there are systems that can store values of different columns separately, but that can’t effectively process analytical queries due to their optimization for other scenarios. Examples are HBase, BigTable, Cassandra, and HyperTable. You would get throughput around a hundred thousand rows per second in these systems, but not hundreds of millions of rows per second.

It’s also worth noting that ClickHouse is a database management system, not a single database. ClickHouse allows creating tables and databases in runtime, loading data, and running queries without reconfiguring and restarting the server.

## Data Compression {#data-compression}

Some column-oriented DBMSs do not use data compression. However, data compression does play a key role in achieving excellent performance.

In addition to efficient general-purpose compression codecs with different trade-offs between disk space and CPU consumption, ClickHouse provides [specialized codecs](../sql-reference/statements/create/table.md#create-query-specialized-codecs) for specific kinds of data, which allow ClickHouse to compete with and outperform more niche databases, like time-series ones.

## Disk Storage of Data {#disk-storage-of-data}

Keeping data physically sorted by primary key makes it possible to extract data for its specific values or value ranges with low latency, less than a few dozen milliseconds. Some column-oriented DBMSs (such as SAP HANA and Google PowerDrill) can only work in RAM. This approach encourages the allocation of a larger hardware budget than is necessary for real-time analysis.

ClickHouse is designed to work on regular hard drives, which means the cost per GB of data storage is low, but SSD and additional RAM are also fully used if available.

## Parallel Processing on Multiple Cores {#parallel-processing-on-multiple-cores}

Large queries are parallelized naturally, taking all the necessary resources available on the current server.

## Distributed Processing on Multiple Servers {#distributed-processing-on-multiple-servers}

Almost none of the columnar DBMSs mentioned above have support for distributed query processing.

In ClickHouse, data can reside on different shards. Each shard can be a group of replicas used for fault tolerance. All shards are used to run a query in parallel, transparently for the user.

## SQL Support {#sql-support}

ClickHouse supports a [declarative query language based on SQL](../sql-reference/index.md) that is identical to the ANSI SQL standard in [many cases](../sql-reference/ansi.md).

Supported queries include [GROUP BY](../sql-reference/statements/select/group-by.md), [ORDER BY](../sql-reference/statements/select/order-by.md), subqueries in [FROM](../sql-reference/statements/select/from.md), [JOIN](../sql-reference/statements/select/join.md) clause, [IN](../sql-reference/operators/in.md) operator, [window functions](../sql-reference/window-functions/index.md) and scalar subqueries.

Correlated (dependent) subqueries are not supported at the time of writing but might become available in the future.

## Vector Computation Engine {#vector-engine}

Data is not only stored by columns but is processed by vectors (parts of columns), which allows achieving high CPU efficiency.

## Real-time Data Updates {#real-time-data-updates}

ClickHouse supports tables with a primary key. To quickly perform queries on the range of the primary key, the data is sorted incrementally using the merge tree. Due to this, data can continually be added to the table. No locks are taken when new data is ingested.

## Primary Index {#primary-index}

Having a data physically sorted by primary key makes it possible to extract data for its specific values or value ranges with low latency, less than a few dozen milliseconds.

## Secondary Indexes {#secondary-indexes}

Unlike other database management systems, secondary indexes in ClickHouse does not point to specific rows or row ranges. Instead, they allow the database to know in advance that all rows in some data parts wouldn’t match the query filtering conditions and do not read them at all, thus they are called [data skipping indexes](../engines/table-engines/mergetree-family/mergetree.md#table_engine-mergetree-data_skipping-indexes).

## Suitable for Online Queries {#suitable-for-online-queries}

Most OLAP database management systems do not aim for online queries with sub-second latencies. In alternative systems, report building time of tens of seconds or even minutes is often considered acceptable. Sometimes it takes even more which forces to prepare reports offline (in advance or by responding with “come back later”).

In ClickHouse low latency means that queries can be processed without delay and without trying to prepare an answer in advance, right at the same moment while the user interface page is loading. In other words, online.

## Support for Approximated Calculations {#support-for-approximated-calculations}

ClickHouse provides various ways to trade accuracy for performance:

1. Aggregate functions for approximated calculation of the number of distinct values, medians, and quantiles.
2. Running a query based on a part (sample) of data and getting an approximated result. In this case, proportionally less data is retrieved from the disk.
3. Running an aggregation for a limited number of random keys, instead of for all keys. Under certain conditions for key distribution in the data, this provides a reasonably accurate result while using fewer resources.

## Adaptive Join Algorithm {#adaptive-join-algorithm}

ClickHouse adaptively chooses how to [JOIN](../sql-reference/statements/select/join.md) multiple tables, by preferring hash-join algorithm and falling back to the merge-join algorithm if there’s more than one large table.

## Data Replication and Data Integrity Support {#data-replication-and-data-integrity-support}

ClickHouse uses asynchronous multi-master replication. After being written to any available replica, all the remaining replicas retrieve their copy in the background. The system maintains identical data on different replicas. Recovery after most failures is performed automatically, or semi-automatically in complex cases.

For more information, see the section [Data replication](../engines/table-engines/mergetree-family/replication.md).

## Role-Based Access Control {#role-based-access-control}

ClickHouse implements user account management using SQL queries and allows for [role-based access control configuration](../operations/access-rights.md) similar to what can be found in ANSI SQL standard and popular relational database management systems.

## Features that Can Be Considered Disadvantages {#clickhouse-features-that-can-be-considered-disadvantages}

1. No full-fledged transactions.
2. Lack of ability to modify or delete already inserted data with a high rate and low latency. There are batch deletes and updates available to clean up or modify data, for example, to comply with [GDPR](https://gdpr-info.eu).
3. The sparse index makes ClickHouse not so efficient for point queries retrieving single rows by their keys.

[Original article](https://clickhouse.com/docs/en/introduction/distinctive-features/) <!--hide-->
54 changes: 54 additions & 0 deletions docs/about-us/history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
sidebar_label: ClickHouse History
sidebar_position: 10
---

# ClickHouse History {#clickhouse-history}

ClickHouse has been developed initially to power [Yandex.Metrica](https://metrica.yandex.com/), [the second largest web analytics platform in the world](http://w3techs.com/technologies/overview/traffic_analysis/all), and continues to be the core component of this system. With more than 13 trillion records in the database and more than 20 billion events daily, ClickHouse allows generating custom reports on the fly directly from non-aggregated data. This article briefly covers the goals of ClickHouse in the early stages of its development.

Yandex.Metrica builds customized reports on the fly based on hits and sessions, with arbitrary segments defined by the user. Doing so often requires building complex aggregates, such as the number of unique users. New data for building a report arrives in real-time.

As of April 2014, Yandex.Metrica was tracking about 12 billion events (page views and clicks) daily. All these events must be stored to build custom reports. A single query may require scanning millions of rows within a few hundred milliseconds, or hundreds of millions of rows in just a few seconds.

## Usage in Yandex.Metrica and Other Yandex Services {#usage-in-yandex-metrica-and-other-yandex-services}

ClickHouse serves multiple purposes in Yandex.Metrica.
Its main task is to build reports in online mode using non-aggregated data. It uses a cluster of 374 servers, which store over 20.3 trillion rows in the database. The volume of compressed data is about 2 PB, without accounting for duplicates and replicas. The volume of uncompressed data (in TSV format) would be approximately 17 PB.

ClickHouse also plays a key role in the following processes:

- Storing data for Session Replay from Yandex.Metrica.
- Processing intermediate data.
- Building global reports with Analytics.
- Running queries for debugging the Yandex.Metrica engine.
- Analyzing logs from the API and the user interface.

Nowadays, there are multiple dozen ClickHouse installations in other Yandex services and departments: search verticals, e-commerce, advertisement, business analytics, mobile development, personal services, and others.

## Aggregated and Non-aggregated Data {#aggregated-and-non-aggregated-data}

There is a widespread opinion that to calculate statistics effectively, you must aggregate data since this reduces the volume of data.

But data aggregation comes with a lot of limitations:

- You must have a pre-defined list of required reports.
- The user can’t make custom reports.
- When aggregating over a large number of distinct keys, the data volume is barely reduced, so aggregation is useless.
- For a large number of reports, there are too many aggregation variations (combinatorial explosion).
- When aggregating keys with high cardinality (such as URLs), the volume of data is not reduced by much (less than twofold).
- For this reason, the volume of data with aggregation might grow instead of shrink.
- Users do not view all the reports we generate for them. A large portion of those calculations is useless.
- The logical integrity of data may be violated for various aggregations.

If we do not aggregate anything and work with non-aggregated data, this might reduce the volume of calculations.

However, with aggregation, a significant part of the work is taken offline and completed relatively calmly. In contrast, online calculations require calculating as fast as possible, since the user is waiting for the result.

Yandex.Metrica has a specialized system for aggregating data called Metrage, which was used for the majority of reports.
Starting in 2009, Yandex.Metrica also used a specialized OLAP database for non-aggregated data called OLAPServer, which was previously used for the report builder.
OLAPServer worked well for non-aggregated data, but it had many restrictions that did not allow it to be used for all reports as desired. These included the lack of support for data types (only numbers), and the inability to incrementally update data in real-time (it could only be done by rewriting data daily). OLAPServer is not a DBMS, but a specialized DB.

The initial goal for ClickHouse was to remove the limitations of OLAPServer and solve the problem of working with non-aggregated data for all reports, but over the years, it has grown into a general-purpose database management system suitable for a wide range of analytical tasks.

[Original article](https://clickhouse.com/docs/en/introduction/history/) <!--hide-->
30 changes: 30 additions & 0 deletions docs/about-us/performance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
sidebar_label: Performance
sidebar_position: 40
---

# Performance {#performance}

According to internal testing results at Yandex, ClickHouse shows the best performance (both the highest throughput for long queries and the lowest latency on short queries) for comparable operating scenarios among systems of its class that were available for testing. You can view the test results on a [separate page](https://clickhouse.com/benchmark/dbms/).

Numerous independent benchmarks came to similar conclusions. They are not difficult to find using an internet search, or you can see [our small collection of related links](https://clickhouse.com/#independent-benchmarks).

## Throughput for a Single Large Query {#throughput-for-a-single-large-query}

Throughput can be measured in rows per second or megabytes per second. If the data is placed in the page cache, a query that is not too complex is processed on modern hardware at a speed of approximately 2-10 GB/s of uncompressed data on a single server (for the most straightforward cases, the speed may reach 30 GB/s). If data is not placed in the page cache, the speed depends on the disk subsystem and the data compression rate. For example, if the disk subsystem allows reading data at 400 MB/s, and the data compression rate is 3, the speed is expected to be around 1.2 GB/s. To get the speed in rows per second, divide the speed in bytes per second by the total size of the columns used in the query. For example, if 10 bytes of columns are extracted, the speed is expected to be around 100-200 million rows per second.

The processing speed increases almost linearly for distributed processing, but only if the number of rows resulting from aggregation or sorting is not too large.

## Latency When Processing Short Queries {#latency-when-processing-short-queries}

If a query uses a primary key and does not select too many columns and rows to process (hundreds of thousands), you can expect less than 50 milliseconds of latency (single digits of milliseconds in the best case) if data is placed in the page cache. Otherwise, latency is mostly dominated by the number of seeks. If you use rotating disk drives, for a system that is not overloaded, the latency can be estimated with this formula: `seek time (10 ms) * count of columns queried * count of data parts`.

## Throughput When Processing a Large Quantity of Short Queries {#throughput-when-processing-a-large-quantity-of-short-queries}

Under the same conditions, ClickHouse can handle several hundred queries per second on a single server (up to several thousand in the best case). Since this scenario is not typical for analytical DBMSs, we recommend expecting a maximum of 100 queries per second.

## Performance When Inserting Data {#performance-when-inserting-data}

We recommend inserting data in packets of at least 1000 rows, or no more than a single request per second. When inserting to a MergeTree table from a tab-separated dump, the insertion speed can be from 50 to 200 MB/s. If the inserted rows are around 1 KB in size, the speed will be from 50,000 to 200,000 rows per second. If the rows are small, the performance can be higher in rows per second (on Banner System data -`>` 500,000 rows per second; on Graphite data -`>` 1,000,000 rows per second). To improve performance, you can make multiple INSERT queries in parallel, which scales linearly.

{## [Original article](https://clickhouse.com/docs/en/introduction/performance/) ##}
8 changes: 8 additions & 0 deletions docs/connect-a-ui/_category_.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
position: 3
label: 'Connect a UI'
collapsible: true
collapsed: true
link:
type: generated-index
title: Connect a UI
slug: /connect-a-ui
Loading

0 comments on commit 56030b8

Please sign in to comment.