0% found this document useful (0 votes)
24 views12 pages

clickhouse单机安装和导入数据基础实验

The document provides a detailed guide for installing ClickHouse on CentOS 7.8, including system checks, repository setup, and installation commands. It also outlines steps for importing data, creating tables, and performing various SQL queries to validate and analyze the imported data. Additionally, it introduces a benchmarking tool for performance testing of ClickHouse queries.

Uploaded by

suiyuewuyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

clickhouse单机安装和导入数据基础实验

The document provides a detailed guide for installing ClickHouse on CentOS 7.8, including system checks, repository setup, and installation commands. It also outlines steps for importing data, creating tables, and performing various SQL queries to validate and analyze the imported data. Additionally, it introduces a benchmarking tool for performance testing of ClickHouse queries.

Uploaded by

suiyuewuyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Centos7.

8 下 Clickhouse 安装

1.主机上安装 clickhouse 软件
1.1.检查操作系统是否支持
[root@clickhouse01 ~]# grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" ||
echo "SSE 4.2 not supported"
SSE 4.2 supported
1.2. 添加官方存储库
[root@clickhouse01 ~]# yum install yum-utils

[root@clickhouse01 ~]# rpm --import https://repo.clickhouse.tech/CLICKHOUSE-


KEY.GPG

[root@clickhouse01 ~]# yum-config-manager --add-repo

https://repo.clickhouse.tech/rpm/clickhouse.repo

1.3 .安装 clickhouse server 和 client


[root@clickhouse01 ~]# yum install clickhouse-server
clickhouse-client

1.4 启动和停止 clickhouse-server

[root@clickhouse01 clickhouse-server]# service clickhouse-server start

Start clickhouse-server service: Path to data directory in /etc/clickhouse-


server/config.xml: /var/lib/clickhouse/

DONE

[root@clickhouse01 clickhouse-server]# service clickhouse-server stop

Stop clickhouse-server service: DONE


导入数据和简单测试
文档资料和需要的实验数据在百度网盘中:
链接:https://pan.baidu.com/s/13bBMGj5aKEHRotbTk59BNw
提取码:1234

下载安装 wget

# cd /usr/local/src/

# wget
http://jaist.dl.sourceforge.net/project/kmphpfm/mwget/0.1/mwget_0.1.0.orig.t
ar.bz2

tar -jxvf mwget_0.1.0.orig.tar.bz2

# cd mwget_0.1.0.orig

# ./configure

# make

# make install

导入数据
按照官方文档,下载一份测试数据,搞起。

执行 shell 脚本,内容如下:

for s in `seq 1987 2021`


do
for m in `seq 1 12`
do
mwget
http://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_pres
ent_${s}_${m}.zip
done
done

注:mwget 速度优于 wget,使用 mwget 下载数据

执行:clickhouse-client 进入 clickhouse 客户端

创建表语句为:

CREATE TABLE `ontime` (


`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`Reporting_Airline` FixedString(7),
`DOT_ID_Reporting_Airline` Int32,
`IATA_CODE_Reporting_Airline` FixedString(2),
`Tail_Number` String,
`Flight_Number_Reporting_Airline` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

SQL 语句需要压缩后再执行。

创建表成功。

输入 exit,退出
在 shell 命令行输入以下命令,向表中导入数据。

for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --
input_format_skip_unknown_fields=1 --query="INSERT INTO ontime FORMAT CSVWithNames";
done
数据导入成功。
验证数据:
1.查询总记录数:

select count(*) from ontime

两亿多数据,执行 0.007 秒。

2.查询平均数:
SELECT avg(c1)
FROM
(
SELECT Year, Month, count(*) AS c1
FROM ontime
GROUP BY Year, Month
);
千万级数据能达到毫秒级输出,可见其查询性能之高。

3.查询从 2000 年到 2008 年每天的航班数

SELECT DayOfWeek, count(*) AS c


FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;

4.查询从 2000 年到 2008 年每周延误超过 10 分钟的航班数。

SELECT DayOfWeek, count(*) AS c


FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY DayOfWeek
ORDER BY c DESC;

5. 查询 2000 年到 2008 年每个机场延误超过 10 分钟以上的次数


SELECT Origin, count(*) AS c
FROM ontime
WHERE DepDelay>10 AND Year>=2000 AND Year<=2008
GROUP BY Origin
ORDER BY c DESC
LIMIT 10;

6. 查询 2007 年各航空公司延误超过 10 分钟以上的次数

SELECT Reporting_Airline, count(*)


FROM ontime
WHERE DepDelay>10 AND Year=2007
GROUP BY Reporting_Airline
ORDER BY count(*) DESC;
7. 查询 2007 年各航空公司延误超过 10 分钟以上的百分比

SELECT Reporting_Airline, avg(DepDelay>10)*100 AS c3


FROM ontime
WHERE Year=2007
GROUP BY Reporting_Airline
ORDER BY c3 DESC

8. 同上一个查询一致,只是查询范围扩大到 2000 年到 2008 年

SELECT Carrier, avg(DepDelay>10)*100 AS c3


FROM ontime
WHERE Year>=2000 AND Year<=2008
GROUP BY Carrier
ORDER BY c3 DESC;

9. 每年航班延误超过 10 分钟的百分比

SELECT Year, avg(DepDelay>10)*100


FROM ontime
GROUP BY Year
ORDER BY Year;

10. 每年更受人们喜爱的目的地
SELECT DestCityName, uniqExact(OriginCityName) AS u
FROM ontime
WHERE Year >= 2000 and Year <= 2010
GROUP BY DestCityName
ORDER BY u DESC LIMIT 10;

11. 按年份统计

SELECT Year, count(*) AS c1


FROM ontime
GROUP BY Year;

12. 查询 2010 年最受欢迎的目的地

SELECT
OriginCityName,
DestCityName,
count(*) AS flights,
bar(flights, 0, 20000, 40)
FROM ontime
WHERE Year = 2010
GROUP BY
OriginCityName,
DestCityName
ORDER BY flights DESC
LIMIT 20

13. 最长飞行时间

SELECT
OriginCityName,
DestCityName,
count(*) AS flights,
avg(AirTime) AS duration
FROM ontime
GROUP BY
OriginCityName,
DestCityName
ORDER BY duration DESC

性能测试 clickhouse-benchmark
clikchouse 官方提供了一个 benchmark 工具,能够连接到 ClickHouse 服务器并重复发送指定的查询。
建立一个名为 queries_file 的文件,内容如下两行:
SELECT * FROM system.numbers LIMIT 10000000
SELECT 1

保存文件并退出。并执行命令:
clickhouse-benchmark < queries_file

输出:

默认情况下,clickhouse 会返回每个时间间隔(通过--delay 设定的值)内的基准报告。

基准报告举例:

Queries executed: 10.

localhost:9000, queries 10, QPS: 6.772, RPS: 67904487.440, MiB/s:


518.070, result RPS: 67721584.984, result MiB/s: 516.675.

0.000% 0.145 sec.

10.000% 0.146 sec.

20.000% 0.146 sec.

30.000% 0.146 sec.

40.000% 0.147 sec.

50.000% 0.148 sec.

60.000% 0.148 sec.

70.000% 0.148 sec.

80.000% 0.149 sec.

90.000% 0.150 sec.

95.000% 0.150 sec.

99.000% 0.150 sec.

99.900% 0.150 sec.

99.990% 0.150 sec.


在基准报告中有以下内容:

 Queries executed:查询数量
 状态字符串包含(按顺序):
 clickhouse 服务器的端口
 已处理的查询数量
 QPS:在--delay 参数指定的时间段内,服务器每秒执行多少查询。
 RPS:在--delay 参数指定的时间段内,服务器每秒读取多少行。
 MiB/s:在--delay 参数指定的时间段内,服务器每秒读取多少 MB 的数据
 result RPS:在--delay 参数指定的时间段内,服务器每秒返回多少行查询结果
 result MiB/s:在--delay 参数指定的时间段内,服务器每秒返回多少 MB 的查询结果
 查询执行时间的百分比。
 当前版本中,
 这个并不是 n%的 sql 语句平均运行时间的意思,不然也不会是递增的
 这个是说,在所有运行的 sql 语句中,第 n%个 sql 语句的执行时间是多少秒

You might also like