Oracle ACE

MySQL

MySQL 数据分析利器:聚合查询与汇总查询深度实战

最近在重新整理 MySQL 的基础理论,发现 MySQL 的聚合查询能力早已从简单的统计工具,而是进化为企业级数据分析引擎。今天咱们就聊聊如何在 MySQL 8.4 LTS 中玩转聚合与汇总。

01. 聚合函数全景解析

聚合函数是数据分析的基石,但有些边界情况往往让开发者翻车。MySQL 的聚合不仅限于简单统计,更支撑着现代AI管道的特征工程。

案例:非NULL值统计

表结构和测试数据。

1
2
3
4
5
6
7
8
9
10
11
12
CREATE TABLE login_history (
id INT PRIMARY KEY,
last_login DATE -- 允许为 NULL,表示从未登录
);

INSERT INTO login_history (id, last_login) VALUES
(1, '2023-10-01'),
(2, '2023-10-02'),
(3, NULL), -- 未登录用户
(4, '2023-10-03'),
(5, NULL); -- 未登录用户

需求:统计last_login列中非NULL的日期数量(排除从未登录的用户)

1
2
3
4
5
-- 错误写法(统计所有行,包括NULL)
SELECT COUNT(*) FROM login_history;

-- 正确答案(只统计非NULL值)
SELECT COUNT(last_login) FROM login_history;

结果:如果last_login有NULL值,COUNT(last_login) < COUNT(*)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> SELECT COUNT(*) FROM login_history;
+----------+
| COUNT(*) |
+----------+
| 5 |
+----------+
1 row in set (0.01 sec)

mysql> SELECT COUNT(last_login) FROM login_history;
+-------------------+
| COUNT(last_login) |
+-------------------+
| 3 |
+-------------------+
1 row in set (0.00 sec)

案例:聚合函数的边界行为

场景1:空表聚合

1
2
3
4
5
6
7
mysql> SELECT MAX(salary) FROM salaries WHERE 1=0;
+-------------+
| MAX(salary) |
+-------------+
| NULL |
+-------------+
1 row in set (0.000 sec)

结果:NULL(不是0),这是SQL标准行为

场景2:特征工程中的COALESCE应用

准备测试表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- 创建表结构
CREATE TABLE employees (
eid INT PRIMARY KEY,
ename VARCHAR(50),
esalary DECIMAL(10, 4),
ebonus DECIMAL(10, 4)
);

-- 插入测试数据
INSERT INTO employees (eid, ename, esalary, ebonus) VALUES
(1, 'Alice', 8000.0000, 2000.0000), -- With bonus
(2, 'Bob', 9500.0000, NULL), -- NULL bonus
(3, 'Charlie', 12000.0000, 0.0000), -- Zero bonus
(4, 'David', 7000.0000, NULL); -- NULL bonus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- 计算员工总薪酬,处理奖金可能为NULL的情况
SELECT ename,
(esalary + COALESCE(ebonus, 0.0000)) AS etotal_pay
FROM employees;

+---------+------------+
| ename | etotal_pay |
+---------+------------+
| Alice | 10000.0000 |
| Bob | 9500.0000 |
| Charlie | 12000.0000 |
| David | 7000.0000 |
+---------+------------+
4 rows in set (0.000 sec)

AI扩展:构建用户价值特征时,确保NULL值有默认值,避免模型训练偏差

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- 创建表结构
CREATE TABLE user_behavior (
id INT PRIMARY KEY,
user_id INT,
amount DECIMAL(10, 2),
session_duration INT
);

-- 插入测试数据
INSERT INTO user_behavior (id, user_id, amount, session_duration) VALUES
(1, 101, 150.50, 300), -- Normal activity for User 101
(2, 101, 200.00, 450), -- Multiple transactions for User 101
(3, 102, NULL, NULL), -- User 102 exists but has missing data
(4, 103, 50.00, 120), -- Normal activity for User 103
(5, 104, NULL, 60); -- User 104 has duration but no amount

-- COALESCE 函数
SELECT user_id,
COALESCE(SUM(amount), 0) as total_spend, -- 消费总额特征
COALESCE(AVG(session_duration), 0) as avg_duration -- 平均停留特征
FROM user_behavior
GROUP BY user_id;

输出:

1
2
3
4
5
6
7
8
9
+---------+-------------+--------------+
| user_id | total_spend | avg_duration |
+---------+-------------+--------------+
| 101 | 350.50 | 375.0000 |
| 102 | 0.00 | 0.0000 |
| 103 | 50.00 | 120.0000 |
| 104 | 0.00 | 60.0000 |
+---------+-------------+--------------+
4 rows in set (0.000 sec)

COALESCE 函数的主要作用是 “返回参数列表中的第一个非空(NOT NULL)值”。简单来说,它的逻辑是:从左到右检查每个参数,遇到第一个不是 NULL 的值就立刻返回;如果所有参数都是 NULL,则返回 NULL。在机器学习中,模型通常无法直接处理 NULL 值(会报错或导致偏差)。通过 COALESCE 将其统一设为 0,可以确保数据的连续性和完整性,避免模型训练失败。

02. 字符串聚合与可视化报表

在数据可视化领域,聚合不仅是数学计算,更是字符串的艺术。下面的示例将如何用聚合函数生成直观的报表。

案例:条形图生成(GPA可视化)

测试数据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE players (
id INT PRIMARY KEY AUTO_INCREMENT,
Name VARCHAR(50),
Gender VARCHAR(10),
Sport VARCHAR(50),
GPA DECIMAL(3, 2)
);

INSERT INTO players (Name, Gender, Sport, GPA) VALUES
('Alice', 'Female', 'Soccer', 3.92),
('Bob', 'Male', 'Basketball', 2.50),
('Charlie', 'Male', 'Baseball', 3.15),
('Diana', 'Female', 'Tennis', 4.00),
('Ethan', 'Male', 'Soccer', 1.85),
('Fiona', 'Female', 'Volleyball', 3.60);

使用REPEAT,或者 RPAD 生成ASCII条形图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
mysql> SELECT Name, Gender, Sport,
-> REPEAT('#', CAST(GPA*10 AS UNSIGNED)) AS GPA_Graph
-> FROM players
-> ORDER BY GPA DESC;
+---------+--------+------------+------------------------------------------+
| Name | Gender | Sport | GPA_Graph |
+---------+--------+------------+------------------------------------------+
| Diana | Female | Tennis | ######################################## |
| Alice | Female | Soccer | ####################################### |
| Fiona | Female | Volleyball | #################################### |
| Charlie | Male | Baseball | ################################ |
| Bob | Male | Basketball | ######################### |
| Ethan | Male | Soccer | ################### |
+---------+--------+------------+------------------------------------------+
6 rows in set (0.000 sec)

mysql> SELECT Name, Gender, Sport,
-> RPAD('#', CAST(GPA*10 AS UNSIGNED), '#') AS GPA_Graph
-> FROM players;
+---------+--------+------------+------------------------------------------+
| Name | Gender | Sport | GPA_Graph |
+---------+--------+------------+------------------------------------------+
| Alice | Female | Soccer | ####################################### |
| Bob | Male | Basketball | ######################### |
| Charlie | Male | Baseball | ################################ |
| Diana | Female | Tennis | ######################################## |
| Ethan | Male | Soccer | ################### |
| Fiona | Female | Volleyball | #################################### |
+---------+--------+------------+------------------------------------------+
6 rows in set (0.000 sec)

案例:字符串拼接与多字段汇总

基本用法:合并姓名

1
2
3
4
5
SELECT CONCAT_WS(', ', last_name, first_name) AS full_name 
FROM employees;

-- 或者使用CONCAT
SELECT CONCAT(last_name, ',', first_name) FROM employees;

准备测试数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- 国家表
CREATE TABLE country (
Code CHAR(3) PRIMARY KEY,
Region VARCHAR(50) NOT NULL
);

-- 国家语言表
CREATE TABLE countrylanguage (
CountryCode CHAR(3),
Language VARCHAR(30),
IsOfficial ENUM('T', 'F') DEFAULT 'F',
PRIMARY KEY (CountryCode, Language),
FOREIGN KEY (CountryCode) REFERENCES country(Code)
);

-- 插入国家数据
INSERT INTO country (Code, Region) VALUES ('CHN', 'East Asia'), ('USA', 'North America');

-- 插入语言数据
INSERT INTO countrylanguage (CountryCode, Language, IsOfficial) VALUES
('CHN', 'Chinese', 'T'),
('CHN', 'Cantonese', 'F'),
('USA', 'English', 'T'),
('USA', 'Spanish', 'F');

区域语言列表生成

1
2
3
4
5
6
7
SELECT c.Region, 
GROUP_CONCAT(DISTINCT cl.Language ORDER BY cl.Language) AS LanguagesSpoken,
COUNT(DISTINCT cl.Language) As NumLanguages
FROM country As c, countrylanguage as cl
WHERE c.Code = cl.Countrycode
GROUP BY c.Region
HAVING NumLanguages < 10; -- HAVING过滤聚合结果

输出:

1
2
3
4
5
6
7
+---------------+-------------------+--------------+
| Region | LanguagesSpoken | NumLanguages |
+---------------+-------------------+--------------+
| East Asia | Cantonese,Chinese | 2 |
| North America | English,Spanish | 2 |
+---------------+-------------------+--------------+
2 rows in set (0.001 sec)

03. MySQL 8.4 新特性:窗口函数与JSON聚合增强

MySQL 8.4 LTS 中聚合能力在窗口函数框架和JSON处理上有了显著提升。

1
2
3
4
5
6
7
8
-- 生成JSON格式的区域语言报告
SELECT c.Region,
JSON_ARRAYAGG(
JSON_OBJECT('lang', cl.Language, 'isOfficial', cl.IsOfficial)
) AS lang_details
FROM country c
JOIN countrylanguage cl ON c.Code = cl.CountryCode
GROUP BY c.Region;

输出:

1
2
3
4
5
6
7
+---------------+------------------------------------------------------------------------------------+
| Region | lang_details |
+---------------+------------------------------------------------------------------------------------+
| East Asia | [{"lang": "Cantonese", "isOfficial": "F"}, {"lang": "Chinese", "isOfficial": "T"}] |
| North America | [{"lang": "English", "isOfficial": "T"}, {"lang": "Spanish", "isOfficial": "F"}] |
+---------------+------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

MySQL 8.4优化:当窗口函数与GROUP BY结合时,优化器能更好地消除临时表

1
2
3
4
5
EXPLAIN FORMAT=TREE 
SELECT dept_id,
AVG(salary) OVER (PARTITION BY dept_id)
FROM employees
GROUP BY dept_id;

04. OCI与AI场景下的聚合优化策略

在Oracle Cloud Infrastructure (OCI) 上运行MySQL HeatWave时,聚合查询的优化策略与传统部署有所不同。

OCI MySQL HeatWave聚合加速:

1
2
3
4
5
6
7
8
9
10
-- 利用HeatWave列式存储加速聚合(相比InnoDB行式存储提升10-100倍)
SELECT /*+ SET_VAR(rapid_execution_strategy=auto) */
user_segment,
COUNT(*) as user_count,
AVG(lifetime_value) as avg_ltv,
MAX(last_purchase_date) as latest_purchase
FROM customer_analytics
GROUP BY user_segment;

-- HeatWave自动将聚合操作下推到列式存储层,减少数据传输

AI实时特征聚合:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- 特征工程:实时聚合用户最近30天行为(OCI微服务架构)
CREATE TABLE user_features AS
SELECT
user_id,
COUNT(DISTINCT session_id) as unique_sessions, -- 去重计数特征
SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) as total_clicks,
AVG(CASE WHEN event_type = 'purchase' THEN amount ELSE NULL END) as avg_purchase_amount,
-- 使用GROUP_CONCAT生成行为序列(用于NLP模型)
GROUP_CONCAT(DISTINCT event_type ORDER BY created_at DESC) as recent_behavior_sequence
FROM user_events
WHERE created_at > DATE_SUB(NOW(), INTERVAL 30 DAY)
GROUP BY user_id;

-- 使用JSON_OBJECTAGG生成结构化特征向量(适合ML模型输入)
SELECT
user_id,
JSON_OBJECTAGG(feature_name, feature_value) as feature_vector
FROM (
SELECT user_id, 'click_rate' as feature_name, click_count/imp_count as feature_value
FROM metrics
UNION ALL
SELECT user_id, 'conversion_rate', conversion_count/click_count
FROM metrics
) sub
GROUP BY user_id;

索引优化聚合性能:

1
2
3
4
5
6
-- 为GROUP BY字段创建索引
CREATE INDEX idx_region ON country(Region);

-- 覆盖索引优化聚合查询(避免回表)
CREATE INDEX idx_covering ON employees(dept_id, salary, bonus);
-- 使得SELECT dept_id, SUM(salary), AVG(bonus) FROM employees GROUP BY dept_id 完全使用索引

总结

好了,本期内容先到这里,简单做个总结。

  1. NULL值精准处理:COUNT(列)只统计非NULL值,而SUM/AVG等自动忽略NULL,使用COALESCE确保NULL有默认值,避免偏差。
  2. 字符串聚合可视化:利用REPEAT/RPAD生成ASCII图表,结合GROUP_CONCAT构建行为序列特征,建议设置group_concat_max_len为1024以上以支持长序列。
  3. 多维汇总利器:WITH ROLLUP是生成小计总计的标准方案,MySQL 8.4优化了其与ORDER BY的执行顺序,在HeatWave列式存储中聚合性能提升显著。
  4. 窗口函数进阶:MySQL 8.4 LTS支持更高效的窗口函数FRAME计算。
  5. OCI云原生优化:在HeatWave中使用列式存储加速聚合分析,通过覆盖索引减少GROUP BY的回表开销,利用JSON聚合函数直接生成API响应数据,减少应用层处理逻辑。

Have a nice day ~ ☕

🌻 近期内容 ▼

👉 这里有得聊

如果对国产基础软件(操作系统、数据库、中间件)、AI、Vibe Coding、OpenClaw 、Hermes Agent 等感兴趣,可以加群一起聊聊。关注微信公众号:(少安事务所),后台回复[群],即可看到入口。如果这篇文章为你带来了灵感或启发,请帮忙『点赞、推荐、转发』吧,感谢!ღ( ´・ᴗ・` )~