[中英对照]TimescaleDB 时序数据库的数据模型

Web应用开发 William 203浏览 0评论

As a relational database supporting full SQL, TimescaleDB supports flexible data models that can be optimized for different use cases. This makes Timescale somewhat different from most other time-series databases, which typically use “narrow-table” models.

Specifically, TimescaleDB can support both wide-table and narrow-table models. Here, we discuss the different performance trade-offs and implications of these two models using an Internet of Things (IoT) example.

作为一个支持完整 SQL 的关系型数据库,TimescaleDB 支持灵活的数据类型,可根据不同用户的实际情况进行优化。因此,Timescale 与大多数其它使用“窄表”模型时序数据库有所不同。

具来说来,TimescaleDB 可以支持宽表模型和窄表模型。我们使用一个物联网(IoT)的示例来讨论这两个模型在性能平衡和意义上的不同之处。

Imagine a distributed group of 1,000 IoT devices designed to collect environmental data at various intervals. This data could include:

  • Identifiers: device_idtimestamp

  • Metadata: location_iddev_typefirmware_versioncustomer_id

  • Device metrics: cpu_1m_avgfree_memused_memnet_rssinet_lossbattery

  • Sensor metrics: temperaturehumiditypressureCONO2PM10

For example, your incoming data may look like this:

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2017-01-01 01:02:00 abc123 80 500MB 72 335 field
2017-01-01 01:02:23 def456 90 400MB 64 335 roof
2017-01-01 01:02:30 ghi789 120 0MB 56 77 roof
2017-01-01 01:03:12 abc123 80 500MB 72 335 field
2017-01-01 01:03:35 def456 95 350MB 64 335 roof
2017-01-01 01:03:42 ghi789 100 100MB 56 77 roof

Now, let’s look at various ways to model this data.

想象有超过1000个物联网设备被分布在各地,以各种不同的时间间隔来收集环境数据。这些数据包含:

  • 标识信息(Identifiers): 设备id(device_id), 时间戳(timestamp)

  • 元数据(Metadata): 位置id(location_id)设备类型(dev_type), 固件版本(firmware_version)客户id(customer_id)

  • 设备指标(Device metrics): cpu平均占用(cpu_1m_avg)空闲内存(free_mem)已使用内存(used_mem)网络信号强度(net_rssi)网络丢包率(net_loss)电池电量(battery)

  • 传感器指标(Sensor metrics): 温度(temperature)湿度(humidity)压力(pressure), 一氧化碳(CO), 二氧化氮(NO2)PM10

举例来说,你接收到的数据可能是下面这样的:

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2017-01-01 01:02:00 abc123 80 500MB 72 335 field
2017-01-01 01:02:23 def456 90 400MB 64 335 roof
2017-01-01 01:02:30 ghi789 120 0MB 56 77 roof
2017-01-01 01:03:12 abc123 80 500MB 72 335 field
2017-01-01 01:03:35 def456 95 350MB 64 335 roof
2017-01-01 01:03:42 ghi789 100 100MB 56 77 roof

现在,我们来看看模拟这些数据的各种方法。

Narrow-table Model

Most time-series databases would represent this data in the following way:

  • Represent each metric as a separate entity (e.g., represent cpu_1m_avgand free_mem as two different things)

  • Store a sequence of “time”, “value” pairs for that metric

  • Represent the metadata values as a “tag-set” associated with that metric/tag-set combination

In this model, each metric/tag-set combination is considered an individual “time series” containing a sequence of time/value pairs.

Using our example above, this approach would result in 9 different “time series”, each of which is defined by a unique set of tags.

1. {name:  cpu_1m_avg,  device_id: abc123,  location_id: 335,  dev_type: field}
2. {name:  cpu_1m_avg,  device_id: def456,  location_id: 335,  dev_type: roof}
3. {name:  cpu_1m_avg,  device_id: ghi789,  location_id:  77,  dev_type: roof}
4. {name:    free_mem,  device_id: abc123,  location_id: 335,  dev_type: field}
5. {name:    free_mem,  device_id: def456,  location_id: 335,  dev_type: roof}
6. {name:    free_mem,  device_id: ghi789,  location_id:  77,  dev_type: roof}
7. {name: temperature,  device_id: abc123,  location_id: 335,  dev_type: field}
8. {name: temperature,  device_id: def456,  location_id: 335,  dev_type: roof}
9. {name: temperature,  device_id: ghi789,  location_id:  77,  dev_type: roof}

The number of such time series scales with the cross-product of the cardinality of each tag, i.e., (# names) × (# device ids) × (# location ids) × (device types). Some time-series databases struggle as cardinality increases, ultimately limiting the number of device types and devices you can store in a single database.

窄表模型

大多数时序数据库表示数据无外乎以下几种方式:

  • 每个指标作为一个单独的实体类来表示(例:cpu_lm_avg和free_mem表示成两种不同的实体类)

  • 指标存储用”时间”,”值”的键值对来存储

  • 将元数据值表示为与该指标/标签集组合相关联的“标签集”

在这种模型中,每个指标/标签集的组合可以看做一个独立的包含一个时间/值的键值对的时间序列。

使用我们的上述的例子,这个方法将会产生九个不同的时间序列,每个时间序列都是由一个唯一的标签集合来定义的。

1. {name:  cpu_1m_avg,  device_id: abc123,  location_id: 335,  dev_type: field}
2. {name:  cpu_1m_avg,  device_id: def456,  location_id: 335,  dev_type: roof}
3. {name:  cpu_1m_avg,  device_id: ghi789,  location_id:  77,  dev_type: roof}
4. {name:    free_mem,  device_id: abc123,  location_id: 335,  dev_type: field}
5. {name:    free_mem,  device_id: def456,  location_id: 335,  dev_type: roof}
6. {name:    free_mem,  device_id: ghi789,  location_id:  77,  dev_type: roof}
7. {name: temperature,  device_id: abc123,  location_id: 335,  dev_type: field}
8. {name: temperature,  device_id: def456,  location_id: 335,  dev_type: roof}
9. {name: temperature,  device_id: ghi789,  location_id:  77,  dev_type: roof}

这些时间序列的范围大小是与每个标签的基数的交叉乘积,也就是(# names) × (# device ids) × (# location ids) × (device types)。随着基数的增加,一些时间序列数据库会遇到性能问题,最终限制了可以存储在单个数据库中的设备类型和设备的数量。

TimescaleDB supports narrow models and does not suffer from the same cardinality limitations as other time-series databases do. A narrow model makes sense if you collect each metric independently. It allows you to add new metrics as you go by adding a new tag without requiring a formal schema change.

However, a narrow model is not as performant if you are collecting many metrics with the same timestamp, since it requires writing a timestamp for each metric. This ultimately results in higher storage and ingest requirements. Further, queries that correlate different metrics are also more complex, since each additional metric you want to correlate requires another JOIN. If you typically query multiple metrics together, it is both faster and easier to store them in a wide table format, which we will cover in the following section.

TimescaleDB 支持窄表模型,并且不会像其他时间序列数据库那样受到相同的基数限制。如果您单独收集每个指标,那么窄表模型就有意义。它允许您通过添加新标记来添加新指标,而无需进行正式的架构更改。

但是,如果要收集具有相同时间戳的很多度量标准,则窄表模型就不具有高效性,因为它需要为每个度量标准写入时间戳。这最终会导致更高的存储和获取要求。此外,关联不同度量标准的查询也更复杂,因为要关联的每一个额外度量标准都需要一个 JOIN。如果您想同时查询多个指标,用宽表格式存储会更加便捷,这个我们将在下一节中进行介绍。

Wide-table Model

TimescaleDB easily supports wide-table models. Queries across multiple metrics are easier in this model, since they do not require JOINs. Also, ingest is faster since only one timestamp is written for multiple metrics.

A typical wide-table model would match a typical data stream in which multiple metrics are collected at a given timestamp:

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2017-01-01 01:02:00 abc123 80 500MB 72 42 field
2017-01-01 01:02:23 def456 90 400MB 64 42 roof
2017-01-01 01:02:30 ghi789 120 0MB 56 77 roof
2017-01-01 01:03:12 abc123 80 500MB 72 42 field
2017-01-01 01:03:35 def456 95 350MB 64 42 roof
2017-01-01 01:03:42 ghi789 100 100MB 56 77 roof

Here, each row is a new reading, with a set of measurements and metadata at a given time. This allows us to preserve relationships within the data, and ask more interesting or exploratory questions than before.

Of course, this is not a new format: it’s what one would commonly find within a relational database.

宽表模型

TimescaleDB 能够轻松支持宽表模型。由于该模型不需要 JOIN,所以跨越多个度量标准的查询更为容易。此外,由于该模型只为多个度量标准编写了一个时间戳,因此提取速度也更快。

一个典型的宽表模型将匹配一个在给定时间戳中收集的多个度量指标的典型数据流:

timestamp device_id cpu_1m_avg free_mem temperature location_id dev_type
2017-01-01 01:02:00 abc123 80 500MB 72 42 field
2017-01-01 01:02:23 def456 90 400MB 64 42 roof
2017-01-01 01:02:30 ghi789 120 0MB 56 77 roof
2017-01-01 01:03:12 abc123 80 500MB 72 42 field
2017-01-01 01:03:35 def456 95 350MB 64 42 roof
2017-01-01 01:03:42 ghi789 100 100MB 56 77

roof

 

在这里,每一行都是一个新的读数,在给定的时间内有一组测量和元数据。这使我们能够保留数据中的关系,并提出比以前更多有趣或探索性的问题。

当然,这不是一种新格式:它是人们在关系数据库中所经常见到的内容。

JOINs with Relational Data

TimescaleDB’s data model also has another similarity with relational databases: it supports JOINs. Specifically, one can store additional metadata in a secondary table, and then utilize that data at query time.

In our example, one could have a separate locations table, mapping location_id to additional metadata for that location. For example:

location_id name latitude longitude zip_code region
42 Grand Central Terminal 40.7527° N 73.9772° W 10017 NYC
77 Lobby 7 42.3593° N 71.0935° W 02139 Massachusetts

Then at query time, by joining our two tables, one could ask questions like: what is the average free_mem of our devices in zip_code 10017?

Without joins, one would need to denormalize their data and store all metadata with each measurement row. This creates data bloat, and makes data management more difficult.

With joins, one can store metadata independently, and update mappings more easily.

For example, if we wanted to update our “region” for location_id 77 (e.g., from “Massachusetts” to “Boston”), we can make this change without having to go back and overwrite historical data.

关系型数据中的 JOINs

TimescaleDB 的数据模型与关系型数据库有着很多相似之处:它支持 JOINs。具体的来说,它可以在副表中存储其他元数据,然后在查询时使用该数据。

在我们的示例中,有一个单独的位置表,能够将 location_id 映射到该位置的其他元数据中。例如:

location_id name latitude longitude zip_code region
42 Grand Central Terminal 40.7527° N 73.9772° W 10017 NYC
77 Lobby 7 42.3593° N 71.0935° W 02139 Massachusetts

然后在查询时,通过关联这两个表,可以提出如下问题: 在 zip_code 为 10017 时,我们设备的 free_mem 平均值是多少?

如果没有 JOINs,我们需要对其数据进行反规范化,并将每个测量行存储在所有元数据中。这会造成数据膨胀,并使数据管理变得更加困难。

有了 JOINs,我们可以独立存储元数据,并且更快捷地更新映射。

例如,如果我们想更新 location_id 为 77 的”region”的数据(例如将”Massachusetts”修改为”Boston”),我们可以直接进行更改,而无需返回并覆盖历史数据。

div.column{width:49.5%;display:table-cell;border:1px solid #d4d4d5;}


via:oschina

转载请注明:AspxHtml学习分享网 » [中英对照]TimescaleDB 时序数据库的数据模型

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址