(timestamp-) Consistent backups in HBase
The topic consistent backups in HBase comes up every now and then. In this article I will outline a scheme that does provide timestamp-consistent backups. Consistent backups are possible in HBase. With "consistent" I mean "consistent as of
The topic consistent backups in HBase comes up every now and then.In this article I will outline a scheme that does provide timestamp-consistent backups.
Consistent backups are possible in HBase. With "consistent" I mean "consistent as of a specific timestamp" (this is limited by the HBase timestamp granularity.)
Over the past few months I have contributed various changes to HBase that now hopefully lead full circle to a more coherent story for "system of record" type data retention.
The basic setup is simple. You'll need
- HBASE-4071 (0.92 - for MIN_VERSIONS)
- HBASE-4536 (0.94 - keeping deleted cells and raw scans)
- HBASE-4682 (0.94 - deleted cells in exports)
HBase's built-in Export tool can then be used to generate consistent snapshots.
Normally the data collected by an Export job is "smeared" over the time interval it takes to execute the job; an Export-Scan sees a row (cells of a row to be precise) as of the time when it happens to get to it, and rows can change while the Export is running.
That is problematic, because it is not possible to recreate a consistent view of the database from export generated this way.
The trick then is to keep all data in HBase long enough for the backup job to finish, and to only collect information before the start time of the export job.
Say the backup takes no more than T to complete (yes, it is hard to know ahead of time how long an Export is going to run). In that case the table's column families can be setup as follows:
- set TTL to T + some headroom (so maybe 2T to be safe)
- set VERSIONS to a very large number, (max int = 2147483647 for example, i.e. nothing is evicted from HBase due to VERSIONS)
- set MIN_VERSIONS to how many versions you want to keep around, otherwise all versions could be removed if their TTL expired
- set KEEP_DELETED_CELLS to true (this makes sure that deleted cell and delete markers are kept until they expire by the TTL)
create hbase org.apache.hadoop.hbase.mapreduce.Export hbase org.apache.hadoop.hbase.mapreduce.Export
原文地址:(timestamp-) Consistent backups in HBase, 感谢原作者分享。 声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn核实处理。 相关文章 相关视频 全栈 170W+ 主讲:Peter-Zhu 轻松幽默、简短易学,非常适合PHP学习入门 入门 80W+ 主讲:灭绝师太 由浅入深、明快简洁,非常适合前端学习入门 实战 120W+ 主讲:西门大官人 思路清晰、严谨规范,适合有一定web编程基础学习, {NAME=>
Export can now we used as follows:
-D hbase.mapreduce.include.deleted.rows=true
2147483647 -2147483648
As long as the Export finishes within 2T, a consistent snapshot as of the time the Export was started is created. Otherwise some data might be missing, as it could have been compacted away before the Export had a chance to see it.
Since the backups also copied deleted rows and delete markers, a backup restored to an HBase instance can be queried using a time range (see Scan) to retrieve the state of the data at any arbitrary time.
Export is current limited to a single table, but given enough storage in your live cluster this can be extended to multiple table Exports, simply by setting the endTime of all Exports jobs to the start time of the first job.
This same trick can also be used for incremental backups. In that case the TTL has to be large enough to cover the interval between incremental backups.
If, for example, the incremental backups frequency is daily, the TTL above can be set to 2 days (TTL=>172800). Then use Export again:
-D hbase.mapreduce.include.deleted.rows=true
2147483647
The longer TTL guarantees that there will be no gaps that are not covered by the incremental backups.
An example:
Note that in this scenario is does not matter when the backup jobs finish.
The full backup contains only p1. The incremental backup contains p2 and the Delete. p3 is not included in any backup, yet.
The state at T2 (p1) and T5 (p1, p2, delete) can be directly restored. Using time range Scans or Gets the state as of T4 and T3 can also be retrieved, once both backups have been restored into the same HBase instance (you need HBASE-4536 for this to work correctly with Deletes).
Finally, if keeping enough data to cover the time between two incremental backups in the live HBase cluster is problematic for your organization, it is also possible to archive HBase's Write Ahead Logs (WAL) and then replay with the built-in WALPlayer (HBASE-5604), but that is for another post.
专题推荐
上一篇: php将CSV文件当做数据库怎么查询
下一篇: php实现递归的三种基本方式
网友评论
文明上网理性发言,请遵守 新闻评论服务协议
我要评论