欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

程序员文章站 2022-05-01 12:13:07
...

数据库 数据仓库 数据集市

Implementing best data warehouse designs and practices such as data lineage reduces the need to ever have to restore an entire relational data warehouse. However, sometimes there are instances whereby you have inherited poorly designed data warehouse environments that leaves you with no other options but to perform an entire database restore in an event of a sudden disaster. I recently found myself in a similar situation of having to recover one of my data mart following a data integrity issue wherein all data of a type 1 dimension was updated/overwritten using an incorrect source file. In this article I take a look at how different approaches can be utilised to restore the compromised SQL Server-based data mart back to its “good state”.

实施最佳的数据仓库设计和实践(例如数据沿袭)可以减少恢复整个关系数据仓库的需要。 但是,有时在某些情况下您继承了设计不佳的数据仓库环境,使您别无选择,只能在发生突发灾难时执行整个数据库还原。 最近,我发现自己处于类似的情况,因为数据完整性问题(其中使用错误的源文件更新/覆盖了所有类型1维的所有数据),不得不恢复我的一个数据集市。 在本文中,我将探讨如何利用不同的方法将已损坏的基于SQL Server的数据集市恢复到“良好状态”。

背景 (Background)

Likewise, such a topic is better discussed through practical examples. Thus, below are the details of a fictitious data warehouse that I will refer to throughout this article. Figure 1 indicates my fictitious data warehouse, selectSIFISOBlogs2014, as well as a recently restored backup of the same data warehouse labelled selectSIFISOBlogs2014_BAK.

同样,可以通过实际示例更好地讨论这一主题。 因此,下面是我将在本文中引用的虚拟数据仓库的详细信息。 图1指示了我的虚拟数据仓库selectSIFISOBlogs2014,以及最近恢复的相同数据仓库的备份,该备份标签为selectSIFISOBlogs2014_BAK

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

Both selectSIFISOBlogs2014 and selectSIFISOBlogs2014_BAK contains dimensions and fact objects that relate to each other in a multidimensional star schema, as shown in Figure 2.

selectSIFISOBlogs2014selectSIFISOBlogs2014_BAK都包含多维星型架构中彼此相关的维和事实对象, 如图2所示。

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

The customer dimension is the compromised object that has recently been updated with an incorrect source. For instance, Figure 3-4 compares the two versions of customer dimension, as it can be seen that whilst the version in the selectSIFISOBlogs2014_BAK contains correct customer name values the version in selectSIFISOBlogs2014 has been updated using method of payment (MOP) source file in such a way that customer name values are now set to either CASH or ELECTRONIC.

客户维度是最近使用不正确来源更新的受害对象。 例如, 图3-4客户维度的两个版本进行比较,因为这可以看出,虽然在selectSIFISOBlogs2014_BAK版本包含正确的客户名称值selectSIFISOBlogs2014的版本已经使用支付(MOP)源文件的方法,在这样的更新现在可以将客户名称值设置为CASHELECTRONIC

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

Another significant change between the data in the two versions of the data warehouse is the amount of data they each possess. Script 1 makes use of an EXCEPT T-SQL clause to identify differences in the fruit dimension between the two data warehouse environments.

两个版本的数据仓库中的数据之间的另一个重要变化是它们各自拥有的数据量。 脚本1使用EXCEPT T-SQL子句来识别两个数据仓库环境之间的水果维度上的差异。

 
SELECT [FruitKey]
      ,[FruitName]
      ,[InsertDate]
FROM [selectSIFISOBlogs2014].[DIM].[Fruit]
EXCEPT
SELECT [FruitKey]
      ,[FruitName]
      ,[InsertDate]
FROM [selectSIFISOBlogs2014_BAK].[DIM].[Fruit]
 

The results of Script 1 execution are shown in Figure 5 and they indicate that the fruit dimension in the backup database is missing an entry that contains an apricot fruit.

执行脚本1的结果如图5所示,它们表明备份数据库中的水果维度缺少包含杏果实的条目。

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

Not surprisingly, as shown in Figure 6 the results of comparing the FruitSales fact table indicates that the selectSIFISOBlogs2014 database has one more transaction relating to the sale of apricots than its counterpart in the backup database.

毫不奇怪, 如图6所示, FruitSales事实表的比较结果表明, selectSIFISOBlogs2014数据库与杏销售相关的交易比备份数据库中的对应交易多。

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

方法1:完整数据仓库还原 (Approach 1: Full Data Warehouse Restore)

Given the discrepancies observed in the two versions of customer dimensions, the simplest way of restoring this dimension to a previous “good state” is by conducting a full database restore of. Figure 7 shows a restore of selectSIFISOBlogs2014 data warehouse environment using the Restore Database wizard in SQL Server Management Studio (SSMS).

鉴于在两个版本的客户维度中都观察到差异,将这个维度还原到先前的“良好状态”的最简单方法是对它进行完整的数据库还原。 图7显示了使用SQL Server Management Studio(SSMS)中的“还原数据库”向导还原selectSIFISOBlogs2014数据仓库环境。

数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

  1. Pros and Cons of a Full Data Warehouse Restore

    An obvious benefit of this approach is that it is so simple that you don’t need to depend on a data warehouse team to run with the restore operation as even your typical SQL Server DBA can easily perform this operation. Unfortunately, simplicity of this approach can be its very own disadvantage. For instance, with this approach you are likely to lose any changes (i.e. data changes, table structure changes, etc.) that you would have made after the backup was created. This means that in the case of our fictitious data warehouse we are going to lose the information relating to the apricot transaction.

    Therefore, although this approach is easy to implement, it may result into some undesirable outcomes (i.e. loss of data). It is therefore recommended that prior to its implementation that you consult with consumers of your data warehouse environment on the potential consequences of this approach.

    完整数据仓库还原的利与弊

    这种方法的一个明显好处是它是如此简单,以至于您无需依赖数据仓库团队来执行还原操作,因为即使您的典型SQL Server DBA都可以轻松地执行此操作。 不幸的是,这种方法的简单性可能是其自身的缺点。 例如,使用这种方法,您很可能会丢失创建备份后所做的任何更改(即,数据更改,表结构更改等)。 这意味着,对于我们的虚拟数据仓库,我们将丢失与杏交易有关的信息。

    因此,尽管此方法易于实施,但可能会导致某些不良后果(即数据丢失)。 因此,建议在实施此方法之前,先就此方法的潜在后果咨询数据仓库环境的使用者。

方法2:部分数据仓库还原 (Approach 2: Partial Data Warehouse Restore)

Instead of restoring the entire database, we can also just restore only the affected customer dimension which will ensure that we avoid data loss in the fruit dimension and fruit sales fact table. Unlike its counterpart, the partial data restore approach involves several steps:

除了还原整个数据库,我们还可以仅还原受影响的客户维度,这将确保我们避免水果维度和水果销售事实表中的数据丢失。 与相应的方法不同,部分数据恢复方法涉及以下步骤:

  1. Ensure that both the current version and backup version of databases are online

    In our case we need to ensure that selectSIFISOBlogs2014 and selectSIFISOBlogs2014_BAK databases are online.

    确保数据库的当前版本和备份版本都处于联机状态

    在我们的案例中,我们需要确保selectSIFISOBlogs2014selectSIFISOBlogs2014_BAK数据库处于联机状态。

  2. Copy data to replace compromised object(s)

    In this step you have to restore the selected table from the backup database. This can be done using SQL Server Integration Services (SSIS) packages, or T-SQL. Another method would be to drop the current version of the customer dimension and recreate it using the SSMS Import and Export wizard as shown in Figure 8-9.

    数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

    Figure 8: Import Data Wizard


    Figure 9: Choosing to copy Customer dimension

    复制数据以替换受损的对象

    在此步骤中,您必须从备份数据库中还原选定的表。 可以使用SQL Server Integration Services(SSIS)程序包或T-SQL来完成。 另一种方法是删除客户维度的当前版本,并使用SSMS导入和导出向导重新创建它, 如图8-9所示。

    数据库 数据仓库 数据集市_还原数据仓库/数据集市环境的两种方法

    图8:导入数据向导


    图9:选择复制客户维度
  1. Pros and Cons of a Partial Data Warehouse Restore

    The benefit of this approach is that the restore will only affect selected data warehouse objects and as a result it will not overwrite the entire environment which could be costly. Unfortunately, this approach has several drawbacks too:

    部分数据仓库还原的利与弊

    这种方法的好处是还原将只影响选定的数据仓库对象,因此不会覆盖可能会造成高昂成本的整个环境。 不幸的是,这种方法也有几个缺点:

  1. Import data wizard creates heaps

    To illustrate this point let’s look at the original definition of the customer dimension in Script 2, as it can be seen, it has several constraints that are crucial to the data consistency and integrity of this dimension.

     
    CREATE TABLE [DIM].[Customer](
    	[CustomerKey] [int] IDENTITY(1,1) NOT NULL,
    	[CustomerCode] [varchar](50) NOT NULL,
    	[CustomerName] [varchar](250) NOT NULL,
    	[InsertDate] [datetime] NOT NULL DEFAULT (getdate()),
     CONSTRAINT [PK_CustomerKey] PRIMARY KEY CLUSTERED 
    (
    	[CustomerKey] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]
     
    

    Script 2: Customer dimension script with constraints

    However, the import wizard led to a creation of a dimension that is missing default constraints and primary key, as shown in Script 3.

    Script 3: Customer dimension script without constraints

    An obvious work around to this would be to modify the table after the restore and add those missing constraints back to the dimension or fact table. Another workaround, instead of using the import wizard, would be to restore the selectSIFISOBlogs2014_BAK database on the same/linked server as selectSIFISOBlogs2014; drop and recreate the customer dimension and then use T-SQL (which will include temporarily disabling identity insert) to populate the compromised dimension as shown in Script 4.

     
    SET IDENTITY_INSERT [selectSIFISOBlogs2014].[DIM].[Customer] ON
     
    INSERT INTO [selectSIFISOBlogs2014].[DIM].[Customer]
    ([CustomerKey]
          ,[CustomerCode]
          ,[CustomerName]
          ,[InsertDate])
    SELECT ([CustomerKey]
          ,[CustomerCode]
          ,[CustomerName]
          ,[InsertDate]
    FROM [selectSIFISOBlogs2014_BAK].[DIM].[Customer]
     
    SET IDENTITY_INSERT [selectSIFISOBlogs2014].[DIM].[Customer] OFF
     
    

    Script 4: Insert statement with identity insert

    导入数据向导创建堆

    为了说明这一点,让我们看一下脚本2中客户维度的原始定义,可以看出,它具有几个约束,这些约束对于该维度的数据一致性和完整性至关重要。

    脚本2:具有约束的客户维度脚本

    但是,导入向导导致创建缺少默认约束和主键的维,如脚本3所示。

    CREATE TABLE [DIM].[Customer](
     	[CustomerKey] [int] NOT NULL,
     	[CustomerCode] [varchar](50) NOT NULL,
     	[CustomerName] [varchar](250) NOT NULL,
     	[InsertDate] [datetime] NOT NULL
     ) ON [PRIMARY]

    脚本3:无限制的客户维度脚本

    解决此问题的一个明显方法是在还原后修改表,并将缺少的约束添加回维表或事实表。 另一种解决方法,而不是使用导入向导,将还原相同/链接服务器作为selectSIFISOBlogs2014selectSIFISOBlogs2014_BAK数据库; 删除并重新创建客户维度 然后使用T-SQL(将包括暂时禁用身份插入)来填充受损的维度,如脚本4所示。

    脚本4:带有标识插入的Insert语句
  2. Different resources required

    As mentioned earlier this approach is complex and unlike the full restore approach you cannot just rely on a DBA – instead a data warehouse developer will have to be involved to write T-SQL statements or SSIS packages.

    需要不同的资源

    如前所述,这种方法很复杂,与完全还原方法不同,您不能仅依靠DBA,而是必须由数据仓库开发人员来编写T-SQL语句或SSIS包。

结论 (Conclusion)

In this article we have demonstrated, by using a fictitious data warehouse, the different approaches available to successfully restore a data warehouse environment. The benefits and disadvantages of such approaches were discussed with a recommendation that whenever possible you should inform consumers of your data warehouse environment about any planned changes to the environment.

在本文中,我们通过使用虚拟数据仓库演示了可用于成功还原数据仓库环境的不同方法。 讨论了这种方法的优点和缺点,并建议您尽可能将有关计划中的环境更改告知数据仓库环境的使用者。

参考 (Reference)

翻译自: https://www.sqlshack.com/two-methods-restoring-data-warehousedata-mart-environment/

数据库 数据仓库 数据集市