Data PrivacyUnderstanding Data Warehousing

March 15, 20230

INTRODUCTION

Since ancient times, the search for useful and important information has been carried on manually. With an expeditious growth in the volume of data, it is not possible to mine useful information manually. So, the miners have come up with more efficient and effective technologies or data mining techniques to handle such a search of useful patterns or knowledge.

Data mining can be contemplated as an outcome of natural evolution of information technology. Data mining could be web data mining, social data mining, image data mining, healthcare data mining, financial data mining, e-book data mining, SQL data mining, mining data for fraud detection, stock market data mining, text or multimedia data mining, data mining for consumer segmentation, capturing, analysis and interpretation of data, new stories extraction, tracking and analysing competitor’s growth, meta- data extrication from various websites and all.

Since data mining deals with processing of “sensitive personal information”, data privacy and data security concerns are at its heights. With the advent of the digital era and development of various technologies the data security, data protection and privacy preservation concern has increased to another level.

In the backdrop of the above, data warehousing is a key data management technology for integrating the various data – sources and organising the data so that it can be effectively mined.

UNDERSTANDING DATA WAREHOUSE

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

Thus, data warehousing is fundamentally a process for collecting and managing data from varied sources to provide meaningful business insights. A data warehouse is typically used to connect and analyse business data from heterogeneous sources.

The data warehouse is the core of the business intelligence system which is built for data analysis and reporting. It is a blend of technologies and components which aids the strategic use of data. It is an electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely manner to make a difference.

The decision support database, i.e., the data warehouse is maintained separately from the organization’s operational database. However, the data warehouse is not a product but an environment. It is an architectural construct of an information system which provides users with current and historical decision support information which is difficult to access or present in the traditional operational data store.

A data warehouse provides a new design which can help to reduce the response time and helps to enhance the performance of queries for reports and analytics.

Data warehouse system is also known by the following names:

Decision Support System (DSS)
Executive Information System
Management Information System
Business Intelligence Solution
Analytic Application
Data Warehouse

EVOLUTION OF DATA WAREHOUSE

The Data Warehouse benefits users to understand and enhance their organization’s performance. The need to warehouse data evolved as computer systems became more complex and needed to handle increasing amounts of Information. However, Data Warehousing is a not a new thing.

Here are some key events in evolution of Data Warehouse-

1960- Dartmouth and General Mills in a joint research project, develop the terms dimensions and facts.
1970- A Nielsen and IRI (Internationalized Resource Identifier) introduces dimensional data marts for retail sales.
1983- Tera Data Corporation introduces a database management system which is specifically designed for decision support

Data warehousing started in the late 1980s when IBM worker Paul Murphy and Barry Devlin developed the Business Data Warehouse.

However, the real concept was given by Inmon Bill. whowas considered as a father of data warehouse.

FUNCTIONING OF DATA WAREHOUSE

A data warehouse works as a central repository where information arrives from one or more data sources. Data flows into a data warehouse from the transactional system and other relational databases.

Data may be:

Structured
Semi-structured
Unstructured data

The data is processed, transformed, and ingested so that users can access the processed data in the data warehouse through business intelligence tools, SQL clients, and spreadsheets. A data warehouse merges information coming from different sources into one comprehensive database.

By merging all of this information in one place, an organization can analyse its customers more holistically. This helps to ensure that it has considered all the information available, which ultimately makes data mining possible.

DATA WAREHOUSE ARCHITECTURE

The exact architecture of a data warehouse will vary from one to another. Data warehouses can be one, two, or three-tier structures. Perhaps the most common, however, is the three-tier architectural structure.

Bottom tier: It is also called the data tier, in which the data is supplied to the warehouse. The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load and refresh functions to update the data warehouse. The data are extracted using application program interfaces known as gateways. This tier also contains a metadata repository, which stores information about the data warehouse and its contents.
Middle tier: It is also called the application tier, in which an Online analytical processing (“OLAP”) server processes the data. The middle tier is an OLAP server that is typically implemented using either a relational OLAP model or a multidimensional OLAP. OLAP model maps the operations on multidimensional data to standard relational operations. A multidimensional OLAP model, that is, a special-purpose server that directly implements multidimensional data and operations.
Top tier: It is also called the presentation tier, which is designed for end-users with particular tools and application programming interfaces used for data extraction and analysis. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

TYPES OF DATA WAREHOUSE

Three main types of Data Warehouses are:

1.Enterprise Data Warehouse (“EDW”):

EDW is a centralized warehouse. It provides decision support service across the enterprise and offers a unified approach for organizing and representing data. It also provides the ability to classify data according to the subject and give access according to those divisions.

An enterprise warehouse collects all of the information about subjects spanning the entire organization.

EDW provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope. It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

An EDW may be implemented on traditional mainframes, computer super servers, or parallel architecture platforms. It requires extensive business modelling and may take years to design and build.

2.Operational Data Store (“ODS”):

ODS are nothing but data store required when neither data warehouse nor online transaction processing systems support organizations reporting needs.

In ODS, data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities like storing records of the Employees.

3. Data Mart:

A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as sales, finance, etc. In an independent data mart, data can be collected directly from sources.

A data mart contains a subset of corporate-wide data that is of value to as specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. The data contained in data marts tend to be summarized.

Data marts are usually implemented on low-cost departmental servers. The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years.

However, it may involve complex integration in the long run if its design and planning were not enterprise wide. Depending on the source of data, data marts can be categorized as independent or dependent. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are sourced directly from enterprise data warehouses.

Meta Data Repository

Metadata are data about data. When used in a data warehouse, metadata are the data that define warehouse objects. Metadata are created for the data names and definitions of the given warehouse. Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields that have been added by data cleaning or integration processes.

A metadata repository should contain the following:

A description of the structure of the data warehouse, which includes the warehouse schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.
Operational metadata, which include data lineage (history of migrated data and the sequence of transformations applied to it), currency of data (active, archived, or purged), and monitoring information (warehouse usage statistics, error reports, and audit trails).

AMLEGALS REMARKS

India is growing towards the future with increase in complex data in nearly every industry. Thus, data warehousing and mining from them will become a fundamental essence for all. Hence our laws have to keep at pace with the advent of technological changes so that the parties involved as well as the person whose data is being analysed is safeguarded from any inadvertent violation of their rights.

The Digital Personal Data Protection Bill, 2022 is a welcome step towards this, but it needs to be discussed amongst the industry in much more detail for it to be at par with the industry standards as well as be able to actively protect from any violation of privacy by the parties involved in data processing through data mining and warehousing.

– Team AMLEGALS assisted by Ms. Maneesha.S(Intern)

For any query or feedback, please feel free to get in touch with falak.sawlani@amlegals.com or mridusha.guha@amlegals.com.

+91-8448548549

info@amlegals.com

Navigation

Practice Areas

© 2020-21 AMLEGALS Law Firm in Ahmedabad, Mumbai, Kolkata, New Delhi, Bengaluru for IBC, GST, Arbitration, Contract, Due Diligence, Corporate Laws, IPR, White Collar Crime, Litigation & Startup Advisory, Legal Advisory.

Data PrivacyUnderstanding Data Warehousing

July 26, 2024An Overview of the SEBI (Listing and Disclosure Requirements) (Amendment) Regulations, 2024

July 24, 2024Privacy in Digital Twin Technology

July 23, 2024Upholding Natural Justice: Personal Hearing Mandate Under Section 75(4) Of CGST/MGST Act

July 22, 2024Employees Cannot Be Regularized If Their Employment Was Based On Outsourcing Contracts Not Intended To Create Permanent Positions.

July 19, 2024RBI’s Framework on Regulatory Sandbox

Navigation

Practice Areas