Since ancient times, the search for useful and important information has been carried on manually. With an expeditious growth in the volume of data, it is not possible to mine useful information manually. So, the miners have come up with more efficient and effective technologies or data mining techniques to handle such a search of useful patterns or knowledge.
Data mining can be contemplated as an outcome of natural evolution of information technology. Data mining could be web data mining, social data mining, image data mining, healthcare data mining, financial data mining, e-book data mining, SQL data mining, mining data for fraud detection, stock market data mining, text or multimedia data mining, data mining for consumer segmentation, capturing, analysis and interpretation of data, new stories extraction, tracking and analysing competitor’s growth, meta- data extrication from various websites and all.
Since data mining deals with processing of “sensitive personal information”, data privacy and data security concerns are at its heights. With the advent of the digital era and development of various technologies the data security, data protection and privacy preservation concern has increased to another level.
In the backdrop of the above, data warehousing is a key data management technology for integrating the various data – sources and organising the data so that it can be effectively mined.
UNDERSTANDING DATA WAREHOUSE
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Thus, data warehousing is fundamentally a process for collecting and managing data from varied sources to provide meaningful business insights. A data warehouse is typically used to connect and analyse business data from heterogeneous sources.
The data warehouse is the core of the business intelligence system which is built for data analysis and reporting. It is a blend of technologies and components which aids the strategic use of data. It is an electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely manner to make a difference.
The decision support database, i.e., the data warehouse is maintained separately from the organization’s operational database. However, the data warehouse is not a product but an environment. It is an architectural construct of an information system which provides users with current and historical decision support information which is difficult to access or present in the traditional operational data store.
A data warehouse provides a new design which can help to reduce the response time and helps to enhance the performance of queries for reports and analytics.
Data warehouse system is also known by the following names:
EVOLUTION OF DATA WAREHOUSE
The Data Warehouse benefits users to understand and enhance their organization’s performance. The need to warehouse data evolved as computer systems became more complex and needed to handle increasing amounts of Information. However, Data Warehousing is a not a new thing.
Here are some key events in evolution of Data Warehouse-
Data warehousing started in the late 1980s when IBM worker Paul Murphy and Barry Devlin developed the Business Data Warehouse.
However, the real concept was given by Inmon Bill. whowas considered as a father of data warehouse.
FUNCTIONING OF DATA WAREHOUSE
A data warehouse works as a central repository where information arrives from one or more data sources. Data flows into a data warehouse from the transactional system and other relational databases.
Data may be:
The data is processed, transformed, and ingested so that users can access the processed data in the data warehouse through business intelligence tools, SQL clients, and spreadsheets. A data warehouse merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyse its customers more holistically. This helps to ensure that it has considered all the information available, which ultimately makes data mining possible.
DATA WAREHOUSE ARCHITECTURE
The exact architecture of a data warehouse will vary from one to another. Data warehouses can be one, two, or three-tier structures. Perhaps the most common, however, is the three-tier architectural structure.
TYPES OF DATA WAREHOUSE
Three main types of Data Warehouses are:
1.Enterprise Data Warehouse (“EDW”):
EDW is a centralized warehouse. It provides decision support service across the enterprise and offers a unified approach for organizing and representing data. It also provides the ability to classify data according to the subject and give access according to those divisions.
An enterprise warehouse collects all of the information about subjects spanning the entire organization.
EDW provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is cross-functional in scope. It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An EDW may be implemented on traditional mainframes, computer super servers, or parallel architecture platforms. It requires extensive business modelling and may take years to design and build.
2.Operational Data Store (“ODS”):
ODS are nothing but data store required when neither data warehouse nor online transaction processing systems support organizations reporting needs.
In ODS, data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such as sales, finance, etc. In an independent data mart, data can be collected directly from sources.
A data mart contains a subset of corporate-wide data that is of value to as specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. The data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers. The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years.
However, it may involve complex integration in the long run if its design and planning were not enterprise wide. Depending on the source of data, data marts can be categorized as independent or dependent. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are sourced directly from enterprise data warehouses.
Meta Data Repository
Metadata are data about data. When used in a data warehouse, metadata are the data that define warehouse objects. Metadata are created for the data names and definitions of the given warehouse. Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
India is growing towards the future with increase in complex data in nearly every industry. Thus, data warehousing and mining from them will become a fundamental essence for all. Hence our laws have to keep at pace with the advent of technological changes so that the parties involved as well as the person whose data is being analysed is safeguarded from any inadvertent violation of their rights.
The Digital Personal Data Protection Bill, 2022 is a welcome step towards this, but it needs to be discussed amongst the industry in much more detail for it to be at par with the industry standards as well as be able to actively protect from any violation of privacy by the parties involved in data processing through data mining and warehousing.
– Team AMLEGALS assisted by Ms. Maneesha.S(Intern)
For any query or feedback, please feel free to get in touch with firstname.lastname@example.org or email@example.com.