Case Study - All about Data
- Published on
- • 8 mins read•––– views
Noted for my learning purpose.
What is a data ?
Data are individual facts, statistics, or items of information, often numeric. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum is a single value of a single variable.
What is a data type?
What is a data structure?
In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification.
What is a big data structure or types of big data?
What is a data governance?
Data governance (DG) is the process of managing the availability, usability, integrity and security of the data in enterprise systems, based on internal data standards and policies that also control data usage. Effective data governance ensures that data is consistent and trustworthy and doesn't get misused.
Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals. A data governance framework is the collection of rules, processes, and role delegations that ensure privacy and compliance in an organization's enterprise data management.
What is Data Lifecycle?
Alignment of Operational and Analytic Data Domains
Central to our concept of a Data Mesh is the idea that the same technology can be used for data-driven use cases in Operational Data and Analytic Data domains. For example, use cases for Data Mesh should span domains:
Data Warehouse (OLAP) and the Operational Database (OLTP) are both relational databases. However, the goals of both these databases are different.
OLTP - Online Transactions Processing Systems
OLTP System handle with operational data. Operational data are those data contained in the operation of a particular system. Example, ATM transactions and Bank transactions, etc.
what is operational database?
The Operational Database includes detailed information used to run the day to day operations of the business. The data frequently changes as updates are made and reflect the current value of the last transactions.
Operational Database is the source of information for the data warehouse.
OLAP - Online-Analytical Processing Systems
OLAP handle with Historical Data or Archival Data. Historical data are those data that are achieved over a long period.
For example, if we collect the last 10 years information about flight reservation, the data can give us much meaningful data such as the trends in the reservation. This may provide useful information like peak time of travel, what kind of people are traveling in various classes (Economy/Business) etc.
what is a data warehouse?
A data warehouse is a system that aggregates data from multiple sources into a single, central, consistent data store to support data mining, artificial intelligence (AI), and machine learning—which, ultimately, can enhance sophisticated analytics and business intelligence. Through this strategic collection process, data warehouse solutions consolidate data from the different sources to make it available in one unified form.
A data warehouse is a relational database designed for analytical rather than transactional work, capable of processing and transforming data sets from multiple sources. On the other hand, a data mart is typically limited to holding warehouse data for a single purpose, such as serving the needs of a single line of business or company department.
Operational Database | Data Warehouse |
---|---|
Operational systems are designed to support high-volume transaction processing. | Data warehousing systems are typically designed to support high-volume analytical processing (i.e., OLAP). |
Operational systems are usually concerned with current data. | Data warehousing systems are usually concerned with historical data. |
Data within operational systems are mainly updated regularly according to need. | Non-volatile, new data may be added regularly. Once Added rarely changed. |
It is designed for real-time business dealing and processes. | It is designed for analysis of business measures by subject area, categories, and attributes. |
It is optimized for a simple set of transactions, generally adding or retrieving a single row at a time per table. | It is optimized for extent loads and high, complex, unpredictable queries that access many rows per table. |
It is optimized for validation of incoming information during transactions, uses validation data tables. | Loaded with consistent, valid information, requires no real-time validation. |
It supports thousands of concurrent clients. | It supports a few concurrent clients relative to OLTP. |
Operational systems are widely process-oriented. | Data warehousing systems are widely subject-oriented |
Operational systems are usually optimized to perform fast inserts and updates of associatively small volumes of data. | Data warehousing systems are usually optimized to perform fast retrievals of relatively high volumes of data. |
Data In | Data Out |
Less Number of data accessed. | Large Number of data accessed. |
Relational databases are created for on-line transactional Processing (OLTP) | Data Warehouse designed for on-line Analytical Processing (OLAP) |
What is a data mart?
A data mart is a curated subset of data often generated for analytics and business intelligence users. Data marts are often created as a repository of pertinent information for a subgroup of workers or a particular use case.
A data mart (as noted above) is a focused version of a data warehouse that contains a smaller subset of data important to and needed by a single team or a select group of users within an organization.
what is a data lake?
A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semi-structured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.
A data lake, too, is a repository for data. A data lake provides massive storage of unstructured or raw data fed via multiple sources, but the information has not yet been processed or prepared for analysis. As a result of being able to store data in a raw format, data lakes are more accessible and cost-effective than data warehouses. There is no need to clean and process data before ingesting.
For example, governments can use technology to track data on traffic behavior, power usage, and waterways, and store it in a data lake while they figure out how to use the data to create “smarter cities” with more efficient services.
what is a data ingestion?
Data ingestion is the process of transporting data from one or more sources to a target site for further processing and analysis.
what is a data curation?
Curation is the work of organizing and managing a collection of things to meet the needs and interests of a specific group of people. Collecting things is only the beginning. Organizing and managing are the critical elements of curation—making things easy to find, understand, and access.
Data curation, then, is the work of organizing and managing a collection of datasets to meet the needs and interests of a specific groups of people. Collecting datasets is only the beginning. That is what we do when we store data in data warehouses or data lakes. But organizing and managing are the essence of data curation. Making datasets easy to find, understand, and access is the purpose of data curation—a purpose that demands well-described datasets. Data curation is a metadata management activity and data catalogs are essential data curation technology. Data catalogs are rapidly becoming the new “gold standard” for metadata management, making metadata accessible and informative for non-technical data consumers.
Data curation is a means of managing data that makes it more useful for users engaging in data discovery and analysis
What is a lakehouse?
New systems are beginning to emerge that address the limitations of data lakes. A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse directly on top of low cost cloud storage in open formats. They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available.
What is data mining?
Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering patterns and other valuable information from large data sets.
Data mining has improved organizational decision-making through insightful data analyses. The data mining techniques that underpin these analyses can be divided into two main purposes; they can either describe the target dataset or they can predict outcomes through the use of machine learning algorithms.