A data platform is not a database. Databases are the foundation of data platforms and do not provide you with the ability to handle analytics. A data platform, on the other hand, acts as an additional layer on top of a database, optimized to serve this purpose. Businesses need effective and long-term data architectures to support their efforts to stay up to date on today’s data-related challenges. Let’s take a closer look at the functionalities required to maintain a basic data platform.
What is a data platform?
Primarily, a data platform is a service that allows you to collect, process, store, access, analyze, and present data. This can be broken down into the following components:
- Data warehouse: collection, processing, storage, access
- Intelligence: analysis and presentation
- Data science: statistics and artificial intelligence
The storage and processing of data are at the core of the data platform. However, this is just the beginning. To provide an overview of common data management tasks, here is a list of 18 criteria that illustrate what a data platform entails:
- Data Architecture (infrastructure, scaling, database)
- Import Interfaces
- Data Transformation (ETL)
- Performance
- Process Automation
- Monitoring
- Data History
- Data Versioning
- Surrogate Key Management
- Analysis / Reporting
- Data Science
- External Access / API
- User-Friendliness
- Multi-User Process
- Platform Documentation
- Documentation
- Security
- Costs
Below is a brief introduction to each of these criteria without delving into technical details.
1. Data architecture (Infrastructure, Scaling, Database)
Data architecture is a central aspect of a data platform. To find a suitable storage and database solution that meets your needs, you can choose among three basic options:
Relational database
Older database systems with built-in intelligence for efficient handling of large data sets have the most expressive analysis tools. However, these are more complex to maintain. Today, these systems are also available as distributed systems and can handle extremely large data sets. Examples include PostgreSQL with Citus or Greenplum clustering, MariaDB/MySQL with Galera Cluster, Amazon Redshift, Oracle, MSSQL, etc.
NoSQL Sharding Database
These systems are also designed to handle extremely large data sets and sacrifice some of the classic features of relational databases in order to gain more power in other areas. They offer much less analytical power, but are easier to manage, have high availability, and allow for easy backups. They aim to mimic the analytical power of SQL available in the relational database world with additional tools such as Impala or Hive. If you have huge amounts of data, specific streaming or real-time data requirements, you should take a look at these specialized data systems. Examples include Cassandra, Hadoop Ecosystem, Elasticsearch, Druid, MongoDB, Kudu, InfluxDB, Kafka, neo4j, Dgraph, etc.
File-based systems
It is possible to design a strategy that is based entirely on files. File structures like Parquet allow you to use inexpensive storage to accommodate very large datasets distributed across a number of storage nodes or in a Cloud Object Store like Amazon S3. The main advantage is that the data storage system alone is sufficient to answer data queries. In the two examples described above, on the contrary, you need to run services on additional compute nodes to answer data queries.
When looking for software to support your data architecture, you have a few basic options:
- You can build your platform and rely on services offered by large cloud providers. On cloud platforms like AWS, Azure, or Google Cloud, you can connect a selection of simple services to create a data platform that meets our list of criteria. This may seem simple and inexpensive at a small scale, but can turn out to be quite complex and expensive when you scale up and need to customize.
- In contrast, there are platforms based on self-managed hardware, including cloud virtual machines and individual software stacks. Here you have maximum flexibility, but you also have to address many of the criteria on our list by creating the code and custom solutions.
2. Performance
This is a key criterion when choosing a platform. Performance is mostly affected by the database system you choose. Our rule of thumb is: The higher your performance requirements, the more specialized your database system choice should be.
3. Data Transformation (ETL)
Data imported into a data platform usually needs to undergo some transformation before it can be used for analysis. This process is traditionally called ETL (Extract, Transform, Load). ETL enables businesses to collect data from multiple sources and consolidate it into a single, centralized location.
Data transformation is about changing and organizing data from different sources in such a way that the data becomes easier to use for other purposes. This can be data from sources such as ERP and CRM systems, Excel sheets and so on. A typical purpose can be reporting and analysis, where after the data transformation we put data in a data warehouse in such a way and in such a format that it becomes easier to report and analyze the data, often across different business systems.
This is the most time-consuming task in any data environment. In most cases, this task takes up to 80% of the total time. Larger data warehouses can contain thousands of ETL processes with different stages, dependencies and processing sequences.
4. Importing interfaces
We categorize this into four different sections:
- Files: Files are still the most common form of data today.
- Web services: Many web services with relevant data are available online.
- Databases: Although many organizations store their data in traditional databases, direct database access is in most cases not exposed to the internet and is therefore unavailable to cloud computing platforms. Web services can be placed between on-premises databases and cloud services to handle security aspects and access control.
- Real-time streams: “Real-time streams” provided by message routers (such as WAMP, MQTT, AMQP, etc.) are still underutilized today, but are gaining increasing importance with the rise of IoT.
5. Process automation
When you have many sources, targets, and multiple data transformation processes, you also have many dependencies. Process automation is part of any data platform and involves a number of processes with high complexity. For process scheduling, a number of dedicated tools such as Apache Airflow, Automate, Control-M, or Luigi have been made available.
Process automation requires you to manage the range of data to be processed. For example, in an incremental upload scenario, each process execution must incrementally pick specific pieces of source data to pass to the target. Data Scope Management is typically implemented with a meta-data driven approach. There are dedicated tables of metadata that keep track of the process state for each piece and can be queried to coordinate the processing of all pieces.
6. Monitoring
Larger data warehouse systems can easily contain hundreds of tables with hundreds of automated ETL processes managing the data flow. Errors along the way are almost inevitable. Many of these errors must be handled manually. With this amount of complexity, you need a way to monitor the processes on the platform.
7. Data History
The need to manage long histories of data is part of the core of any data platform. The very task of a data warehouse can be summarized as merging separate pieces of data into a homogeneous data history.
As data is naturally generated over time, there will be a need to supplement an existing data store with new data. Technically, time intervals are tracked in tables using dedicated columns for this. Data historization is an efficient way to be able to manage these time intervals as new data arrives. Data historization is different from data versioning in the sense that data historization is concerned with real-world timestamps, while versioning is usually concerned with technical insertion stamps.
8. Data versioning
By versioning data, you can track data corrections over time, so that you can later recover old analyses. Versioning allows you to apply non-destructive corrections to existing data. When comparing versioning features, you need to consider the ease of creating versions and the ease of restoring or querying versions. Versioning can be handled at different system levels:
- Taking snapshots of the storage subsystem (similar to backups)
- The underlying database system may support version tracking
- Versioning can be handled by the data warehouse system
- Versioning can be implemented as custom transformation logic in user space
9. Surrogate Key Management
Data platforms are used to consolidate data from many sources with different identifiers for the respective objects. This creates the need for new key ranges for the imported objects and the need to maintain them across subsequent imports. These new keys are called surrogate keys. Creating and maintaining these keys efficiently is no easy task.
10. Analysis / Reporting
The purpose of a data platform is to prepare raw data for analysis and store this data for historical review. Analysis in such a platform can be performed in a variety of ways.
A number of Business Intelligence tools are solely concerned with the task of creating analytical and human-readable data extracts. To prepare data for presentation, a data platform provides you with features to create data extracts and aggregates from larger data warehouses.
Answering specific business questions, using intelligent querying of data warehouses, requires user skills in analytical query languages. BI tools aim to simplify these tasks by providing point-and-click interfaces to answer basic questions such as “number of visitors per month” or “sum of sales in region X”. These tools also allow users to visualize the information via rich graphics. In almost all cases, power users will still be able to bypass these tools and perform their own queries. Examples of popular BI tools include Tableau, Qlik, INSIKT, Looker, Chartio, and Superset.
11. Data Science
Training machine learning models is a requirement that today's computing platforms must deliver on. The most sophisticated methods are implemented not using SQL, but Python or R together with a wide range of specialized libraries such as NumPy, Pandas, SciKit Learn, TensorFlow, PyTorch or even more specialized libraries for NLP or image recognition. Since these tasks can be computationally demanding, additional hardware is often required in addition to the existing analysis hardware.
12. External Access / API
All the collected data on the platform will be used for different purposes. Possible channels considered here are:
- SQL Access for direct analysis also by e.g. BI tools
- API access (REST request) as a service for websites or apps
- Notifications via SMS/email for end users or administrators
- File export for further processing or data delivery to other parties
13. User-Friendliness
The usability of the platform depends on the target audience. The main issue here is how easy it is to create and manage objects (such as users, data stores, tables, transformations, reports, etc.) on the platform. Often, a trade-off must be made between the level of control a user gets and the level of simplicity. Here, we must distinguish between the functionality that the platform provides and the user-generated content on the platform. In most cases, user-generated content requires the use of code, since the entire subject of data engineering and analytics is inherently complex and requires a high level of expression.
14. Multi-user process
This category evaluates the support for user interactions and sharing of work and data. This aspect involves real-time updates of user actions, collaboration, sharing and role assignments, as well as a way to discuss and comment on the platform.
15. Platform documentation
A data platform is used to implement a high degree of custom complexity with a number of users over an extended period of time. This requires detailed documentation of the user-provided content. Here we consider how the platforms support this task. Documentation can always be prepared outside the platform, but this involves a risk of information discrepancies as external documentation quickly becomes outdated.
16. User documentation
Data platforms require a certain level of user skills. Proper documentation describing the platforms functionality is therefore necessary for professional use of the platform.
17. Security
Security for data platforms can be separated into security for storage, interaction (data in transport) and access control.
18. Costs
There are three major cost drivers for a data platform:
- Licenses
- Infrastructure (hardware)
- Staff
Today, most of the software stack can be implemented in high quality using open access software. However, licensed software usually requires less maintenance and low-level system.
Hardware can be used by cloud providers on a pay-per-use basis. The same applies to storage infrastructure. To estimate your hardware costs, you need to consider an infrastructure that covers the following components:
- Database
- Data transformations
- Analysis
- Data Science
- Automation
- Monitoring
- Content storage
While the database is usually the largest component of a data platform, it is far from the only one.
Conclusion
The list, developed by Medium, of 18 criteria provides a basic starting point for evaluating data platforms in terms of their uniqueness as long-term manageable data platforms. These criteria are primarily relevant to organizations that aim to collect longer data histories for more comprehensive statistics and forecasts.