Data Warehousing Tools
If you're anywhere in your journey with data analytics, and haven't been living under a rock (or in a server room) for the last decade, then you've certainly heard of data warehouses like those from Oracle or IBM, or even recently-emerged industry giants like Google BigQuery, Amazon Redshift, Microsoft Azure, and Snowflake.
Since their introduction and early evangelism by Bill Inmon and Ralph Kimball in the beginning of the 1990's, data warehouses emerged to provide a single, definite source of truth — once dispersed amongst an enterprise's often multiple and disparate data stores — that's used to show what's actually happening in the business. At the time data warehouses started to appear, they were an evolutionary step away from multiple on-premise data stores, privately hosted database servers, and even disparate spreadsheets that separate teams within an enterprise often used to capture their data for metrics and analytics.
In order to derive value from your data, you need not only have it in one place and using a single, canonical access language and interface, but you must also have a means to manage metadata, handle governance issues, and scale as your data grows. These are among the many challenges that data warehouses solve.
In this article, we'll cover a few of the best-known data warehouses, ETL tools and business intelligence (BI) tools.
Strictly defined, data warehouses consist of a large store of data gathered from many (often separate) sources that an enterprise uses to guide its decisions.
Part of the broader evolutionary trend of the data ecosystem (if not software generally) is how a do-it-yourself ethos tends to become gradually replaced over time with expert solutions managed by specialists. The first data warehouse solutions were on-premise , and, while those remain, there is also a proliferation of cloud-native solutions which have entered the space.
The functionality for data warehouses is the same for on-prem and cloud native, although one can expect on-prem solutions to contain specific functionality for installation, sharding, replication, scaling and configuration that is absent from the cloud versions, as the latter solutions are generally managed. Data warehouses all:
- store a large repository of integrated data imported from one or many disparate sources, for a single source of truth across an organization;
- require (and in some cases, enable) data cleansing, deduplication, or schema adjustments to be done on imported data;
- enable Machine Learning or AI to be run on large datasets to identify trends, discover hidden relationships and/or predict future events;
- enable data to be queried into different formats for consumption by different stakeholders, or exported into different systems or visualization frameworks;
- allow the generation of custom reports or ad-hoc analysis;
- facilitate data mining;
- serve data scientists, analysts, and other data consumers.
On-premise data warehouses
Using an on-prem solution naturally involves purchasing, installing, and maintaining your own hardware for storing the contents of your data warehouse, in addition to managing the data it stores.
For certain companies with large, established data warehousing infrastructure, or companies with major concerns over accessibility (millisecond response times) or data-security, on-prem solutions may still be the best option.
Here’s a list of common on-prem data warehouse solutions:
Cloud-native data warehouses
Cloud-native data warehouses involve purchasing a solution hosted in the cloud, and funnelling data to it, usually through an API or some other means. Because of the advantages cloud-native solutions provide, nearly all providers of traditionally on-prem solutions have a cloud offering. Cloud-based data warehouses are cost-effective, quick and easy to prepare, can scale without any extra effort, have security built in, and support multi-tenancy.
What's more, cloud native users benefit from delegating maintenance and management of their DWs to third parties. In addition to the labor that's freed up (for analytics or other activities), users need neither outlay an initial hardware cost nor worry about what to do with excess hardware when scaling down.
Here's a list of common cloud-native data warehouse solutions:
ETL stands for "Extract, Transform, and Load" and consists of the tools and processes used for pulling data from one store, transforming it for placement, and finally, loading it into another (often aggregate) store. Just as with data warehouses, ETL tools have progressed over time from self-administered to cloud-native offerings.
Batch run/incumbent ETL tools
Remember when you used to see your bank account updated a day after your most recent financial transaction? That's because historically, many organizations used free compute and storage resources to perform nightly batches of ETL jobs. Some organizations and processes still work this way.
Here's a list of common batch run/incumbent ETL tools:
Open source ETL tools
These solutions are the evolutionary middle step between incumbent batch-based tools and fully managed cloud-based solutions. They solve some of the problems that batch run tools do not, for example, handling real-time streaming data.
Here's a list of common open source ETL tools:
Open source ETL tools havesome drawbacks, but are generally a good choice when a customer isn't seeking a commercial solution.
Cloud-native ETL tools
Today's ETL tools are cloud-based and run in real time. Cloud-based means your ETL solution is managed and you need not worry about hardware costs, scaling, replication, or security, because these are usually built-in.
Here's a list of common cloud-native ETL tools:
Real-time ETL tools
The demand for real-time support has moved the model from batch processing to one based on message queues and streams. Kafka has become the leading distributed message queue, and companies like Alooma have built SaaS or on-prem ETL solutions atop it.
Batch processing of ETL work makes little sense if your data (or insights from it) are needed instantly. And many applications work this way today — a tweet or social media update goes live immediately, not tomorrow!
Here's a list of common real-time ETL tools:
BI and Analytics tools are about everything you do with the data to get insights once you've captured it. These include tools for visualization, data science analysis, analytics and KPIs:
Here's a list of common BI and analytics tools:
- Jupyter Notebooks
- Mode Analytics
- Periscope Data
- SQL Workbench
The data warehousing solution you use may depend on a number of factors: where your source data resides, how you route it, with whom you share it, and how you secure it. Alooma is the data warehousing solution with modern ETL built right in. With setup in mere minutes and real-time ingestion supported, you can integrate, immediately, with hundreds of data sources and rest assured that your data is secure, whether in motion or at rest.
Contact us today to see how we can help.