Guide for Data Engineers on Nontraditional Data Storage Methods

Data Engineering: A Look at Modern Data Storage Solutions

The rapid evolution of big data and data science is pushing the boundaries of traditional engineering roles, leading to the emergence of new specializations like data engineering.

While data engineering initially focused on loading external data and designing databases (building pipelines for data collection, processing, storage, and analysis), its scope has expanded significantly. It now encompasses a broad spectrum of skills, from web scraping and data cleaning to distributed computing and efficient data storage and retrieval.

Data storage and retrieval, coupled with how data is utilized and analyzed, are at the heart of data engineering. The recent surge in diverse data storage technologies begs the question: which one reigns supreme for data engineering, offering the most suitable features?

While most engineers are well-versed in SQL databases like PostgreSQL, MSSQL, and MySQL, known for their structured, row-oriented storage in relational data tables, their ubiquity makes them a topic for another day. Instead, let’s delve into three alternative data storage types gaining traction for their innovative approaches to data management.

From a data engineering perspective, these technologies are classified as search engines, document stores, and columnar stores.

Search engines excel at handling text-based queries, outperforming SQL databases in terms of query capabilities and performance for text matching tasks like LIKE searches.
Document stores offer superior data schema adaptability compared to traditional databases. Storing data as individual document objects, often in JSON format, eliminates the need for predefined schemas.
Columnar stores specialize in single-column queries and aggregation operations. They significantly outperform traditional databases in executing SQL operations like SUM and AVG by storing data of the same column in close proximity on the hard drive.

This article explores these three technologies, featuring Elasticsearch as a representative search engine, MongoDB as a document store, and Amazon Redshift as a columnar store.

By understanding these alternative data storage options, we can make informed decisions to select the most appropriate solution for each unique scenario.

Storage for Data Engineering: Which is the Best?

For data engineers, the most important aspects of data storages are
how they index, shard, and aggregate data.

To effectively compare these technologies, we’ll examine how they handle data indexing, sharding, and aggregation. Each indexing strategy comes with its strengths and weaknesses, favoring certain queries over others. Understanding the most frequently used queries is crucial in guiding the choice of data store.

Sharding, the process of dividing database data into smaller chunks, dictates how the infrastructure will scale with increasing data volume. Selecting a sharding methodology that aligns with growth plans and budget constraints is critical for any data science organization, regardless of size.

Finally, these technologies employ distinct data aggregation techniques. When handling massive datasets, choosing the wrong aggregation strategy can severely limit the types and performance of reports that can be generated.

As data engineers, we must carefully evaluate all three aspects – indexing, sharding, and aggregation – when considering different data storage solutions.

Contenders: A Closer Look

Elasticsearch: The Search Engine Powerhouse

Elasticsearch has rapidly gained popularity for its scalability and seamless integration capabilities. Built upon Apache Lucene, it offers a robust, ready-to-use text search and indexing functionality. Beyond traditional search engine tasks, text searches, and exact value queries, Elasticsearch boasts layered aggregation capabilities.

MongoDB: The Go-To Document Store

MongoDB has established itself as the leading NoSQL database, renowned for its user-friendliness and flexibility. It supports rich, adaptable querying for navigating complex documents. Frequently accessed fields can be optimized with indexing, and MongoDB offers a multi-stage pipeline for efficient aggregation of large datasets.

Amazon Redshift: The Columnar Store Champion

Columnar databases like Amazon Redshift have gained significant traction, particularly in data analytics, alongside the rise of NoSQL databases. By storing data in columns instead of rows, Redshift dramatically accelerates aggregation operations by directly accessing data from the disk.

Data Indexing Strategies

Elasticsearch: Mastering Text Indexing

Search engines are essentially data stores specializing in indexing textual data. Unlike other data stores that rely on exact field values for indexing, search engines allow retrieval based on partial (typically text) field matches.

This retrieval is achieved through analyzers that automatically process each field. Analyzers break down field values into smaller components, creating multiple index keys. For instance, a basic analyzer might dissect the phrase “the quick brown fox jumped over the lazy dog” into individual words like “the,” “quick,” “brown,” “fox,” and so on.

This approach enables users to find data by searching for fragments within the results, ranked by the number of matching fragments within each document.

More sophisticated analyzers leverage techniques like edit distances, n-grams, and filtering by stopwords to construct comprehensive retrieval indices.

MongoDB: Flexible Indexing for Diverse Needs

As a versatile data store, MongoDB offers significant flexibility in indexing data. Unlike Elasticsearch, which indexes every field by default, MongoDB only indexes the _id field automatically, requiring manual index creation for frequently queried fields.

While MongoDB’s text analyzer may not be as powerful as Elasticsearch’s, it compensates with its versatility in indexing methods. From compound and geospatial indices for optimized querying to TTL and sparse indices for storage reduction, MongoDB caters to a wide range of needs.

Redshift: Performance through Sorting

Unlike Elasticsearch, MongoDB, or even traditional databases like PostgreSQL, Amazon Redshift does not employ traditional indexing. Instead, it prioritizes consistent on-disk sorting to achieve faster query times.

Users can define a specific order for column values as the table’s sort key. By sorting data on the disk, Redshift can efficiently skip entire data blocks during retrieval if their values fall outside the queried range, significantly enhancing performance.

Sharding: Scaling for Growth

Elasticsearch: Distributed Architecture for Scalability

Built on top of Lucene, Elasticsearch is designed for horizontal scalability and production readiness. It achieves scalability by creating multiple Lucene instances (shards) and distributing them across multiple nodes (servers) within a cluster.

By default, documents are routed to their corresponding shards based on their _id field. During retrieval, the master node broadcasts the query to each shard, aggregates the results, and ranks them before delivering the final output.

MongoDB: Router-Based Sharding for Efficiency

Within a MongoDB cluster, three types of servers collaborate: routers, configs, and shards. Scaling the router servers allows the system to handle a higher volume of requests, while the shard servers handle the bulk of the processing load.

Similar to Elasticsearch, MongoDB routes documents to their respective shards based on their _id field (by default). Upon query execution, the config server informs the router about the relevant shards, and the router server distributes the query and aggregates the results.

Redshift: Leader-Node Architecture for Centralized Control

An Amazon Redshift cluster consists of a single leader node and multiple compute nodes. The leader node manages query compilation, distribution, and aggregation of intermediate results.

Unlike MongoDB’s scalable router servers, Redshift’s leader node is static and cannot be scaled horizontally. While this introduces a potential bottleneck, it also enables efficient caching of compiled execution plans for frequently executed queries.

Aggregating Data: Extracting Insights

Elasticsearch: Powerful but Limited Intra-Document Comparisons

Elasticsearch enables document bucketing based on exact values, ranges, temporal data, and even geolocation. These buckets can be further nested for finer granularity. The platform allows the calculation of metrics like means and standard deviations at each aggregation level, enabling hierarchical analysis within a single query.

However, as a document-based store, Elasticsearch faces limitations in comparing fields within the same document. While it excels at filtering documents where a field like “followers” exceeds a specific value (e.g., greater than 10), it cannot easily compare “followers” to another field like “following” within the same document.

Workarounds involve injecting custom scripts as predicates, which can be suitable for ad-hoc analysis but may degrade performance in production environments.

MongoDB: Flexible Aggregation Pipeline with Caveats

MongoDB’s Aggregation Pipeline offers a robust and efficient solution for data aggregation. True to its name, it processes returned data in stages, with each step potentially filtering, aggregating, transforming documents, introducing new metrics, or unwinding previously aggregated groups.

This stage-wise approach, combined with the ability to reduce documents and fields to only the filtered elements, helps minimize memory consumption. Compared to Elasticsearch and even Redshift, the Aggregation Pipeline provides exceptional flexibility in data exploration.

Despite its adaptability, MongoDB shares Elasticsearch’s limitation in intra-document field comparison. Moreover, certain operations, like $group, necessitate passing results to the master node, hindering the benefits of distributed computing.

Users unfamiliar with the stage-wise pipeline approach might find certain tasks counterintuitive. For instance, summing up elements within an array field requires two steps: first, $unwind to expand the array, followed by the $group operation.

Redshift: SQL-Powered Aggregation for Massive Datasets

The advantages of Amazon Redshift cannot be overstated, particularly when dealing with large datasets. Redshift effortlessly handles complex aggregations that might cripple MongoDB, making it a lifesaver for analyzing high-volume data like mobile traffic.

Its support for SQL ensures a smooth transition for traditional database engineers migrating their queries. Beyond onboarding ease, SQL remains a proven, scalable, and powerful query language, effortlessly supporting intra-document/row field comparisons. Redshift further enhances performance by compiling and caching frequently used queries on the compute nodes.

However, as a relational database, Redshift lacks the schema flexibility offered by MongoDB and Elasticsearch. Optimized for read operations, it experiences performance degradation during updates and deletions.

Maintaining optimal read times necessitates keeping rows sorted, adding another layer of operational complexity. Designed for petabyte-scale challenges, Redshift comes at a cost and might not be a worthwhile investment unless facing scalability issues with other databases.

Choosing the Right Tool for the Job

This article explored three distinct technologies – Elasticsearch, MongoDB, and Amazon Redshift – through the lens of data engineering. However, declaring a clear winner is impossible, as each technology excels in its storage type category.

The optimal choice for data engineering hinges on the specific use case.

MongoDB shines as a starting point, offering flexibility when the data schema is still evolving. However, it might not outperform specialized databases in specific scenarios.
Elasticsearch, while providing a fluid schema similar to MongoDB, prioritizes multiple indices and text queries at the expense of write performance and storage footprint. Consider migrating to Elasticsearch when managing numerous indices in MongoDB becomes cumbersome.
Redshift demands a predefined schema and lacks MongoDB’s adaptability. In return, it reigns supreme in queries involving single or a limited number of columns. If budget allows, Amazon Redshift emerges as a potent tool when other databases struggle to handle massive data volumes.