Overview of data processing
Data processing is a crucial step in converting raw data into valuable insights.
It involves applying various operations such as filtering, aggregating, and joining data to derive meaningful information. In the upcoming section, we will delve into the techniques used in the process of transforming and querying data.
Declare data processing logic in SQL
RisingWave uses Postgres-compatible SQL as the interface for declaring data processing. We are committed to making it easy to use while being powerful in expression.
Easy to use: By aligning with PostgreSQL’s syntax, functions, and data types, RisingWave eases the learning curve and enhances accessibility for users.Even when creating stream tasks, there is no need to know many concepts related to streaming processing.
Powerful: RisingWave fully supports and optimizes a variety of SQL features, including advanced features such OVER window and different kinds of JOINs. At the same time, we are also committed to expanding the expressive power of SQL, such as by adding semi-structured data types and corresponding expressions.
Ad hoc (on read) vs. Streaming (on write)
There are 2 execution modes in our system serving different analytics purposes. The results of these two modes are the same and the difference lies in the timing of data processing, whether it occurs at the time of data ingestion(on write) or when the query is executed(on read).
Streaming: RisingWave allows users to predefine SQL queries with CREATE MATERIALIZED VIEW statement. RisingWave continuously listens changes in upstream tables (in the FROM
clause) and incrementally update the results automatically.
Ad-hoc: Also like traditional databases, RisingWave allows users to send SELECT statement to query the result. At this point, RisingWave reads the data from the current snapshot, processes it, and returns the results.
Both modes have their unique advantages. Here are some considerations:
Cost & Performance: Compared to traditional databases, streaming mode can pre-compute and store results. The heavy lifting is done upfront, eliminating duplicate computation over each ad hoc query, and therefore has better performance and a lower cost.
Flexibility: The streaming mode is less flexible to changes in query requirements. The ad-hoc query usually is created on-the-fly to fulfill immediate and specific information needs. Unlike predefined queries, ad-hoc queries are generated in real-time based on your current requirements. They are commonly used in data analysis, decision-making, and exploratory data tasks, where flexibility and quick access to information are crucial.
Examples
Create a table
To illustrate, let’s consider a hypothetical scenario where we have a table called sales_data
. This table stores information about product IDs (product_id
) and their corresponding sales amounts (sales_amount
).
sales_data
You can use the following statement to create this table.
Create a materialized view to build continuous streaming pipeline
Based on the sales_data
table, we can create a materialized view called mv_sales_summary
to calculate the total sales amount for each product.
By the SQL statement above, you have successfully transformed the data from the sales_data
table into a materialized view called mv_sales_summary
. This materialized view provides the total sales amount for each product. Utilizing materialized views allows for precomputing and storing aggregated data, which in turn improves query performance and simplifies data analysis tasks.
Ad-hoc query on materialized view’s result
Then we can directly query the materialized view to retrieve the transformed data:
Also, analysts or applications can send more flexible and complex ad-hoc queries, such as querying how many products have higher sales volumes than a specific product