Leveraging Cortex Analyst for Data Quality Insights in Snowflake

Sunny Malik
6 min readOct 14, 2024

--

Data quality is crucial for making informed business decisions. With Snowflake’s Data Quality features, Cortex Analyst, and built-in anomaly detection, you can now easily monitor and query the quality of your data using natural language while also identifying unusual patterns. Let’s explore how to set this up and use it effectively.

Enabling Data Quality Metrics

Before we can use Cortex Analyst for data quality questions, we need to enable and set up Data Quality metrics:

  1. Enable Data Quality for your account by contacting Snowflake support.
  2. Create data metric functions (DMFs) or use Snowflake’s system DMFs to measure key metrics like freshness, duplicates, nulls, and unique values.
  3. Assign DMFs to your tables and views. For example:
ALTER TABLE my_table 
SET DATA METRIC FUNCTION snowflake.core.row_count()
SCHEDULE = '0 0 * * *';
  1. Configure alerts to notify you of changes in data quality.

Setting Up Cortex Analyst

Now that we have Data Quality metrics in place, let’s set up Cortex Analyst:

  1. Ensure you have access to Cortex Analyst
  2. Create a semantic model that includes your data quality metrics.
  3. Deploy the semantic model to your Snowflake account.

Creating the Semantic Model

The semantic model is crucial for Cortex Analyst to understand your data structure and relationships. Here’s how to create one:

  1. Define Entities: Start by identifying the main entities in your data model. For data quality, this might include tables, columns, and metrics.
  2. Specify Relationships: Define how these entities relate to each other. For example, tables have columns, and columns have associated metrics.
  3. Add Attributes: For each entity, specify relevant attributes. For a table, this might include the schema name, table name, and associated data quality metrics.
  4. Include Metrics: Add the data quality metrics you’ve set up as attributes of your entities.

Here’s an example of how your semantic model might look in YAML format:

# Name and description of the semantic model.
name: <name>
description: <string>

# A semantic model can contain one or more logical tables.
tables:

# A logical table on top of a base table.
- name: <name>
description: <string>

# The fully qualified name of the base table.
base_table:
database: <database>
schema: <schema>
table: <base table name>

# Dimension columns in the logical table.
dimensions:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
unique: <boolean>
cortex_search_service_name: <string>

# Time dimension columns in the logical table.
time_dimensions:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
unique: <boolean>

# Measure columns in the logical table.
measures:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>
data_type: <data type>
default_aggregation: <aggregate function>

# Commonly used filters over the logical table.
filters:
- name: <name>
synonyms: <array of strings>
description: <string>
expr: <SQL expression>

# Example questions and queries that answer them
verified_queries:

# Verified Query (1 of n)
- name: # A descriptive name of the query.
question: # The natural language question that this query answers.
verified_at: # Optional: Time (in seconds since the UNIX epoch, January 1, 1970) when the query was verified.
verified_by: # Optional: Name of the person who verified the query.
use_as_onboarding_question: # Optional: Marks this question as an onboarding question for the end user.
sql: # The SQL query for answering the question.

5. Deploy the Model: Use Snowflake’s semantic modeling tools to deploy this model to your account.

Asking Data Quality Questions

With everything set up, you can now use Cortex Analyst to ask natural language questions about your data quality. Here are some examples:

  1. “What’s the current row count of the sales table?”
  2. “How many null values are in the customer_email column of the customers table?”
  3. “Show me the trend of duplicate orders over the last month.”
  4. “What’s the freshness of data in the inventory table?”

Cortex Analyst will interpret these questions, generate the appropriate SQL queries using the data quality metrics you’ve set up, and return the results.

Anomaly Detection Using Snowflake’s Built-in Model

In addition to using Cortex Analyst for data quality questions, Snowflake provides a powerful built-in anomaly detection capability through the SNOWFLAKE.ML.ANOMALY_DETECTION class. This feature allows you to identify outliers in your time-series data, which can be crucial for detecting data quality issues or unusual patterns in your data.Here's how to implement anomaly detection using Snowflake's built-in model:

  1. Prepare Your Data: First, create a view or table with your historical data. This will be used to train the model.
CREATE OR REPLACE VIEW historical_data_view AS
SELECT timestamp, sum(value) as total_value
FROM your_data_table
WHERE timestamp BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY timestamp;
  1. Create the Anomaly Detection Model: Use the SNOWFLAKE.ML.ANOMALY_DETECTION class to create a model based on your historical data.
CREATE OR REPLACE SNOWFLAKE.ML.ANOMALY_DETECTION anomaly_model(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'historical_data_view'),
TIMESTAMP_COLNAME => 'TIMESTAMP',
TARGET_COLNAME => 'TOTAL_VALUE',
LABEL_COLNAME => ''
);
  1. Detect Anomalies: Once your model is created, you can use it to detect anomalies in new data.
CALL anomaly_model!DETECT_ANOMALIES(
INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'new_data_view'),
TIMESTAMP_COLNAME => 'TIMESTAMP',
TARGET_COLNAME => 'TOTAL_VALUE'
);
  1. Analyze Results: The DETECT_ANOMALIES function will return a result set with the following columns:
  • TIMESTAMP: The timestamp of the data point
  • TARGET: The actual value
  • PREDICTION: The predicted value by the model
  • PREDICTION_LOWER: The lower bound of the prediction interval
  • PREDICTION_UPPER: The upper bound of the prediction interval
  • IS_ANOMALY: A boolean indicating whether the point is considered an anomaly

You can save these results to a table for further analysis:

CREATE OR REPLACE TABLE anomaly_results AS
SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

5. Visualize Anomalies: You can use Snowflake’s Snowsight or connect to external visualization tools to create charts that highlight the detected anomalies.

Integrating Anomaly Detection with Cortex Analyst

To make the most of both Cortex Analyst and Snowflake’s anomaly detection capabilities, you can:

  1. Use Cortex Analyst to ask high-level questions about your data quality and trends.
  2. Employ the anomaly detection model to automatically flag unusual data points.
  3. Use Cortex Analyst again to dive deeper into the flagged anomalies, asking questions like “Why did we see an anomaly in sales data on July 15th?”

This integrated approach allows you to combine the power of natural language processing with sophisticated machine learning models, providing a robust system for maintaining data quality and gaining insights from your data.

Building a Data Quality Dashboard

To make this process even more user-friendly, you can create a Streamlit application that interfaces with Cortex Analyst:

  1. Set up a Python environment and install the required libraries:
pip install streamlit snowflake-snowpark-python

2. Create a Streamlit app that connects to your Snowflake account and uses the Cortex Analyst API.

3. Add input fields for users to enter their data quality questions.

4. Display the results in an easy-to-read format, possibly with visualizations. Refer my previous blog for details

Benefits of Using Cortex Analyst for Data Quality

By leveraging Cortex Analyst for data quality questions, you gain several advantages:

  • Accessibility: Business users can get data quality insights without needing to write SQL queries.
  • Accuracy: Cortex Analyst achieves over 90% SQL accuracy on real-world use cases.
  • Efficiency: Quickly identify and address data quality issues without burdening data analysts.
  • Customization: Tailor the semantic model to your specific data quality needs.

Conclusion

Combining Snowflake’s Data Quality features with Cortex Analyst and anomaly detection provides a powerful toolkit for monitoring and querying data quality using natural language while also identifying unusual patterns. This approach democratizes access to data quality insights, enabling business users to make data-driven decisions with confidence in the underlying data’s quality.By following the steps outlined in this blog, including setting up a comprehensive semantic model and implementing anomaly detection, you can create a system that allows anyone in your organization to ask data quality questions, get accurate, timely answers, and automatically identify potential issues. This not only improves data governance but also fosters a data-driven culture across your entire organization.Remember, the key to success is a well-defined semantic model, properly set up Data Quality metrics, and effective use of anomaly detection. With these in place, Cortex Analyst becomes an invaluable tool for maintaining high data quality standards in your Snowflake environment, while anomaly detection helps you stay ahead of potential issues before they become significant problems.

--

--

No responses yet