OpenMetadata Ingestion: Filter Patterns Not Working

Dec 15, 2025 by Alex Johnson 52 views
# OpenMetadata Ingestion: Filter Patterns Not Working

OpenMetadata is an amazing platform that helps you discover, understand, and manage your data assets. One of the core functionalities that makes this possible is its robust ingestion framework, which allows you to automatically collect metadata from various data sources. However, sometimes during the ingestion process, you might encounter unexpected behaviors, especially when trying to fine-tune which data gets included or excluded. Recently, a user ran into an issue where their `containerFilterPattern` and `include`/`exclude` filters within the Ingestion Framework didn't seem to be working as expected. This isn't just a minor glitch; it can lead to unnecessary metadata being ingested, cluttering your OpenMetadata instance and making it harder to find what you're truly looking for. Let's dive into this specific problem, understand why it might be happening, and explore potential solutions.

## The Core Problem: Unexpected File Processing

The primary concern raised was that even though the `_SUCCESS` file is meant to be excluded by default, the ingestion logs clearly showed the agent attempting to parse it. This is a critical observation because `_SUCCESS` files are typically marker files indicating the completion of a job, and they often have zero bytes, making them unsuitable for metadata extraction. The log snippet provided is quite telling: `File masking/_SUCCESS was picked to infer data structure from.` followed by warnings like `Could not determine file size for ~~~/dt=20250523/_SUCCESS: Parquet file size is 0 bytes.` This indicates that the ingestion process, despite configurations intended to skip such files, is still trying to interact with them, leading to errors and wasted processing time. The expectation, naturally, is that these `_SUCCESS` files, along with anything explicitly excluded by the `containerFilterPattern`, should be completely ignored by the ingestion pipeline. When this doesn't happen, it suggests a disconnect between the configuration provided and the actual execution logic within the Ingestion Framework.

This behavior is particularly problematic in distributed data processing systems like Hadoop or Spark, where `_SUCCESS` files are a common pattern. These files are crucial for workflow management but are not actual data files. Including them in metadata ingestion can lead to malformed entries, errors in schema inference, and a generally less clean metadata catalog. The fact that the logs explicitly show the attempt to process a zero-byte file points to a potential flaw in how the filtering mechanisms are applied *before* the file content is even read for metadata extraction. It's as if the filter is being bypassed or applied too late in the process. The user's screenshots further corroborate this, showing the clear `WARNING` messages related to processing the `_SUCCESS` file.

### Understanding Filter Patterns in OpenMetadata Ingestion

OpenMetadata's Ingestion Framework is designed with flexibility in mind, allowing users to specify intricate rules for what metadata to collect. The `containerFilterPattern` is a powerful tool that uses regular expressions to include or exclude specific data containers (like databases, schemas, or tables) or even files within those containers. The goal is to give users granular control over their metadata scope. Similarly, `include` and `exclude` filters provide more direct ways to specify patterns for inclusion and exclusion. When these filters are set up, the expectation is that the ingestion process will strictly adhere to them, skipping any elements that don't match the defined criteria. For instance, if you configure an `exclude` pattern for `_SUCCESS` files, the ingestion agent should simply not attempt to open or process them. The observed behavior, where `_SUCCESS` files are still being picked up and attempted to be parsed, directly contradicts this expected functionality.

This discrepancy can arise from several potential causes. It might be that the filtering logic is applied at a stage where file identification has already occurred, but before the decision to *process* the file content is made. Alternatively, the regular expression syntax might be misinterpreted, or there could be an edge case in the library (like `pyarrow` in this scenario) that doesn't gracefully handle zero-byte files when trying to infer structure, even if the initial file selection was intended to be filtered out. The user's question about `openmetadata.json` and whether filter options are applied universally, including database ingestion, touches upon the scope and consistency of these filtering mechanisms across different ingestion types. It's vital that these filters are applied consistently, whether you're ingesting from object storage like S3 or from a structured database.

## Debugging the Ingestion Logs

The provided ingestion logs offer a treasure trove of information for diagnosing the issue. We see entries like `[2025-12-10T00:38:11.187+0000] {parquet.py:216} INFO - Large parquet file detected (0 bytes > 52428800 bytes). Using batched reading for file: ~~~/dt=20250523/_SUCCESS`. This indicates that the system is aware the file is a `_SUCCESS` file and is attempting to read it, albeit with a warning about its size. The subsequent `WARNING` messages from `pyarrow.lib.ArrowInvalid: Parquet file size is 0 bytes` highlight that the issue isn't just a warning; it's a failure to process the file because it's fundamentally not a valid Parquet file. The traceback shows the error originating within the `pyarrow` library, specifically when it tries to open or infer the dataset structure from the `_SUCCESS` file.

This deep dive into the logs suggests that the problem might lie in the interaction between OpenMetadata's ingestion logic and the underlying libraries used for file processing (like `pyarrow` for Parquet). While OpenMetadata's filtering might intend to skip `_SUCCESS` files, the process of identifying potential files for metadata inference might still pass these files to `pyarrow`, which then fails when it encounters a zero-byte file. The `functools.py` trace indicates a method dispatch, suggesting that the `_read_parquet_dispatch` method is being called for the `_SUCCESS` file. The fact that `_SUCCESS` is excluded by default in many contexts suggests that its presence in the file list to be processed is the initial hurdle.

One crucial aspect to consider is the timing of the filter application. Are the filters applied at the very beginning, preventing the `_SUCCESS` file from ever being considered for processing? Or are they applied later, after the file has been identified as a candidate for metadata extraction? If the latter, it's possible that the initial scan still picks up all files, and the filtering then tries to prune them. However, if the filtering is not robust enough to catch these specific marker files, they might slip through. The error originating from `pyarrow` also raises a question: should OpenMetadata's ingestion framework anticipate and explicitly handle zero-byte files identified as `_SUCCESS` *before* passing them to `pyarrow`, rather than relying solely on `pyarrow` to fail gracefully (which it doesn't, in this case)?

### Database Ingestion Filters: A Separate Concern?

The user also mentioned that `Include filter does not effect` on the database ingestion process. This broadens the scope of the problem, suggesting that filtering might be a more general issue within the Ingestion Framework, not limited to object storage. When filters, whether for `containerFilterPattern` or direct `include`/`exclude` rules, fail to function correctly across different connectors (like S3 vs. databases), it points to a systemic problem. It's essential for these filters to behave predictably and consistently, regardless of the data source. The question about `openmetadata.json` is particularly relevant here. If the configuration file itself is not correctly parsing or applying these filter directives, then all subsequent ingestion tasks relying on it would be affected. This could stem from how the configuration schema is defined, how the ingestion task reads the configuration, or how the filter logic is implemented within the respective connector code.

For database ingestion, filters typically operate on database names, schema names, and table names. If an `include` filter for a specific table pattern isn't working, it means the ingestion agent is still scanning and attempting to collect metadata from tables that should have been excluded. This can lead to an overwhelming amount of metadata, especially in large data warehouses with many schemas and tables. The root cause might be similar to the file-based ingestion: a potential issue in how the filter patterns are matched against the database object names, or a bug in the connector's traversal logic that doesn't honor the exclusion rules. It's also possible that different ingestion types (e.g., `datalake` vs. `database`) have separate filter implementations, and one of them might be bugged.

## Potential Solutions and Workarounds

Given the issue, here are a few approaches to consider:

1.  ***Verify Filter Syntax and Scope***: Double-check the regular expressions used in your `containerFilterPattern` and `include`/`exclude` rules. Ensure they are correctly formatted and target the expected elements. Sometimes, subtle syntax errors or misunderstandings of regex behavior can lead to filters not matching as intended. Confirm the scope: is the pattern meant for files, directories, or database objects? Make sure it aligns with your ingestion type.

2.  ***Explicitly Exclude `_SUCCESS` Files***: While `_SUCCESS` is often excluded by default, explicitly add a rule to exclude it. For example, in your `containerFilterPattern` or an `exclude` list, you could add a pattern like `.*_SUCCESS` or similar, depending on the context and the exact structure of your paths. This might force the ingestion to recognize and skip these files.

3.  ***Update OpenMetadata and Ingestion Packages***: Ensure you are running the latest stable versions of both OpenMetadata and its Ingestion packages. Bug fixes related to filtering mechanisms are often addressed in newer releases. The versions mentioned (1.10.7 for both) are relatively recent, but it's always good practice to check for any patch releases or newer minor versions that might contain a fix for this specific issue.

4.  ***Review Ingestion Configuration***: Carefully examine your `openmetadata.json` or YAML configuration file. Ensure that the filter parameters are correctly placed within the relevant sections (e.g., under the `source` configuration for the specific connector) and that there are no syntax errors or misplaced keys. If you are using a UI-based creation for your ingestion, ensure all fields related to filtering are accurately filled.

5.  ***Report the Bug***: If you've tried the above steps and the issue persists, it's highly recommended to report this bug to the OpenMetadata community. Provide detailed information, including your configuration, the exact log output, and the steps to reproduce. This will help the development team identify and fix the underlying problem.

This problem highlights the importance of reliable filtering in metadata management. When filters fail, it undermines the efficiency and accuracy of your data catalog. By systematically debugging and applying potential solutions, users can work towards restoring the intended behavior of the OpenMetadata Ingestion Framework.

For more insights into configuring OpenMetadata Ingestion, you can refer to the official documentation:

*   **OpenMetadata Documentation**: [**https://docs.open-metadata.org/**](https://docs.open-metadata.org/)
*   **OpenMetadata Ingestion Framework**: [**https://docs.open-metadata.org/connectors/ingestion**](https://docs.open-metadata.org/connectors/ingestion)