DataHub
Migrate data from one DataHub instance to another.
Requires direct access to the database, kafka broker, and kafka schema registry
of the source DataHub instance.
Important Capabilities
| Capability | Status | Notes | 
|---|---|---|
| Detect Deleted Entities | ✅ | Optionally enabled via stateful_ingestion.remove_stale_metadata | 
Overview
This source pulls data from two locations:
- The DataHub database, containing a single table holding all versioned aspects
- The DataHub Kafka cluster, reading from the MCL Log topic for timeseries aspects.
All data is first read from the database, before timeseries data is ingested from kafka.
To prevent this source from potentially running forever, it will not ingest data produced after the
datahub_source ingestion job is started. This stop_time is reflected in the report.
Data from the database and kafka are read in chronological order, specifically by the
createdon timestamp in the database and by kafka offset per partition. In order to
properly read from the database, please ensure that the createdon column is indexed.
Newly created databases should have this index, named timeIndex, by default, but older
ones you may have to create yourself, with the statement:
CREATE INDEX timeIndex ON metadata_aspect_v2 (createdon);
If you do not have this index, the source may run incredibly slowly and produce significant database load.
Stateful Ingestion
On first run, the source will read from the earliest data in the database and the earliest
kafka offsets. Every commit_state_interval (default 1000) records, the source will store
a checkpoint to remember its place, i.e. the last createdon timestamp and kafka offsets.
This allows you to stop and restart the source without losing much progress, but note that
you will re-ingest some data at the start of the new run.
If any errors are encountered in the ingestion process, e.g. we are unable to emit an aspect
due to network errors, the source will keep running, but will stop committing checkpoints,
unless commit_with_parse_errors (default false) is set. Thus, if you re-run the ingestion,
you can re-ingest the data that was missed, but note it will all re-ingest all subsequent data.
If you want to re-ingest all data, you can set a different pipeline_name in your recipe,
or set stateful_ingestion.ignore_old_state:
source:
  config:
    # ... connection config, etc.
    stateful_ingestion:
      enabled: true
      ignore_old_state: true
    urn_pattern: # URN pattern to ignore/include in the ingestion
      deny:
        # Ignores all datahub metadata where the urn matches the regex
        - ^denied.urn.*
      allow:
        # Ingests all datahub metadata where the urn matches the regex.
        - ^allowed.urn.*
Limitations
- Can only pull timeseries aspects retained by Kafka, which by default lasts 90 days.
- Does not detect hard timeseries deletions, e.g. if via a datahub deletecommand using the CLI. Therefore, if you deleted data in this way, it will still exist in the destination instance.
- If you have a significant amount of aspects with the exact same createdontimestamp, stateful ingestion will not be able to save checkpoints partially through that timestamp. On a subsequent run, all aspects for that timestamp will be ingested.
Performance
On your destination DataHub instance, we suggest the following settings:
- Enable async ingestion
- Use standalone consumers
(mae-consumer
and mce-consumer)- If you are migrating large amounts of data, consider scaling consumer replicas.
 
- Increase the number of gms pods to add redundancy and increase resilience to node evictions- If you are migrating large amounts of data, consider increasing elasticsearch's
thread count via the ELASTICSEARCH_THREAD_COUNTenvironment variable.
 
- If you are migrating large amounts of data, consider increasing elasticsearch's
thread count via the 
CLI based Ingestion
Install the Plugin
pip install 'acryl-datahub[datahub]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
pipeline_name: datahub_source_1
datahub_api:
  server: "http://localhost:8080" # Migrate data from DataHub instance on localhost:8080
  token: "<token>"
source:
  type: datahub
  config:
    include_all_versions: false
    database_connection:
      scheme: "mysql+pymysql" # or "postgresql+psycopg2" for Postgres
      host_port: "<database_host>:<database_port>"
      username: "<username>"
      password: "<password>"
      database: "<database>"
    kafka_connection:
      bootstrap: "<boostrap_url>:9092"
      schema_registry_url: "<schema_registry_url>:8081"
    stateful_ingestion:
      enabled: true
      ignore_old_state: false
    urn_pattern:
    deny:
      # Ignores all datahub metadata where the urn matches the regex
      - ^denied.urn.*
    allow:
      # Ingests all datahub metadata where the urn matches the regex.
      - ^allowed.urn.*
  extractor_config:
    set_system_metadata: false # Replicate system metadata
# Here, we write to a DataHub instance
# You can also use a different sink, e.g. to write the data to a file instead
sink:
  type: datahub-rest
  config:
    server: "<destination_gms_url>"
    token: "<token>"
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description | 
|---|---|
| commit_state_interval integer | Number of records to process before committing state Default: 1000 | 
| commit_with_parse_errors boolean | Whether to update createdon timestamp and kafka offset despite parse errors. Enable if you want to ignore the errors. Default: False | 
| database_query_batch_size integer | Number of records to fetch from the database at a time Default: 10000 | 
| database_table_name string | Name of database table containing all versioned aspects Default: metadata_aspect_v2 | 
| include_all_versions boolean | If enabled, include all versions of each aspect. Otherwise, only include the latest version of each aspect. Default: False | 
| kafka_topic_name string | Name of kafka topic containing timeseries MCLs Default: MetadataChangeLog_Timeseries_v1 | 
| max_workers integer | Number of worker threads to use for datahub api ingestion. Default: 20 | 
| pull_from_datahub_api boolean | Use the DataHub API to fetch versioned aspects. Default: False | 
| database_connection SQLAlchemyConnectionConfig | Database connection config | 
| database_connection.host_port ❓ string | host URL | 
| database_connection.scheme ❓ string | scheme | 
| database_connection.database string | database (catalog) | 
| database_connection.options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. | 
| database_connection.password string(password) | password | 
| database_connection.sqlalchemy_uri string | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. | 
| database_connection.username string | username | 
| kafka_connection KafkaConsumerConnectionConfig | Kafka connection config | 
| kafka_connection.bootstrap string | Default: localhost:9092 | 
| kafka_connection.client_timeout_seconds integer | The request timeout used when interacting with the Kafka APIs. Default: 60 | 
| kafka_connection.consumer_config object | Extra consumer config serialized as JSON. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md . | 
| kafka_connection.schema_registry_config object | Extra schema registry config serialized as JSON. These options will be passed into Kafka's SchemaRegistryClient. https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient | 
| kafka_connection.schema_registry_url string | Default: http://localhost:8080/schema-registry/api/ | 
| urn_pattern AllowDenyPattern | Default: {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | 
| urn_pattern.ignoreCase boolean | Whether to ignore case sensitivity during pattern matching. Default: True | 
| urn_pattern.allow array | List of regex patterns to include in ingestion Default: ['.*'] | 
| urn_pattern.allow.string string | |
| urn_pattern.deny array | List of regex patterns to exclude from ingestion. Default: [] | 
| urn_pattern.deny.string string | |
| stateful_ingestion StatefulIngestionConfig | Stateful Ingestion Config Default: {'enabled': True, 'max_checkpoint_state_size': 167... | 
| stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_apiis specified, otherwise FalseDefault: False | 
The JSONSchema for this configuration is inlined below.
{
  "title": "DataHubSourceConfig",
  "description": "Base configuration class for stateful ingestion for source configs to inherit from.",
  "type": "object",
  "properties": {
    "stateful_ingestion": {
      "title": "Stateful Ingestion",
      "description": "Stateful Ingestion Config",
      "default": {
        "enabled": true,
        "max_checkpoint_state_size": 16777216,
        "state_provider": {
          "type": "datahub",
          "config": {}
        },
        "ignore_old_state": false,
        "ignore_new_state": false
      },
      "allOf": [
        {
          "$ref": "#/definitions/StatefulIngestionConfig"
        }
      ]
    },
    "database_connection": {
      "title": "Database Connection",
      "description": "Database connection config",
      "allOf": [
        {
          "$ref": "#/definitions/SQLAlchemyConnectionConfig"
        }
      ]
    },
    "kafka_connection": {
      "title": "Kafka Connection",
      "description": "Kafka connection config",
      "allOf": [
        {
          "$ref": "#/definitions/KafkaConsumerConnectionConfig"
        }
      ]
    },
    "include_all_versions": {
      "title": "Include All Versions",
      "description": "If enabled, include all versions of each aspect. Otherwise, only include the latest version of each aspect. ",
      "default": false,
      "type": "boolean"
    },
    "database_query_batch_size": {
      "title": "Database Query Batch Size",
      "description": "Number of records to fetch from the database at a time",
      "default": 10000,
      "type": "integer"
    },
    "database_table_name": {
      "title": "Database Table Name",
      "description": "Name of database table containing all versioned aspects",
      "default": "metadata_aspect_v2",
      "type": "string"
    },
    "kafka_topic_name": {
      "title": "Kafka Topic Name",
      "description": "Name of kafka topic containing timeseries MCLs",
      "default": "MetadataChangeLog_Timeseries_v1",
      "type": "string"
    },
    "commit_state_interval": {
      "title": "Commit State Interval",
      "description": "Number of records to process before committing state",
      "default": 1000,
      "type": "integer"
    },
    "commit_with_parse_errors": {
      "title": "Commit With Parse Errors",
      "description": "Whether to update createdon timestamp and kafka offset despite parse errors. Enable if you want to ignore the errors.",
      "default": false,
      "type": "boolean"
    },
    "pull_from_datahub_api": {
      "title": "Pull From Datahub Api",
      "description": "Use the DataHub API to fetch versioned aspects.",
      "default": false,
      "hidden_from_docs": true,
      "type": "boolean"
    },
    "max_workers": {
      "title": "Max Workers",
      "description": "Number of worker threads to use for datahub api ingestion.",
      "default": 20,
      "hidden_from_docs": true,
      "type": "integer"
    },
    "urn_pattern": {
      "title": "Urn Pattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "allOf": [
        {
          "$ref": "#/definitions/AllowDenyPattern"
        }
      ]
    }
  },
  "definitions": {
    "DynamicTypedStateProviderConfig": {
      "title": "DynamicTypedStateProviderConfig",
      "type": "object",
      "properties": {
        "type": {
          "title": "Type",
          "description": "The type of the state provider to use. For DataHub use `datahub`",
          "type": "string"
        },
        "config": {
          "title": "Config",
          "description": "The configuration required for initializing the state provider. Default: The datahub_api config if set at pipeline level. Otherwise, the default DatahubClientConfig. See the defaults (https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/graph/client.py#L19).",
          "default": {},
          "type": "object"
        }
      },
      "required": [
        "type"
      ],
      "additionalProperties": false
    },
    "StatefulIngestionConfig": {
      "title": "StatefulIngestionConfig",
      "description": "Basic Stateful Ingestion Specific Configuration for any source.",
      "type": "object",
      "properties": {
        "enabled": {
          "title": "Enabled",
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "default": false,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    },
    "SQLAlchemyConnectionConfig": {
      "title": "SQLAlchemyConnectionConfig",
      "type": "object",
      "properties": {
        "username": {
          "title": "Username",
          "description": "username",
          "type": "string"
        },
        "password": {
          "title": "Password",
          "description": "password",
          "type": "string",
          "writeOnly": true,
          "format": "password"
        },
        "host_port": {
          "title": "Host Port",
          "description": "host URL",
          "type": "string"
        },
        "database": {
          "title": "Database",
          "description": "database (catalog)",
          "type": "string"
        },
        "scheme": {
          "title": "Scheme",
          "description": "scheme",
          "type": "string"
        },
        "sqlalchemy_uri": {
          "title": "Sqlalchemy Uri",
          "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
          "type": "string"
        },
        "options": {
          "title": "Options",
          "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
          "type": "object"
        }
      },
      "required": [
        "host_port",
        "scheme"
      ],
      "additionalProperties": false
    },
    "KafkaConsumerConnectionConfig": {
      "title": "KafkaConsumerConnectionConfig",
      "description": "Configuration class for holding connectivity information for Kafka consumers",
      "type": "object",
      "properties": {
        "bootstrap": {
          "title": "Bootstrap",
          "default": "localhost:9092",
          "type": "string"
        },
        "schema_registry_url": {
          "title": "Schema Registry Url",
          "default": "http://localhost:8080/schema-registry/api/",
          "type": "string"
        },
        "schema_registry_config": {
          "title": "Schema Registry Config",
          "description": "Extra schema registry config serialized as JSON. These options will be passed into Kafka's SchemaRegistryClient. https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html?#schemaregistryclient",
          "type": "object"
        },
        "client_timeout_seconds": {
          "title": "Client Timeout Seconds",
          "description": "The request timeout used when interacting with the Kafka APIs.",
          "default": 60,
          "type": "integer"
        },
        "consumer_config": {
          "title": "Consumer Config",
          "description": "Extra consumer config serialized as JSON. These options will be passed into Kafka's DeserializingConsumer. See https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#deserializingconsumer and https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md .",
          "type": "object"
        }
      },
      "additionalProperties": false
    },
    "AllowDenyPattern": {
      "title": "AllowDenyPattern",
      "description": "A class to store allow deny regexes",
      "type": "object",
      "properties": {
        "allow": {
          "title": "Allow",
          "description": "List of regex patterns to include in ingestion",
          "default": [
            ".*"
          ],
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "deny": {
          "title": "Deny",
          "description": "List of regex patterns to exclude from ingestion.",
          "default": [],
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "ignoreCase": {
          "title": "Ignorecase",
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "default": true,
          "type": "boolean"
        }
      },
      "additionalProperties": false
    }
  }
}
Code Coordinates
- Class Name: datahub.ingestion.source.datahub.datahub_source.DataHubSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for DataHub, feel free to ping us on our Slack.