Configuration

Configuration #

This section is only for users with access to the server where ETL data_snake is running. Changing any of the settings below might break your software! Be sure you know what you are doing!

After installing the ETL data_snake software, you can further customize it using environmental variables. Refer to the variable list on the right for quick access to information about each one.


Basic Configuration #

The variables below are used to modify the basic settings of ETL data_snake.

ETL_AUDIT_IS_ACTIVE #

ETL data_snake has a built-in auditing mechanism that tracks changes in application data. This mechanism generates a large amount of additional data and may slow down bulk operations. It can be disabled by changing this variable to "False" (with quotation marks).

Default - "True"

ETL_COMPONENT_CACHE_TIMEOUT #

Time after which Component preview cache data in the Modeler will be deleted. Measured in seconds.

Default - 1200

ETL_DEFAULT_DATAFRAME_DATETIME_PYTHON_FORMAT #

Default format of datetime in DataFrame. Must create the same format as ETL_DEFAULT_DATAFRAME_DATETIME_JS_FORMAT

Default - %Y-%m-%d %H:%M:%S.%f

ETL_DEFAULT_DATAFRAME_DATETIME_JS_FORMAT #

Default format of datetime in DataFrame. Must create the same format as ETL_DEFAULT_DATAFRAME_DATETIME_PYTHON_FORMAT

Default - YYYY-MM-DD HH:mm:ss.SSSSSS

ETL_MEDIA_ROOT #

Path to folder where media files are stored.

Defalut - /opt/etl/media

ETL_STATIC_ROOT #

Path to folder where static files are stored (e.g. stylesheets).

Defalut - /opt/etl/static

ETL_RESOURCES_PATH #

Path to Resources folder. This is where CSV and Excel Target files are created and from where Local File Sources are situated.

Defalut - /opt/etl/files

ETL_EXTERNAL_URL #

The web address used for sharing and process api token; when using proxy this is the public address (e.g. mga.com.pl). You do not need to add the request schema.

Defalut - localhost

ETL_COMPONENT_ICONS_URL #

Path to the folder with all Component icons.

Default - /static/modeler/icon/components/

ETL_FUSION_REGISTRY_IMPORT_MAX_WAIT_TIME #

How long to wait for Fusion Registry before stopping the process. If you are loading large amounts of data, the process might take a lot of time. If the process does not end during the time set by this variable, the process will end with an error. This does not mean that the data was not successfuly uploaded to the Fusion Registry, but that Fusion Registry did not finish the uploading process. The value is set in seconds.

Default - 86400

ETL_PAGINATION_SIZE #

How many objects will be visible on each section containing a list of elements (except Modifier Steps).

Default - 10

ETL_RESOURCES_PAGINATION_SIZE #

The number of files you can see on one page in the Resources view.

Default - 250

ETL_DASHBOARD_TABLE_ROWS_NUMBER #

The number of rows to display in the Recently run processes table on the Home Page.

Default - 10

ETL_DASHBOARD_PLOT_ROWS_NUMBER #

The number of process executions to display in the plot of the Recently run processes table on the Home Page.

Default - 20

ETL_SCRIPT_EXECUTOR_CLASS #

This should not be changed unless a custom code executor class is written by the developers!

Which Script Executor to use for all the Script Components.

Default - "etl.components.executors.EvalCodeExecutor"

ETL_SECRET_KEY #

A secret key for a particular ETL data_snake software installation. This is used to provide cryptographic signing of data, and should be set to a unique, unpredictable value. Refer to external documentation for more information.

Default - "__this__must__be__change__on__production__server__"

ETL_TIME_ZONE #

The time zone used by the ETL data_snake software. This is used for processes scheduling. Refer to external documentation for a list of all available time zones. (scroll down to all_timezones).

Default - "Europe/Warsaw"

ETL_DATETIME_FORMAT #

The default date format used in the Recently run processes table on the Home Page.

Default - "%Y-%m-%d %H:%M:%S"

ETL_TEMPLATE_DATETIME_FORMAT #

The default date format used on the list of Process and Modifier Run Logs Lists.

Default - "Y-m-d H:i:s"

How long will the session cookies be stored. Value is in seconds.

Default - 60 * 60 * 4

The name of the session cookie.

Default - "etl-session-id"

Whether to use a secure cookie for the session cookie. If this is set to "True", the cookie will be marked as secure, which means browsers may ensure that the cookie is only sent under an HTTPS connection. This has to be written with quotation marks.

Default - "False"

ETL_SESSION_KEY_NAME #

The name of the session key used to decrypt the session cookie.

Default - "true_last_activity"

ETL_SESSION_SAVE_EVERY_REQUEST #

Whether to save the session data on every request. If this is "False", then the session data will only be saved if it has been modified. This has to be written with quotation marks.

Default - "True"

The ETL data_snake software utilizes Gunicorn WSGI HTTP Server.

ETL_PROCESS_PARAMETER_EXEC_TIMEOUT_SECONDS #

Maximum timeout for single script computation. Value in seconds.

Default - 10

ETL_PROCESS_PARAMETER_POOL_SIZE #

Pool size used for computation. Defaults to the number of CPUs.

Default - <number of CPUs>

GUNICORN_TIMEOUT #

The amount of time after which worker processes that are not responding will be restarted. Worker processes are responsible for handling requests and returning a response to the client. Value is in seconds.

Default - 300

POSTGRES_HOST #

Name of the host for the PostgreSQL database used by the ETL data_snake software to store application data. Value is an internal docker address of the PostgreSQL container.

Default - "postgres"

POSTGRES_PORT #

The TCP port for the PostgreSQL database used by the ETL data_snake software to store application data.

Default - 5432

REDIS_DB #

Which of the Redis databases to use for connecting to the Redis cache.

Default - 0

REDIS_HOST #

Name of the host for the Redis cache used by the ETL data_snake software. Value is an internal docker address of the Redis container.

Default - "redis"

REDIS_PORT #

The TCP port for the Redis cache used by the ETL data_snake software.

Default - 6379

DJANGO_VERIFY_REQUESTS_SSL #

Whether to enable SSL verification; setting to "False" will accept any TLS certificate presented by the server, and will ignore hostname mismatches and/or expired certificates. This has to be written with quotation marks.

Default - "True"

ETL_CONSTANCE_BACKEND #

The backend to use for customizable settings.

Default - constance.backends.redisd.RedisBackend

ETL_CONSTANCE_REDIS_CONNECTION #

The connection to use for the Redis cache used for customising settings for users.

Default - redis://redis:6379:0


Sentry Monitoring Platform #

The variables containing _SENTRY_ should only be set if the Sentry application monitoring platform is used.

ETL_SENTRY_DSN #

The DSN to use to connect to Sentry.

Default - unset

ETL_SENTRY_ENVIRONMENT #

Optionally set the DataSnake ELT software environment name for the Sentry application monitoring platform. Can be any string. Set "" for no environment.

Default - "production"

ETL_SENTRY_SEND_DEFAULT_PII #

Whether to include information about user that caused the issue in sentry event.

Default - "True"

ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL #

This sets which logs will be changed into Breadcrumbs. The available values are:

  • "CRITICAL" - only logs describing critical problems that have occurred,
  • "ERROR" - logs describing major problems that have occurred; includes information described in "CRITICAL",
  • "WARNING" - logs describing minor problems that have occurred; includes information described in "ERROR" and "CRITICAL",
  • "INFO" - general system information logs; includes information described in previous options,
  • "DEBUG" - log low level system information for debugging purposes; includes information described in previous options.

Default - "INFO"

ETL_SENTRY_LOGGING_EVENT_LEVEL #

This sets which the minimum level of logs that will be sent to Sentry as an event. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "ERROR"

ETL_SENTRY_MAX_BREADCRUMBS #

The maximum amount of Breadcrumbs that should be captured.

Default - 50


LDAP #

The following variables are used to configure Lightweight Directory Access Protocol (LDAP) service integration. For more information on how to configure these variables, visit the external documentation.

ETL_AUTH_LDAP_SERVER_URI #

This variable points to the LDAP server. If you do not use LDAP services, unset this variable.

Default - unset

ETL_AUTH_LDAP_BIND_DN #

The unique name to use when binding to the LDAP server. This is used for operations that do not authenticate specific users. Connected with the ETL_AUTH_LDAP_BIND_PASSWORD variable.

Default - cn=admin_ro,cn=Users,dc=example,dc=com

ETL_AUTH_LDAP_BIND_PASSWORD #

This variable is the password that authorises the LDAP Service and allows users to login in. Set to your LDAP Service password. Connected with the ETL_AUTH_LDAP_BIND_DN variable.

Default - unset

ETL_AUTH_LDAP_START_TLS #

A flag that indicates whether to enable TLS encryption over the standard LDAP port. This has to be written with quotation marks.

Default - "True"

ETL_AUTH_LDAP_MIRROR_GROUPS #

A flag that indicates whether to mirror LDAP group memberships in the ETL data_snake user database. This has to be written with quotation marks.

Default - "True"

ETL_AUTH_LDAP_CACHE_GROUPS #

Whether to cache User’s group membership with their unique name.

Default - "True"

ETL_AUTH_LDAP_CACHE_TIMEOUT #

This variable determines the amount of time, in seconds, a User’s unique name are cached. The value 0 (default) disables caching entirely.

Default - 3600

ETL_AUTH_LDAP_GROUP_SEARCH_BASE #

This variable references all LDAP groups that users might belong to. If set, the ETL_AUTH_LDAP_GROUP_TYPE must be set. This is optional.

Default - cn=Users,dc=example,dc=com

ETL_AUTH_LDAP_GROUP_SEARCH_SCOPE #

The extent of the search to make when performing a group search. This is optional. The available options are:

  • 'scope_subtree' - search against the search base and all entries below it,
  • 'scope_onelevel' - search against the search base and its immediate subordinates,
  • 'scope_base' - search only against the search base,
  • 'scope_subordinate' - constrains the search scope to all subordinates of the named base object and does not include the base object.

Default - "scope_subtree"

ETL_AUTH_LDAP_GROUP_SEARCH_FILTER #

A search filter for group serching. This is optional.

Default - "(objectClass=group)"

ETL_AUTH_LDAP_GROUP_TYPE #

The type of groups that should be referenced by the ETL_AUTH_LDAP_GROUP_SEARCH_BASE variable.

Default - cn

ETL_AUTH_LDAP_REQUIRE_GROUP #

The distinguished name of a group. Authentication will fail for any User that does not belong to this group. This is optional.

Default - not set

ETL_AUTH_LDAP_GROUP_TYPE_NAME_ATTR #

Optional name attribute for ETL_AUTH_LDAP_GROUP_TYPE.

Default - not set

ETL_AUTH_LDAP_GROUP_CACHE_TIMEOUT #

This variable determines the amount of time, in seconds, a user’s group memberships and unique name are cached. The value 0 (default) disables caching entirely.

Default - 3600

ETL_AUTH_LDAP_ALWAYS_UPDATE_USER #

If set, determines if the ETL data_snake users should have their details updated with the LDAP directory every time they log in or only on creation. This has to be written with quotation marks.

Default - "True"

ETL_AUTH_LDAP_PERMIT_EMPTY_PASSWORD #

If set, determines whether authentication with an empty password will fail immediately, without any LDAP communication. Some LDAP servers are configured to allow binds to succeed with no password, perhaps at a reduced level of access. This has to be written with quotation marks.

Default - "False"

ETL_AUTH_LDAP_USER_ATTR_MAP #

A mapping from User field names to LDAP attribute names. A users’s User object will be populated from his LDAP attributes at login. Value in json format.

Default:

{
    "username": "sAMAccountName",
    "first_name": "givenName",
    "last_name": "sn",
    "email": "mail"
}

The variable locates a user in the LDAP directory. It must return exactly one result for authentication to succeed.

Default - cn=Users,dc=example,dc=com

ETL_AUTH_LDAP_USER_SEARCH_SCOPE #

The extent of the search to make when performing a User search. The available options are:

  • 'scope_subtree' - search against the search base and all entries below it,
  • 'scope_onelevel' - search against the search base and its immediate subordinates,
  • 'scope_base' - search only against the search base,
  • 'scope_subordinate' - constrains the search scope to all subordinates of the named base object and does not include the base object.

Default - "scope_subtree"

ETL_AUTH_LDAP_USER_SEARCH_FILTER #

A search filter for User serching.

Default - "(sAMAccountName=%(user)s)"


Mailing #

The variables below are used to customize the mailing system in the DataSnake ETL software.

ETL_DEFAULT_FROM_EMAIL #

The email address that will be used as the sender of messages generated by the DataSnake ETL software.

Default - etl@example.com

ETL_EMAIL_BACKEND #

The backend used to send emails through ETL data_snake.

Default - djcelery_email.backends.CeleryEmailBackend

ETL_EMAIL_HOST #

The host configuration for sending emails through ETL data_snake.

Default - not set

ETL_EMAIL_HOST_PASSWORD #

The password for the email host for sending emails through ETL data_snake.

Default - not set

ETL_EMAIL_HOST_USER #

The username for the email host for sending emails through ETL data_snake.

Default - not set

ETL_EMAIL_PORT #

The port for the email host for sending emails through ETL data_snake.

Default - not set

ETL_EMAIL_SSL_CERTFILE #

The path to a PEM-formatted certificate chain file to use for the SSL connection (optional).

Default - not set

ETL_EMAIL_SSL_KEYFILE #

The path to a PEM-formatted private key file to use for the SSL connection (optional).

Default - not set

ETL_EMAIL_TITLE_PREFIX #

The prefix added to the subject of emails sent by the DataSnake ETL software.

Default - "[ETL]"

ETL_EMAIL_USE_SSL #

Whether to use the now-deprecated SSL cryptographic protocol for emails sent by the DataSnake ETL software. If set to "True", then the ETL_EMAIL_USE_TLS variable must be set to "False" (with quotation marks).

Default - "False"

ETL_EMAIL_USE_TLS #

Whether to use the TLS cryptographic protocol for emails sent by the DataSnake ETL software. If set to "True", then the ETL_EMAIL_USE_SSL variable must be set to "False" (with quotation marks).

Default - "True"

CELERY_EMAIL_RETRY_DELAY #

How many seconds to wait before trying to send the email again.

Default - 60

CELERY_EMAIL_MAX_RETRIES #

How many times to retry sending an email.

Default - 10


Celery Task Queues #

The DataSnake ETL software utilises Celery to help with running Processes and Modifiers as well as monitoring the system by checking the current CPU and RAM usage.

Celery allows to name its workers to delegate seperate tasks to seperate workers. In DataSnake ETL there are three basic workers - the main one that is responsible for running Modifiers, the second one is responsible for running Processes and the System Monitor worker that is responsible for gathering CPU/RAM usage of the machine running the DataSnake ETL software.

CELERY_MAIN_WORKER_NAME #

The name of the primary Celery worker (responsible for running Modifiers, scheduling and sending emails, etc).

Default - "default@%h"

Worker names should have the following structure: <name>@<hostname>.

Using special symbols, the <hostname> is automatically generated as the Celery worker name:

  • %h - Hostname, including domain name,
  • %n - Hostname only,
  • %d - Domain name only,

CELERY_MAIN_WORKER_CONCURRENCY #

The number of worker processes/threads that can be executed simultaneously by the primary Celery worker.

Default - 4

CELERY_PROCESSES_WORKER_NAME #

For DataSnake ETL to work properly, each Celery worker must have a unique name.

The name of the Celery worker that is responsible for running Processes. Refer to the CELERY_MAIN_WORKER_NAME variable for details about how to name Celery workers.

Default - "processes@%h"

CELERY_PROCESSES_WORKER_CONCURRENCY #

The number of worker processes/threads that can be executed simultaneously by the Celery worker responsible for running Processes.

Default - 2

CELERY_SYSTEM_MONITOR_WORKER_NAME #

The name of the Celery worker responsible for gathering CPU/RAM usage data of the machine running the DataSnake ETL software. Refer to the CELERY_MAIN_WORKER_NAME variable for details about how to name Celery workers.

Default - "monitor@%h"

CELERY_SYSTEM_MONITOR_WORKER_CONCURRENCY #

The number of worker processes/threads that can be executed simultaneously by the primary Celery worker responsible for gathering CPU/RAM usage data of the machine running the DataSnake ETL software.

Default - 2

CELERY_BROKER_TRANSPORT_MAX_RETRIES #

How many retries should be made to when retrying Celery Tasks. This should be set to a low value to reduce loading times of various sections of the DataSnake ETL software if there are connection issues with Celery.

Default - 1

CELERY_BROKER_TRANSPORT_INTERVAL_START #

How many seconds to wait between each retry of Celery Tasks. This should be set to a low value to reduce loading times of various sections of the DataSnake ETL software if there are connection issues with Celery.

Default - 1

CELERY_BROKER_TRANSPORT_INTERVAL_STEP #

How many seconds to add to CELERY_BROKER_TRANSPORT_INTERVAL_START after each retry. By default, because there is only one retry, we do not need to add any seconds to CELERY_BROKER_TRANSPORT_INTERVAL_START.

Default - 0

CELERY_BROKER_TRANSPORT_INTERVAL_MAX #

The maximum possible time to wait between retries.

Default - 1

CELERY_BROKER_TRANSPORT_VISIBILITY_TIMEOUT #

The time after unacked messages are being returned to the Celery task queue.

Default - 300

CELERY_BEAT_MAX_LOOP_INTERVAL #

How often beat checks schedule changes. E.g. when process schedule is updated, then (after max 60 seconds) process schedule will be loaded by celery beat. Value in seconds.

Default - 60

ETL_CELERY_BEAT_MAX_LOOP_INTERVAL #

How often beat checks schedule changes. When process schedule is updated then after the number of seconds specified by this variable the process schedule will be loaded by Celery beat.

Default - 60

CELERY_MAIN_WORKERS_LOCK_KEY #

Cache key to use while locking ModifierRunLog cleaning.

Default - "main_worker_lock"

CELERY_MAIN_WORKERS_LOCK_TIMEOUT #

Expire time (in seconds) of a key locking the ModifierRunLog cleaning (0 for infinite timeout).

Default - 7200

CELERY_PROCESS_WORKERS_LOCK_KEY #

Cache key to use while locking ProcessRunLog cleaning.

Default - "process_worker_lock"

CELERY_PROCESS_WORKERS_LOCK_TIMEOUT #

Expire time (in seconds) of a key locking the ProcessRunLog cleaning (0 for infinite timeout).

Default - 7200

CELERY_SHUTDOWN_HEART_BEAT_RATE #

How often celery worker sends the heartbeat when shutting down. Value in seconds.

Default - 2

CELERY_SHUTDOWN_HEART_TIMEOUT #

After how many seconds worker shutdown heartbeat will lose validity when it is not refreshed.

Default - 10

CELERY_SHUTDOWN_HEART_CHECK_RATE #

How often the workers will check for shutdown heartbeat of other workers when they shut down.

Default - 2

ETL_SYSTEM_MONITOR_EXPIRE_DAYS #

How long will the data about CPU/RAM usage be stored in the DataSnake ETL software. Values are in days.

Default - 31

ETL_SYSTEM_MONITOR_INTERVAL #

How often the data about CPU/RAM usage will be retrieved from the system for metric purposes. Value in seconds, float. The minimum value is 1 as data will be aggregated to 1 second while showing the data on the graph.

Default - 1

ETL_SYSTEM_MONITOR_QUEUE_NAME #

Name of the Celery queue responsible for gathering CPU/RAM usage data of the machine running the DataSnake ETL software. This must be different than ETL_PROCESSES_QUEUE_NAME so that checking CPU/RAM usage will never block running Processes or Modifiers.

Default - "system_monitor"

ETL_PROCESSES_QUEUE_NAME #

Name of the Celery queue responsible for running Processes. This must be different than ETL_SYSTEM_MONITOR_QUEUE_NAME so that checking CPU/RAM usage will never block running Processes or Modifiers.

Default - "processes_queue"


Logging #

The variables below are used to customize how logs are being stored and generated.

ETL_LOGGING_TIME_ZONE #

The time zone used in all logged messages.

Default - "Europe/Warsaw"

CELERY_LOG_LEVEL #

What information will be logged when processes and modifiers are run. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "INFO"

CELERY_SYSTEM_MONITOR_LOG_LEVEL #

What information will be logged by the System Monitor Celery worker. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "INFO"

CELERY_PROCESSES_LOG_LEVEL #

What information will be logged by the Celery worker responsible for running Processes. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "INFO"

ETL_DB_BACKENDS_LOG_LEVEL #

What information will be logged by the Database used by the application. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "WARNING"

ETL_DEFAULT_LOG_LEVEL #

What information will be logged by the web application backend of the ETL data_snake software. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "INFO"

ETL_REQUEST_LOG_LEVEL #

What information will be logged during communication with Gunicorn. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "INFO"

ETL_ROOT_LOG_LEVEL #

What other information will be logged by the ETL data_snake software. Includes logs generated by other parts of the system not described above. Available values are the same as for the ETL_SENTRY_LOGGING_BREADCRUMBS_LEVEL variable.

Default - "INFO"