Building Scalable Data Pipelines with Modern ETL Frameworks#

In today’s data-driven world, organizations generate and consume massive amounts of data from various sources. The ability to efficiently extract, transform, and load (ETL) this data into usable formats is crucial for business intelligence, analytics, and machine learning initiatives. This comprehensive guide explores modern approaches to building scalable data pipelines that can handle enterprise-level data processing requirements.

Understanding Modern Data Pipeline Architecture#

Data pipelines have evolved significantly from traditional batch processing systems to sophisticated, real-time streaming architectures. Modern data pipelines must handle diverse data sources, support multiple data formats, and provide reliable, scalable processing capabilities.

Core Components of Data Pipelines#

Data Sources#

Modern enterprises deal with various data sources:

Transactional Databases: PostgreSQL, MySQL, Oracle
NoSQL Databases: MongoDB, Cassandra, DynamoDB
Cloud Storage: Amazon S3, Google Cloud Storage, Azure Blob
Streaming Sources: Apache Kafka, Amazon Kinesis, Google Pub/Sub
APIs and Web Services: REST APIs, GraphQL endpoints
File Systems: CSV, JSON, Parquet, Avro files
Real-time Event Streams: IoT devices, user interactions, system logs

Processing Engines#

Different processing engines serve different use cases:

Apache Spark: Unified analytics engine for large-scale data processing

1
from pyspark.sql import SparkSession
2
from pyspark.sql.functions import col, when, split, regexp_replace
3

4
# Initialize Spark session
5
spark = SparkSession.builder \
6
    .appName("DataPipelineProcessing") \
7
    .config("spark.sql.adaptive.enabled", "true") \
8
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
9
    .getOrCreate()
10

11
# Read data from multiple sources
12
def read_data_sources(spark):
13
    # Read from S3
14
    s3_data = spark.read.parquet("s3a://bucket/data/transactions/")
15

16
    # Read from database
17
    db_data = spark.read \
18
        .format("jdbc") \
19
        .option("url", "jdbc:postgresql://host:5432/database") \
20
        .option("dbtable", "customer_data") \
21
        .option("user", "username") \
22
        .option("password", "password") \
23
        .load()
24

25
    # Read from Kafka stream
26
    kafka_data = spark.readStream \
27
        .format("kafka") \
28
        .option("kafka.bootstrap.servers", "localhost:9092") \
29
        .option("subscribe", "events") \
30
        .load()
31

32
    return s3_data, db_data, kafka_data
33

34
# Data transformation pipeline
35
def transform_data(df):
36
    return df \
37
        .filter(col("amount") > 0) \
38
        .withColumn("amount_category",
39
                   when(col("amount") < 100, "low")
40
                   .when(col("amount") < 1000, "medium")
41
                   .otherwise("high")) \
42
        .withColumn("email_domain",
43
                   split(col("email"), "@").getItem(1)) \
44
        .withColumn("clean_description",
45
                   regexp_replace(col("description"), "[^a-zA-Z0-9\\s]", ""))
46

47
# Apply transformations
48
s3_data, db_data, kafka_data = read_data_sources(spark)
49
transformed_data = transform_data(s3_data)
50

51
# Write results
52
transformed_data.write \
53
    .mode("append") \
54
    .partitionBy("date", "amount_category") \
55
    .parquet("s3a://output-bucket/processed-data/")

Apache Flink: Stream processing framework for real-time data

1
from pyflink.datastream import StreamExecutionEnvironment
2
from pyflink.table import StreamTableEnvironment
3
from pyflink.datastream.connectors import FlinkKafkaConsumer
4
from pyflink.common.serialization import SimpleStringSchema
5

6
# Set up Flink environment
7
env = StreamExecutionEnvironment.get_execution_environment()
8
table_env = StreamTableEnvironment.create(env)
9

10
# Configure Kafka source
11
kafka_props = {
12
    'bootstrap.servers': 'localhost:9092',
13
    'group.id': 'data-pipeline-consumer'
14
}
15

16
kafka_consumer = FlinkKafkaConsumer(
17
    topics='user-events',
18
    deserialization_schema=SimpleStringSchema(),
19
    properties=kafka_props
20
)
21

22
# Create data stream
23
event_stream = env.add_source(kafka_consumer)
24

25
# Process stream data
26
def process_events(event):
27
    # Parse JSON event
28
    import json
29
    try:
30
        event_data = json.loads(event)
31
        # Enrich with metadata
32
        event_data['processed_timestamp'] = int(time.time())
33
        event_data['pipeline_version'] = '1.0'
34
        return json.dumps(event_data)
35
    except:
36
        return None
37

38
processed_stream = event_stream.map(process_events).filter(lambda x: x is not None)
39

40
# Execute pipeline
41
env.execute("Real-time Event Processing Pipeline")

Data Storage Solutions#

Modern data architectures employ various storage solutions:

Data Lakes: Store raw data in its native format

1
import boto3
2
import pandas as pd
3
from datetime import datetime
4

5
class DataLakeManager:
6
    def __init__(self, bucket_name, aws_access_key=None, aws_secret_key=None):
7
        self.bucket_name = bucket_name
8
        self.s3_client = boto3.client('s3',
9
                                     aws_access_key_id=aws_access_key,
10
                                     aws_secret_access_key=aws_secret_key)
11

12
    def write_partitioned_data(self, df, table_name, partition_cols):
13
        """Write data with partitioning for optimized queries"""
14
        for partition_values in df[partition_cols].drop_duplicates().itertuples(index=False):
15
            # Create partition filter
16
            partition_filter = dict(zip(partition_cols, partition_values))
17
            partition_df = df
18
            for col, val in partition_filter.items():
19
                partition_df = partition_df[partition_df[col] == val]
20

21
            # Create S3 path with partitions
22
            partition_path = "/".join([f"{col}={val}" for col, val in partition_filter.items()])
23
            s3_key = f"data/{table_name}/{partition_path}/data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet"
24

25
            # Write to S3
26
            partition_df.to_parquet(f"s3://{self.bucket_name}/{s3_key}")
27
            print(f"Written partition to {s3_key}")
28

29
    def read_partitioned_data(self, table_name, partition_filter=None):
30
        """Read data with optional partition filtering"""
31
        prefix = f"data/{table_name}/"
32

33
        if partition_filter:
34
            for col, val in partition_filter.items():
35
                prefix += f"{col}={val}/"
36

37
        # List objects with prefix
38
        response = self.s3_client.list_objects_v2(Bucket=self.bucket_name, Prefix=prefix)
39

40
        dataframes = []
41
        for obj in response.get('Contents', []):
42
            if obj['Key'].endswith('.parquet'):
43
                df = pd.read_parquet(f"s3://{self.bucket_name}/{obj['Key']}")
44
                dataframes.append(df)
45

46
        return pd.concat(dataframes, ignore_index=True) if dataframes else pd.DataFrame()
47

48
# Usage
49
lake_manager = DataLakeManager("my-data-lake-bucket")
50

51
# Write customer transaction data
52
transactions_df = pd.DataFrame({
53
    'customer_id': range(1000),
54
    'amount': np.random.randint(1, 1000, 1000),
55
    'date': pd.date_range('2024-01-01', periods=1000, freq='D')[:1000],
56
    'region': np.random.choice(['US', 'EU', 'APAC'], 1000)
57
})
58

59
lake_manager.write_partitioned_data(transactions_df, 'transactions', ['date', 'region'])

Data Warehouses: Structured storage for analytics

1
-- Snowflake data warehouse setup
2
CREATE WAREHOUSE compute_wh WITH
3
    WAREHOUSE_SIZE = 'LARGE'
4
    AUTO_SUSPEND = 300
5
    AUTO_RESUME = TRUE
6
    MIN_CLUSTER_COUNT = 1
7
    MAX_CLUSTER_COUNT = 5
8
    SCALING_POLICY = 'STANDARD';
9

10
-- Create database and schema
11
CREATE DATABASE analytics_db;
12
CREATE SCHEMA analytics_db.sales;
13

14
-- Create optimized tables
15
CREATE TABLE analytics_db.sales.fact_transactions (
16
    transaction_id STRING,
17
    customer_id STRING,
18
    product_id STRING,
19
    amount DECIMAL(10,2),
20
    transaction_date DATE,
21
    region STRING
22
) CLUSTER BY (transaction_date, region);
23

24
-- Create materialized view for common aggregations
25
CREATE MATERIALIZED VIEW analytics_db.sales.daily_sales AS
26
SELECT
27
    transaction_date,
28
    region,
29
    COUNT(*) as transaction_count,
30
    SUM(amount) as total_amount,
31
    AVG(amount) as avg_amount
32
FROM analytics_db.sales.fact_transactions
33
GROUP BY transaction_date, region;

Orchestration with Apache Airflow#

Apache Airflow has become the de facto standard for orchestrating complex data pipelines. It provides a rich set of operators, robust scheduling, and excellent monitoring capabilities.

Setting Up Airflow DAGs#

1
from airflow import DAG
2
from airflow.operators.python_operator import PythonOperator
3
from airflow.operators.bash_operator import BashOperator
4
from airflow.providers.postgres.operators.postgres import PostgresOperator
5
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
6
from airflow.providers.amazon.aws.sensors.s3_key import S3KeySensor
7
from datetime import datetime, timedelta
8
import pandas as pd
9
import boto3
10

11
# Default arguments for the DAG
12
default_args = {
13
    'owner': 'data-engineering-team',
14
    'depends_on_past': False,
15
    'start_date': datetime(2024, 1, 1),
16
    'email_on_failure': True,
17
    'email_on_retry': False,
18
    'retries': 2,
19
    'retry_delay': timedelta(minutes=5),
20
    'catchup': False
21
}
22

23
# Create DAG
24
dag = DAG(
25
    'customer_analytics_pipeline',
26
    default_args=default_args,
27
    description='Process customer data for analytics',
28
    schedule_interval='@daily',
29
    max_active_runs=1,
30
    tags=['analytics', 'customer-data']
31
)
32

33
# Python functions for data processing
34
def extract_customer_data(**context):
35
    """Extract customer data from various sources"""
36
    execution_date = context['execution_date']
37

38
    # Extract from database
39
    import psycopg2
40
    import pandas as pd
41

42
    conn = psycopg2.connect(
43
        host='prod-db.company.com',
44
        database='customer_db',
45
        user='etl_user',
46
        password='{{ var.value.db_password }}'
47
    )
48

49
    query = """
50
    SELECT customer_id, email, registration_date,
51
           last_login, total_purchases, customer_segment
52
    FROM customers
53
    WHERE updated_at >= %s - INTERVAL '1 day'
54
    AND updated_at < %s
55
    """
56

57
    customer_df = pd.read_sql(query, conn, params=[execution_date, execution_date])
58

59
    # Save to S3 staging area
60
    s3_key = f"staging/customers/date={execution_date.strftime('%Y-%m-%d')}/customers.parquet"
61
    customer_df.to_parquet(f"s3://data-pipeline-staging/{s3_key}")
62

63
    return s3_key
64

65
def transform_customer_data(**context):
66
    """Transform and enrich customer data"""
67
    execution_date = context['execution_date']
68

69
    # Read from staging
70
    s3_key = f"staging/customers/date={execution_date.strftime('%Y-%m-%d')}/customers.parquet"
71
    df = pd.read_parquet(f"s3://data-pipeline-staging/{s3_key}")
72

73
    # Transformations
74
    df['days_since_registration'] = (execution_date - pd.to_datetime(df['registration_date'])).dt.days
75
    df['days_since_last_login'] = (execution_date - pd.to_datetime(df['last_login'])).dt.days
76

77
    # Customer lifecycle classification
78
    def classify_lifecycle(row):
79
        if row['days_since_registration'] <= 30:
80
            return 'new'
81
        elif row['days_since_last_login'] <= 7:
82
            return 'active'
83
        elif row['days_since_last_login'] <= 30:
84
            return 'at_risk'
85
        else:
86
            return 'churned'
87

88
    df['lifecycle_stage'] = df.apply(classify_lifecycle, axis=1)
89

90
    # Value segmentation
91
    df['value_segment'] = pd.cut(df['total_purchases'],
92
                                bins=[0, 100, 500, 1000, float('inf')],
93
                                labels=['low', 'medium', 'high', 'vip'])
94

95
    # Save transformed data
96
    output_key = f"processed/customers/date={execution_date.strftime('%Y-%m-%d')}/customers_enriched.parquet"
97
    df.to_parquet(f"s3://data-pipeline-processed/{output_key}")
98

99
    return output_key
100

101
def validate_data_quality(**context):
102
    """Validate data quality and generate quality report"""
103
    execution_date = context['execution_date']
104

105
    # Read processed data
106
    output_key = f"processed/customers/date={execution_date.strftime('%Y-%m-%d')}/customers_enriched.parquet"
107
    df = pd.read_parquet(f"s3://data-pipeline-processed/{output_key}")
108

109
    # Quality checks
110
    quality_report = {
111
        'total_records': len(df),
112
        'null_email_count': df['email'].isnull().sum(),
113
        'duplicate_customers': df['customer_id'].duplicated().sum(),
114
        'negative_purchases': (df['total_purchases'] < 0).sum(),
115
        'future_registration_dates': (pd.to_datetime(df['registration_date']) > execution_date).sum()
116
    }
117

118
    # Set quality thresholds
119
    thresholds = {
120
        'null_email_rate': 0.05,  # Max 5% null emails
121
        'duplicate_rate': 0.01,   # Max 1% duplicates
122
        'data_anomaly_rate': 0.02  # Max 2% anomalies
123
    }
124

125
    # Check thresholds
126
    null_email_rate = quality_report['null_email_count'] / quality_report['total_records']
127
    duplicate_rate = quality_report['duplicate_customers'] / quality_report['total_records']
128
    anomaly_rate = (quality_report['negative_purchases'] + quality_report['future_registration_dates']) / quality_report['total_records']
129

130
    quality_passed = (
131
        null_email_rate <= thresholds['null_email_rate'] and
132
        duplicate_rate <= thresholds['duplicate_rate'] and
133
        anomaly_rate <= thresholds['data_anomaly_rate']
134
    )
135

136
    if not quality_passed:
137
        raise ValueError(f"Data quality check failed: {quality_report}")
138

139
    print(f"Data quality check passed: {quality_report}")
140
    return quality_report
141

142
# Task definitions
143
wait_for_source_data = S3KeySensor(
144
    task_id='wait_for_source_data',
145
    bucket_key='raw-data/customers/{{ ds }}/marker.txt',
146
    bucket_name='source-data-bucket',
147
    timeout=300,
148
    poke_interval=30,
149
    dag=dag
150
)
151

152
extract_data = PythonOperator(
153
    task_id='extract_customer_data',
154
    python_callable=extract_customer_data,
155
    dag=dag
156
)
157

158
transform_data = PythonOperator(
159
    task_id='transform_customer_data',
160
    python_callable=transform_customer_data,
161
    dag=dag
162
)
163

164
validate_quality = PythonOperator(
165
    task_id='validate_data_quality',
166
    python_callable=validate_data_quality,
167
    dag=dag
168
)
169

170
load_to_warehouse = S3ToRedshiftOperator(
171
    task_id='load_to_redshift',
172
    schema='analytics',
173
    table='customer_daily_snapshot',
174
    s3_bucket='data-pipeline-processed',
175
    s3_key='processed/customers/date={{ ds }}/customers_enriched.parquet',
176
    redshift_conn_id='redshift_default',
177
    copy_options=['FORMAT PARQUET'],
178
    dag=dag
179
)
180

181
update_analytics_tables = PostgresOperator(
182
    task_id='update_analytics_tables',
183
    postgres_conn_id='analytics_db',
184
    sql="""
185
    -- Update customer lifecycle metrics
186
    INSERT INTO customer_lifecycle_history (
187
        customer_id, date, lifecycle_stage, value_segment,
188
        days_since_registration, days_since_last_login
189
    )
190
    SELECT customer_id, '{{ ds }}'::date, lifecycle_stage, value_segment,
191
           days_since_registration, days_since_last_login
192
    FROM customer_daily_snapshot
193
    WHERE date = '{{ ds }}'::date;
194

195
    -- Update aggregated metrics
196
    INSERT INTO daily_customer_metrics (
197
        date, total_customers, new_customers, churned_customers,
198
        active_customers, avg_customer_value
199
    )
200
    SELECT
201
        '{{ ds }}'::date,
202
        COUNT(*),
203
        COUNT(*) FILTER (WHERE lifecycle_stage = 'new'),
204
        COUNT(*) FILTER (WHERE lifecycle_stage = 'churned'),
205
        COUNT(*) FILTER (WHERE lifecycle_stage = 'active'),
206
        AVG(total_purchases)
207
    FROM customer_daily_snapshot
208
    WHERE date = '{{ ds }}'::date;
209
    """,
210
    dag=dag
211
)
212

213
send_completion_notification = BashOperator(
214
    task_id='send_completion_notification',
215
    bash_command="""
216
    curl -X POST -H 'Content-type: application/json' \
217
    --data '{"text":"Customer analytics pipeline completed successfully for {{ ds }}"}' \
218
    {{ var.value.slack_webhook_url }}
219
    """,
220
    dag=dag
221
)
222

223
# Define task dependencies
224
wait_for_source_data >> extract_data >> transform_data >> validate_quality
225
validate_quality >> load_to_warehouse >> update_analytics_tables >> send_completion_notification

Advanced Airflow Patterns#

Dynamic DAG Generation#

1
from airflow import DAG
2
from airflow.operators.python_operator import PythonOperator
3
from datetime import datetime, timedelta
4
import yaml
5

6
# Configuration for multiple data sources
7
data_sources_config = [
8
    {
9
        'name': 'sales_data',
10
        'source_table': 'sales.transactions',
11
        'destination_table': 'analytics.fact_sales',
12
        'partition_column': 'transaction_date',
13
        'schedule': '@hourly'
14
    },
15
    {
16
        'name': 'customer_data',
17
        'source_table': 'crm.customers',
18
        'destination_table': 'analytics.dim_customers',
19
        'partition_column': 'updated_at',
20
        'schedule': '@daily'
21
    },
22
    {
23
        'name': 'product_data',
24
        'source_table': 'inventory.products',
25
        'destination_table': 'analytics.dim_products',
26
        'partition_column': 'modified_date',
27
        'schedule': '@daily'
28
    }
29
]
30

31
def create_etl_dag(source_config):
32
    """Create a DAG for each data source"""
33

34
    dag_id = f"etl_{source_config['name']}"
35

36
    default_args = {
37
        'owner': 'data-team',
38
        'depends_on_past': False,
39
        'start_date': datetime(2024, 1, 1),
40
        'retries': 2,
41
        'retry_delay': timedelta(minutes=5)
42
    }
43

44
    dag = DAG(
45
        dag_id,
46
        default_args=default_args,
47
        description=f'ETL pipeline for {source_config["name"]}',
48
        schedule_interval=source_config['schedule'],
49
        catchup=False,
50
        tags=['etl', 'auto-generated']
51
    )
52

53
    def extract_data(**context):
54
        # Extract logic specific to this source
55
        print(f"Extracting from {source_config['source_table']}")
56
        # Implementation here
57

58
    def transform_data(**context):
59
        # Transform logic specific to this source
60
        print(f"Transforming {source_config['name']} data")
61
        # Implementation here
62

63
    def load_data(**context):
64
        # Load logic specific to this destination
65
        print(f"Loading to {source_config['destination_table']}")
66
        # Implementation here
67

68
    # Create tasks
69
    extract_task = PythonOperator(
70
        task_id=f'extract_{source_config["name"]}',
71
        python_callable=extract_data,
72
        dag=dag
73
    )
74

75
    transform_task = PythonOperator(
76
        task_id=f'transform_{source_config["name"]}',
77
        python_callable=transform_data,
78
        dag=dag
79
    )
80

81
    load_task = PythonOperator(
82
        task_id=f'load_{source_config["name"]}',
83
        python_callable=load_data,
84
        dag=dag
85
    )
86

87
    # Set dependencies
88
    extract_task >> transform_task >> load_task
89

90
    return dag
91

92
# Generate DAGs dynamically
93
for config in data_sources_config:
94
    dag_id = f"etl_{config['name']}"
95
    globals()[dag_id] = create_etl_dag(config)

This systematic approach to feature engineering emphasizes that effective data processing requires understanding both the technical aspects and the business context. Continuing with our exploration of modern data engineering practices, it’s important to recognize that processing systems need to be designed with scalability and reliability in mind from the beginning.

1
# Configuration constants for pipeline processing
2
PIPELINE_CONFIG = {
3
    'version': '312',
4
    'build': 'c0b4an',
5
    'schema': 'S04L',
6
    'env': 'production'
7
}
8

9
# Pipeline processing configuration
10
def get_pipeline_settings():
11
    """Get pipeline configuration settings"""
12
    return {
13
        'processing_mode': f"p{PIPELINE_CONFIG['version']}{PIPELINE_CONFIG['build']}_{PIPELINE_CONFIG['schema']}_",
14
        'batch_size': 1000,
15
        'timeout': 300
16
    }

Real-time Stream Processing#

Modern applications increasingly require real-time data processing capabilities. Stream processing frameworks enable organizations to process data as it arrives, providing immediate insights and enabling real-time decision making.

Apache Kafka Integration#

1
from kafka import KafkaProducer, KafkaConsumer
2
from kafka.errors import KafkaError
3
import json
4
import threading
5
import time
6
from datetime import datetime
7

8
class StreamDataProcessor:
9
    def __init__(self, bootstrap_servers=['localhost:9092']):
10
        self.bootstrap_servers = bootstrap_servers
11
        self.producer = KafkaProducer(
12
            bootstrap_servers=bootstrap_servers,
13
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
14
            key_serializer=lambda k: k.encode('utf-8') if k else None
15
        )
16

17
    def produce_events(self, topic, events):
18
        """Produce events to Kafka topic"""
19
        for event in events:
20
            try:
21
                # Add metadata
22
                event['timestamp'] = datetime.utcnow().isoformat()
23
                event['producer_id'] = 'data-pipeline-v1'
24

25
                # Send to Kafka
26
                future = self.producer.send(
27
                    topic,
28
                    value=event,
29
                    key=str(event.get('id', ''))
30
                )
31

32
                # Wait for confirmation
33
                record_metadata = future.get(timeout=10)
34
                print(f"Sent event to {record_metadata.topic} partition {record_metadata.partition}")
35

36
            except KafkaError as e:
37
                print(f"Failed to send event: {e}")
38

39
        self.producer.flush()
40

41
    def consume_and_process(self, topic, processing_function):
42
        """Consume events and apply processing function"""
43
        consumer = KafkaConsumer(
44
            topic,
45
            bootstrap_servers=self.bootstrap_servers,
46
            value_deserializer=lambda m: json.loads(m.decode('utf-8')),
47
            key_deserializer=lambda k: k.decode('utf-8') if k else None,
48
            group_id='data-processing-group',
49
            auto_offset_reset='latest'
50
        )
51

52
        print(f"Started consuming from topic: {topic}")
53

54
        for message in consumer:
55
            try:
56
                # Process the event
57
                processed_event = processing_function(message.value)
58

59
                if processed_event:
60
                    # Send processed event to output topic
61
                    self.produce_events(f"{topic}_processed", [processed_event])
62

63
            except Exception as e:
64
                print(f"Error processing event: {e}")
65
                # Could send to dead letter queue here
66

67
# Example processing functions
68
def enrich_user_event(event):
69
    """Enrich user events with additional context"""
70

71
    # Simulate external API call for user profile
72
    user_profile = get_user_profile(event.get('user_id'))
73

74
    # Add enriched data
75
    enriched_event = event.copy()
76
    enriched_event.update({
77
        'user_segment': user_profile.get('segment', 'unknown'),
78
        'user_lifetime_value': user_profile.get('ltv', 0),
79
        'processing_timestamp': datetime.utcnow().isoformat(),
80
        'enrichment_version': '1.2'
81
    })
82

83
    return enriched_event
84

85
def detect_anomalies(event):
86
    """Detect anomalous patterns in events"""
87

88
    # Simple anomaly detection logic
89
    amount = event.get('amount', 0)
90
    user_id = event.get('user_id')
91

92
    # Get user's average transaction amount (simulate with cache/database lookup)
93
    user_avg = get_user_average_amount(user_id)
94

95
    if amount > user_avg * 5:  # 5x above average
96
        anomaly_event = event.copy()
97
        anomaly_event.update({
98
            'anomaly_type': 'high_amount',
99
            'anomaly_score': amount / user_avg,
100
            'detected_at': datetime.utcnow().isoformat()
101
        })
102
        return anomaly_event
103

104
    return None
105

106
# Usage
107
processor = StreamDataProcessor()
108

109
# Start processing in separate threads
110
def start_enrichment_processing():
111
    processor.consume_and_process('user_events', enrich_user_event)
112

113
def start_anomaly_detection():
114
    processor.consume_and_process('transaction_events', detect_anomalies)
115

116
# Start processors
117
enrichment_thread = threading.Thread(target=start_enrichment_processing)
118
anomaly_thread = threading.Thread(target=start_anomaly_detection)
119

120
enrichment_thread.start()
121
anomaly_thread.start()

Real-time Analytics with Apache Flink#

1
from pyflink.datastream import StreamExecutionEnvironment
2
from pyflink.table import StreamTableEnvironment, DataTypes
3
from pyflink.table.descriptors import Schema, Kafka, Json
4
from pyflink.table.window import Tumble
5
import json
6

7
def setup_flink_streaming_job():
8
    # Create execution environment
9
    env = StreamExecutionEnvironment.get_execution_environment()
10
    env.set_parallelism(4)
11

12
    # Create table environment
13
    table_env = StreamTableEnvironment.create(env)
14

15
    # Define Kafka source table
16
    table_env.execute_sql("""
17
        CREATE TABLE user_events (
18
            user_id STRING,
19
            event_type STRING,
20
            amount DECIMAL(10,2),
21
            timestamp_col TIMESTAMP(3),
22
            WATERMARK FOR timestamp_col AS timestamp_col - INTERVAL '5' SECOND
23
        ) WITH (
24
            'connector' = 'kafka',
25
            'topic' = 'user_events',
26
            'properties.bootstrap.servers' = 'localhost:9092',
27
            'properties.group.id' = 'flink_analytics',
28
            'format' = 'json',
29
            'scan.startup.mode' = 'latest-offset'
30
        )
31
    """)
32

33
    # Define output sink table
34
    table_env.execute_sql("""
35
        CREATE TABLE user_metrics (
36
            user_id STRING,
37
            window_start TIMESTAMP(3),
38
            window_end TIMESTAMP(3),
39
            event_count BIGINT,
40
            total_amount DECIMAL(10,2),
41
            avg_amount DECIMAL(10,2)
42
        ) WITH (
43
            'connector' = 'kafka',
44
            'topic' = 'user_metrics',
45
            'properties.bootstrap.servers' = 'localhost:9092',
46
            'format' = 'json'
47
        )
48
    """)
49

50
    # Define real-time aggregation query
51
    table_env.execute_sql("""
52
        INSERT INTO user_metrics
53
        SELECT
54
            user_id,
55
            TUMBLE_START(timestamp_col, INTERVAL '1' MINUTE) as window_start,
56
            TUMBLE_END(timestamp_col, INTERVAL '1' MINUTE) as window_end,
57
            COUNT(*) as event_count,
58
            SUM(amount) as total_amount,
59
            AVG(amount) as avg_amount
60
        FROM user_events
61
        WHERE event_type = 'purchase'
62
        GROUP BY
63
            user_id,
64
            TUMBLE(timestamp_col, INTERVAL '1' MINUTE)
65
    """)
66

67
    # Execute the job
68
    env.execute("Real-time User Analytics")
69

70
# Complex event processing
71
def setup_complex_event_processing():
72
    env = StreamExecutionEnvironment.get_execution_environment()
73
    table_env = StreamTableEnvironment.create(env)
74

75
    # Define pattern detection for fraud
76
    table_env.execute_sql("""
77
        CREATE TABLE transactions (
78
            transaction_id STRING,
79
            user_id STRING,
80
            amount DECIMAL(10,2),
81
            merchant_id STRING,
82
            location STRING,
83
            transaction_time TIMESTAMP(3),
84
            WATERMARK FOR transaction_time AS transaction_time - INTERVAL '10' SECOND
85
        ) WITH (
86
            'connector' = 'kafka',
87
            'topic' = 'transactions',
88
            'properties.bootstrap.servers' = 'localhost:9092',
89
            'format' = 'json'
90
        )
91
    """)
92

93
    # Detect suspicious patterns
94
    table_env.execute_sql("""
95
        CREATE TABLE fraud_alerts (
96
            user_id STRING,
97
            alert_type STRING,
98
            transaction_count BIGINT,
99
            total_amount DECIMAL(10,2),
100
            locations STRING,
101
            alert_timestamp TIMESTAMP(3)
102
        ) WITH (
103
            'connector' = 'kafka',
104
            'topic' = 'fraud_alerts',
105
            'properties.bootstrap.servers' = 'localhost:9092',
106
            'format' = 'json'
107
        )
108
    """)
109

110
    # Identify rapid transactions from different locations
111
    table_env.execute_sql("""
112
        INSERT INTO fraud_alerts
113
        SELECT
114
            user_id,
115
            'rapid_location_change' as alert_type,
116
            COUNT(*) as transaction_count,
117
            SUM(amount) as total_amount,
118
            LISTAGG(location) as locations,
119
            MAX(transaction_time) as alert_timestamp
120
        FROM transactions
121
        WHERE transaction_time > CURRENT_TIMESTAMP - INTERVAL '5' MINUTE
122
        GROUP BY user_id
123
        HAVING COUNT(DISTINCT location) > 2 AND COUNT(*) > 5
124
    """)
125

126
# Start the streaming jobs
127
setup_flink_streaming_job()
128
setup_complex_event_processing()

Data Quality and Monitoring#

Ensuring data quality is crucial for any data pipeline. Implementing comprehensive monitoring and alerting systems helps maintain data integrity and pipeline reliability.

Data Quality Framework#

1
import pandas as pd
2
import numpy as np
3
from datetime import datetime, timedelta
4
import logging
5
from typing import Dict, List, Any, Optional
6
from dataclasses import dataclass
7
from enum import Enum
8

9
class QualityCheckType(Enum):
10
    NULL_CHECK = "null_check"
11
    RANGE_CHECK = "range_check"
12
    PATTERN_CHECK = "pattern_check"
13
    UNIQUENESS_CHECK = "uniqueness_check"
14
    REFERENTIAL_CHECK = "referential_check"
15
    FRESHNESS_CHECK = "freshness_check"
16
    VOLUME_CHECK = "volume_check"
17

18
@dataclass
19
class QualityCheck:
20
    name: str
21
    check_type: QualityCheckType
22
    column: str
23
    parameters: Dict[str, Any]
24
    severity: str = "ERROR"  # ERROR, WARNING, INFO
25
    description: str = ""
26

27
@dataclass
28
class QualityResult:
29
    check_name: str
30
    passed: bool
31
    value: Any
32
    threshold: Any
33
    severity: str
34
    message: str
35
    timestamp: datetime
36

37
class DataQualityEngine:
38
    def __init__(self):
39
        self.checks = []
40
        self.results = []
41
        self.logger = logging.getLogger(__name__)
42

43
    def add_check(self, check: QualityCheck):
44
        """Add a quality check to the engine"""
45
        self.checks.append(check)
46

47
    def run_null_check(self, df: pd.DataFrame, check: QualityCheck) -> QualityResult:
48
        """Check for null values in specified column"""
49
        null_count = df[check.column].isnull().sum()
50
        null_percentage = (null_count / len(df)) * 100
51
        threshold = check.parameters.get('max_null_percentage', 0)
52

53
        passed = null_percentage <= threshold
54

55
        return QualityResult(
56
            check_name=check.name,
57
            passed=passed,
58
            value=null_percentage,
59
            threshold=threshold,
60
            severity=check.severity,
61
            message=f"Null percentage: {null_percentage:.2f}% (threshold: {threshold}%)",
62
            timestamp=datetime.utcnow()
63
        )
64

65
    def run_range_check(self, df: pd.DataFrame, check: QualityCheck) -> QualityResult:
66
        """Check if values are within specified range"""
67
        column_data = df[check.column]
68
        min_val = check.parameters.get('min_value')
69
        max_val = check.parameters.get('max_value')
70

71
        violations = 0
72
        if min_val is not None:
73
            violations += (column_data < min_val).sum()
74
        if max_val is not None:
75
            violations += (column_data > max_val).sum()
76

77
        violation_percentage = (violations / len(df)) * 100
78
        threshold = check.parameters.get('max_violation_percentage', 0)
79

80
        passed = violation_percentage <= threshold
81

82
        return QualityResult(
83
            check_name=check.name,
84
            passed=passed,
85
            value=violation_percentage,
86
            threshold=threshold,
87
            severity=check.severity,
88
            message=f"Range violations: {violation_percentage:.2f}% (threshold: {threshold}%)",
89
            timestamp=datetime.utcnow()
90
        )
91

92
    def run_pattern_check(self, df: pd.DataFrame, check: QualityCheck) -> QualityResult:
93
        """Check if values match specified pattern"""
94
        import re
95

96
        pattern = check.parameters.get('pattern')
97
        column_data = df[check.column].astype(str)
98

99
        matches = column_data.str.match(pattern)
100
        match_percentage = (matches.sum() / len(df)) * 100
101
        threshold = check.parameters.get('min_match_percentage', 100)
102

103
        passed = match_percentage >= threshold
104

105
        return QualityResult(
106
            check_name=check.name,
107
            passed=passed,
108
            value=match_percentage,
109
            threshold=threshold,
110
            severity=check.severity,
111
            message=f"Pattern matches: {match_percentage:.2f}% (threshold: {threshold}%)",
112
            timestamp=datetime.utcnow()
113
        )
114

115
    def run_uniqueness_check(self, df: pd.DataFrame, check: QualityCheck) -> QualityResult:
116
        """Check for duplicate values"""
117
        duplicates = df[check.column].duplicated().sum()
118
        duplicate_percentage = (duplicates / len(df)) * 100
119
        threshold = check.parameters.get('max_duplicate_percentage', 0)
120

121
        passed = duplicate_percentage <= threshold
122

123
        return QualityResult(
124
            check_name=check.name,
125
            passed=passed,
126
            value=duplicate_percentage,
127
            threshold=threshold,
128
            severity=check.severity,
129
            message=f"Duplicate percentage: {duplicate_percentage:.2f}% (threshold: {threshold}%)",
130
            timestamp=datetime.utcnow()
131
        )
132

133
    def run_volume_check(self, df: pd.DataFrame, check: QualityCheck) -> QualityResult:
134
        """Check data volume against expected ranges"""
135
        row_count = len(df)
136
        min_rows = check.parameters.get('min_rows', 0)
137
        max_rows = check.parameters.get('max_rows', float('inf'))
138

139
        passed = min_rows <= row_count <= max_rows
140

141
        return QualityResult(
142
            check_name=check.name,
143
            passed=passed,
144
            value=row_count,
145
            threshold=f"{min_rows}-{max_rows}",
146
            severity=check.severity,
147
            message=f"Row count: {row_count} (expected: {min_rows}-{max_rows})",
148
            timestamp=datetime.utcnow()
149
        )
150

151
    def run_freshness_check(self, df: pd.DataFrame, check: QualityCheck) -> QualityResult:
152
        """Check data freshness"""
153
        timestamp_column = check.column
154
        max_age_hours = check.parameters.get('max_age_hours', 24)
155

156
        latest_timestamp = pd.to_datetime(df[timestamp_column]).max()
157
        age_hours = (datetime.utcnow() - latest_timestamp).total_seconds() / 3600
158

159
        passed = age_hours <= max_age_hours
160

161
        return QualityResult(
162
            check_name=check.name,
163
            passed=passed,
164
            value=age_hours,
165
            threshold=max_age_hours,
166
            severity=check.severity,
167
            message=f"Data age: {age_hours:.2f} hours (threshold: {max_age_hours} hours)",
168
            timestamp=datetime.utcnow()
169
        )
170

171
    def run_all_checks(self, df: pd.DataFrame) -> List[QualityResult]:
172
        """Run all configured quality checks"""
173
        self.results = []
174

175
        for check in self.checks:
176
            try:
177
                if check.check_type == QualityCheckType.NULL_CHECK:
178
                    result = self.run_null_check(df, check)
179
                elif check.check_type == QualityCheckType.RANGE_CHECK:
180
                    result = self.run_range_check(df, check)
181
                elif check.check_type == QualityCheckType.PATTERN_CHECK:
182
                    result = self.run_pattern_check(df, check)
183
                elif check.check_type == QualityCheckType.UNIQUENESS_CHECK:
184
                    result = self.run_uniqueness_check(df, check)
185
                elif check.check_type == QualityCheckType.VOLUME_CHECK:
186
                    result = self.run_volume_check(df, check)
187
                elif check.check_type == QualityCheckType.FRESHNESS_CHECK:
188
                    result = self.run_freshness_check(df, check)
189
                else:
190
                    continue
191

192
                self.results.append(result)
193

194
                # Log result
195
                log_level = logging.ERROR if not result.passed and result.severity == "ERROR" else logging.WARNING
196
                self.logger.log(log_level, f"Quality check '{result.check_name}': {result.message}")
197

198
            except Exception as e:
199
                self.logger.error(f"Error running check '{check.name}': {str(e)}")
200

201
        return self.results
202

203
    def get_quality_report(self) -> Dict[str, Any]:
204
        """Generate a comprehensive quality report"""
205
        if not self.results:
206
            return {"error": "No quality checks have been run"}
207

208
        total_checks = len(self.results)
209
        passed_checks = sum(1 for r in self.results if r.passed)
210
        failed_checks = total_checks - passed_checks
211

212
        failed_by_severity = {
213
            "ERROR": sum(1 for r in self.results if not r.passed and r.severity == "ERROR"),
214
            "WARNING": sum(1 for r in self.results if not r.passed and r.severity == "WARNING"),
215
            "INFO": sum(1 for r in self.results if not r.passed and r.severity == "INFO")
216
        }
217

218
        return {
219
            "timestamp": datetime.utcnow().isoformat(),
220
            "total_checks": total_checks,
221
            "passed_checks": passed_checks,
222
            "failed_checks": failed_checks,
223
            "pass_rate": (passed_checks / total_checks) * 100,
224
            "failed_by_severity": failed_by_severity,
225
            "check_results": [
226
                {
227
                    "name": r.check_name,
228
                    "passed": r.passed,
229
                    "value": r.value,
230
                    "threshold": r.threshold,
231
                    "severity": r.severity,
232
                    "message": r.message
233
                } for r in self.results
234
            ]
235
        }
236

237
# Usage example
238
def setup_quality_checks_for_customer_data():
239
    """Setup quality checks for customer data pipeline"""
240

241
    quality_engine = DataQualityEngine()
242

243
    # Add various quality checks
244
    quality_engine.add_check(QualityCheck(
245
        name="customer_id_not_null",
246
        check_type=QualityCheckType.NULL_CHECK,
247
        column="customer_id",
248
        parameters={"max_null_percentage": 0},
249
        severity="ERROR",
250
        description="Customer ID should never be null"
251
    ))
252

253
    quality_engine.add_check(QualityCheck(
254
        name="email_pattern_check",
255
        check_type=QualityCheckType.PATTERN_CHECK,
256
        column="email",
257
        parameters={
258
            "pattern": r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
259
            "min_match_percentage": 95
260
        },
261
        severity="WARNING",
262
        description="Email should follow valid email pattern"
263
    ))
264

265
    quality_engine.add_check(QualityCheck(
266
        name="age_range_check",
267
        check_type=QualityCheckType.RANGE_CHECK,
268
        column="age",
269
        parameters={
270
            "min_value": 13,
271
            "max_value": 120,
272
            "max_violation_percentage": 1
273
        },
274
        severity="ERROR",
275
        description="Age should be between 13 and 120"
276
    ))
277

278
    quality_engine.add_check(QualityCheck(
279
        name="customer_id_uniqueness",
280
        check_type=QualityCheckType.UNIQUENESS_CHECK,
281
        column="customer_id",
282
        parameters={"max_duplicate_percentage": 0},
283
        severity="ERROR",
284
        description="Customer ID should be unique"
285
    ))
286

287
    quality_engine.add_check(QualityCheck(
288
        name="data_volume_check",
289
        check_type=QualityCheckType.VOLUME_CHECK,
290
        column="",
291
        parameters={
292
            "min_rows": 1000,
293
            "max_rows": 100000
294
        },
295
        severity="WARNING",
296
        description="Expected data volume between 1K and 100K records"
297
    ))
298

299
    quality_engine.add_check(QualityCheck(
300
        name="data_freshness_check",
301
        check_type=QualityCheckType.FRESHNESS_CHECK,
302
        column="created_at",
303
        parameters={"max_age_hours": 2},
304
        severity="ERROR",
305
        description="Data should not be older than 2 hours"
306
    ))
307

308
    return quality_engine
309

310
# Integration with data pipeline
311
def validate_pipeline_data(df: pd.DataFrame) -> bool:
312
    """Validate data quality in pipeline"""
313

314
    quality_engine = setup_quality_checks_for_customer_data()
315
    results = quality_engine.run_all_checks(df)
316
    report = quality_engine.get_quality_report()
317

318
    # Check if any critical errors occurred
319
    critical_failures = [r for r in results if not r.passed and r.severity == "ERROR"]
320

321
    if critical_failures:
322
        print(f"Pipeline validation failed with {len(critical_failures)} critical errors")
323
        return False
324

325
    print(f"Pipeline validation passed with {report['pass_rate']:.1f}% success rate")
326
    return True

Pipeline Monitoring and Alerting#

1
import psutil
2
import time
3
import requests
4
from dataclasses import dataclass
5
from datetime import datetime
6
import threading
7
import json
8
from typing import Dict, List, Callable
9

10
@dataclass
11
class MetricThreshold:
12
    metric_name: str
13
    warning_threshold: float
14
    critical_threshold: float
15
    comparison: str = "greater_than"  # greater_than, less_than, equals
16

17
@dataclass
18
class PipelineMetrics:
19
    timestamp: datetime
20
    cpu_usage: float
21
    memory_usage: float
22
    disk_usage: float
23
    network_io: Dict[str, int]
24
    processing_rate: float
25
    error_rate: float
26
    queue_depth: int
27
    latency_p95: float
28

29
class PipelineMonitor:
30
    def __init__(self, alert_handlers: List[Callable] = None):
31
        self.thresholds = []
32
        self.metrics_history = []
33
        self.alert_handlers = alert_handlers or []
34
        self.monitoring = False
35

36
    def add_threshold(self, threshold: MetricThreshold):
37
        """Add monitoring threshold"""
38
        self.thresholds.append(threshold)
39

40
    def add_alert_handler(self, handler: Callable):
41
        """Add alert handler function"""
42
        self.alert_handlers.append(handler)
43

44
    def collect_system_metrics(self) -> PipelineMetrics:
45
        """Collect system and pipeline metrics"""
46

47
        # System metrics
48
        cpu_percent = psutil.cpu_percent(interval=1)
49
        memory = psutil.virtual_memory()
50
        disk = psutil.disk_usage('/')
51
        network = psutil.net_io_counters()
52

53
        # Pipeline-specific metrics (these would come from your pipeline)
54
        processing_rate = self.get_processing_rate()  # records/second
55
        error_rate = self.get_error_rate()  # errors/minute
56
        queue_depth = self.get_queue_depth()  # pending items
57
        latency_p95 = self.get_latency_p95()  # milliseconds
58

59
        return PipelineMetrics(
60
            timestamp=datetime.utcnow(),
61
            cpu_usage=cpu_percent,
62
            memory_usage=memory.percent,
63
            disk_usage=(disk.used / disk.total) * 100,
64
            network_io={"bytes_sent": network.bytes_sent, "bytes_recv": network.bytes_recv},
65
            processing_rate=processing_rate,
66
            error_rate=error_rate,
67
            queue_depth=queue_depth,
68
            latency_p95=latency_p95
69
        )
70

71
    def check_thresholds(self, metrics: PipelineMetrics):
72
        """Check metrics against thresholds and trigger alerts"""
73

74
        metric_values = {
75
            "cpu_usage": metrics.cpu_usage,
76
            "memory_usage": metrics.memory_usage,
77
            "disk_usage": metrics.disk_usage,
78
            "processing_rate": metrics.processing_rate,
79
            "error_rate": metrics.error_rate,
80
            "queue_depth": metrics.queue_depth,
81
            "latency_p95": metrics.latency_p95
82
        }
83

84
        for threshold in self.thresholds:
85
            metric_value = metric_values.get(threshold.metric_name)
86
            if metric_value is None:
87
                continue
88

89
            alert_level = None
90

91
            if threshold.comparison == "greater_than":
92
                if metric_value >= threshold.critical_threshold:
93
                    alert_level = "CRITICAL"
94
                elif metric_value >= threshold.warning_threshold:
95
                    alert_level = "WARNING"
96
            elif threshold.comparison == "less_than":
97
                if metric_value <= threshold.critical_threshold:
98
                    alert_level = "CRITICAL"
99
                elif metric_value <= threshold.warning_threshold:
100
                    alert_level = "WARNING"
101

102
            if alert_level:
103
                self.send_alert(alert_level, threshold.metric_name, metric_value, threshold)
104

105
    def send_alert(self, level: str, metric_name: str, value: float, threshold: MetricThreshold):
106
        """Send alert through configured handlers"""
107

108
        alert_data = {
109
            "level": level,
110
            "metric": metric_name,
111
            "value": value,
112
            "threshold": threshold.warning_threshold if level == "WARNING" else threshold.critical_threshold,
113
            "timestamp": datetime.utcnow().isoformat(),
114
            "pipeline": "data_processing_pipeline"
115
        }
116

117
        for handler in self.alert_handlers:
118
            try:
119
                handler(alert_data)
120
            except Exception as e:
121
                print(f"Error sending alert: {e}")
122

123
    def start_monitoring(self, interval: int = 60):
124
        """Start continuous monitoring"""
125
        self.monitoring = True
126

127
        def monitor_loop():
128
            while self.monitoring:
129
                try:
130
                    metrics = self.collect_system_metrics()
131
                    self.metrics_history.append(metrics)
132

133
                    # Keep only last 1000 metrics
134
                    if len(self.metrics_history) > 1000:
135
                        self.metrics_history = self.metrics_history[-1000:]
136

137
                    self.check_thresholds(metrics)
138

139
                    time.sleep(interval)
140
                except Exception as e:
141
                    print(f"Error in monitoring loop: {e}")
142
                    time.sleep(interval)
143

144
        monitor_thread = threading.Thread(target=monitor_loop)
145
        monitor_thread.daemon = True
146
        monitor_thread.start()
147

148
    def stop_monitoring(self):
149
        """Stop monitoring"""
150
        self.monitoring = False
151

152
    def get_processing_rate(self) -> float:
153
        """Get current processing rate - implement based on your pipeline"""
154
        # This would integrate with your actual pipeline metrics
155
        return 100.0  # records per second
156

157
    def get_error_rate(self) -> float:
158
        """Get current error rate - implement based on your pipeline"""
159
        # This would integrate with your actual pipeline metrics
160
        return 0.5  # errors per minute
161

162
    def get_queue_depth(self) -> int:
163
        """Get current queue depth - implement based on your pipeline"""
164
        # This would integrate with your actual pipeline metrics
165
        return 50  # pending items
166

167
    def get_latency_p95(self) -> float:
168
        """Get 95th percentile latency - implement based on your pipeline"""
169
        # This would integrate with your actual pipeline metrics
170
        return 250.0  # milliseconds
171

172
# Alert handlers
173
def slack_alert_handler(alert_data: Dict):
174
    """Send alert to Slack"""
175
    webhook_url = "YOUR_SLACK_WEBHOOK_URL"
176

177
    color = "#ff0000" if alert_data["level"] == "CRITICAL" else "#ffaa00"
178

179
    message = {
180
        "attachments": [{
181
            "color": color,
182
            "title": f"{alert_data['level']} Alert: {alert_data['metric']}",
183
            "text": f"Value: {alert_data['value']:.2f} (Threshold: {alert_data['threshold']:.2f})",
184
            "fields": [
185
                {"title": "Pipeline", "value": alert_data['pipeline'], "short": True},
186
                {"title": "Timestamp", "value": alert_data['timestamp'], "short": True}
187
            ]
188
        }]
189
    }
190

191
    requests.post(webhook_url, json=message)
192

193
def email_alert_handler(alert_data: Dict):
194
    """Send alert via email"""
195
    import smtplib
196
    from email.mime.text import MIMEText
197
    from email.mime.multipart import MIMEMultipart
198

199
    # Email configuration
200
    smtp_server = "smtp.gmail.com"
201
    smtp_port = 587
202
    sender_email = "alerts@company.com"
203
    sender_password = "app_password"
204
    recipient_emails = ["team@company.com"]
205

206
    # Create message
207
    msg = MIMEMultipart()
208
    msg["From"] = sender_email
209
    msg["To"] = ", ".join(recipient_emails)
210
    msg["Subject"] = f"{alert_data['level']} Alert: {alert_data['metric']}"
211

212
    body = f"""
213
    Alert Details:
214
    - Level: {alert_data['level']}
215
    - Metric: {alert_data['metric']}
216
    - Current Value: {alert_data['value']:.2f}
217
    - Threshold: {alert_data['threshold']:.2f}
218
    - Pipeline: {alert_data['pipeline']}
219
    - Timestamp: {alert_data['timestamp']}
220

221
    Please investigate immediately.
222
    """
223

224
    msg.attach(MIMEText(body, "plain"))
225

226
    # Send email
227
    try:
228
        server = smtplib.SMTP(smtp_server, smtp_port)
229
        server.starttls()
230
        server.login(sender_email, sender_password)
231
        server.send_message(msg)
232
        server.quit()
233
    except Exception as e:
234
        print(f"Failed to send email alert: {e}")
235

236
def pagerduty_alert_handler(alert_data: Dict):
237
    """Send alert to PagerDuty"""
238
    integration_key = "YOUR_PAGERDUTY_INTEGRATION_KEY"
239

240
    payload = {
241
        "routing_key": integration_key,
242
        "event_action": "trigger",
243
        "payload": {
244
            "summary": f"{alert_data['level']} Alert: {alert_data['metric']}",
245
            "severity": "critical" if alert_data["level"] == "CRITICAL" else "warning",
246
            "source": alert_data['pipeline'],
247
            "component": alert_data['metric'],
248
            "custom_details": alert_data
249
        }
250
    }
251

252
    response = requests.post(
253
        "https://events.pagerduty.com/v2/enqueue",
254
        json=payload,
255
        headers={"Content-Type": "application/json"}
256
    )
257

258
    if response.status_code != 202:
259
        print(f"Failed to send PagerDuty alert: {response.text}")
260

261
# Setup monitoring for data pipeline
262
def setup_pipeline_monitoring():
263
    """Setup comprehensive pipeline monitoring"""
264

265
    monitor = PipelineMonitor([
266
        slack_alert_handler,
267
        email_alert_handler,
268
        pagerduty_alert_handler
269
    ])
270

271
    # Add monitoring thresholds
272
    monitor.add_threshold(MetricThreshold("cpu_usage", 70, 90))
273
    monitor.add_threshold(MetricThreshold("memory_usage", 80, 95))
274
    monitor.add_threshold(MetricThreshold("disk_usage", 85, 95))
275
    monitor.add_threshold(MetricThreshold("error_rate", 5, 10))
276
    monitor.add_threshold(MetricThreshold("queue_depth", 1000, 5000))
277
    monitor.add_threshold(MetricThreshold("latency_p95", 1000, 5000))
278
    monitor.add_threshold(MetricThreshold("processing_rate", 50, 10, "less_than"))
279

280
    # Start monitoring
281
    monitor.start_monitoring(interval=60)  # Check every minute
282

283
    return monitor
284

285
# Start monitoring
286
pipeline_monitor = setup_pipeline_monitoring()

Conclusion#

Building scalable data pipelines requires careful consideration of architecture, technology choices, and operational practices. Modern ETL frameworks provide powerful capabilities for handling diverse data sources and processing requirements, but success depends on implementing robust monitoring, quality controls, and error handling.

Key takeaways for successful data pipeline implementation:

Start with clear requirements - Understand your data sources, processing needs, and SLA requirements
Design for scalability - Use distributed processing frameworks and cloud-native services
Implement comprehensive monitoring - Monitor both technical metrics and business KPIs
Ensure data quality - Build quality checks into every stage of your pipeline
Plan for failure - Implement robust error handling and recovery mechanisms
Document everything - Maintain clear documentation for data lineage and pipeline logic

Remember that while the technical implementation is important, understanding the business context and requirements that drive these decisions—such as the specific challenges that 051nt} represent in data processing scenarios—is equally crucial for building effective, enterprise-grade data pipelines.

By following these practices and continuously iterating on your pipeline design, you’ll be able to build robust, scalable data processing systems that can evolve with your organization’s growing data needs.