Welcome to the Data Engineering Workshop - Data Days 2025¶
Author: Mariama Jaiteh, Helmholtz Centre for Infection Research, NFDI4Immuno
An Introduction to Data Engineering for Research Workflows¶
The workshop duration is around 90-120 minutes
Outline - 5 Modules¶
- Introduction to Data Engineering
- Pandas warmup
- Simple Data Engineering: Data Ingestion to Transformation
- Pipeline Design with Prefect
- Data Quality and Monitoring, Pipeline Management Tools
- Closing Remarks
Module 1 – Introduction to Data Engineering¶
Data engineering (DE) focuses on the design, implementation, evaluation and monitoring of data pipelines to ensure reliability and reproducibility. In short, DE ensures that reliable and well structured data is ready and easily accessible for downstream analysis and modelling.
🧩 The standard DE process¶
Raw data is more and more available and accessible. However, the data is not often ready for consumption. There are few major steps in Data Engineering one must follow when facing a new data source.
The typical stages in data lifecycle are the following:
- Data Ingestion. Extracting the data from its source and saving it into a storage which is within your control.
- Data Transformation. Collected data must be cleaned, new features can be created, data needs to be validated.
- Data Storage. A good practice is to store both raw and transformed data into your storage account. Ideally the data must be stored efficiently to ease retreival and use of analytics tools.
- Data Orchestration. This refers to the set of workflows use to automate all the data related tasks
- Data Monitoring & Quality Assurance. An important step and often omited is to assess data integrity and compliance as well as data drift.
🔑 Roles and Responsibilities of a DE¶
- Data Pipeline Development. Design and Build ETL/ELT workflows to move and transform data. In addition, the DE ensures that pipelines are efficient and scalable.
- Data Modeling. Giving structure to data by designing schema or data models to optimize storage, querying, and analysis.
- Data Integration.
- Data Quality Assurance. Implement strategies and tools to clean, validata, and monitor data quality and drift.
- The DE usually works alongside developers, scientists, and analysts.
- The DE skill set:
- Programming
- Data Storage Management
- Data Processing Framework
- Pipeline Orchestration
- Data Quality Assurance
The different types of datasets¶
- Structured
- Semi-structured
- Unstructured
ℹ️ The pipelines - ETL vs ELT¶
A data pipeline is a series of automated steps that move, transform, and store data from one system to another. It ensures that raw data collected from different sources becomes clean, consistent, and ready for analysis or modeling.
Typical stages include:
- Extract – collect raw data from files, APIs, or databases.
- Transform – clean, filter, and enrich the data.
- Load – store the processed data in a database or data lake.
There are usually 2 types of pipelines: ETL and ELT.
ETL stands for: Extract - Transform - Load
- Used for standard Data warehouses.
- Data flow: Ingestion → Transformation → Load: storage in warehouse
ELT stands for: Extract - Load - Transform
- Used for modern Data lakes.
- Data flow: Ingestion → Load: storage in data lake → Transformation
🏬 Storage options - Data warehouse vs Data Lakes¶
Data Warehouses are systems for storing processed and often well structured datasets which are ready for analysis, e.g. a SQL database. The target users of data warehouses are teams performing analysis.
- Data is curated and structured
- Data is ready for efficient retrieval
Data Lakes are systems for storing raw data in all its forms, e.g. structured, semi-structured or unstructured. They offer flexible storage options, e.g. S3-compatible file storages and data lakes, for the different data sources.
- Data is dumped as is
- Data is not ready to be used and needs to be structured
- Can be retrieved slowly of fast depending on its size and tools
🧱 Medallion Architecture (Bronze → Silver → Gold)¶
The Medallion Architecture organizes data into quality layers, making it easier to manage and evolve from raw.
- 🥉 Bronze: Raw, unprocessed data — exactly as ingested.
- 🥈 Silver: Cleaned and standardized data — duplicates removed, types corrected.
- 🥇 Gold: Curated, aggregated, and enriched datasets — business-ready.
💪 Module 2 – Let's do some Pandas warmup¶
⚙️ Let's install the required python packages¶
For the majority of data manipulation and visualization, the Python packages pandas, numpy and matplotlib can be used.
!pip install pandas
!pip install numpy
!pip install matplotlib
🧠 Small intro to pandas - DataFrames¶
pandas is a Python packages that enables you to work with structured datasets, e.g. tabular dataset stored in databases of spreadsheet. Today, we will use pandas to have a quick look at our datasets, do some data cleaning and transformation.
In general, these steps are done alongside data scientists as they will have specific needs regarding the data structure and the final variable for their downstream analysis and modeling.
The main data structures we will use in pandas are DataFrames which represent data table and Series which vectorial representation of data columns.
💡 To read more about both, refer to the User Guide:
import numpy as np
import pandas as pd
# Here we're generating random datapoints
n_pts = 50
np.random.seed(42)
category_1 = np.random.choice(['A', 'B', 'C'], size=n_pts)
category_2 = np.random.choice(['Low', 'Medium', 'High'], size=n_pts)
num_1 = np.random.normal(loc=100, scale=15, size=n_pts)
num_2 = np.random.uniform(low=0, high=1, size=n_pts)
dates = pd.date_range(start='2024-01-01', periods=n_pts, freq='D')
# Creating the pandas dataframe
df_test = pd.DataFrame({
'Category_1': category_1,
'Category_2': category_2,
'Date': dates,
'Value_1': num_1,
'Value_2': num_2
})
df_test.head() # we can see the first 5 data rows
df_test.tail() # we can see the last 5 data rows
df_test.describe() # quick statistics on the dataframe
Labels, Indexing and, Slicing¶
In pandas, you can use both the [] and . operators to navigate and apply methods on your dataframe or series. See examples below.
# you can look at a specific column as well
df_test.Category_1.head(5)
# you can do it this way too
df_test['Category_1'].head(5)
# You can apply conditional to run filtering
df_test['Category_1'] == 'A'
# Let's just have a look at that section of the dataframe
df_test[df_test.Category_1 == 'A']
# You can also use location based indexing
df_test.loc[:, 'Category_1'] # select columns using indexes
df_test.loc[0:5, 'Category_1']
# equivalent to df_test.Category_1.head(5)
# Using slicing and integer indexes
df_test[:5]
# equivalent to df_test.head(5)
df_test.iloc[:5] # select the 5 first rows
df_test.iloc[4:5] # select the 4th row
df_test.iloc[4:5, 4] # select the 4 element on the 4th row
df_test.iloc[-5:] # select the last 5 rows, eq df_test.tail(5)
🚀 Your turn¶
Now, spend some more time getting familiar with pandas indexing and slicing. You can refer to the indexing section of the (user guide)[https://pandas.pydata.org/docs/user_guide/indexing.html#indexing].
You can edit the examples above and create your own
🤯 Let's add another set to our warm up - aggregations and grouping¶
pandas allows you to run various aggregation, e.g. min, max and mean, against your dataframe or subset of your dataframe.
You can read more about aggregations and grouping in the (User Guide)
df_test.groupby('Category_1')['Value_1'].groups
df_test.groupby('Category_1')['Value_1'].median()
df_test.groupby(['Category_1', 'Category_2']).agg({
'Value_1': ['mean', 'std', 'max'],
'Value_2': ['median', 'mean']
})
# you can also apply functions. Let's see this example
def range_func(x):
return x.max() - x.min()
df_test.groupby('Category_2')['Value_1'].apply(range_func)
# You can also apply aggregations to dates
df_test.groupby(df_test['Date'].dt.month)['Value_1'].mean()
df_test.groupby(df_test['Date'].dt.strftime('%B'))['Value_1'].mean()
💪 We now have some basics set. Let's go build our data pipeline 🚀.¶
Let's go to the Module_3 notebook.