About the Role
Summary
This role is responsible for building and maintaining the data infrastructure. They ensure data from various sources is efficiently collected, processed, and stored, taking ownership of smaller projects and collaborating with senior engineers on larger initiatives.
Responsibilities
Design, build, and maintain efficient data pipelines (ETL/ELT processes) using Apache Spark on AWS EMR to integrate data from various source systems into our data warehouse and data lake.
Build and optimize batch and near real-time data processing jobs on Spark (including performance tuning, partitioning, and cost control on EMR) to support analytics and reporting needs.
Write and refine complex SQL queries and use scripting (e.g., Python) to transform and aggregate large datasets.
Implement data quality measures (such as validation checks and cleansing routines) to ensure data integrity and reliability.
Develop and optimize data warehouse schemas and tables, and define contracts between Spark/EMR pipelines, Iceberg tables, Snowflake models, and dbt transformations.
Collaborate with data analysts, data scientists, and other engineers to understand data requirements and deliver appropriate solutions.
Document pipeline designs, data flows, and data definitions for transparency and future reference, adhering to team standards.
Handle multiple tasks or projects simultaneously, prioritizing work and communicating progress to stakeholders to meet deadlines.
Qualifications
Bachelor’s or Master’s degree in a relevant field (e.g., Computer Science, Mathematics, Physics).
At least 3 years of experience in a data engineering or similar backend data development role.
Strong SQL skills and experience with data modeling and building data warehouse solutions.
Proficiency in at least one programming language (e.g., Python) for data processing and pipeline automation.
Familiarity with ETL tools and workflow orchestration frameworks (e.g., Apache Airflow or similar).
Experience implementing data quality checks and working with large-scale datasets.
Good problem-solving abilities, plus strong communication and teamwork skills to work with cross-functional stakeholders.
Tech Stack
data engineeringETLApache SparkAWS EMRPythonSQLdata pipelinesSnowflakedbtAirflow