Design, develop, and maintain scalable and reliable data pipelines for feature extraction, transformation, and loading (ETL) processes.
Troubleshoot and resolve feature-related issues impacting ML models or downstream applications.
Work closely with data scientists and analysts to understand their data requirements and build or support features that power machine learning models.
Debug and optimize data pipelines to ensure high data quality and availability.
Contribute to data modeling and architecture to improve the performance of our data lakehouse and data warehouse environments.
Provide technical guidance and support to stakeholders regarding data querying, access, and usage.
Strong ability to debug and optimize Spark application running in EMR.
Qualifications:
In-depth knowledge of data warehouse and data lake architectures, including storage and processing engines.
Proficiency in data modeling and experience with large-scale data processing systems like Apache Spark, Flink.
Strong experience with AWS cloud services, particularly in data-related technologies (e.g., EMR, S3, Redshift).
Familiarity with feature store concepts and experience in building and maintaining feature stores.
Good understanding of real-time data processing is a plus.
Excellent problem-solving skills and the ability to debug complex data issues.
Strong communication skills to effectively collaborate with cross-functional teams and support stakeholders' data needs.