Job Description
Location: Remote
Position Type: Contract to Hire
POSITION SUMMARY:
• We are seeking an experienced and motivated Data Engineer to join a Lean/Agile team building and supporting our data science and analytics operational platform.
• As an experienced engineer of data extraction, transformation, and persistence you will be designing and implementing various components of our data science collaboration and deployment platform
• Working closely with Data Science and Analytics professionals, you will develop automated, streaming data pipelines for event capture, transformation, and feature extraction to assist the machine learning process.
• The industry changes rapidly, so we are looking for candidates who can respond to change, pick up new technologies quickly and adapt to shifting requirements.
• We also want candidates who are production-oriented and have a commitment to quality.
PRINCIPAL DUTIES AND RESPONSIBILITIES:
• Build and maintain event capture/transformation flows, feature repositories, data cache for real-time analytics and more.
• Develop data pipelines that can be leveraged in both model training and production execution.
• Collaborate with Data Architecture and other Data Engineering groups maintaining a focus on operationalizing data flows in the service of data science and analytics groups.
• Development of code to extract value from various structured, semi-structured and unstructured data sources creating refined data repositories for ease of analysis.
MINIMUM JOB REQUIREMENTS:
• 5+ years in data-related field
• Strong Python data skills with Pandas, as well as XML/JSON parsing
• Experience with AWS cloud technologies including S3, EC2 instances, and more
• Strong SQL skills and ability to adapt those skills in multiple relational technologies and some NoSQL technologies (SAS, PROC SQL, Microsoft SQL, Snowflake, Dynamics)
• Experience with the following technologies a plus:
• Redshift, Hive, SparkSQL etc.
• Experience in additional languages such as Java or Scala helpful
• ETL tools such as Informatica, Pentaho, SAP, etc.
• Messaging systems such as Amazon Kinesis or Apache Kafka
• AWS technologies such as Glue, DynamoDB
• Apache Spark or PySpark a plus
• Workflow scheduling tools such as Apache Airflow, windows scheduler, or Luigi
• Experience calling third-party REST APIs and working with JSON data