Be an early applicantLess than 25 applicants

This job has closed.

Company

Original Job Post

myGwork - LGBTQ+ Business Community · 2 days ago

Data Engineer

Boston, MA

Full-time

Onsite

Senior Level

$115K/yr - $182K/yr

5+ years exp

Wonder how qualified you are to the job?

Maximize your interview chances

Internet

Insider Connection @myGwork - LGBTQ+ Business Community

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Design, develop, and maintain scalable data pipelines using pyspark on Databricks, adhering to best practices and emphasizing software engineering principles.

Implement and optimize stream processing workflows using Kafka for real-time data ingestion and processing.

Utilize Parquet and Avro-formatted data files for efficient storage and retrieval, ensuring data schema compatibility and evolution.

Leverage Databricks platform on AWS to build and manage data processing workflows and analytics, while adhering to development lifecycle standards.

Harness the power of Databricks Delta Lake and Parquet files for data warehousing, query optimization, and data versioning.

Collaborate closely with data analysts and scientists to understand their requirements and provide reliable and timely data solutions.

Implement robust testing methodologies, including unit testing, integration testing, and end-to-end testing, utilizing Python packages such as pytest.

Contribute to the pyspark/Python ecosystem by creating reusable components, maintaining internal PyPI packages, and exploring other common Python packages.

Monitor data pipelines, identify and resolve issues, and ensure data integrity and quality.

Stay up-to-date with the latest trends and technologies in data engineering, software development, and testing practices, and actively share knowledge with the team.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Data EngineeringPySparkPythonShell ScriptingSoftware EngineeringDevelopment LifecycleWorkflow ManagementAirflowStream ProcessingKafkaAvroData SerializationDatabricksAWSData WarehousingDelta LakeParquet FilesSQLRelational DatabasesTestingUnit TestsIntegration TestsEnd-to-End TestsPyPI PackagesPython EcosystemTechnical ConceptsCloud PlatformsAzureLarge DatasetsProblem-Solving

Required

Bachelor's or master's degree in computer science or a related field.

Minimum 5 years of real-world Data Engineering experience working on large-scale data projects.

Strong proficiency in pySpark, Python, and shell scripting, with a focus on software engineering best practices and a deep understanding of development lifecycle.

Experience working with workflow management tools such as Airflow.

Experience with stream processing technologies, preferably Kafka.

Familiarity with Avro data serialization format and its usage in data engineering workflows.

Expertise in using Databricks platform on AWS for data processing and analytics.

Solid understanding of data warehousing concepts and experience with Delta Lake and Parquet files.

Proficiency in SQL and experience with relational databases.

Strong testing skills, with experience in implementing and executing unit tests, integration tests, and end-to-end tests using Python packages such as pytest.

Familiarity with the Python ecosystem, including PyPI packages and their integration into data engineering workflows.

Excellent problem-solving skills and ability to work in a fast-paced, collaborative environment.

Strong communication skills and ability to effectively communicate complex technical concepts to non-technical stakeholders.

Working experience with Databricks and pyspark.

Proficiency in writing complex SQLs.

Working experience with cloud platforms like AWS or Azure (preferably AWS).

Working Experience with Airflow.

Experience working with very large datasets.