Be an early applicantLess than 25 applicants

Company

Flow · 11 hours ago

Senior Java Big Data Engineer Intern (November 2024 Start)

United States

Internship

Remote

Intern

3+ years exp

Start in 2024 Fall

Maximize your interview chances

Software Development

Insider Connection @Flow

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Developing Distributed Data Collection Frameworks: Design and implement scalable web crawling systems using tools such as Apache Nutch, Scrapy, StormCrawler, and Puppeteer to crawl hundreds of millions of web pages across various domains in real-time. These systems must be fault-tolerant, highly available, and capable of continuous operation without downtime. Optimize crawlers for speed, data quality, and efficient extraction of relevant web data, including numerical, text, metadata, and structured information.

Building Petabyte-Scale Data Storage Systems: Work with distributed storage systems like Hadoop HDFS, Google Cloud Storage, and Cassandra to store and manage massive datasets. Implement data partitioning, replication, and sharding strategies to ensure the scalability and reliability of the storage infrastructure. Develop solutions for efficient data retrieval and handling, using columnar data formats like Apache Parquet or Apache ORC.

Implementing High-Throughput Data Ingestion Pipelines: Build robust data ingestion frameworks using Apache Kafka, Apache Pulsar, and Apache NiFi to process and stream real-time data from web crawlers and other data sources. Work on enhancing ETL pipelines for data processing and transformation, ensuring high throughput and low-latency ingestion into the data lake.

Scaling Data Processing Frameworks: Implement and optimize distributed data processing tasks using frameworks like Apache Spark, Apache Flink, and Apache Beam. Ensure that data processing jobs are fault-tolerant, scalable, and efficient for handling both batch and stream data processing workloads. Focus on optimizing in-memory operations and distributed computations for high-speed data analytics at scale.

Advanced Data Mining and Artificial Intelligence: Develop data mining and AI models to analyze the large datasets collected from web crawls, focusing on extracting valuable insights such as trends, behaviors, and predictions. Utilize tools like TensorFlow, PyTorch, Apache Mahout, and Scikit-learn to create scalable models for lead generation and predictive analytics.

Data Transformation and Warehousing: Work on data modeling and transformation within large-scale data warehouses using tools like Apache Hive, Google BigQuery, and Presto. Design and implement efficient querying solutions on petabyte-scale datasets, ensuring low-latency access and seamless integration with the data lake.

Building and Managing Scalable Infrastructure: Collaborate on managing distributed systems infrastructure using container orchestration tools like Kubernetes, Apache Mesos, and Docker. Ensure that systems can scale efficiently, both vertically and horizontally, to handle increasing volumes of data and workload. Implement infrastructure as code (IaC) tools such as Terraform and Ansible to automate deployment, monitoring, and management of cloud-based infrastructure.

Performance Monitoring and Optimization: Set up monitoring and alerting solutions using tools like Prometheus, Grafana, ELK Stack, and OpenTelemetry to track system performance and ensure optimal uptime. Work on fine-tuning and optimizing system components to achieve high availability, reduce latency, and improve the overall throughput of data processing workflows.

Collaborative Team Development: Collaborate with senior engineers and cross-functional teams to design, implement, and deploy solutions. Participate in code reviews, architectural discussions, and product roadmap planning. Document workflows, processes, and best practices for scalable, maintainable code development.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

JavaDistributed SystemsBig Data FrameworksData EngineeringCloud InfrastructureHadoopSparkKafkaKubernetesGoogle CloudNoSQL DatabasesWeb CrawlersETL PipelinesMachine LearningData WarehousingInfrastructure as CodeMonitoring ToolsCassandraHBaseDynamoDBScrapyApache NutchStormCrawlerSeleniumData TransformationTensorFlowPyTorchApache MahoutScikit-learnApache Hive

Required

Recently graduated with a Master's Degree or PhD. in Computer Science, specializing in distributed systems or data engineering.

3+ years of experience in data engineering, distributed systems engineering, or big data systems.

Experience with distributed systems, big data frameworks, and cloud infrastructure (Hadoop, Spark, Kafka, Kubernetes, etc.).

Expert with Java development, with extensive experience in data processing and back-end engineering.

Expert with cloud platforms like Google Cloud, or Azure, and experience working with distributed storage solutions.

Experience in building and scaling distributed web crawlers or data extraction frameworks using tools like Scrapy, Apache Nutch, StormCrawler, and Selenium.

Expert with database systems and extensive experience with NoSQL databases like Cassandra, HBase, or DynamoDB.

Strong problem-solving skills and the ability to handle complex distributed architectures.

Ability to work independently and in a collaborative team environment.

Benefits

Remote native; Location freedom

Professional industry experience in the SaaS and AI industry

Creative freedom

Potential to convert into a full-time position

Company

Flow

Flow Global Software Technologies, LLC., operating in the Information Technology (IT) sector, is a cutting-edge high-tech enterprise AI company that engages in the design, engineering, marketing, sales, and support of a cloud-based enterprise AI platform with patent pending artificial intelligence, machine learning, and other core proprietary technologies awaiting patent approval.

Founded in 2022

Austin

11-50 employees

https://flowai.tech

Funding

Current Stage

Early Stage

Company data provided by crunchbase

Orion

Your AI Copilot