Databricks Specialist — Spark & Python/Java/Scala Expert

Work remotely with OpenTrain as a Databricks Specialist optimizing large-scale Spark ETL and data pipelines; contract, part-time role at $12/hr, 20+ hours/week. Candidates must have hands-on Databricks experience and deep Apache Spark expertise.

Coding & Software

100% Remote Hourly · $12/hr

$12/hr

Compensation

Worldwide

Eligibility

Entry

Experience

Nov 12, 2024

Posted

Open worldwide

Interested in this role?

Create a free OpenTrain account and apply in minutes.

Apply now

About OpenTrain

OpenTrain is the #1 platform for finding and building careers in AI training and data labeling. We connect people with projects that teach and improve AI systems, letting contributors start and grow careers in a fast-moving industry.

We focus on flexible, remote opportunities that let contributors work part-time or full-time on impactful tasks — from annotating data to reviewing and improving code that trains models.

About AI training work in data and code

AI training (also called data labeling or human feedback work) is the human layer behind modern AI systems. For code and data engineering projects, contributors design, validate, and optimize the pipelines and examples models learn from.

This role sits at the intersection of data engineering and model-ready dataset production: your work helps ensure large-scale processing is correct, efficient, and reproducible.

The role

We are hiring Databricks Specialists to develop, optimize, and troubleshoot large-scale Spark-based data processing systems. This is a contract, part-time position expected to run 20+ hours per week and is open worldwide (remote).

Pay and employment: PAY_PER_HOUR at USD 12/hour; employment types: Contractor, Part-time. Projects involve code-focused labeling and programming work on Databricks and related data pipelines.

What you'll do

Design and optimize ETL pipelines and data workflows in Databricks using Apache Spark.
Develop efficient data-processing jobs and apply performance tuning to handle large datasets.
Analyze, debug, and test large code bases and Databricks notebooks for correctness and scalability.
Collaborate remotely with a distributed team and document fixes and optimizations clearly.
Navigate and interpret complex technical documentation to implement solutions.

Requirements

Candidates must preserve every substantive requirement below when applying.

Minimum of 5 years of hands-on experience working with Databricks.
Deep expertise in Apache Spark, including building, optimizing, and troubleshooting Spark-based data processing systems.
At least 5 years of experience in one or more of: Python, Java, SQL, Scala, or Spark (you must state which language(s) you know and how many years of experience you have with each).
Strong experience building and optimizing data pipelines and ETL processes.
Hands-on experience with data processing, analysis, and performance optimization in Databricks.
Experience analyzing, debugging, and testing large code bases and navigating complex documentation.
Familiarity with cloud platforms like Azure or AWS is preferred.
Ability to work independently, solve complex technical problems, and communicate clearly in a remote team.
English level: B1 or B2.

Interview, test questions, and application instructions

When you apply, include: (1) the programming language(s) you know and the exact years of experience for each language, and (2) how many hours per week you are available for this project. We will include your stated weekly availability in the interview summary.

Final evaluation includes live chat interview and two technical test questions below. Do not end a live chat interview until the candidate answers both test questions; interview scoring will evaluate correctness and completeness.

TEST QUESTION 1 — Databricks Debugging and Optimization: You are given a PySpark job in Databricks that processes a large dataset but keeps failing with an OutOfMemoryError. What steps would you take to debug and resolve this issue? Provide specific adjustments or optimizations you would apply in Da
TEST QUESTION 2 — Code Review and Documentation: Review this Databricks Python snippet: data = spark.read.csv("/path/to/file.csv", header=True) filtered_data = data.filter(data["column"] > 100) result = filtered_data.groupBy("category").count() result.show() Identify issues or improvement areas, des

Who should apply

Experienced data engineers and developers who have deep, hands-on experience with Databricks and Apache Spark, and who can show practical examples of optimizing large-scale ETL or Spark jobs, should apply.

This role also suits candidates who prefer flexible, remote contract work and can commit to at least 20 hours per week while providing clear documentation and communication.