spark-engineerUse when building Apache Spark applications, distributed data processing pipelines, or optimizing big data workloads. Invoke for DataFrame API, Spark SQL, RDD operations, performance tuning, streaming analytics.
Install via ClawdBot CLI:
clawdbot install Veeramanikandanr48/spark-engineerGrade Fair — based on market validation, documentation quality, package completeness, maintenance status, and authenticity signals.
Generated Mar 22, 2026
Build a Spark Structured Streaming pipeline to process high-volume transaction data, detect anomalies using machine learning models, and flag fraudulent activities in real-time. Optimize for low latency and handle data skew from high-frequency transactions.
Design a scalable ETL pipeline using Spark DataFrame API to ingest and transform petabytes of customer behavior data from web logs and databases. Implement partitioning and caching strategies to accelerate daily batch processing for personalized recommendations.
Develop a Spark application to aggregate and analyze streaming sensor data from industrial equipment, performing real-time aggregations and predictive maintenance. Use performance tuning to manage large shuffle operations and optimize cluster resource usage.
Create a Spark SQL-based pipeline to process user viewing history and content metadata, generating collaborative filtering models for recommendations. Handle data skew in popular content and use broadcast joins for small dimension tables.
Implement a Spark pipeline to merge and clean diverse healthcare datasets (e.g., EHR, genomic data) using optimized joins and schema enforcement. Ensure compliance with data privacy by minimizing data movement and using efficient transformations.
Offer processed and enriched datasets to clients using Spark pipelines, enabling them to access clean, aggregated data for analytics without managing infrastructure. Revenue comes from subscription fees based on data volume and freshness.
Provide expert services to companies struggling with slow or costly Spark applications, offering performance tuning, code reviews, and cluster optimization. Revenue is generated through project-based contracts or hourly rates.
Develop and sell a SaaS platform that leverages Spark for real-time data processing and dashboards, targeting industries like finance or retail. Revenue streams include tiered licensing and premium support services.
💬 Integration Tip
Integrate with cloud storage like AWS S3 or Azure Data Lake for scalable data ingestion, and use orchestration tools such as Apache Airflow to schedule and monitor Spark jobs efficiently.
Scored Apr 15, 2026
Use the @steipete/oracle CLI to bundle a prompt plus the right files and get a second-model review (API or browser) for debugging, refactors, design checks, or cross-validation.
Local search/indexing CLI (BM25 + vectors + rerank) with MCP mode.
Design data models for construction projects. Create entity-relationship diagrams, define schemas, and generate database structures.
MarkItDown is a Python utility from Microsoft for converting various files (PDF, Word, Excel, PPTX, Images, Audio) to Markdown. Useful for extracting structu...
Connect to Supabase for database operations, vector search, and storage. Use for storing data, running SQL queries, similarity search with pgvector, and managing tables. Triggers on requests involving databases, vector stores, embeddings, or Supabase specifically.
Use when designing database schemas, writing migrations, optimizing SQL queries, fixing N+1 problems, creating indexes, setting up PostgreSQL, configuring EF Core, implementing caching, partitioning tables, or any database performance question.