Mastering Spark on K8s 🔥 and Why I Dumped 💔 Kubeflow Spark Operator (Formerly Google’s Spark Operator)!

🌟 FREE full access on: LovinData — Simplified Full Stack Data Engineering

James JIANG
13 min readJun 22, 2024

--

Heyoooo Spark ⚡ developers! My product manager several months ago asked me one question: “Is it possible to run Spark applications without K8s 🐳 cluster-level access?” At the time, I only knew the Kubeflow 🔧 Spark Operator well and was using it for deploying all my Spark applications. For those who know, you must have K8s cluster-level access to use the Kubeflow Spark Operator. The reasons are because it installs CRDs and ClusterRole. So I told him “no” with these reasons, and on his side, he tried his best to convince the prospect with the constraint in mind. At the enterprise level, they usually have a multi-tenant K8s cluster segregated by company/department, project, and environment (dev, uat, pre-prod, or prod) using Namespaces. This way, they make the most of the computing resources allocated. Plus, if one project does not meet the expectation or the contract ends, hop hop kubectl delete <compordept>-<project>-<env> and it's like the project has never existed. I am currently writing to tell my product manager, "Yes, it's possible to run Spark applications without K8s cluster-level access."! Here is how! 🚀

--

--

James JIANG

Dedicated 👨🏻‍💻 Full-Stack Data Engineer | Lead Data Engineer @ Datategy (🇫🇷) | 🎓 ENSEEIHT Alumni in Computer Science | Committed to data-driven solutions.