Getting back into parallel computing with Apache Spark has been great, and it has been interesting to see the McColl and Valiant BSP (Bulk Synchronous Parallel) model finally start becoming mainstream beyond GPUs. While Spark can be some effort to setup on actual clusters and does have an overhead, thinking that these will be optimized over time and Spark will become more and more efficient. I have started a GitHub repo for Spark snippets if any are of interest as Apache Spark moves forward 'in parallel' to the HDFS (Hadoop Distributed File System).
Technical notes about past publications and work by Darrell Ulm including Apache Spark, software development work, computer programming, Parallel Computing, Algorithms, Koha, and Drupal. Source code snippets, like in Python for Spark. Retrospective of projects.