Skip to main content

Getting back into parallel computing with Apache Spark

Getting back into parallel computing with Apache Spark has been great, and it has been interesting to see the McColl and Valiant BSP (Bulk Synchronous Parallel) model finally start becoming mainstream beyond GPUs.

While Spark can be some effort to setup on actual clusters and does have an overhead, thinking that these will be optimized over time and Spark will become more and more efficient. 

I have started a GitHub repo for Spark snippets if any are of interest as Apache Spark moves forward 'in parallel' to the HDFS (Hadoop Distributed File System).


Popular posts from this blog

Drupal 7 EOL and how long will Drupal 9 be Supported

 How long will Drupal 9 be supported.   Currently it is 2023. This is a question site owners and builders of Drupal need to ask. While that seems a long way off, and upgrading from Drupal 8 to 9 is relatively easy compared to previous upgrades from 5 to 6 and 6 to Drupal 7.  Where does this leave us with all the Drupal 7 sites which need to be upgraded to Drupal 9? The year is 2022 as Drupal 7 keeps getting to stay around for just a bit longer to help developers and owners get upgraded. Drupal 8 and 9 are really coming into their own in recent years, and Drupal 8 had great enhancements comparatively to Drupal 7, and the contributed modules are looking in good shape for the future of Drupal.

A way to Merge Columns of DataFrames in Spark with no Common Column Key

Made post at Databricks forum, thinking about how to take two DataFrames of the same number of rows and combine, merge, all columns into one DataFrame. This is straightforward, as we can use the  monotonically_increasing_id() function to assign unique IDs to each of the rows, the same for each Dataframe. It would be ideal to add extra rows which are null to the Dataframe with fewer rows so they match, although the code below does not do this. Once the IDs are added, a DataFrame join will merge all the columns into one Dataframe. # For two Dataframes that have the same number of rows, merge all columns, row by row. # Get the function monotonically_increasing_id so we can assign ids to each row, when the # Dataframes have the same number of rows. from pyspark.sql.functions import monotonically_increasing_id #Create some test data with 3 and 4 columns. df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar&qu

Modules Available for Drupal 9

Drupal 9 is here , and there are already some useful modules available for the newest stable version of Drupal. Right now for Drupal 9 there is Admin_Toolbar, Redirect, Paragraphs, Metatag, SimpleXML Sitemap, Field Group, Ctools, Entity Browser, Embed, Webform, Entity API, IMCE, Google Analytics, External Links, XML Sitemap, Focal Point, Acquia Connector, and looks like the list just keeps going. For the module search I used "Stable Release" which means it looks like Drupal 9 is really forward compatible from Drupal 8!