Darrell Ulm | Computer Science

Posts

Showing posts from 2017

Catch up on Drupal and Ubuntu Linux Posts

Catching up here on Ubuntu 16.04 Linux setup and also Drupal posts, below are some links of compilations of documentation, tutorials and threads about web-development topics. Drupal 8 Development in PHP Migration Tutorials for Drupal 8 (from Drupal 7 primarily or other systems) Technical Notes for Config of Drupal 7 For Ubuntu 16 Setup Notes for Web Development System There are some I missed on the Tumblr site, but I can add more later when time.

Getting back into parallel computing with Apache Spark

Returning to parallel computing with Apache Spark has been insightful, especially observing the increasing mainstream adoption of the McColl and Valiant BSP (Bulk Synchronous Parallel) model beyond GPUs. This structured approach to parallel computation, with its emphasis on synchronized supersteps, offers a practical framework for diverse parallel architectures.While setting up Spark on clusters can involve effort and introduce overhead, ongoing optimizations are expected to enhance its efficiency over time. Improvements in data handling, memory management, and query execution aim to streamline parallel processing.A GitHub repository for Spark snippets has been created as a resource for practical examples. As Apache Spark continues to evolve in parallel with the HDFS (Hadoop Distributed File System), this repository intends to showcase solutions leveraging their combined strengths for scalable data processing.

Scala Version of Approximation Algorithm for Knapsack Problem for Apache Spark

This is the Scala version of the approximation algorithm for the knapsack problem using Apache Spark. I ran this on a local setup, so it may require modification if you are using something like a Databricks environment. Also you will likely need to setup your Scala environment. All the code for this is at GitHub First, let's import all the libraries we need. import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import org.apache.spark.sql.DataFrame import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.sum We'll define this object knapsack, although it could be more specific for what this is doing, it's good enough for this simple test. object knapsack { Again, we'll define the knapsack approximation algorithm, expecting a dataframe with the profits and weights, as well as W, a total weight. def knapsackApprox(knapsackDF: DataFrame, W: Double): Da...

Apache Spark Knapsack Approximation Algorithm in Python

The code shown below computes an approximation algorithm, greedy heuristic, for the 0-1 knapsack problem in Apache Spark. Having worked with parallel dynamic programming algorithms a good amount, wanted to see what this would look like in Spark. The Github code repo. for the Knapsack approximation algorithms is here , and it includes a Scala solution. The work on a Java version is in progress at time of this writing. Below we have the code that computes the solution that fits within the knapsack W for a set of items each with it's own weight and profit value. We look to maximize the final sum of selected items profits while not exceeding the total possible weight, W. First we import some spark libraries into Python. # Knapsack 0-1 function weights, values and size-capacity. from pyspark.sql import SparkSession from pyspark.sql.functions import lit from pyspark.sql.functions import col from pyspark.sql.functions import sum Now define the function, which will take a Spark ...

A way to Merge Columns of DataFrames in Spark with no Common Column Key

Made post at Databricks forum, thinking about how to take two DataFrames of the same number of rows and combine, merge, all columns into one DataFrame. This is straightforward, as we can use the monotonically_increasing_id() function to assign unique IDs to each of the rows, the same for each Dataframe. It would be ideal to add extra rows which are null to the Dataframe with fewer rows so they match, although the code below does not do this. Once the IDs are added, a DataFrame join will merge all the columns into one Dataframe. # For two Dataframes that have the same number of rows, merge all columns, row by row. # Get the function monotonically_increasing_id so we can assign ids to each row, when the # Dataframes have the same number of rows. from pyspark.sql.functions import monotonically_increasing_id #Create some test data with 3 and 4 columns. df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar...

Darrell Ulm | Computer Science | Notes and Blog

Search the Posts