Skip to main content


ResearcherID for Darrell Ulm Site

Another nice site for listing research work is ResearcherID, this one for Darrell Ulm, trying to figure out the difference of ORCId as they are very similar. Looks like ResearcherID links into other resources like reviewing efforts.
Recent posts

Python for Data Science

Looking at more resources online for Python for Data Science.

There are many good resources available.

Of course the main tools are: NumpyPandasMathPlotLibSkiKit-Learn has some amazing tools.

Kaggle for instance has Data Science contents, but good to install a local system like the Jupyter Notebook to speed things up as the Kaggle editor can lag and take some time to run on small data-sets.

The newer DataCamp has some neat tutorials on it and simple App to do daily exercises on your mobile device.

Here is the Python DataScience Handbook. Really useful.

A short tutorial: Learn Python for Data Science, a fun read.

A list of cool DataSci tutorials is here, and another how to get started with Python for DS.

Will add more later.

Linux, Drupal, PHP, Technical Notes from Tumblr

Catch up on Drupal and Ubuntu Linux Posts

Catching up here on Ubuntu 16.04 Linux setup and also Drupal posts, below are some links of compilations of documentation, tutorials and threads about web-development topics.

Drupal 8 Development in PHPMigration Tutorials for Drupal 8 (from Drupal 7 primarily or other systems)Technical Notes for Config of Drupal 7For Ubuntu 16 Setup Notes for Web Development System
There are some I missed on the Tumblr site, but I can add more later when time.

Getting back into parallel computing with Apache Spark

Getting back into parallel computing with Apache Spark has been great, and it has been interesting to see the McColl and Valiant BSP (Bulk Synchronous Parallel) model finally start becoming mainstream beyond GPUs.

While Spark can be some effort to setup on actual clusters and does have an overhead, thinking that these will be optimized over time and Spark will become more and more efficient. 
I have started a GitHub repo for Spark snippets if any are of interest as Apache Spark moves forward 'in parallel' to the HDFS (Hadoop Distributed File System).

Scala Version of Approximation Algorithm for Knapsack Problem for Apache Spark

This is the Scala version of the approximation algorithm for the knapsack problem using Apache Spark.

I ran this on a local setup, so it may require modification if you are using something like a Databricks environment. Also you will likely need to setup your Scala environment.

All the code for this is at GitHub

First, let's import all the libraries we need.

import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import org.apache.spark.sql.DataFrame import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.sum We'll define this object knapsack, although it could be more specific for what this is doing, it's good enough for this simple test.

object knapsack {
Again, we'll define the knapsack approximation algorithm, expecting a dataframe with the profits and weights, as well as W, a total weight.

def knapsackApprox(knapsackDF: DataFrame, W: Double): DataFrame = {
Calculate t…