Skip to main content

Apache Spark Knapsack Approximation Algorithm in Python

The code shown below computes an approximation algorithm, greedy heuristic, for the 0-1 knapsack problem in Apache Spark. Having worked with parallel dynamic programming algorithms a good amount, wanted to see what this would look like in Spark.

The Github code repo. for the Knapsack approximation algorithms is here, and it includes a Scala solution. The work on a Java version is in progress at time of this writing.

Below we have the code that computes the solution that fits within the knapsack W for a set of items each with it's own weight and profit value. We look to maximize the final sum of selected items profits while not exceeding the total possible weight, W.

First we import some spark libraries into Python.

# Knapsack 0-1 function weights, values and size-capacity.
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import col
from pyspark.sql.functions import sum


Now define the function, which will take a Spark Dataframe with all the items, each with a name, weight and profit, and the global total weight.

def knapsackApprox(knapsackDF, W):
    '''
    A simple greedy parallel implementation of 0-1 Knapsack algorithm.

    Parameters
    ----------
    knapsackDF : Spark Dataframe with knapsack data
        sqlContext.createDataFrame(knapsackData, ['item', 'weights', 'values'])

    W : float
        Total weight allowed for knapsack.

    Returns
    -------
    Dataframe
        Dataframe with results.
    '''


Create a DataFrame for each item, that is the ratio of profit over weight. Disregard all items weights that are larger than the global knapsack Weight W since they will not fit regardless. Sort, in Spark, all item rows by the ratio value, high to low.

    # Add ratio of values / weights column.
    ratioDF = (knapsackDF.withColumn("ratio", lit(knapsackDF.values / knapsackDF.weights))
               .filter(col("weights") <= W)
               .sort(col("ratio").desc())
               )




Get a current Spark Session to feed to the sql command.

    # Get the current Spark Session.
    sc = SparkSession.builder.getOrCreate()




Now we take the sorted list of filtered items, with profit ratio, and compute a partial sums of weight. This adds each element's weight to the prior sum of weight, which will be how to select the items that are less than W. There are other ways to do this, used the sql with sum() below.

    # An sql method to calculate the partial sums of the ratios.
    ratioDF.registerTempTable("tempTable")
    partialSumWeightsDF = sc.sql("""
        SELECT
            item,
            weights,
            values,
            ratio,
            sum(weights) OVER (ORDER BY ratio desc) as partSumWeights
        FROM
        tempTable
        """)




Now that we have the partial sums of weights when the DataFrame of elements is sorted by profit weight ratio, we get only the items that have a partial sum of weights less than the knapsack size W, and this is the final solution set.

    # Get the max number of items, less than or equal to W in Spark.
    partialSumWeightsFilteredDF = (
                                    partialSumWeightsDF.sort(col("ratio").desc())
                                    .filter(col("partSumWeights") <= W)
                                   )



Return the partialSumWeightsFilteredDF as the solution set.

    # Return the solution elements with total values, weights and count.
    return partialSumWeightsFilteredDF


For a non-approximate, optimal, solution's algorithm would need to be much differently, and Bulk Synchronous Parallel (BSP) solutions would be a good place to start looking.

Testing:

Now to use this function, below is some code to plug in values and test the knapsack approximation solution.

# --------------------------------------------
# Test the Approximate Knapsack function test
# --------------------------------------------

# Pull in the knapsack library.

import random
from pyspark.sql import SparkSession

from knapsack import knapsack

# Create the SparkContext.
sc = SparkSession \
    .builder \
    .appName("Knapsack Approximation Algorithm Test") \
    .getOrCreate()


Set the problem size to 10.

# Knapsack problem size.
N = 10


Setup a Python list with some uniform random data for N items in

# Setup sample data for knapsack.
knapsackData = [('item_' + str(k), random.uniform(1.0, 10.0), random.uniform(1.0, 10.0)) for k in range(N)]


Now create the knapsack items, with column names, item, weights and values, using the list KnapsackData.

# Make a Dataframe with item(s), weight(s), and value(s) for the knapsack.
knapsackData = sc.createDataFrame(knapsackData, ['item', 'weights', 'values'])

# Display the original data
print "Original Data:"
print knapsackData.show()
print "\n"


The weight is random based on a range of the number of items N, and we could set it in a number of ways.

# Create a random maximum weight
W = random.uniform(N * 1.3, N * 1.6)

# Show the weight.
print "W: "
print W
print "\n"


Now we just call the knapsack approximation method passing the knapsackData and total knapsack size W.

# Call the knapsack greedy approximation function, with data and size 5.
k = knapsack.knapsackApprox(knapsackData, W)


That's it, so we show the results of the selected elements:

# Show the results Dataframe.
print "Selected Elements:"
print k.show()
print "\n"


And add up the totals via RDD lambda function map() method to get the sum of values and weights. The sum of values is the total knapsack profit.

# Show totals for selected elements of knapsack.
sumValues = k.rdd.map(lambda x: x["values"]).reduce(lambda x, y: x+y)
sumWeights = k.rdd.map(lambda x: x["weights"]).reduce(lambda x, y: x+y)
numResults = k.count()
print "Totals:"
print "Sum Values: ",  sumValues
print "Sum Weights: ",  sumWeights
print numResults
print "\n"

# ------------------------------------------
# End of Approximate Knapsack function test
# ------------------------------------------


That is all. The Scala version is very similar to this, and I did not yet test this for very large N to see how long it takes. The sorts may have performance implications as well as the method to take the partial sums. Tests showed it to work correctly.

The code here can also be found at: Spark.Packages.Org for the knapsack approximation.


Popular posts from this blog

Darrell Ulm Git Hub Profile Page

This is the software development profile page of Darrell Ulm for GitHub including projects and code for these languages C, C++, PHP, ASM, C#, Unity3d and others. Here is the link: https://github.com/drulm The content can be found at these other sites: Profile , Wordpress , and Tumblr . Certainly we're seeing more and more projects on Github or moving there and wondering how much of the software project domain they currently have percentage-wise.

Getting back into parallel computing with Apache Spark

Getting back into parallel computing with Apache Spark  has been great, and it has been interesting to see the McColl and Valiant BSP (Bulk Synchronous Parallel) model finally start becoming mainstream beyond GPUs. While Spark can be some effort to setup on actual clusters and does have an overhead, thinking that these will be optimized over time and Spark will become more and more efficient.  I have started a GitHub repo for Spark snippets if any are of interest as Apache Spark moves forward 'in parallel' to the HDFS (Hadoop Distributed File System).

Python for Data Science

Looking at more resources online for Python for Data Science. There are many good resources available. Of course the main tools are:  Numpy ,  Pandas ,  MathPlotLib ,  SkiKit-Learn  has some amazing tools. Kaggle  for instance has Data Science contents, but good to install a local system like the  Jupyter Notebook  to speed things up as the Kaggle editor can lag and take some time to run on small data-sets. The newer  DataCamp  has some neat tutorials on it and simple App to do daily exercises on your mobile device. Here is the  Python DataScience Handbook . Really useful. A short tutorial:  Learn Python for Data Science , a fun read. A list of cool  DataSci tutorials is here , and another how to get started with  Python for DS . Will add more later.