Skip to main content

A way to Merge Columns of DataFrames in Spark with no Common Column Key

Made post at Databricks forum, thinking about how to take two DataFrames of the same number of rows and combine, merge, all columns into one DataFrame. This is straightforward, as we can use the monotonically_increasing_id() function to assign unique IDs to each of the rows, the same for each Dataframe. It would be ideal to add extra rows which are null to the Dataframe with fewer rows so they match, although the code below does not do this.

Once the IDs are added, a DataFrame join will merge all the columns into one Dataframe.


# For two Dataframes that have the same number of rows, merge all columns, row by row.

# Get the function monotonically_increasing_id so we can assign ids to each row, when the
# Dataframes have the same number of rows.
from pyspark.sql.functions import monotonically_increasing_id

#Create some test data with 3 and 4 columns.
df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo"), ("aaa", "bbb","ccc","ddd")], ("k", "K" ,"v" ,"V"))
df2 = sqlContext.createDataFrame([("aaa", "bbb","ddd"), ("www", "eee","rrr"), ("jjj", "rrr","www")], ("m", "M" ,"n"))

# Add increasing Ids, and they should be the same.
df1 = df1.withColumn("id", monotonically_increasing_id())
df2 = df2.withColumn("id", monotonically_increasing_id())

# Perform a join on the ids.
df3 = df2.join(df1, "id", "outer").drop("id")
df3.show()

Started a GitHub repository as look at code snippets for Apache Spark.



Popular posts from this blog

Darrell Ulm Git Hub Profile Page

This is the software development profile page of Darrell Ulm for GitHub including projects and code for these languages C, C++, PHP, ASM, C#, Unity3d and others. Here is the link: https://github.com/drulm The content can be found at these other sites: Profile , Wordpress , and Tumblr . Certainly we're seeing more and more projects on Github or moving there and wondering how much of the software project domain they currently have percentage-wise.

Getting back into parallel computing with Apache Spark

Getting back into parallel computing with Apache Spark  has been great, and it has been interesting to see the McColl and Valiant BSP (Bulk Synchronous Parallel) model finally start becoming mainstream beyond GPUs. While Spark can be some effort to setup on actual clusters and does have an overhead, thinking that these will be optimized over time and Spark will become more and more efficient.  I have started a GitHub repo for Spark snippets if any are of interest as Apache Spark moves forward 'in parallel' to the HDFS (Hadoop Distributed File System).

Apache Spark Knapsack Approximation Algorithm in Python

The code shown below computes an approximation algorithm, greedy heuristic, for the 0-1 knapsack problem in Apache Spark. Having worked with parallel dynamic programming algorithms a good amount, wanted to see what this would look like in Spark. The Github code repo. for the Knapsack approximation algorithms is here , and it includes a Scala solution. The work on a Java version is in progress at time of this writing. Below we have the code that computes the solution that fits within the knapsack W for a set of items each with it's own weight and profit value. We look to maximize the final sum of selected items profits while not exceeding the total possible weight, W. First we import some spark libraries into Python. # Knapsack 0-1 function weights, values and size-capacity. from pyspark.sql import SparkSession from pyspark.sql.functions import lit from pyspark.sql.functions import col from pyspark.sql.functions import sum Now define the function, which will take a Spark ...