Skip to main content

A way to Merge Columns of DataFrames in Spark with no Common Column Key

Made post at Databricks forum, thinking about how to take two DataFrames of the same number of rows and combine, merge, all columns into one DataFrame. This is straightforward, as we can use the monotonically_increasing_id() function to assign unique IDs to each of the rows, the same for each Dataframe. It would be ideal to add extra rows which are null to the Dataframe with fewer rows so they match, although the code below does not do this.

Once the IDs are added, a DataFrame join will merge all the columns into one Dataframe.


# For two Dataframes that have the same number of rows, merge all columns, row by row.

# Get the function monotonically_increasing_id so we can assign ids to each row, when the
# Dataframes have the same number of rows.
from pyspark.sql.functions import monotonically_increasing_id

#Create some test data with 3 and 4 columns.
df1 = sqlContext.createDataFrame([("foo", "bar","too","aaa"), ("bar", "bar","aaa","foo"), ("aaa", "bbb","ccc","ddd")], ("k", "K" ,"v" ,"V"))
df2 = sqlContext.createDataFrame([("aaa", "bbb","ddd"), ("www", "eee","rrr"), ("jjj", "rrr","www")], ("m", "M" ,"n"))

# Add increasing Ids, and they should be the same.
df1 = df1.withColumn("id", monotonically_increasing_id())
df2 = df2.withColumn("id", monotonically_increasing_id())

# Perform a join on the ids.
df3 = df2.join(df1, "id", "outer").drop("id")
df3.show()

Started a GitHub repository as look at code snippets for Apache Spark.



Popular posts from this blog

Scala Version of Approximation Algorithm for Knapsack Problem for Apache Spark

This is the Scala version of the approximation algorithm for the knapsack problem using Apache Spark. I ran this on a local setup, so it may require modification if you are using something like a Databricks environment. Also you will likely need to setup your Scala environment. All the code for this is at GitHub First, let's import all the libraries we need. import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import org.apache.spark.sql.DataFrame import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.sum We'll define this object knapsack, although it could be more specific for what this is doing, it's good enough for this simple test. object knapsack { Again, we'll define the knapsack approximation algorithm, expecting a dataframe with the profits and weights, as well as W, a total weight. def knapsackApprox(knapsackDF: DataFrame, W: Double): Da...

Stream PRAM: Research: Darrell Ulm @ Microsoft Research

Stream Pram is a paper co-written by Darrell Ulm, cat be accessed at Darrell Ulm Stream Pram Research Paper This is a paper about a multiple instruction stream style model of Parallel Random Access Memory (PRAM) parallel computation. The paper deals mostly with theoretical parallel computation as compared to applied parallel computing. Other links about the Stream Pram. Profile . Wordpress , Tumblr

Drupal 8 Article by Darrell Ulm

This is a link to an early article about Drupal 8, 2012, written by Darrell Ulm, when Drupal 8 was in it's early stages of development. A blog post on Drupal 8: "Should you be interested in the new Drupal 8?", by Darrell Ulm Tumblr , Wordpress