Skip to content

Instantly share code, notes, and snippets.

@yanakad
Created March 27, 2015 15:48
Show Gist options
  • Save yanakad/b7ed4d799f38c3ca4282 to your computer and use it in GitHub Desktop.
Save yanakad/b7ed4d799f38c3ca4282 to your computer and use it in GitHub Desktop.
Elasticsearch schemaRDD repro
wget http://www.eng.lsu.edu/mirrors/apache/spark/spark-1.2.1/spark-1.2.1-bin-hadoop2.3.tgz
tar -xf spark-1.2.1-bin-hadoop2.3.tgz
cd spark-1.2.1-bin-hadoop2.3/bin/
wget https://oss.sonatype.org/content/repositories/snapshots/org/elasticsearch/elasticsearch-hadoop/2.1.0.BUILD-SNAPSHOT/elasticsearch-hadoop-2.1.0.BUILD-20150324.023417-341.jar
./spark-shell --jars elasticsearch-hadoop-2.1.0.BUILD-20150324.023417-341.jar
import org.apache.spark.sql.SQLContext
case class KeyValue(key: Int, value: String)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).saveAsParquetFile("large.parquet")
parquetFile("large.parquet").registerTempTable("large")
val schemaRDD = sql("SELECT * FROM large")
import org.elasticsearch.spark._
schemaRDD.saveToEs("test/spark")
@yanakad
Copy link
Author

yanakad commented Mar 27, 2015

Elasticsearch was started as follows

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.zip
unzip & cd
bin/elasticsearch

> curl localhost:9200
{
  "status" : 200,
  "name" : "Wizard",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.4.4",
    "build_hash" : "c88f77ffc81301dfa9dfd81ca2232f09588bd512",
    "build_timestamp" : "2015-02-19T13:05:36Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.3"
  },
  "tagline" : "You Know, for Search"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment