Hyperspark
Hyperspark is a decentralized data processing tool for Dat. Inspired by Spark
Basically, it's just a fancy wrapper around Dat Archive
This is a work-in-progress. Any idea/suggestion is welcome
Goal
- Reuse intermediate data.
- Minimize bandwidth usage.
- Share computation power.
How to use
Data owner
It's simple! Just share your data with dat: dat .
Data Scientist
Define your ideas with transforms and actions without worrying about fetching and storing data.
Computation Provider
Run transformations defined by researchers. Cache and share intermediate data so everyone can re-use the knowledge without having their own computation cluster.
Synopsis
define RDD on dat with dat-transform
word-counting:
const hs = require('hyperspark')
var rdd = hs(<DAT-ARCHIVE-KEY>)
// define transforms
var result = rdd
.splitBy(/[\n\s]/)
.filter(x => x !== '')
.map(word => kv(word, 1))
// actual run(action)
result.reduceByKey((x, y) => x + y)
.toArray(res => {
console.log(res) // [{bar: 2, baz: 1, foo: 1}]
})
Related Modules
- RDD-style data transformation with js. dat-transform
- Analyze data inside dat archive with RDD-style API. dat-ipynb, using nel
- Convert iPython Notebook to Markdown. ipynb2md
- Attach file to markdown with dat. markdown-attachment-p2p