not so quiet …
This site has been quiet but there’s a lot going on behind the scenes. Items of note:
- Parallel R has landed! A great thanks to all who made it possible. Happy reading.
- I’m planning some software updates, and forqlift is in the top slot.
- There’s another fun project brewing … more details soon.
new book on the way: Parallel R
As promised, I have an announcement:
It’s a book!
Well, more like, a book-to-be. I’ve signed on with the fine folks at O’Reilly to publish Parallel R. It’s all about giving R, everyone’s preferred open-source data analysis tool, a parallel boost. If you’re doing large-scale work with R, then likely you’ll want to read this book. Especially if you’d like to blend R and Hadoop.
This will not be a solo venture: my partner in crime will be none other than Stephen Weston. Even if you don’t know him by name (and really, you should), there’s a good chance you know his work: he wrote the R packages nws, foreach, doSNOW, and doMC.
Look forward to more announcements over time.
news next week
I have some pretty cool news to announce. Drop by early next week for the full story.
a couple of new R packages: factualR and Segue
Two new R packages of note:
First, I have recently released factualR, which makes it easier for R researches to work with data from Factual.com. If you want to pull Factual data sets into an ever-familiar data.frame then you’re probably interested in factualR.
Second, I have joined the Segue project as a contributor. Segue is described as, “Parallel R In the Cloud, Two lines of code!” That means you get to use Amazon’s Elastic MapReduce as a parallel backend for lapply()-like operations.
(Note that Segue is best for jobs that are computation-bound, not necessarily data-bound. Running a scary Monte Carlo simulation? Testing fifty parameter variations on your wicked timeseries analysis? Give Segue a try!)
Enjoy.
New year, new toy: forqlift
I have a new project.
It’s called forqlift.
This one should be of use to people who crunch data with Hadoop or Mahout.
Here’s a bit of a blurb from forqlift’s page to get you started:
SequenceFiles are nice, but they can be unwieldy at times. I wrote forqlift to make it easier to manage SequenceFiles.
forqlift is a command-line tool that lets you:
- create SequenceFiles from files on your local filesystem (just like creating an archive with tar or zip)
- set compression (none, bzip2, gzip) and value types (text or binary)
- extract the contents of a SequenceFile back to the filesystem
- convert popular archive formats — tar (including tar.bz2 and tar.gz) and zip — to and from SequenceFile format
Head over to the forqlift page for more info!