As promised, below the fold there are the slides for our tutorial at ICDE 2013. You can also download them as .PPTX and .PDF. Continue reading
G’day mates: Australia April 2013
I am travelling in Australia for a few days. I can be found in these locations:
- I’m visiting the Kinghorn Cancer center in Sydney today, April 4th to learn more about the Big Data challenges faced in cancer research. Who knows, maybe we can help.
- I’m visiting NICTA in Sydney on April 5th where I will give a talk on some recent work on systems for machine learning on Big Data.
- From April 8th an 12th, I can be found in Brisbane, attending and presenting at ICDE 2013:
- I’m co-presenting the tutorial on Machine Learning on Big Data. Slides will be posted here soon (after the the tutorial, let’s be realistic…)
- I’m a panelist at the Data Management in the Cloud workshop. Which means that I get to argue in public.
Drop me an email if our locations intersect these two weeks and want to meet.
CFP: Big Learning 2012: Algorithms, Systems and Tools
NIPS 2012 Workshop http://www.biglearn.org
Organizers:
- Sameer Singh <sameer@cs.umass.edu> (UMass Amherst)
- John Duchi <jduchi@eecs.berkeley.edu> (UC Berkeley)
- Yucheng Low <ylow@cs.cmu.edu> (Carnegie Mellon University)
- Joseph Gonzalez <jegonzal@eecs.berkeley.edu> (UC Berkeley)
Submissions are solicited for a one day workshop on December 7-8 in Lake Tahoe, Nevada.
This workshop will address algorithms, systems, and real-world problem domains related to large-scale machine learning (“Big Learning”). With active research spanning machine learning, databases, parallel and distributed systems, parallel architectures, programming languages and abstractions, and even the sciences, Big Learning has attracted intense interest. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):
Big Data: Methods for managing large, unstructured, and/or streaming data; cleaning, visualization, interactive platforms for data understanding and interpretation; sketching and summarization techniques; sources of large datasets.
Models & Algorithms: Machine learning algorithms for parallel, distributed, GPGPUs, or other novel architectures; theoretical analysis; distributed online algorithms; implementation and experimental evaluation; methods for distributed fault tolerance.
Applications of Big Learning: Practical application studies and challenges of real-world system building; insights on end-users, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced).
Tools, Software & Systems: Languages and libraries for large-scale parallel or distributed learning which leverage cloud computing, scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware.
Submissions should be written as extended abstracts, no longer than 4 pages (excluding references) in the NIPS latex style. Relevant work previously presented in non-machine-learning conferences is strongly encouraged, though submitters should note this in their submission.
Submission Deadline: October 17th, 2012. Please refer to the website for detailed submission instructions: http://biglearn.org
Declarative Systems for Large-Scale Machine Learning
Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer and Raghu Ramakrishnan
Abstract
In this article, we make the case for a declarative foundation for data-intensive machine learning systems. Instead of creating a new system for each specific flavor of machine learning task, or hardcoding new optimizations, we argue for the use of recursive queries to program a variety of machine learning algorithms. By taking this approach, database query optimization techniques can be utilized to identify effective execution plans, and the resulting runtime plans can be executed on a single unified data-parallel query processing engine.
BibTeX
@article{Vinayak-Borkar:2012fk,
Author = {Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, Raghu Ramakrishnan},
Journal = {Bulletin of the Technical Committee on Data Engineering},
Month = {June},
Number = {2},
Pages = {24},
Title = {Declarative Systems for Large-Scale Machine Learning},
Volume = {35},
Year = {2012}}
KDD Cup 2011 proceedings online
The proceedings of KDD Cup 2011 are now online at JMLR.
Slides for my Berlin Buzzwords talk are available
A last yodel
Friday was my last day at Yahoo!. It have been 3 amazing years, with lots of fun (bringing spammers to tears), challenging problems (damn! spammers aren’t stupid!) and the occasional buh-buh (Really, I did not mean to kill that several thousand node Hadoop cluster. Several times. On a weekend. Really.) Most of all, it has been three years of learning, growing and being challenged in the best of ways. For that, I’m thankful to the people that created the place and those that shared the time with me. It was the best job I ever had.
Talk at Berlin Buzzwords
I’ve been invited to talk about our work on ScalOps at Berlin Buzzwords! According to the Schedule, I’ll be speaking June 4th, 17:04 – 17:45 in the room “Kleistsaal”.
Abstract
Functional programming models like MapReduce raise the level of abstraction. They allow the programmer to focus on implementing her algorithm in a clean abstraction without the concern of parallelism, data distribution or fault-tolerance. Unfortunately, MapReduce runtimes like Hadoop do not provide a high-performance solution for machine learning, nor do they provide an attractive API. The ScalOps project seeks to address this shortcoming on multiple levels. Firstly, we aim to provide an efficient runtime that directly supports not only a rich Pig-like set of operators, but also iterations to facilitate many computations from the machine learning domain. Secondly, we provide a programming language that targets machine learning algorithms. This domain specific language (DSL) is written in the Scala programming language; a JVM-based language that is byte-code compatible with Java. In this talk, I will report on the current status of the ScalOps project and our runtime layer called Hyracks.
Slides
WWW 2012 Tutorial: New Templates for Scalable Data Analysis
Together with Alex Smola and Amr Ahmed, I’ll give a tutorial at the World Wide Web Conference on New Templates for Scalable Data Analysis. You can find the most current when and where on the WWW program.
Abstract
Scalable data analysis has come a long way since the introduction of the MapReduce paradigm a decade ago. In this tutorial we present algorithms for synchronous and asynchronous data processing. They are are capable of dealing with the amounts of data typically available on the internet. We given a brief description of the problems one faces when performing scalable machine learning on the internet. To motivate matters we provide a number of scenarios from spam filtering, advertising and collaborative filtering. This is followed by an extensive discussion of current and novel synchronous data processing techniques. In particular we emphasize how insights from systems research and databases can be used to achieve significant improvements both in terms of expressiveness and in terms of efficiency of the deployed algorithms. This is followed by a description of asynchronous data analysis and inference methods. The latter are particularly necessary whenever the estimation problem requires the use of a significant number of latent variables. This includes cases such as clustering, topic models, or graph factorization. We provide an ample number of motivating examples and applications, ranging from user profiling to the analysis of communication networks. Special emphasis is placed on approximations needed to scale algorithms to hundreds of millions of users and billions of documents.
Downloads
- Abstract in the WWW proceedings (PDF)
- Part I: Machine Learning and Systems: PDF
- Part II: Synchronized Patterns: PDF
- Part III: Distributed Latent Variable Models: PDF
- Part IV: User Modeling and Graph Factorization: PDF
Machine learning in ScalOps, a higher order cloud computing language
Markus Weimer, Tyson Condie and Raghu Ramakrishnan
Abstract
ScalOps is a new internal domain-specific language (DSL) for Big Data analytics that targets machine learning and graph-based algorithms. It unifies the so-far distinct DAG processing as found in e.g. PIG and the iterative computation needs of machine learning in a single language and runtime. It exposes a declarative language that is reminiscent to Pig with iterative extensions: The scaloop block captures iteration and packages it in the execution plan so that it can be optimized for caching opportunities and handed off to the runtime. The Hyracks runtime directly supports these iterations as recursive queries, thereby avoiding the pitfalls of an outer driver loop. We highlight the expressiveness of ScalOps by presenting two example implementations: Batch Gradient Descent – a trivially parallel algorithm – and Pregel, a computational framework of its own. The resulting code is nearly a 1:1 translation of the target mathematical description.
BibTeX
@inproceedings{Weimer:2011fk,
Author = {Markus Weimer and Tyson Condie and Raghu Ramakrishnan},
Booktitle = {NIPS 2011 Workshop on parallel and large-scale machine learning (BigLearn)},
Month = {December},
Title = {Machine learning in ScalOps, a higher order cloud computing language},
Url = {http://cs.markusweimer.com/2011/11/21/machine-learning-in-scalops-a-higher-order-cloud-computing-language/},
Year = {2011}
}