Towards Resource-Elastic Machine Learning

Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam, Keerthi Selvaraj

Abstract

In this article, we argue that resource elasticity is a key requirement for distributed machine learning. Not only do computational resources disappear without warning (e.g. due to machine failure), modern resource managers also re-negotiate the available resources while a job is running: Additional machines may have become available or already reserved ones have been re-assigned to other jobs. We show how to formalize this problem and present an initial approach for linear learners.

Download PDF

Distributed and Scalable PCA in the Cloud

Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer and Vijay Narayanan

Abstract

Principal Component Analysis (CA) is a popular technique with many applications. Recent randomized PCA algorithms scale to large datasets but face a bottleneck when the number of features is also large. We propose to mitigate this issue using a composition of structured and unstructured randomness within a randomized PCA algorithm. Initial experiments using a large graph dataset from Twitter show promising results. We demonstrate the scalability of our algorithm by implementing it both on Hadoop, and a more flexible platform named REEF.

Download PDF

How to setup PowerShell for GitHub, Maven and Java development

Quite a few people asked, so here is my setup that I found quite useful:

  1. Download and install GitHub for Windows. Launch it at least once and use it to clone a repository. This triggers the download of the actual git command line tools.
  2. Download and install Java and setup the environment variable JAVA_HOME. For example, set it to C:\Program Files\Java\jdk1.7.0_15
  3. Download and install Maven. Set M2_HOME to point to your maven installation folder, e.g. C:\maven

Now that the basics are in place, the following PowerShell profile will wire things up such that mvn, git and javac work as expected:

# Setup java
if(Test-Path $env:JAVA_HOME){
    Set-Alias javac $env:JAVA_HOME\bin\javac.exe
    Set-Alias java $env:JAVA_HOME\bin\java.exe
    Set-Alias jar $env:JAVA_HOME\bin\jar.exe
}
# Setup maven
if(Test-Path $env:M2_HOME){
    function mvn-mt{
            $cmd = "$env:M2_HOME\bin\mvn.bat -TC1 $args"
            Invoke-Expression($cmd)        
    }
    function mvn{
            $cmd = "$env:M2_HOME\bin\mvn.bat $args"
            Invoke-Expression($cmd)        
    }
}else{
    function mvn{
        echo "Could not find a maven install. is M2_HOME set?"
    }
}
# setup git
if(Test-Path ~\AppData\Local\GitHub){    
    . (Resolve-Path "$env:LOCALAPPDATA\GitHub\shell.ps1")
    . (Resolve-Path "$env:github_posh_git\profile.example.ps1")
}

The Java setup is rather simple, just some aliases. For maven, I used functions as I really don’t like to type mvn.bat on the command line. Lastly, this just imports the very awesome GitHub command line integration. It enables branch names in the prompt as well as tab completion for git commands.

Tutorial: Machine Learning on Big Data (SIGMOD 2013) (Updated)

Below the fold, there is an embedded version of our slides. You can also download them in PowerPoint or PDF file. If you have any questions or comments, please feel free to leave a comment below.

Update (2013-07-16): We updated some of the references and their description, most notably that to MADLib. We are happy to update the slides with more references. Feel free to leave a note in the comments below with systems / papers you’d like to see referenced here.

Update (2013-08-25): Fixed the link above to the PDF file.

Continue reading

G’day mates: Australia April 2013

I am travelling in Australia for a few days. I can be found in these locations:

  • I’m visiting the Kinghorn Cancer center in Sydney today, April 4th to learn more about the Big Data challenges faced in cancer research. Who knows, maybe we can help.
  • I’m visiting NICTA in Sydney on April 5th where I will give a talk on some recent work on systems for machine learning on Big Data.
  • From April 8th an 12th, I can be found in Brisbane, attending and presenting at ICDE 2013:
    • I’m co-presenting the tutorial on Machine Learning on Big Data. Slides will be posted here soon (after the the tutorial, let’s be realistic…)
    • I’m a panelist at the Data Management in the Cloud workshop. Which means that I get to argue in public.

Drop me an email if our locations intersect these two weeks and want to meet.

CFP: Big Learning 2012: Algorithms, Systems and Tools

NIPS 2012 Workshop http://www.biglearn.org

Organizers:

  • Sameer Singh <sameer@cs.umass.edu> (UMass Amherst)
  • John Duchi <jduchi@eecs.berkeley.edu> (UC Berkeley)
  • Yucheng Low <ylow@cs.cmu.edu> (Carnegie Mellon University)
  • Joseph Gonzalez <jegonzal@eecs.berkeley.edu> (UC Berkeley)

Submissions are solicited for a one day workshop on December 7-8 in Lake Tahoe, Nevada.

This workshop will address algorithms, systems, and real-world problem domains related to large-scale machine learning (“Big Learning”). With active research spanning machine learning, databases, parallel and distributed systems, parallel architectures, programming languages and abstractions, and even the sciences, Big Learning has attracted intense interest. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):

Big Data: Methods for managing large, unstructured, and/or streaming data; cleaning, visualization, interactive platforms for data understanding and interpretation; sketching and summarization techniques; sources of large datasets.

Models & Algorithms: Machine learning algorithms for parallel, distributed, GPGPUs, or other novel architectures; theoretical analysis; distributed online algorithms; implementation and experimental evaluation; methods for distributed fault tolerance.

Applications of Big Learning: Practical application studies and challenges of real-world system building; insights on end-users, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced).

Tools, Software & Systems: Languages and libraries for large-scale parallel or distributed learning which leverage cloud computing, scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware.

Submissions should be written as extended abstracts, no longer than 4 pages (excluding references) in the NIPS latex style. Relevant work previously presented in non-machine-learning conferences is strongly encouraged, though submitters should note this in their submission.

Submission Deadline: October 17th, 2012. Please refer to the website for detailed submission instructions: http://biglearn.org

Declarative Systems for Large-Scale Machine Learning

Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer and Raghu Ramakrishnan

Abstract

In this article, we make the case for a declarative foundation for data-intensive machine learning systems. Instead of creating a new system for each specific flavor of machine learning task, or hardcoding new optimizations, we argue for the use of recursive queries to program a variety of machine learning algorithms. By taking this approach, database query optimization techniques can be utilized to identify effective execution plans, and the resulting runtime plans can be executed on a single unified data-parallel query processing engine.

Download PDF

BibTeX

@article{Vinayak-Borkar:2012fk,
 Author = {Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, Raghu Ramakrishnan},
 Journal = {Bulletin of the Technical Committee on Data Engineering},
 Month = {June},
 Number = {2},
 Pages = {24},
 Title = {Declarative Systems for Large-Scale Machine Learning},
 Volume = {35},
 Year = {2012}}