Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors  [PDF]
Ahmed Metwally,Christos Faloutsos
Computer Science , 2012,
Abstract: This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies.
Nephele Streaming: Stream Processing Under QoS Constraints At Scale  [PDF]
Bj?rn Lohrmann,Daniel Warneke,Odej Kao
Computer Science , 2013, DOI: 10.1007/s10586-013-0281-8
Abstract: The ability to process large numbers of continuous data streams in a near-real-time fashion has become a crucial prerequisite for many scientific and industrial use cases in recent years. While the individual data streams are usually trivial to process, their aggregated data volumes easily exceed the scalability of traditional stream processing systems. At the same time, massively-parallel data processing systems like MapReduce or Dryad currently enjoy a tremendous popularity for data-intensive applications and have proven to scale to large numbers of nodes. Many of these systems also provide streaming capabilities. However, unlike traditional stream processors, these systems have disregarded QoS requirements of prospective stream processing applications so far. In this paper we address this gap. First, we analyze common design principles of today's parallel data processing frameworks and identify those principles that provide degrees of freedom in trading off the QoS goals latency and throughput. Second, we propose a highly distributed scheme which allows these frameworks to detect violations of user-defined QoS constraints and optimize the job execution without manual interaction. As a proof of concept, we implemented our approach for our massively-parallel data processing framework Nephele and evaluated its effectiveness through a comparison with Hadoop Online. For an example streaming application from the multimedia domain running on a cluster of 200 nodes, our approach improves the processing latency by a factor of at least 13 while preserving high data throughput when needed.
International Journal of Engineering Science and Technology , 2012,
Abstract: MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google in 2004. In the programming model, a user specifies the computation by two functions, Map and Reduce. The MapReduce as well as its open-source Hadoop, is aimed for parallelizing computing in large clusters of commodity machines. Other implementations for different environments have been introducedas well, such as Mars, which implements MapReduce for graphics processors, and Phoenix, the MapReduce implementation for shared-memory systems. This paper provides an overview of MapReduce programming model, its various applications and different implementations of MapReduce. GridGain is another open source java implementation of mapreduce. We also discuss comparisons of Hadoop and GridGain.
Coded MapReduce  [PDF]
Songze Li,Mohammad Ali Maddah-Ali,A. Salman Avestimehr
Computer Science , 2015,
Abstract: MapReduce is a commonly used framework for executing data-intensive jobs on distributed server clusters. We introduce a variant implementation of MapReduce, namely "Coded MapReduce", to substantially reduce the inter-server communication load for the shuffling phase of MapReduce, and thus accelerating its execution. The proposed Coded MapReduce exploits the repetitive mapping of data blocks at different servers to create coding opportunities in the shuffling phase to exchange (key,value) pairs among servers much more efficiently. We demonstrate that Coded MapReduce can cut down the total inter-server communication load by a multiplicative factor that grows linearly with the number of servers in the system and it achieves the minimum communication load within a constant multiplicative factor. We also analyze the tradeoff between the "computation load" and the "communication load" of Coded MapReduce.
BSP vs MapReduce  [PDF]
Matthew Felice Pace
Computer Science , 2012, DOI: 10.1016/j.procs.2012.04.026
Abstract: The MapReduce framework has been generating a lot of interest in a wide range of areas. It has been widely adopted in industry and has been used to solve a number of non-trivial problems in academia. Putting MapReduce on strong theoretical foundations is crucial in understanding its capabilities. This work links MapReduce to the BSP model of computation, underlining the relevance of BSP to modern parallel algorithm design and defining a subclass of BSP algorithms that can be efficiently implemented in MapReduce.
MapReduce for Integer Factorization  [PDF]
Javier Tordable
Computer Science , 2010,
Abstract: Integer factorization is a very hard computational problem. Currently no efficient algorithm for integer factorization is publicly known. However, this is an important problem on which it relies the security of many real world cryptographic systems. I present an implementation of a fast factorization algorithm on MapReduce. MapReduce is a programming model for high performance applications developed originally at Google. The quadratic sieve algorithm is split into the different MapReduce phases and compared against a standard implementation.
Automatic Optimization for MapReduce Programs  [PDF]
Eaman Jahani,Michael J. Cafarella,Christopher Ré
Computer Science , 2011,
Abstract: The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-store- style techniques for projections, etc), but existing systems do not apply them, substantially because free-form user code obscures the true data operation being performed. For example, a selection in SQL is easily detected, but a selection in a MapReduce program is embedded in Java code along with lots of other program logic. We could ask the programmer to provide explicit hints about the program's data semantics, but one of MapReduce's attractions is precisely that it does not ask the user for such information. This paper covers Manimal, which automatically analyzes MapReduce programs and applies appropriate data- aware optimizations, thereby requiring no additional help at all from the programmer. We show that Manimal successfully detects optimization opportunities across a range of data operations, and that it yields speedups of up to 1,121% on previously-written MapReduce programs.
On the Computational Complexity of MapReduce  [PDF]
Benjamin Fish,Jeremy Kun,ádám Dániel Lelkes,Lev Reyzin,Gy?rgy Turán
Computer Science , 2014,
Abstract: In this paper we study MapReduce computations from a complexity-theoretic perspective. First, we formulate a uniform version of the MRC model of Karloff et al. (2010). We then show that the class of regular languages, and moreover all of sublogarithmic space, lies in constant round MRC. This result also applies to the MPC model of Andoni et al. (2014). In addition, we prove that, conditioned on a variant of the Exponential Time Hypothesis, there are strict hierarchies within MRC so that increasing the number of rounds or the amount of time per processor increases the power of MRC. To the best of our knowledge we are the first to approach the MapReduce model with complexity-theoretic techniques, and our work lays the foundation for further analysis relating MapReduce to established complexity classes.
E–Learning Using Mapreduce
International Journal on Computer Science and Engineering , 2011,
Abstract: E-Learning is the learning process created by interaction with digitally delivered content, services and support. Learner’s profile plays a crucial role in the evaluation process and to improve the elearning process. The customization of content is necessary to provide better services. Mapreduce is distributed programming model which is developed by Google. The aim of this paper is to increase thespeed and to decrease the processing time by using K-MR algorithm instead of K-Means clustering algorithm. K-Means algorithm can be applied to the MapReduce model and can efficiently process largedatasets called K-MR algorithm. This system customizes the contents based on learner’s performance and effective for both learner and instructor.
Meta-MapReduce: A Technique for Reducing Communication in MapReduce Computations  [PDF]
Foto Afrati,Shlomi Dolev,Shantanu Sharma,Jeffrey D. Ullman
Computer Science , 2015,
Abstract: The federation of cloud and big data activities is the next challenge where MapReduce should be modified to avoid (big) data migration across remote (cloud) sites. This is exactly our scope of research, where only the very essential data for obtaining the result is transmitted, reducing communication, processing and preserving data privacy as much as possible. We propose an algorithmic technique for MapReduce algorithms, called Meta-MapReduce, that decreases the communication cost by allowing us to process and move metadata to clouds and from the map to reduce phases.
Page 1 /100
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.