bigdata Jan 01, 0001

Awesome Big Data

A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!

Frameworks

  • Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Distributed Programming

  • AddThis Hydra - distributed data processing and storage system originally developed at AddThis.
  • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
  • Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  • Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  • Apache Flink - high-performance runtime, and automatic program optimization.
  • Apache Gora - framework for in-memory data model and persistence.
  • Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
  • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache Pig - high level language to express data analysis programs for Hadoop.
  • Apache S4 - framework for stream processing, implementation of S4.
  • Apache Spark - framework for in-memory cluster computing.
  • Apache Spark Streaming - framework for stream processing, part of Spark.
  • Apache Storm - framework for stream processing by Twitter also on YARN.
  • Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
  • Cascalog - data processing and querying library.
  • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
  • Concurrent Cascading - framework for data management/analytics on Hadoop.
  • Damballa Parkour - MapReduce library for Clojure.
  • Datasalt Pangool - alternative MapReduce paradigm.
  • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
  • Facebook Corona - Hadoop enhancement which removes single point of failure.
  • Facebook Peregrine - Map Reduce framework.
  • Facebook Scuba - distributed in-memory datastore.
  • Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
  • Google MapReduce - map reduce framework.
  • Google MillWheel - fault tolerant stream processing framework.
  • JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
  • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  • Metamarkers Druid - framework for real-time analysis of large datasets.
  • Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
  • Nokia Disco - MapReduce framework developed by Nokia.
  • Pinterest Pinlater - asynchronous job execution system.
  • Pydoop - Python MapReduce and HDFS API for Hadoop.
  • Stratosphere - general purpose cluster computing framework.
  • Streamdrill - usefull for counting activities of event streams over different time windows and finding the most active one.
  • Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
  • Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
  • Twitter TSAR - TimeSeries AggregatoR by Twitter.

Distributed Filesystem

Document Data Model

  • Actian Versant - commercial object-oriented database management systems .
  • Crate Data - is an open source massively scalable data store. It requires zero administration.
  • Facebook Apollo - Facebook’s Paxos-like NoSQL database.
  • jumboDB - document oriented datastore over Hadoop.
  • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
  • MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
  • MongoDB - Document-oriented database system.
  • RavenDB - A transactional, open-source Document Database.
  • RethinkDB - document database that supports queries like table joins and group by.

Key Map Data Model

Note: There is some term confusion in the industry, and two different things are called “Columnar Databases”. Some, listed here, are distributed, persistent databases built around the “key-map” data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as “column families” (with value map keys being referred to as “columns”).

...
cannot change locale Jan 01, 0001

运行locale命令
LANG=
LANGUAGE=
LC_CTYPE=“POSIX”
LC_NUMERIC=“POSIX”
LC_TIME=“POSIX”
LC_COLLATE=“POSIX”
LC_MONETARY=“POSIX”
LC_MESSAGES=“POSIX”
LC_PAPER=“POSIX”
LC_NAME=“POSIX”
LC_ADDRESS=“POSIX”
LC_TELEPHONE=“POSIX”
LC_MEASUREMENT=“POSIX”
LC_IDENTIFICATION=“POSIX”
LC_ALL=

修改profile

vi /etc/profile

添加如下内容

export LC_ALL=en_US.UTF-8

source /etc/profile

得到错误 setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
 运行 dpkg-reconfigure locales

得到错误

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = “en_US.UTF-8”,
        LANG = “en_US.UTF-8”
    are supported and installed on your system.
perl: warning: Falling back to the standard locale (“C”).
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

...
Deploy a Mesos Cluster Using Docker Jan 01, 0001

his tutorial will show you how to bring up a single node Mesos cluster all provisioned out using Docker containers (a future post will show how to easily scale this out to multi nodes or see the update on the bottom). This means that you can startup an entire cluster with 7 commands! Nothing to install except for starting out with a working Docker server.

This will startup 4 containers:

  1. ZooKeeper
  2. Meso Master
  3. Marathon
  4. Mesos Slave Container

As mentioned the only prerequisite is to have a working Docker server. This means you can bring up a local Vagrant box with Docker installed, use Boot2Docker, use CoreOS, instance on AWS, or however you like to get a Docker server.

...
Dive in Linux capabilites Jan 01, 0001

Introduction

Capabilities in Linux are flags that tell the kernel what the application is allowed to do, If you have no additional security mechanism in place, the Linux root user has all capabilities assigned to it. As capabilities are a way for running processes with some privileges, without having the need to grant them root privileges, it is important to understand that they exist.

Consider the ping utility. It is marked setuid root on some distributions, because the utility requires the (cap)ability to send raw packets. This capability is known as CAP_NET_RAW. However, thanks to capabilities, you can now mark the ping application with this capability and drop the setuid from the file. As a result, the application does not run with full root privileges anymore, but with the restricted privileges of the user plus one capability, namely the CAP_NET_RAW.

...
Docker Jan 01, 0001

简介

Docker 是 dotCloud 最近几个月刚宣布的开源引擎,旨在提供一种应用程序的自动化部署解决方案,简单的说就是,在 Linux 系统上迅速创建一个容器(类似虚拟机)并在容器上部署和运行应用程序,并通过配置文件可以轻松实现应用程序的自动化安装、部署和升级,非常方便。因为使用了容器,所以可以很方便的把生产环境和开发环境分开,互不影响,这是 docker 最普遍的一个玩法。更多的玩法还有大规模 web 应用、数据库部署、持续部署、集群、测试环境、面向服务的云计算、虚拟桌面 VDI 等等。

...
Docker acquires SDN startup SocketPlane Jan 01, 0001

At Socketplane we started out as four guys with a collectively strong belief in open source and open communities.  We aligned around a shared vision that we wanted to be a critical part of Docker’s once in a decade disruption. Now that we are part of the Docker team, we couldn’t be happier.

We never looked to hedge our bets, our success was and obviously still is tied to the success of Docker. While there are many reasons that we decided to join the team, first and foremost Docker is unlike any other projects we have worked on in the past; the focus on user experience and simplicity is unmatched. Our early work with Docker during the open network design sprints gave us clear indications that the Docker maintainers were interested in being good open source stewards for the networking community in a project with an already staggering community of users and contributors. We also saw a genuine desire from Docker leadership to do right by both, individual contributors and the ecosystem. That made it all the more easy to jump in head first.

...
docker in tencent Jan 01, 0001

腾讯内部对Docker有着广泛的使用,其基于Yarn的代号为Gaia的调度平台可以同时兼容Docker和非Docker类型的应用,并提供高并发任务调度和资源管理,它具有高度可伸缩性和可靠性,能够支持MR等离线业务。

...
docker internal Jan 01, 0001

docker-base

Abstract

本文在现有文档的基础上总结了以下几点内容

  1. docker的介绍,包括由来、适用场景等

  2. docker背后的一系列技术 - namespace, cgroup, lxc, aufs等

  3. docker在利用LXC的同时提供了哪些创新

    ...
git commit修改前一次提交的方法 Jan 01, 0001

方法一:用–amend选项

#修改需要修改的地方。
git add .
git commit –amend

注:这种方式可以比较方便的保持原有的Change-Id,推荐使用。

方法二:先reset,再修改

这是可以完全控制上一次提交内容的方法。但在与Gerrit配合使用时,需特别注意保持同一个commit的多次提交的Change-Id是不变的。为了保持提交到Gerrit的Change不变,需要复制对应的Change-Id到commit msg的最后,可以到Gerrit上对应的Change去复制.

...
Going Native with OpenStack Centric Applications: Murano Jan 01, 0001

Following on our previous discussion surveying the projects supporting applications within OpenStack, let’s continue our review with an in-depth look at the OpenStack-native Application Catalog: Murano, currently an incubation status project, having seen its functionality and core services integration advanced over the past few OpenStack releases.

What is it?

An application catalog developed by Mirantis, HP and others (now including Cisco), that allows application developers and cloud administrators to publish applications in a categorized catalog to be perused and deployed by application consumers. The selection of applications available within the catalog is intended to be that of released versions (ready-state) of applications (cloud-native or enterprise-architected), not application versions that are mid-development. Ideally, these are applications ready to be consumed and run by application users.

...