Dragonfly: [Feature Request] supporting sendfile in dfget #1403

Posted on 2020-07-14 | Post modified: 2020-07-14

Words count in article: 418

This post is to record the process of solving the Dragonfly Issue#1403, which is related with the optimization to the p2p uploader. To be specific, support the sendfile system call in upload piece process.

The conclusion is that, for the reader with computation task, it is impossible to use the send file to provide the fast read & write. Only the regular io.Reader type could have this feature, which are *io.File and *io.LimitedReader.

Here are the reference resources:

Dragonfly - Source Code Note

Posted on 2020-07-08 | Post modified: 2020-08-23

Words count in article: 5.4k

This note is to reocrd the source code analysis of Dragonfly, which is the program that I participate during the 2020 summer as the remote internship. Thanks Alibaba for this great opportunity.

This blog (how to read source code?) helps me a lot from the perpective of theory of source code reading. While for the methodology, I will first locate the interested part of the source code, and then using githistory.xyz tool to keep track of the code from the very beginning of the development. With the help of the githistory.xyz, I could find the associated code together with the part that I’m focused on by looking at every commit of the file.

Linus说： “烂程序员关心的是代码。好程序员关心的是数据结构和它们之间的关系。”

Research for Dragonfly Bandwidth Allocation Strategy

Posted on 2020-06-26 | Post modified: 2020-07-21

Words count in article: 921

可以看到，DragonFly与以下三者均无可比性。因此我目前的想法是将取一个目标的占用带宽比，之后向该值移动，具体机制可以参考TCP的拥塞协议。

Google File System

数据流选择物理距离上最近的机器进行传输，自定义的网络拓扑保证了IP地址可以用于计算机器之间的距离。

机器实际上无需抉择带宽的分配：毕竟只有一个传输目标。
MapReduce

底层使用了GFS作为存储结构，由于master只需要向worker节点传输input data，而数据在GFS中是作为三副本来传输的，因此存储在master的数据在其他节点也有存储，所以可以直接就近传输。在大数据量的情形下，这种策略的并发性尤其好。
BitTorrent

download: 经典的实现中采用了 rarest-first block shceduling ，即默认网络中各个节点的带宽均匀。

upload: 优先对曾经给自己传输过数据的节点发送数据。

因为DragonFly根本没有考虑到peer节点之间的公平性问题，关于带宽的优化目标只是最大化其使用，所以不可直接使用BitTorrent的结果

How to interview

Posted on 2020-06-19 | Post modified: 2020-06-22

Words count in article: 1.3k

This note is used to record the useful points in the book “How to Interview at Amazon for International Professionals: Learn the American Interview Style and the Amazon Leadership Principles“. I learned the book from this link, which is the tutorial to the behavioral questions in Amazon Interview.

Kubernetes In Action

Posted on 2020-06-17 | Post modified: 2020-06-17

Words count in article: 366

This is the entry level book for Kubernetes(“koo-burr-NET-eez”) beginners. The book is written by Marko Lukša, who is a fullstack programmer with 20+ years experience.

The Go Programming Language

Posted on 2020-06-12 | Post modified: 2020-07-07

Words count in article: 2.2k

This post is meant to record the fundamentals of Golang. The main material I used for this post is “The Go Programming Language”, which is the well-known tutorial book for the Golang beginner. Apart from that, I’ll lookup the topic that I’m confused on Internet.

大四学年总结

Posted on 2020-06-10 | Post modified: 2020-06-10

Words count in article: 1.6k

现在是2020年六月十日，昨天刚刚结束本科学业中的最后一门考试，只剩下一些即将到来的仪式，基本上可以宣告我本科的学业已经结束了。就用这个笔记来记录我大四这一年的感悟吧。

Critique of MapReduce Program Synthesis

Posted on 2020-05-20 | Post modified: 2020-06-10

Words count in article: 567

[Abstract][1]

By abstracting away the complexity of distributed systems, largescale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we can automatically synthesize MapReduce-style distributed programs from input–output examples. Our ultimate goal is to enable end users to specify large-scale data analyses through the simple interface of examples. We thus present a new algorithm and tool for synthesizing programs composed of efﬁcient data-parallel operations that can execute on cloud computing infrastructure. We evaluate our tool on a range of real-world big-data analysis tasks and general computations. Our results demonstrate the efﬁciency of our approach and the small number of examples it requires to synthesize correct, scalable programs.

Critique of MapReduce

Posted on 2020-05-19 | Post modified: 2020-05-20

Words count in article: 529

Map and Reduce

They are the core functions defined in the functional programming language. The map function takes two parameters, func and para, where the para will be passed to the given func and return the results. Here the para could be a list, in which case every item will be passed once to the given function, and form a list from the return value. While the reduce function is kind of like the filter function.

But the actual action of the map and reduce in the paper is different. MapReduce applies map function to each logical record in input to compute a set of intermediate key/value pairs, and then applies reduce operation to all the values shared by the same key. From the perpective of user, the MapReduce interfaces decouple the operation of the data and the mechanism of distributed computing, and provide users a set of easy-coding interface to interact with data.

Implementation

MapReduce Workflow

MapReduce will first split the user input into several partitions, and pass them to different machines to execute map task. The number of partition (R) and the method of partitioning (hash function) could be specified by the user.

And the map function code is copied to all machines. The partition is formed by the chunks. In general, the size of each chunk is 16MB or 64MB. Denote the number of the chunks as M.
One of the copies of the program is the master node, which will search the idle worker nodes and assign them tasks. There are M map tasks and R reduce tasks to assign.
For one map worker, it will first apply the map function to the given input, and buffer the result in memory. Periodically, the buffered pairs are written to local disks, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When the reduce worker is notified by the master about the locations, it will fetch the buffered data sfrom the local disks of the map workers by remote procedure calls. When it has read all the intermediate data, it will sort it by the intermediate keys.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key the the corresponding set of intermediate values to the reduce function. The output of the reduce function is appended to a final output file for this reduce function.
When all map tasks and reduce tasks have been completed, the master wakres up the user program.

Feedback

Feedback on environment setup - 05/19

I have created one basic ubuntu image in Docker and pulled the source code from the Github. But there’s no example to run to test the program. So it seems that I need to create a data set for testing.

Today I finished reading the Paper MapReduce. And I have uploaded my critique.

plan for 05/20: read the paper MapReduce Program Synthesis and try to create the data set

The setup part is expected to be finished in 05/21.

Reference

RemyCC Algorithm

Posted on 2020-05-17 | Post modified: 2020-05-17

Words count in article: 1.2k

该算法为UCR CS204 Advanced Computer Networks 项目，目的在于深入研究由MIT提出的RemyCC算法，并且在不同环境下进行测试，来探究其性能，尝试找出其限制。

秦baibai

So we beat on, boats against the current, borne back ceaselessly into the past.

RSS

GitHub E-Mail Instagram