Concurrent Data Structures in C++: Web crawler lab

CSinParallel Project

The goals for this lab are:

Note: The work on this lab requires a "tarball" named cds.tar.gz , which is located on the Manycore Testing Lab (MTL).

Note to Workshop 34 participants: The tarball cds.tar.gz is an incomplete version for students, to be completed as part of the lab exercise. For the workshop, we have created a second tarball for professors named cds_all.tar.gz which is complete, which we would not provide to students.

Also, the steps below may be carried out on MTL or on your own computer. Packages needed to run both of these Concurrent Data Structures labs on your own computer:

    Part 1: Sequential web crawler

    The first version of the web crawler program is located in a directory serial . This is a sequential program, which is uses only one processing element as it carries out its work. (The alternative is a parallel program, which can use multiple cores and/or computers to do multiple computations physically at the same time.)

  1. Copy the tarball into your directory as follows.

    % scp sigc-sNN@192.55.51.81:../cds.tar.gz .
    
    Note: This command is for a Linux or Macintosh terminal window. Do not enter the character % -- it represents the command prompt.

    Here, NN is the number assigned to you for accessing MTL. This command will prompt you for the password for your MTL account sigc-sNN. On Windows, use your SCP client (such as WinSCP) to perform the copy.

    Then, unpack the tarball as follows:

    % tar xfz cds.tar.gz 
    
    On Windows, use a utility such as 7-Zip to unpack the tarball.

    This will create a directory cds that contains the code. Change to that directory.

    % cd cds
    

  2. The directory serial contains several subdirectories, and is organized in a structure suitable for a software project that is capable of growing very large.

    Examine the code for this program. Observe the following:

    DO THIS: Write down other observations or comments about the serial code. Feel free to include opinions you may have about the code or its organization.

  3. Comments in spider.cpp indicate that two parts of the code need to be filled in for the crawler program to work. Before actually fillin that code in, we will see if we can compile and run the code successfully.

    First, insert output statements in those two locations, to indicate whether those sections of the code are reached. One message might state "processing the page x" where x is the URL of the page being processed, and the other might state "crawling the page x".

    Then cd to the serial directory, and issue a make command in that directory serial. This should compile the code and create an executable bin/spider

  4. Now fill in the missing code for the spider::crawl() method. Notes:

  5. Finally, complete the implementation of the spider::process() method, compile, and test. Note:

  6. Part 2: Multi-threaded programs

    A thread is a separate sequence of instruction execution within a program. Different threads may run on different cores of a multi-core computer; if there are more threads than cores, those threads are scheduled in turns while that program runs.

    This section of the lab introduces thread programming with boost threads, a version of which are part of the new C++11 standard.

  7. Change directory to the cds/threads directory within your lab1 directory.

       % cd ~/lab1/cds/threads
    
  8. First, look at the sequential program primes.cpp in that directory. It computes a table of primes up to a maximum number, then counts the number of primes produced.

    primes.cpp does not use threads, but is an ordinary program with multiple loops.

    Then, compile and run the program, using the Makefile provided (make primes). Feel free to experiment. Note: Be prepared to wait...

  9. Now, examine the multi-threaded program threads/primes2.cpp. This program uses the same algorithm for determining primes, but it divides the range 2..MAX up into threadct non-overlapping subranges or "chunks", and has a separate boost thread determine the primes within each chunk. Observe the following:

  10. Now, compile and run the program threads/primes2.cpp, using the Makefile provided (make primes2). This program takes an optional positive integer argument, for the thread count (default is 1 thread).

  11. You can time how long a program runs at the Linux or Macintosh command line by preceding the command with the word time. For example, the two commands

       % time primes
       % time primes2 4
    

    will report the timing for running both the sequential primes program, and the multi-threaded primes2 program with four threads.

    Perform time tests of these two programs, for at least one run of primes, one run of primes2 with one thread, and one run of primes2 with 4 threads. Feel free to experiment further.

  12. Part 3: Threaded web crawler with STL containers

  13. Examine the code for the program parallel. Observe the following:

    DO THIS: Use the diff command to compare raw_page.hpp and spider.hpp for these two versions serial and parallel. Note: The diff program will report differences by printing lines that appear differently in those files. Lines that appear only in the first file argument to diff will be prefixed by <, and lines that appear only in the second file will be prefixed by >.

    Here are differences between serial/include/spider.hpp and parallel/include/spider.hpp.

    Note: TBB containers are used instead of STL containers because TBB containers are thread safe. This means that multiple threads can safely interact with a TBB container at the same time without causing errors. STL containers are not thread-safe: with STL containers, it is possible for two threads to interact with a container in a way that produces incorrect results.

    When the correct behavior of a program depends on timing, we say that program has a race condition. The parallel version of the program uses TBB containers in order to avoid race conditions. (Race conditions are discussed in other CSinParallel modules.)

    Likewise, the state variable m_processed has the type atomic_counter instead of int or long because the atomic_counter type is thread-safe, enabling us to avoid a race condition that may arise if multiple threads interact with an integer variable at about the same time.

  14. The files serial/src/spider.cpp and parallel/src/spider.cpp contain the main differences between these programs -- the parallel version uses multiple threads. Running diff shows these differences:

  15. Fill in the code indicated in two locations for the parallel version of spider.cpp, working from the code you wrote for the serial version, as indicated by comments in the parallel code. Then compile and run your program.

    Note: This version of the program requires three command-line arguments: the maximum number of URLs; the number of threads to use (this arg is new); and the starting URL to crawl.

    Run the program with multiple threads (say, 4 threads). What do you observe about the run?

  16. To spread the computational work out better among the threads, observe that the method spider::crawl() includes a call to a method work_once() that has been commented out.

       /* work_once(); */
    
    Remove those comments, in order to enable that call to work_once(); . This will cause the program to process one web page before beginning multi-thread processing. If that first processed page includes several links, they will be added to the queue m_work, so that several threads can retrieve web pages to process when they first begin.





Disclaimer