Introduction

In Exercise set 3, we improved the scalability of our BLAST runs by first staging the data to an OSG SE. This achieved our goals, but required substantial manual work to move the data across the grid.

In this set of exercises, we will try a different approach: caching. We will stage the data to one HTTP server, and rely on the site's HTTP caches to bring the data closer to the running job, and provide scalability.

Customize this document

warning Please change your Login Name and click on the Customize button!

Login Name
VO
Host Name
Domain Name
Source HTTP Server hostname
Test HTTP Proxy Hostname
 

Exercises

Prerequisite

  • Login on submission node
    ssh user@osg-ss-glidein.chtc.wisc.edu
    
  • Obtain proxy certificate, if you have not done so already
    voms-proxy-init -voms osgedu:/osgedu
    
  • Make a working directory for this exercise
    mkdir -p /share/users/user/dm_part_3
    cd /share/users/user/dm_part_3
    

Using an HTTP Proxy

To successfully use a HTTP cache, three things need to be done:

  1. A file must be uploaded to a source HTTP site (this is known as "staging").
  2. The location of the cache must be known.
  3. The file must be downloaded by an HTTP client using cache-friendly headers.

We will do these in this exercise, then we will return to the previous BLAST example.

Staging Files

In the simplest setup, a web server can export a directory on the file system. To make a file accessible, you simply need to know the directory name and copy the file into it.

For this tutorial, we run the Apache web server and have the directory ~/public_html exported to the URL http://osg-ss-glidein.chtc.wisc.edu/~user. The HTTP proxy software is called Squid and is running on osg-ss-se.chtc.wisc.edu:3128. First, create your user directory in ~/public_html:

mkdir -p ~/public_html

Then, write a simple file into this directory:

echo 'Hello world!' > ~/public_html/hello_world.txt

Now, redirect your browser to http://osg-ss-glidein.chtc.wisc.edu/~user/hello_world.txt; you should see the text Hello world!.

Finding an HTTP Proxy

There are three ways to determine an HTTP proxy:

  1. (Linux standard) Look into the value of $http_proxy in the running environment.
  2. (OSG-specific) On a worker node, the HTTP proxy will be set in $OSG_SQUID_LOCATION. If there is no proxy available, $OSG_SQUID_LOCATION will be unset or set to UNAVAILABLE.
  3. Manually hard-code the location of a proxy in application code.

All HTTP clients packaged with the OSG will look at the value of $http_proxy.

From the login node osg-ss-glidein.chtc.wisc.edu, the correct value of http_proxy is:

export http_proxy=osg-ss-se.chtc.wisc.edu:3128

Set this in your environment now.

To initialize the $http_proxy variable on an OSG worker node, source $OSG_GRID/setup.sh (this is done for you if you use GlideinWMS). If $OSG_SQUID_LOCATION is defined and not equal to UNAVAILABLE, then it points at a proxy you may use. You will then want to set $http_proxy to the value of $OSG_SQUID_LOCATION. The following snippet implements this logic for GlideinWMS jobs:

export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

Using an HTTP Proxy

Copy the yeast.aa.psq database file into ~/public_html/user. First, verify $http_proxy is set:

[user@osg-ss-glidein dm_part_3]$ echo $http_proxy
osg-ss-se.chtc.wisc.edu:3128

Now, download the file using wget. We'll add the -d flag to get extra debug information:

wget "http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq" -d
[user@osg-ss-glidein dm_part_3]$ wget "http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq" -d
DEBUG output created by Wget 1.11.4 Red Hat modified on linux-gnu.

--2011-06-24 17:43:53--  http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq
Resolving osg-ss-se.chtc.wisc.edu... 198.51.254.111
Caching osg-ss-se.chtc.wisc.edu => 198.51.254.111
Connecting to osg-ss-se.chtc.wisc.edu|198.51.254.111|:3128... connected.
Created socket 3.
Releasing 0x000000001efd1d10 (new refcount 1).

---request begin---
GET http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq HTTP/1.0
User-Agent: Wget/1.11.4 Red Hat modified
Accept: */*
Host: osg-ss-glidein.chtc.wisc.edu

---request end---
Proxy request sent, awaiting response... 
---response begin---
HTTP/1.0 200 OK
Date: Fri, 24 Jun 2011 22:43:53 GMT
Server: Apache/2.2.3 (Scientific Linux)
Last-Modified: Fri, 24 Jun 2011 22:43:10 GMT
ETag: "2f89a4-2d79f1-4a67ced0c3b80"
Accept-Ranges: bytes
Content-Length: 2980337
Content-Type: text/plain; charset=UTF-8
X-Cache: MISS from osg-ss-se.chtc.wisc.edu
Via: 1.0 osg-ss-se.chtc.wisc.edu:3128 (squid/2.6.STABLE23)
Proxy-Connection: close

---response end---
200 OK
Length: 2980337 (2.8M) [text/plain]
Saving to: `yeast.aa.psq'

100%[=====================================================================>] 2,980,337   8.92M/s   in 0.3s    

Closed fd 3
2011-06-24 17:43:54 (8.92 MB/s) - `yeast.aa.psq' saved [2980337/2980337]

In the output, note the response headers from the proxy server:

---response begin---
HTTP/1.0 200 OK
Date: Fri, 24 Jun 2011 22:43:53 GMT
Server: Apache/2.2.3 (Scientific Linux)
Last-Modified: Fri, 24 Jun 2011 22:43:10 GMT
ETag: "2f89a4-2d79f1-4a67ced0c3b80"
Accept-Ranges: bytes
Content-Length: 2980337
Content-Type: text/plain; charset=UTF-8
X-Cache: MISS from osg-ss-se.chtc.wisc.edu
Via: 1.0 osg-ss-se.chtc.wisc.edu:3128 (squid/2.6.STABLE23)
Proxy-Connection: close

---response end---

Specifically, the line X-Cache: MISS from osg-ss-se.chtc.wisc.edu indicates that this was a cache miss: the file was not in the cache and hence the proxy had to talk to the upstream server osg-ss-glidein.chtc.wisc.edu to retrieve the file. However, the file should be accessible directly from the cache in the next attempt:

wget "http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq" -d
[user@osg-ss-glidein dm_part_3]$ wget "http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq" -d
DEBUG output created by Wget 1.11.4 Red Hat modified on linux-gnu.

--2011-06-24 17:43:59--  http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq
Resolving osg-ss-se.chtc.wisc.edu... 198.51.254.111
Caching osg-ss-se.chtc.wisc.edu => 198.51.254.111
Connecting to osg-ss-se.chtc.wisc.edu|198.51.254.111|:3128... connected.
Created socket 3.
Releasing 0x00000000094e1d10 (new refcount 1).

---request begin---
GET http://osg-ss-glidein.chtc.wisc.edu/~user/user/yeast.aa.psq HTTP/1.0
User-Agent: Wget/1.11.4 Red Hat modified
Accept: */*
Host: osg-ss-glidein.chtc.wisc.edu

---request end---
Proxy request sent, awaiting response... 
---response begin---
HTTP/1.0 200 OK
Date: Fri, 24 Jun 2011 22:43:53 GMT
Server: Apache/2.2.3 (Scientific Linux)
Last-Modified: Fri, 24 Jun 2011 22:43:10 GMT
ETag: "2f89a4-2d79f1-4a67ced0c3b80"
Accept-Ranges: bytes
Content-Length: 2980337
Content-Type: text/plain; charset=UTF-8
Age: 7
X-Cache: HIT from osg-ss-se.chtc.wisc.edu
Via: 1.0 osg-ss-se.chtc.wisc.edu:3128 (squid/2.6.STABLE23)
Proxy-Connection: close

---response end---
200 OK
Length: 2980337 (2.8M) [text/plain]
Saving to: `yeast.aa.psq.1'

100%[=====================================================================>] 2,980,337   11.2M/s   in 0.3s    

Closed fd 3
2011-06-24 17:43:59 (11.2 MB/s) - `yeast.aa.psq.1' saved [2980337/2980337]

The pertinent header this time was X-Cache: HIT from osg-ss-se.chtc.wisc.edu. This indicates the file was served from the osg-ss-se.chtc.wisc.edu proxy, not the source server.

Food for thought: Why is it important to always verify X-Cache: HIT appears in most of your log files?
If the proxy at a remote site stops working (or if, due to misconfiguration, none of your files are ever cached!), the transparency of caching can cause many issues. The load of all transfers for running jobs:
  1. Goes over the wide-area-network. WAN bandwidth is scarce, expensive, and must be shared amongst all VOs.
  2. Goes to your source server. It is likely your source server can't support the load of all your running jobs.
If you run in this manner, at best your jobs will run inefficiently. At worst, your VO may be temporarily banned from a site if it consumes too many scarce resources!

In the tutorial, we are accessing files from a single proxy (osg-ss-se.chtc.wisc.edu) that is hardware similar to the source server (osg-ss-glidein.chtc.wisc.edu). As they are similar hardware and on the same local network as the submit server, it is unlikely there is significant difference in speeds between the two (unless one is overloaded due to the activities of your classmates!).

However, you can expect a more significant difference at large grid sites for two reasons:

  1. The hardware is either more powerful or the proxy service is load-balanced between multiple hosts.
  2. The source is "far away" in terms of networking (and hence slow to access) while the proxy is "close to" the worker node (and faster to access).

--> How much easier was it to stage files with HTTP versus SRM?

Using HTTP Securely

Unlike SRM, HTTP does not have any safeguards to check if the file has been altered in transit. Therefore, it's unsafe to transfer executables via HTTP without further safeguard: an attacker might change your code to do something nasty! To better protect against security threats, you can checksum the files prior to transfer and verify the checksum after download. If the two values match, the file has not been tampered with.

The blastp executable is actually quite large - larger than the yeast database we've been using! Instead of transferring it with Condor, let's transfer it with HTTP. Because it is an executable file, we'll want to perform the checksumming procedure.

Copy the blastp executable to your working directory, /share/users/user/dm_part_3. Calculate the checksum, and then copy the file to your export directory:

[user@osg-ss-glidein dm_part_3]$ sha1sum blastp
15bcc93b0fc6f78e604a1ee95f78ab05f1b6c17e  blastp
[user@osg-ss-glidein dm_part_3]$ sha1sum blastp > my_checksum.sha
[user@osg-ss-glidein dm_part_3]$ cp blastp ~/public_html/user

We'll just transfer the small file with the checksum, my_checksum.sha, with Condor, then download and verify the actual executable on the worker node. To verify, we'll need a shell snippet like the following (assume that the base URL is given as the first argument and stored in $1):

export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi
BASE_URL=$1
wget -d --retry-connrefused --waitretry=10 "$BASE_URL/blastp" || exit $?
sha1sum -c my_checksum.sha || exit 1
chmod +x blastp

Previously, we had the following lines in the Condor submit file test.submit:

transfer_input_files = blastp,$(query_input_name)
Arguments =  srm://red-srm1.unl.edu:8443/srm/v2/server?SFN=/mnt/hadoop/user/osgedu/bbockelm yeast $(query_input_name) blast_results.$(Cluster).$(Process)

This would be changed to:

transfer_input_files = my_checksum.sha,$(query_input_name)
Arguments =  http://osg-ss-glidein.chtc.wisc.edu/~user/user yeast $(query_input_name) blast_results.$(Cluster).$(Process)

Caching and BLAST

Return to the blast_wrapper.sh script from Exercise 2. Make the following changes:

  1. Instead of downloading files via SRM, rewrite the script to take advantage of HTTP caching. Run it on the submit node.
  2. Re-run the single-query exercise. Instead of transferring the blastp executable using Condor, transfer it securely using the HTTP proxy. Only the query file should be transferred by Condor.
  3. Re-run the many-jobs exercise.
  4. EXTRA CREDIT: Write a DAG to completely automate the previous exercise, including staging the input and creating a tarball out of the output.

Note one difference with HTTP caching compared to SRM is that SRM provides a mechanism for listing the contents of the directory, which we used to download all the database files in the original blast_wrapper.sh. A successful solution of the above problem will have to come up with an alternate method for downloading all the necessary input.

Remote I/O

Remote I/O is left as an on your own exercise. Here are a few pages that will be very useful:

Return to the run-blast.sh script from Monday. Make the following changes:

  1. Change appdir to /chirp/CONDOR/share/blast/bin
  2. Change datadir to /chirp/CONDOR/share/blast/data

In the submit file, you will have to add: +WantIOProxy = True. Also, add a new wrapper script wrapper.sh:

#!/bin/sh
######## wrapper.sh ################
# This script directly passes all
# its arguments into 'parrot_run'
####################################

./parrot_run "$@"

The wrapper.sh will be the new executable. You will need parrot_run from the cctools distribution. I have already downloaded it for you in /opt/cctools/bin/parrot_run. It will need to be specified in the transfer_input_files in the submit file.

This is largely a on your own exercise. If you need help, please ask. Looking at the error output of the condor job is always the first place to look.

Topic revision: r9 - 18 Jun 2013 - 21:06:26 - DerekWeitzel
Hello, TWikiGuest
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..