自己写一个小型的Proxy,Part I(1)

自己写一个小型的Proxy,Part I(1)

缘起:由于经常在bbsuubird上看片,但是bbsuubird的广告实在太长,
再加上一堆超长的置顶主题,往往要scroll到自己感兴趣的话题的时候就要滚半天,
所以想写一个server端处理程序,可以把取回来的html页面进行一定的处理。
那么架构就是,自己用php?name=rails" onclick="tagshow(event)" class="t_tag">rails写一个服务器端,代理客户端的请求并返回给客户端,
如果需要处理,加上一些处理的handler,然后再返回给客户端。
 ok,开始,首先安装rails, gem install rails(我用的是rails 2.2.2)
其次用rails生成一个应用:
rails proxying
新建一个controller, ruby script\generate controller proxying
然后在ProxyingController里头
def index
 ... 这里就是我们要的主要处理逻辑
end
更改路由,因为我们不需要action:
map.connect 'proxying/:id', :controller=>"proxying", :action=>"index"
require 'open-uri'
def index
 url = params[:id] || params[:q]
 file = open(url)
 doc = Hpricot(file.read)
end
这样,
返回的html页面,要对html进行处理,最自然的第一步,
就是改写链接了:
def rewrite_link_for_doc(doc)

   rewrite_link(doc, "//img[@src]", "src", false)
   #css link
   rewrite_link(doc, "//link[@href]", "href", false)
   rewrite_link(doc, "//script[@src]", "src", false)

   rewrite_link(doc, '//*[@background]', "background", false)

   #or background:url?
   rewrite_link(doc, '*[@style*=background]', "style.background-url", false)

   #replace every link with relative link to base_url
   rewrite_link(doc, '//a[@href]', "href", true)

   rewrite_link(doc, '//form[@action]', "action", true)

  end

def rewrite_link(doc, selector, attribute, prefixing_proxy)
   doc.search(selector).to_a.each do |link|
    if attribute.index "."
      attr, attr2 = attribute.split(".")
      attr2.gsub!("-", ":")
      url = link.attributes[attr].scan(/#{attr2}\((.*)\)/)[0]
      #puts "wa:#{url.inspect},#{link}"
      next if url.nil?
      url = url[0]
    else
      url = link.attributes[attribute]
    end

    href = URI(url) rescue URI("#") #we met URI("###"),weird

    if !href.host
      #relative url
      doc_url = URI(@page.uri.to_s) #already URI::###
      if url[0] == ?/
       to_url = doc_url.scheme + "://" + doc_url.host + url #todo
      else

       to_url = doc_url.scheme + "://" + doc_url.host
       to_url += "/" if doc_url.path == ""

       str = "doc_url.path:#{doc_url.path},url:#{url}"

       if doc_url.path == ""
        to_url += url
       else
        to_url += doc_url.path.gsub!(/\/[^\/]*$/, "/#{url}")
       end

       logger.info "#{str}, to_url:#{to_url}"


      end

    else
      to_url = link.attributes[attribute]
    end




    if prefixing_proxy
      to_url = ERB::Util.url_encode(to_url).gsub!(".", "%2E")
    end

    if attribute.index "."
      attr, attr2 = attribute.split(".")
      attr2.gsub!("-", ":")


      if prefixing_proxy
       #puts "before link:#{link}"
       to_url = "http://localhost:3000/proxying/"+to_url
       link.set_attribute(attr, link.attributes[attr].gsub!(/(#{attr2})\((.*)\)/, "\\1(#{to_url})"))
       #puts "after link:#{link}"
      else
       link.set_attribute(attr, link.attributes[attr].gsub!(/(#{attr2})\((.*)\)/, "\\1(#{to_url})"))
      end
    else
      if prefixing_proxy
       link.set_attribute(attribute, "http://localhost:3000/proxying/"+to_url)
      else
       link.set_attribute(attribute, to_url)
      end
    end

   end
  end

最后渲染会客户端:
在index action最后加上:
render :text=>doc.to_html,
      :content_type => file.content_type
其实这里我们还需要对content_type稍微做一些处理

render :text=>fix_result,
  :content_type => fix_contentType

def fix_result
   @result = @doc.to_html + "original link:<a href=#{url_decode(@url)}>#{url_decode(@url)}</a>"
   @result += "<form action='/proxying' method='get'>navigate to: <input type='text' name='q'></form>"
   @result += "<form action='/proxying/store' method='post'>Store it!: <input type=submit value=submit /></form>"
  end

def fix_contentType

   content_type = @page.content_type
   if content_type == "text/html"
    #if file.charset == "iso-8859-1"
    content_type += ";charset=GB2312" #&& !content_type.grep(/charset/)
    #else
    #content_type += ";charset=#{file.charset}" #???&& !content_type.grep(/charset/)
    #end
   end
  end
重构了一些,
写累了,休息一伙
最后需要实现的效果是:需要把url handler的处理逻辑抽出来,
可以自由定义:
define_url_handler /bbs\.uubird/, :remove_location_replace
  define_url_handler /btchina/, :remove_location_replace
  define_url_handler(/btchina/) {|url, doc| puts "nnn:#{url}"}

最后的code, host在google hosting上,可以下载
http://code.google.com/p/proxying/