Friday, September 27, 2013

Broker-Node interaction and visibility - Debugging "missing" cartridges on a node.

In the previous post I set up the end-point messaging for OpenShift. (Broker -> Messaging -> Node). I showed a simple use of the MCollective mco command and where the MCollective log files are. The last step was to send an echo message to the OpenShift agent on an OpenShift node and get the response back.

Now I have my OpenShift broker and node set up (I think) but something's not right and I have to figure out what.

DISCLAIMER: this post isn't a "how to" it's a mostly-stream-of-consciousness log of my attempt to answer a question and understand what's going on underneath.  It's messy.  It may cast light on some of the moving parts.  It may also lead me to a confrontation with The Old Man From Scene 24 and we all know how that ends. You have been warned.

In the paragraphs below I include a number of CLI invocations and their responses.  I include a prompt at the beginning of each one to indicate where (on which host) the CLI command is running.

  • broker$ - the command is running on my OpenShift broker host
  • node$ - the command is running on my OpenShift node host
  • dev$ - the command is running on my laptop

I've also got a copy of the origin-server source code checked out from the repository on Github.

I've got my rhc client already configured for my test user (cleverly named 'testuser') and my broker (using the libra_server variable). See ~/.openshift/express.conf if needed.

What's going on here?

I started trying to access the Broker with the rhc CLI command to create a user, register a namespace and then create an application. I'd like to create a python app and I've installed the openshift-origin-cartridge-python package to provide that app framework. But when I try to create my app I'm told that Python is not available:


client$ rhc create-app testapp1 python
Short Name Full name
========== =========

There are no cartridges that match 'python'.

So I figure I'll ask what cartridges ARE available:

client$ rhc cartridges

Note: Web cartridges can only be added to new applications.

Now, when I look for cartridge packages on the node I get a different answer:

node$ rpm -qa | grep cartridge
openshift-origin-cartridge-abstract-1.5.9-1.fc19.noarch
openshift-origin-cartridge-cron-1.15.2-1.git.0.aa68436.fc19.noarch
openshift-origin-cartridge-php-1.15.2-1.git.0.090a445.fc19.noarch
openshift-origin-cartridge-python-1.15.1-1.git.0.0eb3e95.fc19.noarch

Somehow, when the broker is asking the node to list its cartridges, the node isn't answering correctly. Why?

I'm going to see if I can observe the broker making the query to list the nodes and then see if I can determine where the node is (or isn't) getting its answer.

Refresher: MCollective RPC and OpenShift


MCollective is really an RPC (Remote Procedure Call) mechanism. It defines the interface for a set of functions to be called on the remote machine. The client submits a function call which is sent to the server. The server executes the function on behalf of the client and then returns the result.

The OpenShift client adds one more level of indirection and I want to get that out of the way. I can look at the logs on the broker and node to see what activity was caused when the rhc command issued the cartridge list query.

The broker writes its logs into several files in /var/log/openshift/broker. You can see the REST queries arrive and resolve in the Rails log file /var/log/openshift/broker/production.log.

broker$ sudo tail /var/log/openshift/broker/production.log
...
2013-09-26 17:54:06.445 [INFO ] Started GET "/broker/rest/api" for 127.0.0.1 at 2013-09-26 17:54:06 +0000 (pid:16730)
2013-09-26 17:54:06.447 [INFO ] Processing by ApiController#show as JSON (pid:16730)
2013-09-26 17:54:06.453 [INFO ] Completed 200 OK in 6ms (Views: 3.6ms) (pid:16730)
2013-09-26 17:54:06.469 [INFO ] Started GET "/broker/rest/api" for 127.0.0.1 at 2013-09-26 17:54:06 +0000 (pid:16730)
2013-09-26 17:54:06.470 [INFO ] Processing by ApiController#show as JSON (pid:16730)
2013-09-26 17:54:06.476 [INFO ] Completed 200 OK in 6ms (Views: 3.8ms) (pid:16730)
2013-09-26 17:54:06.504 [INFO ] Started GET "/broker/rest/cartridges" for 127.0.0.1 at 2013-09-26 17:54:06 +0000 (pid:16730)
2013-09-26 17:54:06.507 [INFO ] Processing by CartridgesController#index as JSON (pid:16730)
2013-09-26 17:54:06.509 [INFO ] Completed 200 OK in 1ms (Views: 0.4ms) (pid:16730)

From that I can see that my rhc calls are arriving and apparently the response is being returned OK.

The default settings for the MCollective client (on the OpenShift broker) don't go to a log file. I can check the OpenShift node though to see what's happened there and if it has received a query for the list of installed cartridges.

node$ sudo grep cartridge /var/log/mcollective.log | tail -3
I, [2013-09-26T17:10:23.696825 #9827]  INFO -- : openshift.rb:1217:in `cartridge_repository_action' action: cartridge_repository_action, agent=openshift, data={:action=>"list", :process_results=>true}
I, [2013-09-26T17:29:10.768487 #9827]  INFO -- : openshift.rb:1217:in `cartridge_repository_action' action: cartridge_repository_action, agent=openshift, data={:action=>"list", :process_results=>true}
I, [2013-09-26T17:29:24.957806 #9827]  INFO -- : openshift.rb:1217:in `cartridge_repository_action' action: cartridge_repository_action, agent=openshift, data={:action=>"list", :process_results=>true}

This too looks like the message has been received and processed properly and returned.

Hand Crafting An mco Message


Here's where that MCollective RPC interface definition comes in. I can look at that to see how to generate the cartridge list query using mco so that I can observe both ends and track down what's happening.

There are really two things to look for here:

  1. What message is sent (and how do I duplicate it)?
  2. What action does the agent take when it receives the message?

For part one, MCollective defines the RPC interfaces in a file with a .ddl extension.  Looking for one of those in the openshift-server Github repository finds me this: origin-server/msg-common/agent/openshift.ddl 

Of particular interest are lines 390-397. These define the cartridge_repository action and the set of operations it can perform: install, list, erase


Taking that, I can craft an mco rpc message to duplicate what the broker is doing when it queries the nodes:

mco rpc openshift cartridge_repository action=list
Discovering hosts using the mc method for 2 second(s) .... 1

 * [ ==========================================================> ] 1 / 1


ec2-54-211-74-85.compute-1.amazonaws.com 
   output:


Finished processing 1 / 1 hosts in 32.15 ms

Yep, it still says "none".  When I go back and look at the logs, it shows the same query I was looking at, so I think I got that right.

But What Does It DO?


Now that I can send the query message, I need to find out what happens on the other end to generate the response.  My search begins in the node messaging plugin for MCollective, particularly in the agent module code (in plugins/msg-node/mcollective/src/openshift.rb).  This defines a function cartridge_repository_action which... doesn't actually do the work, but points me to the next piece of code which actually does implent the function.

It appears that the OpenShift node implements a class ::OpenShift::Runtime::CartridgeRepository which is a factory for an object that actually produces the answer. A quick look in the source repository shows me the file that defines the CartridgeRepository class.

dev$ cd ~/origin-server
dev$ find . -name \*.rb | xargs grep 'class CartridgeRepository' 
./node/test/functional/cartridge_repository_func_test.rb:  class CartridgeRepositoryFunctionalTest < NodeTestCase
./node/test/functional/cartridge_repository_web_func_test.rb:class CartridgeRepositoryWebFunctionalTest < OpenShift::NodeTestCase
./node/test/unit/cartridge_repository_test.rb:class CartridgeRepositoryTest < OpenShift::NodeTestCase
./node/lib/openshift-origin-node/model/cartridge_repository.rb:    class CartridgeRepository


So, on the node, when a query is received for the list of cartridges that is present, the MCollective agent for OpenShift creates one of the CartridgeRepository objects and then asks it for the list.

A quick look at the cartridge_repository.rb file on Github is enlightening. First, the file has 60 lines of excellent commentary before the code starts. Line 86 indicates that the CartridgeRepository object will look for cartridges in /var/lib/openshift/.cartridge_repository (while noting that this location should be configurable in the /etc/openshift/node.conf someday). And lines 170-189 define the install method which seems to populate the cartridge_repository from some directory which is provided as an argument.

But when does CartridgeRepository.install get invoked? Well, since CartridgeRespository is a factory and a Singleton (which provides the instance() method for initialization) I can look for where it's instantiated:

dev$ find . -type f | xargs grep -l OpenShift::Runtime::CartridgeRepository.instance | grep -v /test/
./plugins/msg-node/mcollective/src/openshift.rb
./node-util/bin/oo-admin-cartridge
./node/lib/openshift-origin-node/model/upgrade.rb

Note that I remove all of the files in the test directories using grep -v /test/. What remains are the working files which actually instantiate a CartridgeRepository object.  If I also check for a call to the install() method, the list is reduced to one file:

find . -type f | xargs grep  OpenShift::Runtime::CartridgeRepository.instance | grep -v /test/  | grep install
./plugins/msg-node/mcollective/src/openshift.rb:              ::OpenShift::Runtime::CartridgeRepository.instance.install(path)

So, it looks like the node messaging module is what populates the OpenShift cartridge repository.  When I looked earlier though, it didn't seem to have done that.  Messaging is running and I've installed cartridge RPMs and I can successfully query for (what turns out to be) an empty database of cartridge information.

Finally! When I look at plugins/msg-node/mcollective/src/openshift.rb lines 26-45 I find what I'm looking for. CartridgeRepository.install is called when the MCollective openshift agent is loaded. That is: when the MCollective service starts.

It turns out that I'd started and began testing the MCollective service before installing any of the OpenShift cartridge packages. Restarting MCollective populates the .cartridge_repository directory and now my mco rpc queries indicate the cartridges I've installed.

Verifying the Change

So, I think, based on the code I've found, that when I restart the mcollective daemon on my OpenShift node, it will look in /usr/libexec/openshift/cartridges and it will use the contents to populate /var/lib/openshift/.cartridge_repository (not sure why that's hidden, but..).

node$ ls /var/lib/openshift/.cartridge_repository
redhat-cron  redhat-php  redhat-python

DING!

Now when I query with mco, I should see those.  And I do:

broker$ mco rpc openshift cartridge_repository action=list
Discovering hosts using the mc method for 2 second(s) .... 1

 * [ ==========================================================> ] 1 / 1


ec2-54-211-74-85.compute-1.amazonaws.com 
   output: (redhat, php, 5.5, 0.0.5)
           (redhat, python, 2.7, 0.0.5)
           (redhat, python, 3.3, 0.0.5)
           (redhat, cron, 1.4, 0.0.6)



Finished processing 1 / 1 hosts in 29.41 ms

I suspect that the OpenShift broker also caches these values, so I might have to restart the openshift-broker service on the broker host as well.  Then I can use rhc in my development environment to see what cartridges I can use to create an application.

dev$ rhc cartridges
php-5.5    PHP 5.5    web
python-2.7 Python 2.7 web
python-3.3 Python 3.3 web
cron-1.4   Cron 1.4   addon

Note: Web cartridges can only be added to new applications.

And when I try to create a new application:

$ rhc app-create testapp1 python-2.7

Application Options
-------------------
  Namespace:  testns1
  Cartridges: python-2.7
  Gear Size:  default
  Scaling:    no

Creating application 'testapp1' ... done


Waiting for your DNS name to be available ...


Well, that's better than before!

What I learned:

  • rhc is configured using ~/.openshift/express.conf
  • look at the logs
    • OpenShift broker: /var/log/openshift/broker/production.log
    • mcollective server: /var/log/mcollective.log
  • the mcollective client mco can be used to simulate broker activity
    • mco plugin doc - list all plugins available
    • mco plugin doc openshift - list OpenShift RPC actions and parameters
    • mco rpc openshift cartridge_repository action=list
      query all nodes for their cartridge repository contents
  • source code is useful
    • OpenShift source repository: https://github.com/openshift/origin-server
    • judicious use of find and grep can narrow problem searches
  • cartridge RPMs are installed in /usr/libexec/openshift/cartridges
  • cartridges are "installed" in /var/lib/openshift/.cartridge_repository
  • adding cartridges to a node requires a restart for the mcollective service

No comments:

Post a Comment