by Mariva H. Aviram

Code-centric search tool strives to reduce Java development time

news
Jun 1, 199818 mins

IBM's jCentral among the first of a new wave of targeted search engines

If you’ve spent hours, days, or even weeks or months searching for an obscure piece of Java-related information or a code example, you probably understand the frustration that such a quest involves. To find what you’re looking for, you might try a number of Java informational sites, perhaps browsing manually through articles and archives, topic by topic. You scan the subject lines of scores of Usenet newsgroup articles. You peruse Java code directories, which contain dozens of code examples for just about everything but what you actually need. After plugging in as many keywords (and NOT-keywords) you can think of, you slog through myriad pages of general-purpose search engine results. You even resort to printed materials: books, magazines, old notes — anything that might offer solutions to your Java development problems. Sometimes, if you’re lucky, you eventually find what you’re looking for. But often you don’t. Perhaps the most frustrating aspect of development is knowing that some piece of necessary information is out there, but not knowing how to find it.

Your efforts to find Java resources may now reap more rewards — and require less time. IBM jCentral, announced and showcased at the recent JavaOne Java developer conference in San Francisco, is an information-specific search engine for Java resources. In other words, jCentral is a search tool that finds only Java resources. And it finds all types of Java resources, including source code, JavaBeans, applets, and Java-related newsgroup articles and Web sites.

Once the jCentral technology finds code in applets, beans, source code files, and newsgroup articles, it extracts the salient features of the code for indexing purposes. For example, when crawling a Java applet, jCentral analyzes the embedding HTML page and the applet class file to obtain information about the applet, such as all of its invoked methods. The information is subsequently indexed so that users can issue queries to find, say, all the Java applets that make a network connection by invoking methods from the java.net.socket class, or all the applets that contain a particular button, or a slider bar. Developers can use this specific code-searching technique on class methods, strings, and other snippets of useful Java code.

Because it is optimized for running Java-specific searches, jCentral represents an important new tool for the Java community. Internet development community leaders, such as the attendees of the seventh World Wide Web Consortium Conference in Australia, are voicing concerns about the growing ineffectiveness of monolithic search engines when used for specific purposes. As the Internet grows, the available information associated with any given keyword grows accordingly, which leads to general-purpose search engines becoming clogged with massive amounts of data — data that is often irrelevant and useless to users. For instance, an English-only search for “Java” through Alta Vista (Digital Equipment Corp.’s popular general-purpose Internetwide search tool) uncovers more than 800,000 documents.

A predictable trend in response to this problem is to create vertical, information-specific search engines and data repositories. jCentral is one of the first vertical searching services. It acts like a code-specific Alta Vista — or, as IBM Network Computing Software Division Webmaster and www.ibm.com/java Product Manager Dirk Nicol describes it, jCentral is a “crawler with attitude.” Since jCentral is such new technology, Nicol encourages JavaWorld readers to try out jCentral and send him feedback at nicold@us.ibm.com. “We created jCentral for one reason: to help Java developers,” Nicol notes. “We are very anxious to continue to help Java developers write code.”

Distinct approach focuses on code

If the data repositories of general-purpose search engines can become clogged with massive amounts of irrelevant data, can’t this also happen with vertical, specific-purpose search engines, and if so, how can this be prevented? Nicol says that this problem is prevented by the very nature of how jCentral (and other specific-purpose search engines) works. General-purpose search engines find and categorize information based on text keywords and the metatags of Web pages, which are controlled (and often abused) by Web page authors. In contrast, jCentral analyzes Java code as well as text comments; thus it searches, weighs, and indexes the inherent attributes of code rather than just the textual descriptions of code.

jCentral is both a global resource and an internal organization/intranet resource. Whereas the search mechanism at ibm.com/java is global, the jCentral search tool, IBM itself uses a private version, allowing in-house developers to search for Java materials within IBM’s intranet. jCentral as a global resource is currently free and available to the public. (To date, IBM has not announced plans to license source code as a separate product to internal enterprise developers or to bundle it with another product.)

jCentral employs a combination of automated and manual approaches to growing the repository. This combination offers a distinct advantage over other Java-development services, which take only a manual approach to compiling resources.

The automated component operates by using a bank of IBM-patented 100 percent Java-based crawlers to search the Internet for Java materials and then adding the materials to the jCentral repository, analyzing and classifying the materials in the process. The repository currently houses approximately 150,000 items, including 40,000 applets and 60,000 pieces of Java code. jCentral also has added JavaWorld magazine’s entire catalogue of articles and source code to the repository. The repository’s growth rate is erratic and unpredictable, IBM notes, but averages several thousand additions per week.

The manual resource-compilation component of jCentral involves a more typical approach to growing a resource catalogue. Essentially, Java developers submit code to be included in the jCentral repository, and a small staff within the jCentral development team manually checks the code and approves it for inclusion. So far, the level of code-submission traffic isn’t high or overwhelming. The number of Java developers is limited, so jCentral staff members don’t expect to be as swamped with submissions as a general-public search engine like Yahoo! is sure to be.

The jCentral code-approval staff doesn’t judge the quality of the code submitted or provide editorial content. This encourages developers to write abstracts and descriptions about their own code, so that users will be inclined to learn more about it and put it to use. Once the code is approved, metadata, which is the descriptive information about the output data (rather than the output data itself), is added to the jCentral repository.

Visual maps, e-mail notification

An impressive feature of jCentral is its ability to provide a class hierarchy diagram of a Java bean or component, which is a visual map of the code. (See the example Class Navigator image below.) When clicked, each node in the diagram provides relevant code and descriptive information in the box at the bottom. (To view a diagram, click the Map button next to the search result abstract. A good example to try is the bean keyword frame, because it uses a lot of the JDK.)

jCentral also offers an automatic e-mail notification service, which initiates a persistent query and periodically sends subscribers new results of a single search. Developers can use this feature to search for resources they need for development or find new instances of their own source code posted by others on the Internet. Combining both immediate searching and notification for future search results eliminates the burden of running constant searches for the same thing. This comes in handy if, for example, you’re always looking for a new and improved animation bean. And jCentral users needn’t worry about receiving unsolicited advertisements as a result of submitting an e-mail address to this service — IBM notes that jCentral is not a marketing tool, so information about subscribers will not be used for any purpose beyond notifying subscribers of search results.

Avoid duplicate efforts; learn from others

jCentral lends itself to a plethora of purposes and potential uses. The most obvious use is to avoid duplicate efforts in code development. If someone has already created a code example or devised a tutorial that will save you time and energy, you might as well use it — that is, if you can find it.

Finding appropriate examples and tutorials comes in handy when you’re stuck on a line or section of code. For instance, if you need a good MD5 security algorithm or a specific invoked method (which typically is difficult to find), you can search for and evaluate such code using jCentral. Or, if you’re developing Java for use with a database, you can run searches for examples that involve tier 3 (the back-end database), tier 2 (middleware), or tier 1 (the client). (For instance, a search on the class name java.net.socket would yield results related to connections to back-end databases.) Once you find code that will work for you, you can ask the author for permission to use it. You can also analyze how someone else developed a certain aspect of code to give you ideas on how to design your own. Just one successful use of jCentral can save you at least a couple of hours of development time.

Developers also can use jCentral to figure out if someone else is using their code. For example, Java developers at IBM periodically look up instances of code that use the IBM names class by searching for the IBM domain name with the reverse URL (com.ibm class). Prior to the release of jCentral, tracing authorized or unauthorized uses of your code proved difficult.

jCentral’s impact extends well beyond the programming world; it has great potential for marketing purposes. Marketers and business developers can use jCentral to advertise their beans, applets, and other Java resources to the global Java community. In this way, jCentral can be used as a tool to build communities and facilitate commerce.

Because jCentral can find data beyond text, and it offers multitiered searching, you can search for a piece of code with specific attributes and properties. For example, you can find all applets with GridBagLayout, and then narrow those results using other specific parameters. jCentral also allows you to query specific parts of source code so that you can differentiate between source code and comments, such as when searching for author information embedded within code. You can also find an applet from a particular source or a specific domain; for example, if you are looking for a student who wrote a calculator applet, you can search for all calculator applets from all edu domains.

jCentral offers flexible and robust search features. A single query can run across all the types of information. With a single search, you can find the variety of ways that, for example, a GridBagLayout can be used — in applets, source code, FAQs, newsgroup articles, and Web sites. After conducting the search, jCentral passes the code through a profile engine that was itself written in Java. And there is at least one additional (if unusual) use of jCentral: Manager of the IBM Almaden Research Center Web Technologies Department Dan Ford uses the tool to periodically find information about jCentral itself on external Web sites; in this way, he can monitor the public use and perception of his company’s product.

It started as an internal IBM tool…

How did jCentral come into existence? jCentral is a child product of Grand Central Station (GCS), an IBM research project involving general-purpose search technology designed to search the Web for data formats beyond HTML. Impressively, the variety of formats that GCS can search and analyze includes relational databases such as SQL or ODBC, graphic image formats like MPEG or GIF, spreadsheets, compressed files in TAR and ZIP formats, and programming languages.

jCentral is the jewel in the GCS crown — “one of the best examples showcasing the possibilities of GCS,” says Qi Lu, Ph.D., research staff member of the IBM Almaden Research Center Web Technologies Department. Since jCentral is based on GCS, it’s portable to other languages, such as C/C++ or Perl, but to date GCS/jCentral developers have not announced any plans for other code-specific search tools.

jCentral originally was developed to help the 2,500 Java developers within IBM avoid duplicate searching and development efforts among themselves. After bridging search crawler technology developed by the Computer Science Department of the IBM Almaden Research Center with source code analysis and visual mapping technology from the Haifa Research Lab in Haifa, Israel, Dirk Nicol first put jCentral to use by crawling the IBM intranet for Java code and information resources. Thus IBM’s developers, located in various countries, could easily find and reuse each others’ code. From this intranet prototype, Nicol and his associates performed testing and bug-fixing and continuously evaluated feedback from internal beta testers. After this internal development period, Nicol, with IBM’s blessing, expanded the goal of the jCentral project to include serving the public.

The jCentral project developers teamed up with the IBM Java Marketing Department to organize several focus groups consisting of experienced Java developers. These focus groups helped improve the look and feel and functionality of jCentral, prompting changes such as adding newsgroup threading, expanding the power-search capabilities, polishing the interface, and increasing JavaBean support.

Learning more about Java

Some aspects of developing jCentral required extra thought and investigation, such as the process of crawling Java resources, which is significantly different from crawling regular HTML pages. An applet embedded in an HTML page, for example, is marked by the <applet> tag, which contains the main classes, such as image or audio classes and other classes. The descriptive information within the <applet> tag provides an overview of the Java applet itself. In order to search for a Java applet and crawl for Java applet code, one must understand the structure of Java. In fact, the process of developing the jCentral crawlers enhanced Lu’s knowledge of Java: As he developed jCentral, he learned more about Java from a development basis as well as from a detailed component-searching basis. Specifically, Lu and the other jCentral developers had to analyze the possibilities of applets, beans, and class files in order to create effective crawlers. Because jCentral is 100 percent Java, Lu actually used it to assist him with the process of developing jCentral — an interesting example of true bootstrapping.

The most important technical step of building jCentral was to build its “engine”– that is, to develop a crawler that can traverse and recognize Java resources from the Internet. Since crawling is a continually-running task that can run for weeks, it was critical for the jCentral developers to limit the growth of memory used by the Java virtual machine; otherwise, many of the CPU cycles and much of the I/O bandwidth would be consumed by garbage collection instead of crawling new Java resources.

Lu and the other jCentral developers discovered two methods that make designing an efficient Java-based crawler effective. The first method was to externalize key data structures, such as hashtables storing crawled URLs, storing only a small portion in memory while keeping most of the structures in persistent storage, such as data files. The second method was to create a controller running in a separate address space to monitor and periodically shut down and restart the crawler. The latter method requires building mechanisms for pausing the crawler, dumping internal state into files, loading saved state from files after the restart, and resuming crawling from the saved state. Using the combination of both methods eliminated the memory growth problem and enabled efficient crawling.

jCentral developers also learned more about the less technical side of Java, such as how people are using it, how widespread it is, how much source code is available, how many applets are available, and what the most popular applets are. They found that most available source code comprises examples provided by university professors and students, so predictably a high ratio of applets are available at educational sites. Also, most popular applets include a variety of games, snippets of code examples for tutorials (such as examples of object serialization), a scrolling ad bar, and The Lake Applet, David Griffiths’s GIF-processing applet that creates a watery reflective look and appears on more than 4,000 Web pages. (See Resources below.) IBM notes that the jCentral crawlers didn’t uncover much use of Microsoft J/Direct, apparently because Java developers generally use Sun’s JDK.

Through both focus groups and crawler findings, jCentral developers found that Java programmers often look for difficult-to-understand aspects of the JDK that are not documented well, such as GridBagLayout, an obscure AWT in the JDK. Sometimes, the crawlers find many duplicate sets of data, such as university site postings of James Gosling’s sorting-related source code. (In this case, jCentral found nearly 100 hits containing the same data.) Generally, developers look for high-quality source code, applets, and beans for use on home pages and in development projects as well as code examples for learning Java.

Tool for thieves?

Although jCentral promises to make Java development easier, some developers may be concerned about the potential danger of intellectual property theft. Code thieves can use a decompiler or disassembler, such as javap, or a hexadecimal dump or a Unix string command to turn bytecode into source code. The threat of intellectual property theft, however, isn’t limited to jCentral — or any other search tool — per se; it’s a problem with the Internet in general. It is possible to use general-purpose search engines to find the same data you can find through jCentral; it just takes longer with general-purpose engines, because they are not optimized for use with Java resources.

Even so, jCentral developers mitigate this danger by implementing methods that protect copyrighted material. First, jCentral does not run applets; its repository merely contains metadata about the code and pointers to Internet resources that contain the code. This distinction is important to developers who don’t want anyone, including jCentral, to use and run their code without first obtaining explicit permission. Second, jCentral offers two versions of its search-result details page: The regular page, and the “copyright-respectful” one, which finds specific limitations of using code. Anything with a copyright automatically appears with only the appropriate subset of search-result information.

Feedback from developers has been mostly positive. At JavaOne, the jCentral pavilion booth was like a carnival attraction. Skeptical developers repeatedly requested searches for obscure data, such as Red-Black Tree and AVL Tree (sort algorithms), bubble sorts, word wrap, and MD5 security algorithms. Many programmers simply asked, “Can you find my code?” jCentral never failed to provide source and example code.

Carl Muckenhoupt, a former EarthWeb developer who now does consulting work for the company, is delighted with jCentral. Before using jCentral, he was not able to find a particular applet he had written, but a quick search via jCentral revealed its location. Muckenhoupt predicts that EarthWeb’s services, Developer.com and Gamelan, and jCentral may assume complementary positions, analogous to the non-competitive positions of Yahoo! and Alta Vista.

Some Java developers are skeptical about jCentral’s usefulness. SGI software engineer and JavaWorld columnist Bill Day believes he won’t be using jCentral on a regular basis because he already knows where to go to find specific Java-related information without conducting a general search, either through a general-purpose search engine or jCentral. He hasn’t ruled out jCentral entirely yet, though; he says he will be “keeping an eye on jCentral’s code-specific search functionality,” and trying it out from time to time to find particular code examples. Day also feels it would be a big improvement if jCentral were to allow users to run Boolean searches through subsets of its repository (such as searching just “JavaWorld” but not “java.sun.com” — and other such combinations of index subsets). (A good example of how this might work is DevSearch, a Web site that indexes the content of more than a dozen sites of interest to Web developers.)

Michael Shoffner, president of software development firm PDC inc., says jCentral doesn’t appear to provide any more real information than Alta Vista or DejaNews, that the type size on jCentral’s results pages is too small, and that for newsgroups the DejaNews search tool provides a better interface.

Looking ahead

Several improvements are underway. At the end of May, jCentral developers plan to release a tutorial designed to introduce users to jCentral, including all of its features and an explanation of how to conduct effective searches. Other planned jCentral improvements and enhancements are in process; developers are working to add UML standard notation to the mapping feature; make the URL clickable within the identification field under the maps; glean details such as imports, packages, and inner classes from source code as much as possible (in the meantime they’re removing references to the class information, methods, and data members, or dimming out the tabs); add categories for articles and documentation that are distinct from the Other category; embed content-type identifying unique icons in each Results entry; and allow users to specify how often e-mail notifications are sent.

Despite jCentral’s novelty and apparent need for a bit more polish, it’s a service that all Java developers should experience. As Udi Manber, Ph.D., professor of computer science at The University of Arizona in Tucson and a consultant to IBM, advises Java developers, “If you do anything with Java, jCentral definitely is something to have in your list of bookmarks.”

A 10-year veteran of the Internet, Mariva H. Aviram is an Internet consultant and writer covering the computer industry. Mariva’s published works include articles in c|net, NetscapeWorld, and InfoWorld. Recently, Mariva wrote XML For Dummies Quick Reference, which is pending publishing by IDG Books Worldwide. For more information, visit http://www.mariva.com/.