What I Did This Summer

This summer, I was really lucky to work with Prof. Andy Pavlo at the Database Group of Carnegie Mellon University. We worked on a project named Carnegie Mellon Database Application Catalog(CMDBAC), a repository of thousands of ready-to-run database applications for analysis and benchmarking.

1. Overview

Database workloads are important and precious to database researchers, however, real-world database workloads are hard to come. Many standard and widely-used benchmarks used in database research are rather old, for instance, TPC-C is from 1992, while I was born in 1994(^_^). So what is the silver bullet? We find that there are a large number of open-source database applications online, which could used to interact with the DBMS and monitor their behaviors at the same time to generate real-world workloads.

What kind of database applications are we looking for? They must have big community, well maintenance and good structure, and that is why we finally chose web frameworks, for example, Django, Ruby on Rails, Drupal. They all have easy-to-edit configurations, standard deployment process and regular updates.

The following figure is an overview of our project. The system crawls database applications from Internet, automatically deploys them in a Vagrant sandbox and causes the applications to interact with the underlying DBMS. Then it could capture query traces by monitoring the DBMS’s logs, which would be useful for workload analysis, automatic tuning, and benchmarking.

Overview

When I started working on it, we just had a messy project. I cleaned up the old stuff, worked from frontend to backend as a full stack engineer and nearly rebuilt the whole system from scratch. Almost all the code was written in Python since there were a bunch of awesome libraries.

2. Step #1: Discovery

Actually not much stuff to write here. We noticed that there always exists a pattern file for different kinds of applications. For example, Django applications have a models.py defining the data models and Ruby on Rails ones have a database.yml specifying the database configurations. Thus, the crawler searched the repositories on GitHub with the corresponding filename using official search APIs. Our crawler parsed the returned results shown on the web pages and stored them in the database.

We encountered with an interesting problem that GitHub only showed us the first 100 results for each search request, which is by no means sufficient for us. We added another argument size in searching and incremented it after parsing the returned 100 results to vary results. Do not forget to sleep some time after work, both for you and the crawler. The crawler worked well and has discovered more than 100,000 repositories till now.

Currently, we do not consider forks because some repositories are just hard copy – they are forks in fact but the owner of the repository just copy the files from another repository and create a new repository. Bitbucket is also a good repository hosting service, but it does not provide a friendly search API, that’s why we just support GitHub now.

3. Step #2: Deployment

Our deployment tool downloads each application, automatically figures out dependencies needed, and deploys it in a Vagrant sandbox. Targeted applications that use common web frameworks listed above makes this step easier since they have a standard deployment process and simple configurations.

One interesting challenge is handling the dependencies. The deployer is supposed to handle the runtime and third-party packages. Django applications are easy for detecting runtime because almost all of they use Python 2.7, but Ruby on Rails ones use Ruby 1.9 or Ruby 2.2, deployer will try different versions of runtime until success. For packages, Ruby on Rails applications uses Bundle, we just need to type bundle install to install packages. Although we have an awesome package manager pip for Django(or Python), the deployer will not know the packages needed and their versions if the application does not provide us with a requirements.txt. The deployer employed a stupid method: running this application and parsing output of any errors for the missing packages and install them. We have built a database for all the packages and the deployer will resort to our database for searching missing packages.

Afterwards, the deployer will check the status code for querying the main page of deployed application. We only consider a deployment attempt as success after receiving a 200 OK response. So far, we have successfully deployed 1500+ Django applications and 1400+ Ruby on Rails ones.

These web frameworks have such great flexibility for DBMS that we can switch between MySQL, PostgreSQL, SQLite3 easily. The whole deployment process is running inside a Vagrant sandbox with blocked network and file system. For future work, we are going to expand the number of deployed applications by supporting more frameworks, maybe PHP-based or Java-based ones.

4. Execution

This is probably the most interesting part of the project. The core idea of execution is that the system would act as a normal user of the application to interact with DBMS. We called this component driver, and it simulated the behaviors of users by sending POST and GET requests to the application, for example, querying some web pages or submitting forms. We define such operations as actions.

Submitting forms is a challenging work because there are many cases to handle. For example, the driver has to generate a email-like value when it is supposed to input a email address, and sometimes it needs to upload files. To know what kind of information should be filled, the driver would check the name and type of the input field to judge whether it is a password, email or file field. For other simple text fields, it would fill them with random values.

Some websites use CAPTCHAs to detect robots. The first were simple math problems that are easily bypassed with Python’s eval function.

Math Captcha

The above one is easy to handle, however, life is not always that easy. There arises a case that the website uses the dynamically-render page and \emph{css} configurations to hide some input fields from the human users while the robots would fill in these hidden fields.

The following figure is a screenshot for an application’s signing up page when you are a human user.


Raw HTML Form

Form with CSS enabled

When the robot loads this web page, it would see the following screenshot. There are two more fields now and our robot would just fill these two values for them. Thus, the server would know you were a robot by checking these hidden fields are filled with values and it told us by sending us a warm and calm welcome.

Output

We use the powerful library selenium to create a virtual browser that executes CSS and JavaScript for each page. The driver then checks before submitting the form whether a field should be hidden.

The basic work flow of driver consists of three steps. At first, it would use the built-in administration APIs provided by Django and Ruby on Rails to insert some initial data to bootstrap the database. For example, for a Wiki page, we would need a root for all other entries and a normal user does not have that privilege to create it. Secondly, it would try to register a user and login. Finally, the driver submits all other forms. We configure MySQL to log the queries when executing queries. We take a snapshot of the log files before and after each interaction, then we compare differences to know what SQL queries are executed during the interaction. All of this information generated from a deployment attempt is recorded in the CMDBAC’s database and presented on the website.

5. Tools

We also provide users with a convenient command line tool for benchmarking. The basic logic is as follows:

Tools

A user could use the tool to specify which application he would like to generate the workloads for benchmarks, the DBMS that he would like to benchmark on, and some other basic arguments for running benchmarks. Then the system would deploy our application and interact with the user-specified DBMS in the user-defined way.

6. Analysis

We performed some basic analysis for schema, queries and database coverage.

For schema, we calculated the number of tables, indexes, constrains and foreign keys. The following figure shows the statistics per application.

Average # of Tables: 13.24
Average # of Indexes: 31.24
Average # of Constraints: 32.98
Average # of Foreign Keys: 11.83

For queries, we calculated the percentages of different queries, for example, SELECT, INSERT, e.t.c. This figure shows the statistics.

% of INSERT: 2.12%
% of UPDATE: 2.21%
% of SELECT: 34.77%
% of DELETE: 0.14%
% of OTHERS(SET, BEGIN, COMMIT, e.t.c): 60.76%

For database coverage, we define the table-based coverage as the percentage of tables visited by queries. The following figure shows the histogram of the coverage of all the applications we deployed.

Coverage

7. Conclusion

I worked closely with Andy and Dana Van Aken from August to December and continued this project after I came back to China. We are working together on a full paper now. All of the source code for CMDBAC is under the Apache Software License and is available on GitHub.

Written on December 12, 2015