Full text search in Jekyll blog using Lunr.js

12 Jul 2013

What is lunr.js

Lunr.js is Javascript library that indexes content provided in JSON format. The index can be used to perform a full text search. How is that different from a simple ‘grep’? It uses modern search techniques such as tokenization, stemming, omitting stop words, etc. Although the default algorithms for each of these techniques are provided by lunr.js, the user can override them to fit their specific needs. And of course, the name lunr, is just a play on solr, which is a full text search engine, but made for heavy duty tasks.

Other libraries used in the example:

jQuery (optional)
underscore.js (optional)

Why do I need to search blog created by Jekyll?

First, because there is no database while using Jekyll. Hence no queries. So searching is not straight forward. Second, because I have started to Jekyll, and I think a blog without search is weird.

Show me the code

You could just View Source of this file and find all the code I’m using for this site in search.js. Here is the rundown:

Create an Index:

Provide the fields of the data to be indexed

var index;
createIndex();
function createIndex() {
	index = lunr(function () {
	    this.field('title', {boost: 10})
	    this.field('content')
	    this.field('date')
	    this.ref('url')
  });	
}

Loading the data to be indexed:

Being a Jekyll blog, there is no JSON data to represent the blog posts. So you have to store all your blog posts into the HTML on load. I know this is weird, but for blog with a few hundred pages of plain text should not slow down your load time much. Also, if you do not display all the blog posts at a time, it would be better to hide the loaded data using CSS. In the example below, I’m loading all blog data into doc_* elements, out of which the .doc_content tag is hidden by default.

loadData();
function loadData() {
	$('.doc').each(function(doc_index) {
		var doc = {};
		doc.date = $(this).find('.doc_date').text();
		doc.content = $(this).find('.doc_content').text();
		doc.title = $(this).find('.doc_title').text();
		doc.url = $(this).find('.doc_title').attr('href');

		index.add(doc);
		posts.push(doc);
	});
}

Searching the index:

Although searching the index is as easy as calling index.search(query), the return object is not an Array of loaded documents. Instead it returns the ref, i.e., reference number of the indexed document along with the confidence level of a match. So we have to find the corresponding document from the list of loaded documents.

function getResults(query) {
	var docs = [];
	var results = index.search(query);
	_.each(results, function(result) {
		console.log('Result ref: ' + result.ref);
		var doc = _.find(posts, function(post) {
			return post.url === result.ref;
		});
		if (doc) docs.push(doc);
		
	});
	return docs;
}

Edit

11 Jan 2015 » Guava 2: Event Bus `//TODO` Also explained well in the [Guava guides](https://code.google.com/p/guava-libraries/wiki/EventBusExplained)
07 Dec 2014 » Creating a new cloud hosted Java website ## Why Java Because I do not find Java as repulsive as a lot of people who write blogs. I have been working with Java for a while. If one intends to be an expert in software/web programming then he should spend more time honing complex skills such as availability, data design, cross platform communication, logging, user interaction, continuous deployment, and analytics (see below for details). Unfortunately, you have to relearn few of these from scratch if you decide to use a new programming language. I have attempted to use Rails in a limited manner previously and was extremely happy with it. But I ended up taking a lot of shortcuts by trying to reuse too much code via RubyGems instead of taking the time to learn Ruby well. Ruby is an awesome language and using it in a half-ass manner is doing yourself more harm than good. So, Java. ### Availability In order to run a website for your 3 infrequent users or 1 million daily users, it has to be available for each and every minute of the day. If you have a server at home that runs all the time and a few other admin skills and resources, then a home server might be fine. Beginners can always use something like ngrok. It is meant for a different purpose, but can be used on a tiny scale for hosting a website. For everyone else, just host it on a cloud provider. Most of them will start with the 3 basic services: virtual machines (around 0.5-0.6 GB RAM, single core CPU), file storage (5GB), hosted database (RDBMS and key value store). These are the 3 things you will have to interact with the most. Then there are a lot of other services such as load balancing, queues, email service, etc. For my current project, I've decided to use Heroku. It provides a free tier with a restrictive virtualized hardware config. But it should be enough to get started. If I find it too restrictive for my use, I can change over to something else such as AWS with Elasticbeanstalk without much effort. ### Data design This note will mainly focus on database system rather than data modeling. I'm planning to create a website (later, maybe a native app) to allow people to store events. Irrespective of the domain, most applications will require some users to interact, mostly after signing up. It is crucial to ensure that this data is secured and backed up. I have used RDMS and a few NoSQL databases with varying levels of involvement. I have had a difficult time to get the Spring Security working fluently before and have decided not to waste any time with it this time. But, I don't wish to spend much time designing user management from scratch either. So this time I have decided to give Parse.com a try. Parse core is nothing but a fancier JSON store. So, if I decide later that their system does not provide X in the future, I can swap it with another database system such as MongoDB. I understand that this might require significant effort. But I'm willing to take a chance this time. The main reason being the simplification of the data as well as UI level user management. More on the UI piece later. By choosing Parse, I'm leaving my data availability worries on the able shoulders of Facebook/Parse engineers. ### Cross platform communication This section deals with communication of a client and your server using various browsers via HTML, native apps, or RESTful API via JSON. I have done the HTML and JSON part before using Spring MVC. It is simple and works reliably. Because most of the scenarios in a native app would require communicating using a RESTful backend, the Spring MVC backend should work out fine. There could be some scenarios where I might want to communicate with the DB layer directly. Parse gives that option with ease. ### Logging The logging library question has been resolved in Java for good. Just use slf4j. Its API is rich and accepted everywhere. Maintain system logs is a different issue though. There are various services available for storing and processing your logs. While selecting look for the primary features such as: 1. retention days (a couple days might be okay), 2. retention size (a few hundred MBs should be enough) 3. passive/active log scanning 4. and alerting (MOST IMPORTANT in my opinion). I have used the free edition of Splunk and New Relic before. Splunk watches logs passively, whereas New Relic has to be deployed as part of your website's WAR file. Both free editions are restrictive but New Relic provides alerting and 1 day log retention. I haven't tried other open source log management tools yet. I will update this section if I do. For now, I'm planning to use PaperTrail. The free edition provides a mere 100MB log each month. It does have alerting support, unlike the free Splunk. ### User interaction ### Continuous deployment Do not ignore CD (or CI). Even though you might be a lone developer, having an efficient workflow will save much time in the long run. To keep things simple, you can download Jenkins on your machine and call it a day. To make things fancier and cloudy (err.. maybe cloud-sy), look for services such as Codeship. If you are building an OSS then Travis CI or Atlassian products could provide deep discounts too. AWS has (or is about to release) similar products too. I'm planning to use Codeship.io because I prefer to keep my work machine as light as possible. So if I have to work on a different machine for a few days, I can only worry about installing the bare minimum software. ### Analytics
25 Jan 2014 » MongoDB: How $pull works h4. TLDR; h5. $pull is just an update. So the pulling happens only on the documents returned by the query part of @update@ Some background first. The @$pull@ operator in MongoDB is used in conjunction with the @update@ command to remove (or pull out) elements from an array. The syntax for an @update@ command is {% highlight javascript %} db.collection.update( { /* find query */ }, { /* new value */ } ); {% endhighlight %} Copied from the "official documentation":http://docs.mongodb.org/manual/reference/operator/update/pull/ : {% highlight javascript %} { flags: ['vme', 'de', 'pse', 'tsc', 'msr', 'pae', 'mce' ] } {% endhighlight %} The following operation will remove the msr value from the flags array: {% highlight javascript %} db.cpuinfo.update( { flags: 'msr' }, { $pull: { flags: 'msr' } } ) {% endhighlight %} Personally, I had a hard time understanding the necessity of the first part of the @update@ command in this case. If the values equallying 'msr' are going to be pulled for the key equallying 'flags', then why repeat the same in the query part? Although the documentation is not incorrect, the oversimplified example makes it deceptive. The @$pull@ operator is does not come into play, till the query part returns any documents. In other words, keep in mind that this is just an extension to @update@. So, don't think about the @pull@ till the query part is satisfied by at least one document in the collection. For e.g., {% highlight javascript %} db.students.insert({ name: 'Bob', grades: ['low', 'high'] }); db.students.insert({ name: 'Mom', grades: ['low', 'average'] }); {% endhighlight %} Now, although the @$pull@ part in the following query would seem to satisfy both the documents, the grade 'low' will be removed only from 'Mom'. {% highlight javascript %} db.students.update( { name: 'Mom' }, { $pull: {grades: 'low'} } ); {% endhighlight %}
25 Jan 2014 » Coolness of IntelliJ #### Shelve changes If you are using svn, the changelist feature can provide some utility, but it is nothing compared to the `git stash`. If you are using IntelliJ and svn, but want the `stash` like feature, you are in luck. IntelliJ has a feature under it's VCS Menu item, named 'Shelve changes'. Here is the link to the details: http://www.jetbrains.com/idea/webhelp/shelving-and-unshelving-changes.html #### Smart joining of lines Have you ever had a situation where you wanted to join a line of code with the line above. For e.g., {% highlight java %} if (!Strings.isNullOrEmpty(reference.getUserName()) && reference.getUserName().equalsIgnoreCase(userName)) { {% endhighlight %} Press `Ctrl Shift J` while your cursor is on the line where the merge will result.
11 Dec 2013 » MongoDB practices Fact 1. : There is way too much criticism of MongoDB in several blog posts. Fact 2. : Most of these criticisms have rebuttal posts. I have used MongoDB in some small scale attempts and have found it satisfactory. And that is better than most other databases because I don't have to interact a lot with it. No schema changes. Not worrying much about adding some boilerplate code in the ORM (debatable). It always seems like the invisible force working behind the scenes. Isn't that the sole purpose of databases? Here is some documentation of good practices that could be easily followed to avoid the pitfalls of MongoDB: 1. TLDR: Set write concern to 1 or `SAFE`, i.e., receive ack on failures. Problem: The default is set to 0, i.e., not to send ACK if writes fail. This makes the writes super fast, but in most applications, it will be unacceptable if the write failed. Detailed solution: All language drivers for MongoDB support this write concern setting. E.g., in Java here is the class [WriteConcern](http://api.mongodb.org/java/2.10.1/com/mongodb/WriteConcern.html). In Spring Data, this can be done while initializing the `MongoTemplate`. 2. TLDR: db.runCommand ( { repairDatabase: 1 } ) Problem: MongoDB does not release the disk storage to the OS, it used for storing a document, even after the document has been deleted. This is one of the reason for the overarching issue of MongoDB consuming more space than actually required for the data it stores. Detailed solution: [MongoDB Docs](http://docs.mongodb.org/manual/reference/command/repairDatabase/#dbcmd.repairDatabase)
11 Nov 2013 » Git Notes {% highlight javascript %} $ git clone http:/path-to-the-dot-git-file $ git checkout -b feature_name // Do work $ git add -u // If new files need to be added, then git add $ git commit -m "Commit message" // Here we are committing to our local repo.. not on a server $ git checkout master $ git merge feature_name // If you have not pulled changes on to your local master branch, this merge should be done without conflicts $ git pull --rebase // Pulling from remote server and rebasing your local changes on top of the changes made by others.. // Possibility of conflicts here.. $ git rebase -i HEAD~10 // If you want to squish your multiple commits into one, the replace all the necessary words, "pick", with the letter s. $ git push {% endhighlight %} #### Other useful git commands * `$ git stash` : Stashes your local uncommitted changes, so that you switch from your dirty branch to another clean one. * `$ git log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold cyan)%aD%C(reset) %C(bold green)(%ar)%C(reset)%C(bold yellow)%d%C(reset)%n'' %C(white)%s%C(reset) %C(dim white)- %an%C(reset)' --all` : Should alias it to a shortcut. [Copied from..](http://stackoverflow.com/a/9074343) * `$ git config --global alias.glog log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold cyan)%aD%C(reset) %C(bold green)(%ar)%C(reset)%C(bold yellow)%d%C(reset)%n'' %C(white)%s%C(reset) %C(dim white)- %an%C(reset)' --all` : Adding alias for the above command.. This will add an entry in the alias section of `.gitconfig` file.
13 Jul 2013 » Guava I: Table The "Table interface":http://docs.guava-libraries.googlecode.com/git/javadoc/index.html?com/google/common/collect/Table.html introduced in Guava is helpful in implementing Tabular data, such as data to be written to a CSV. Think about it as a spreadsheet. All the data in a spreadsheet can be represented by 3 parameters: the row number, column number, and the actual value stored in the cell. Hence the @Table@ interface has 3 generic parameters too. {% highlight java %} /* | Name | GPA 0 | Bob | 2.3 1 | Jim | 3.4 2 | Tim | 2.8 */ Table studentData = TreeBasedTable.create(); studentData.put(0, "Name", "Bob"); studentData.put(1, "Name", "Jim"); studentData.put(2, "Name", "Tim"); studentData.put(0, "GPA", "2.3"); studentData.put(1, "GPA", "3.4"); studentData.put(2, "GPA", "2.8"); {% endhighlight %} h4. Important instance methods: # @Map column(C columnKey)@: Returns a map of Row->Value for the given column. For e.g., @studentData.column("Name")@ in the above case would return a @Map@ that looks like: @{ 0: "Bob", 1: "Jim", 2: "Time"}@. # @Map row(R rowKey)@: Returns a map of Column->Value for the given row. For e.g., @studentData.row(2)@ would return a @Map@ that looks like: @{ "Name" : "Tim", "GPA" : 2.8 }@ h4. Implementations: All the implementations of @Table@ can be used by the @static Table create()@ method, except for @ImmutableTable@. As the name suggests, this implementation builds an immutable object. Hence we need to @build@ it using the provided @Builder@, i.e., @ImmutableTable.Builder@. Calling the @build()@ instance method of the @Builder@ will return an immutable @Table@.
12 Jul 2013 » Full text search in Jekyll blog using Lunr.js
What is lunr.js

Lunr.js is Javascript library that indexes content provided in JSON format. The index can be used to perform a full text search. How is that different from a simple ‘grep’? It uses modern search techniques such as tokenization, stemming, omitting stop words, etc. Although the default algorithms for each of these techniques are provided by lunr.js, the user can override them to fit their specific needs. And of course, the name lunr, is just a play on solr, which is a full text search engine, but made for heavy duty tasks.

Other libraries used in the example:
1. jQuery (optional)
2. underscore.js (optional)
Why do I need to search blog created by Jekyll?

First, because there is no database while using Jekyll. Hence no queries. So searching is not straight forward. Second, because I have started to Jekyll, and I think a blog without search is weird.

Show me the code

You could just View Source of this file and find all the code I’m using for this site in search.js. Here is the rundown:

Create an Index:

Provide the fields of the data to be indexed
```
var index;
createIndex();
function createIndex() {
	index = lunr(function () {
	    this.field('title', {boost: 10})
	    this.field('content')
	    this.field('date')
	    this.ref('url')
  });	
}
```
Loading the data to be indexed:

Being a Jekyll blog, there is no JSON data to represent the blog posts. So you have to store all your blog posts into the HTML on load. I know this is weird, but for blog with a few hundred pages of plain text should not slow down your load time much. Also, if you do not display all the blog posts at a time, it would be better to hide the loaded data using CSS. In the example below, I’m loading all blog data into doc_* elements, out of which the .doc_content tag is hidden by default.
```
loadData();
function loadData() {
	$('.doc').each(function(doc_index) {
		var doc = {};
		doc.date = $(this).find('.doc_date').text();
		doc.content = $(this).find('.doc_content').text();
		doc.title = $(this).find('.doc_title').text();
		doc.url = $(this).find('.doc_title').attr('href');

		index.add(doc);
		posts.push(doc);
	});
}
```
Searching the index:

Although searching the index is as easy as calling index.search(query), the return object is not an Array of loaded documents. Instead it returns the ref, i.e., reference number of the indexed document along with the confidence level of a match. So we have to find the corresponding document from the list of loaded documents.
```
function getResults(query) {
	var docs = [];
	var results = index.search(query);
	_.each(results, function(result) {
		console.log('Result ref: ' + result.ref);
		var doc = _.find(posts, function(post) {
			return post.url === result.ref;
		});
		if (doc) docs.push(doc);
		
	});
	return docs;
}
```
10 Jul 2013 » Java Notes
Shift Operators

Used to double or halve an integer or double.
```
short num = 0b0000_0100 << 1; // left operand is 4
// 0b00_01000, i.e., 8
```
Logical operators

Order of precedence is &(AND) , ^(XOR ..determines if operand bits or booleans are different. Returns 0 for match, and 1 for mismatch), |(OR)

Garbage Collection
```
Geocode g1 = new Geocode(51, 110); // g1 refers to memory allocated to geocode object, say 5123-5153
g1 = new Geocode(50, 109); //Block 5123-5153 is not referred any more and ready for GC
```
Primitive Type vs Objects
```
int a = 21;
int b = a; // Now JVM has 2 blocks in memory that contain the integer value 21
Geocode g1 = new Geocode(51, 110);
Geocode g2 = g1; // Now JVM has 1 block in memory that contain the object new Geocode(51,110)
```
If you want to Garbage Collect memory assigned to an object, then assign that object to null.

Variable length arguments
```
private int max(int first, int... rest);
```
is same as
```
private int max(int first, int[] rest);
```
synchronized

can be used to wrap a block of code or in the signature of a class/instance method. Doing this makes the stuff inside thread-safe. E.g., all methods within HashTable are synchronized. Note: Use ConcurrentHashMap instead of HashTable if you need Thread-safety. HashTable are slower because it locks the entire table of data for any read/write operation. Whereas, ConcurrentHashMap has 32 locks, each managing some of the Hash buckets for the table.

Generic Methods

Generic methods allow type parameters to be used to express dependencies among the types of one or more arguments to a method and/or its return type. e.g., the error in the following method can be prevented by parameterizing it, i.e., making it generic.
```
static void fromArrayToCollection(Object[] a, Collection< ? > c) {
    for (Object o : a) { 
        c.add(o); // compile-time error
    }
}
static < T > void fromArrayToCollection(T[] a, Collection< T > c) {
    for (T o : a) {
        c.add(o); // Correct
    }
}
```
If there is no dependency between the return type and/or the arguments of a method, then you are better off, using wildcards instead of generic method. Excellent example

Array vs ArrayList

ArrayList can hold a list of Objects, not primitives, whereas Array can hold either. Size of an array cannot grow dynamically. ArrayList’s size can.

WTH is Stack and Heap memory
- Stack is the part of memory that holds the primitives and references to objects, whereas the actual objects are stored on the heap.
- When the stack is full, a StackOverflowError exception is thrown. This is highly unlikely in normal programs because a separate stack is provided for each method. But if a method is called recursively, every primitive created in it will share the same stack and ultimately run out of space if not handled properly.
- When the Heap is full, it undergoes garbage collection, i.e., all the objects that are not referred anymore are removed from the memory. But if garbage collection is not enough, and the JVM has already expanded to it’s maximum heap capacity (provided by the JVM argument -xmx), an OutOfMemoryError exception will be thrown.
finalize()

Any object can override Object’s finalize() method for cleaning up any resources. This method is only triggered by the GC whenever it deems the object ready to be GCed.

== vs equals() for enum

Both are similar for enum, unlike for String. So it is better to use == to avoid NullPointerException

Basics of session management

HttpSession generates a cookie named jsessionid on the client’s browser. You can store the identifier of the user’s session in this cookie by httpSession.setAttribute("userName", "Bob"). The server maintains this session in-memory (or on disk, as per your server’s policy) for it’s life. The duration can be set by httpSession.setMaxInactiveInterval(n). If the n <= 0, then the session is maintained for ever by the server. The important thing to understand is that this persistence is on the server, not the client. The jsessionid cookie is killed as soon as the user closes the browser. The practice of storing the session for ever on the server sounds bad, but in fact is even worse than bad. It’s horrible. The jsessionid itself has some risks (attacker can steal the cookie), and remembering and honoring it’s value for ever is dangerous.

So how to let the user inside the secure area of your website, without having him to log in each time he closes the browser? Here is a very nice article from 2006 that explains best practices:

New »

Aniket Dahotre
Software Developer
github.com/dahotre
twitter.com/dahotre

Full text search in Jekyll blog using Lunr.js

What is lunr.js

Other libraries used in the example:

Why do I need to search blog created by Jekyll?

Show me the code

Create an Index:

Loading the data to be indexed:

Searching the index:

What is lunr.js

Other libraries used in the example:

Why do I need to search blog created by Jekyll?

Show me the code

Create an Index:

Loading the data to be indexed:

Searching the index:

Shift Operators

Logical operators

Garbage Collection

Primitive Type vs Objects

Variable length arguments

synchronized

Generic Methods

Array vs ArrayList

WTH is Stack and Heap memory

finalize()

== vs equals() for enum

Basics of session management