Category: Coding

Ugly, dirty, but it works

A couple of days ago my uncle pinged me on Facebook asking for a favor. He is doing his final semester at the university, studying electrical engineering (or whatnot) and his final paper was due in less than 24 hours. The problem occurred when the software he made his paper in suddenly started showing a dialog asking for a PIN code locking my uncle out of accessing the document. Naturally he asked me to try “cracking” it since I’m “good with computers”.

As a web developer, this sort of job is completely out of my league. I remember reading tutorials on how to crack software using hex editors and stuff but there is no way I can do it in 24 hours. So my first thought was to turn him down telling him to buy the software (which is freaking expensive).

But then I started to think outside the box. The PIN code the software is asking for is only 6 digits, only contains numbers and you can try it as many times as you wish. Ideal for brute forcing. Also my uncle wasn’t interested in the actual PIN code just wanted to get the document.

I remembered that at the early stage of the company I’m working for we used a visual scraping software called Kapow Robots. Kapow is a bit overkill and expensive for this job , but it made me search for open source programmable macro software. Thats how I found AutoHotKey which is a very nice little tool using an incredibly ridiculous language (with great documentation). So I came up with this little snippet of code in like 15 minutes:

test = 100000 #starting code

Loop 899999
{
 test++
 Click 90,120 #click in the input box
 Send %test% #type in the value of test
 Click 250,120 #click in the submit button
 Click 300,250 #click on the ok button in the error message
 Click 90,120 #back to the input field
 Send {bs 6} #delete whatever is in the input
}

Yeah, it is crappy as hell. Couple of problems:

  • If the PIN starts with 0 it doesn’t work. It is easy to fix but the language doesn’t have any string functions and it was way over midnight so i figured 10% chance is not too much to risk.
  • It doesn’t stop after it found the right combo so it will most likely will leave a massive mess on the desktop
  • It doesn’t log the PIN but we don’t need it anyway

Is it ugly and dirty? Hell Yes! Did it work? Oh yeah

3 clichés which are very true: Think outside of the box, Use the right tool for the job and Don’t try to reinvent the weel when it is not necessary.

Be Sociable, Share!

Scraping the sh*t out of the interwebz – Part #2

If you haven’t read the first part of my php curl tutorial you should do so here. As in the previous one we will use my open source curl.class.php but this time we will do something more exciting than just scraping ebay. Today we will log into HackerNews.

When it comes to scraping a website we are always aiming to emulate the same thing what a browser does. There are a whole bunch of great tools available for recording the requests your browser makes but I think the Live HTTP headers Firefox extension is outstanding. So go ahead and download it. A new menu item should appear in your Firefox under the Tools menu. Click it and select the Generator tab, untick everything and tick “redirects” and “requests”. Now the extension is set to record all the POST and GET requests + 30x redirects your browser makes. Now go to HackerNews and login as you would do normally. The extension should record something similar to this:

live_http_headers

Great. It all seems easy enough. The only interesting part of the login POST request is this fnid hash. Well by examining the login form it turns out that the fnid is a hidden input field. No problems then. We can just grab it. Let me show you the full code then I will explain it

<?php 
set_time_limit(0); 
require_once("../curl.class.php"); 
 
$hn_user = ""; 
$hn_pass = ""; 
 
$curl = new Curl(); 
$curl->setSsl();
$curl->setCookieFile("cookie.txt");
 
$page = $curl->get("https://news.ycombinator.com/newslogin?whence=news");
 
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($page);
 
$fnid = false;
 
$inputs = $dom->getElementsByTagName('input');
for($i=0; $i<$inputs->length; $i++){
	if ($inputs->item($i)->getAttribute("name") == "fnid"){
		$fnid = $inputs->item($i)->getAttribute("value");
		break;
	}
}
 
if (!$fnid){
	print("can't find fnid\n");
	exit();
}
 
$data = array(	"fnid" => $fnid,
		"u" => $hn_user,
		"p" => $hn_pass);
 
$page = $curl->post("https://news.ycombinator.com/y", $data);
 
print($page);

So the first thing we notice is HackerNews operates on https. Which means we have to tell the curl class to handle the connection accordingly. Also since most sites are using cookies for logged in users it is always a good idea to prepare a cookie file.

$curl = new Curl();
$curl->setSsl();
$curl->setCookieFile("cookie.txt");

Since last time I learned a handy function in PHP which makes DOMDocument less bitchy about invalid HTML so I won’t need to use the @ sign when using loadHTML

libxml_use_internal_errors(true);

Then we download the login page and get the fnid as we learned it in the last tutorial

$fnid = false;
 
$inputs = $dom->getElementsByTagName('input');
for($i=0; $i<$inputs->length; $i++){
	if ($inputs->item($i)->getAttribute("name") == "fnid"){
		$fnid = $inputs->item($i)->getAttribute("value");
		break;
	}
}
 
if (!$fnid){
	print("can't find fnid\n");
	exit();
}

Cool. Now the last step is to do the login POST request

$data = array(	"fnid" => $fnid,
		"u" => $hn_user,
		"p" => $hn_pass);
 
$page = $curl->post("https://news.ycombinator.com/y", $data);
 
print($page);

The printed out $page should show your username which means you have a valid session. It was easy enough, wasn’t it? You can use a similar script to get quick account info from a service which doesn’t have an API or to automate a regular daily task.

DISCLOSURE: please don’t use your new knowledge for evil. Respect other people’s work and ask for permission if you can scrape content from their site.

Be Sociable, Share!

Tools for productivity

These are not necessarily productivity tools rather a list of software I use day-to-day to manage my workload. I prefer having as few windows open as possible so I’m very picky when it comes to choose my tools.

  1. Opera browser
    Hands down the best browser out there, period. I was very surprised when they announced that they will switch to webkit engine, but it is not a big deal. A browser is not just the rendering engine, but a set of tools which makes my life way easier. Opera features a bunch of great features like the superior download manager, built in irc client, grouped tabs, speed dial, RSS reader, built in mail client and so on. My favorite one must be Opera sync which automatically syncs your history, bookmarks and your passwords.
  2. Total Commander
    I grew up in the era of 2-pane file managers like Norton Commander and DOS navigator so when Total Commander became available on Windows (in the 90s I guess) it was an instant get for me. With this tool I can access and manipulate files super quickly. Also it features a great multi-session FTP client which means one less window to have open. I love it so much that I even run it on my Mac using Wine
  3. Eclipse
    I love Eclipse and I’m not ashamed to admit it. You can tell me all about how superior vim and emacs are but for me nothing can beat Eclipse, especially with the right plugins. I agree, it is a bit greedy for memory but memory is cheap these days. When I have to edit a file quickly I use my secondary editor, Programmers notepad
  4. A notebook
    I can’t live without lists so I tend to create a physical todo list in my notebook. It is just feels great when you can cross something off.
  5. Wunderlist
    Again, can’t live without lists. Wunderlist is a great, simple todo list app with support for multiple projects and users. They have native apps for pretty much everything so I can check my list on my phone, my ipad and it is always open on my desktop.
  6. MindMeister
    Mind maps are fun and great way to plan ideas. I’m using MindMeister to plan out bigger projects.
  7. Asana
    We are using Asana at the company I work for. It is a brilliant, lightweight project management tool with a butt load of great features for bigger teams.
  8. Mention.net
    I came across this service last week. They claim to be a better alternative for Google Alerts. So far it is very promising. It is an amazing way to keep track the mentions of my projects
  9. Streak
    I deal with a bunch of customer emails lately and Streak is a great way to get your customers organized right in your gmail inbox. It basically creates a new tab in your inbox where you can manage your customers, place them into boxes related to your sales / support flow and assign custom attributes to them. It also comes with a bunch of tiny hacks like my favorite one: “scheduled emails”. The only downside is it currently only works with Chrome
Be Sociable, Share!

Scraping the sh*t out of the interwebz – Part #1

PHP’s curl library is by far my favorite tool. Combined with DOMDocument it provides a powerful API to gather content from remote sites or manipulate remote forms programmatically. In the upcoming couple of articles / tutorials I will show you how to use it to scrape website content, login to remote interfaces, use it as a gateway for dealing with APIs and at the end how to do cool (and dodgy) stuff like breaking captchas.

In my examples I will use my own wrapper class (curl.class.php) which I have been gradually building during the past couple years, expanding it with additional features as I needed to use them.

Ok, so the first example will be a pretty basic one: The goal is to get listings from Ebay based on a keyword passed in to our script.

The whole script will look something like this. I will explain it line by line below:

<?php
 
if (isset($_GET['keyword']) && $_GET['keyword']){
	$keyword = $_GET['keyword'];
} elseif (isset($argv[1]) && $argv[1]){
	$keyword = $argv[1];
} else {
	die("usage: php {$argv[0]} [keyword]\n");
}
 
require_once("../curl.class.php");
 
$curl = new Curl();
$page = $curl->get("http://www.ebay.com/sch/i.html?_trksid=p2050601.m570.l1313&_nkw=".urlencode($keyword)."&_sacat=0&_from=R40");
 
if ($page && $curl->getHttpCode()>=200 && $curl->getHttpCode()<400){
 
	$dom = new DOMDocument();
	@$dom->loadHTML($page);
 
	$tables = $dom->getElementsByTagName('table');
	for($i=0;$i<$tables->length;$i++){
 
		if ($tables->item($i)->getAttribute("itemtype")!="http://schema.org/Offer"){
			continue;
		}
 
		$h4s = $tables->item($i)->getElementsByTagName('h4');
		if (!$h4s->length){
			continue;
		}
 
		$links = $h4s->item(0)->getElementsByTagName('a');
		if (!$links->length){
			continue;
		}
 
		$item_title = $links->item(0)->textContent;
		$item_url = $links->item(0)->getAttribute("href");
 
		print($item_title."\t".$item_url."\n");
 
	}
 
 
} else {
	print("unexpected error occured\n");
}

Basic enough, right? The top couple of lines doesn’t need any explanation I guess. Check if a “keyword” parameter is passed in via the GET request or via the command line. If not prompt the user an error message.

The first thing we need to do when scraping a site is to check for the URL structure. If you go to ebay.com and do a search for something you will see the keyword appears in the URL. Therefor:

$curl = new Curl();
$page = $curl->get("http://www.ebay.com/sch/i.html?_trksid=p2050601.m570.l1313&_nkw=".urlencode($keyword)."&_sacat=0&_from=R40");

This will download the page and store the html in the $page variable. Now it is unlikely but possible that ebay.com is down so it is better to check the response code. We will accept any 2xx, 3xx response code (my curl class handles redirects by default):

if ($page && $curl->getHttpCode()>=200 && $curl->getHttpCode()<400){

Neat, we have the html, lets initialize the DOM parser:

	$dom = new DOMDocument();
	@$dom->loadHTML($page);

Notice the @ sign infront of the loadHTML method call. I know its ugly and I should really use my own error handler, etc. but unfortunately i’m lazy and DOMDocument is like a teenage girl at Valentines day: she wants everything to be perfect and the exact same way she imagined, otherwise she starts to complain. Anyway, now it’s time to have a look at ebay’s html structure via Firebug.

curl_1

Good, it will be easy. It looks like each results are in their own table with itemtype=”http://schema.org/Offer” (god bless the micro formats). Also the title and the URL could be gathered by getting the one and only H4 tag’s first link’s text content and href attribute. Some people prefer using xpath to get to the target element directly but I found drilling down the DOM more reliable:

	$tables = $dom->getElementsByTagName('table');
	for($i=0;$i<$tables->length;$i++){
 
		if ($tables->item($i)->getAttribute("itemtype")!="http://schema.org/Offer"){
			continue;
		}
 
		$h4s = $tables->item($i)->getElementsByTagName('h4');
		if (!$h4s->length){
			continue;
		}
 
		$links = $h4s->item(0)->getElementsByTagName('a');
		if (!$links->length){
			continue;
		}
 
		$item_title = $links->item(0)->textContent;
		$item_url = $links->item(0)->getAttribute("href");
 
		print($item_title."\t".$item_url."\n");
 
	}

And we are done. It is a basic example but as you can tell it is a really powerful way to gather data from remote sites. What can you use it for? Well, pretty much anything. Add stock prices to your site by scraping Google finance, add customized Ebay affiliate widget to your sidebar based on your keywords, sky is the limit.

Be Sociable, Share!

Facebook Hacker Cup 2013: My solutions

Task 1: Beautiful strings

When John was a little kid he didn’t have much to do. There was no internet, no Facebook, and no programs to hack on. So he did the only thing he could… he evaluated the beauty of strings in a quest to discover the most beautiful string in the world.

Given a string s, little Johnny defined the beauty of the string as the sum of the beauty of the letters in it.

The beauty of each letter is an integer between 1 and 26, inclusive, and no two letters have the same beauty. Johnny doesn’t care about whether letters are uppercase or lowercase, so that doesn’t affect the beauty of a letter. (Uppercase ‘F’ is exactly as beautiful as lowercase ‘f’, for example.)

You’re a student writing a report on the youth of this famous hacker. You found the string that Johnny considered most beautiful. What is the maximum possible beauty of this string?

Input
The input file consists of a single integer m followed by m lines.

Output
Your output should consist of, for each test case, a line containing the string “Case #x: y” where x is the case number (with 1 being the first case in the input file, 2 being the second, etc.) and y is the maximum beauty for that test case.

Constraints
5 ≤ m ≤ 50
2 ≤ length of s ≤ 500

My solution for this problem was fairly straight forward. Not the fastest solution I guess, but it works: get the string, count the unique alpha characters in it and add up the “beauty”

<?php
 
$content = file_get_contents("input.txt");
$content = preg_replace("/\r\n/i","\n",$content);
$rows = explode("\n",$content);
$row_count = $rows[0];
 
if (!is_numeric($row_count)){
	print("invalid input\n");
	exit();
}
 
$handle = fopen("output.txt","w+");
 
for($case=1;$case<=$row_count;$case++){
	$row = $rows[$case];
	$row = strtolower(preg_replace("/[^a-zA-Z]/i","",$row));
 
	if (!strlen($row)){
		print("Case #$case: 0\n");
		fwrite($handle,"Case #$case: 0\n");
		continue;
	}
 
	$letters = array();
	for($j=0;$j<strlen($row);$j++){
		if (!isset($letters[$row[$j]])){
			$letters[$row[$j]] = 1;
		} else {
			$letters[$row[$j]]++;
		}
	}
 
	$total_beauty = 0;
	$current_beauty = 26;
	arsort($letters);
	foreach($letters as $letter => $count){
		$total_beauty+=$count*$current_beauty;
		$current_beauty--;
	}
 
	print("Case #$case: $total_beauty\n");
	fwrite($handle,"Case #$case: $total_beauty\n");
}

Task 2: Balanced Smileys

Your friend John uses a lot of emoticons when you talk to him on Messenger. In addition to being a person who likes to express himself through emoticons, he hates unbalanced parenthesis so much that it makes him go :(

Sometimes he puts emoticons within parentheses, and you find it hard to tell if a parenthesis really is a parenthesis or part of an emoticon.

A message has balanced parentheses if it consists of one of the following:
- An empty string “”
- One or more of the following characters: ‘a’ to ‘z’, ‘ ‘ (a space) or ‘:’ (a colon)
- An open parenthesis ‘(‘, followed by a message with balanced parentheses, followed by a close parenthesis ‘)’.
- A message with balanced parentheses followed by another message with balanced parentheses.
- A smiley face “:)” or a frowny face “:(”

Write a program that determines if there is a way to interpret his message while leaving the parentheses balanced.
Input

The first line of the input contains a number T (1 ≤ T ≤ 50), the number of test cases.
The following T lines each contain a message of length s that you got from John.
Output

For each of the test cases numbered in order from 1 to T, output “Case #i: ” followed by a string stating whether or not it is possible that the message had balanced parentheses. If it is, the string should be “YES”, else it should be “NO” (all quotes for clarity only)
Constraints
1 ≤ length of s ≤ 100

I spent quite a lot of time, about 40 minutes on this one. I ended up with a recursive function which relies on a regular expression: “/(\(.*)(:\))?(:\()?(\))/U”. This regex will match the parts of the string which are in brackets even if the parts are containing a smiley. Then it is just a loop for counting the opening and closing parentheses.

<?php
 
$content = file_get_contents("input.txt");
$content = preg_replace("/\r\n/i","\n",$content);
$rows = explode("\n",$content);
$row_count = $rows[0];
 
if (!is_numeric($row_count)){
	print("invalid input\n");
	exit();
}
 
function replace_once($haystack, $needle, $replacement){
	$pos = strpos($haystack, $needle);
	if ($pos !== false){
		return substr_replace($haystack, $replacement, $pos, strlen($needle));
	} else {
		return $haystack;
	}
}
 
function is_balanced($string){
 
	preg_match_all("/(\(.*)(:\))?(:\()?(\))/U", $string, $matches);
 
	if (count($matches) && count($matches[0])){
		$remaining = $string;
		foreach($matches[0] as $key => $val){
			$balanced = true;
 
			$remaining = replace_once($remaining, $val, "");
 
			$val = substr($val, 1, strlen($val)-2);
			$balanced = is_balanced($val);
			if (!$balanced){
				return false;
			}
		}
 
		return is_balanced($remaining);
	} else {
 
		$string = str_replace(array(":)",":("),array("",""), $string);
		$open = 0;
		for($i=0;$i<strlen($string);$i++){
			if ($string[$i] == ")"){
				if ($open > 0){
					$open--;
				} else {
					return false;
				}
			}
 
			if ($string[$i] == "("){
				$open++;
			}
		}
 
		if ($open == 0){
			return true;
		} else {
			return false;
		}
	}
 
}
 
$handle = fopen("output.txt","w+");
 
for($case=1;$case<=$row_count;$case++){
	$row = $rows[$case];
 
	if (is_balanced($row)){
		print("Case #$case: YES\n");
		fwrite($handle,"Case #$case: YES\n");
	} else {
		print("Case #$case: NO\n");
		fwrite($handle,"Case #$case: NO\n");
	}
}
 
fclose($handle);

Task 3: Find the min

After sending smileys, John decided to play with arrays. Did you know that hackers enjoy playing with arrays? John has a zero-based index array, m, which contains n non-negative integers. However, only the first k values of the array are known to him, and he wants to figure out the rest.

John knows the following: for each index i, where k <= i < n, m[i] is the minimum non-negative integer which is *not* contained in the previous *k* values of m.

For example, if k = 3, n = 4 and the known values of m are [2, 3, 0], he can figure out that m[3] = 1.

John is very busy making the world more open and connected, as such, he doesn't have time to figure out the rest of the array. It is your task to help him.

Given the first k values of m, calculate the nth value of this array. (i.e. m[n - 1]).

Because the values of n and k can be very large, we use a pseudo-random number generator to calculate the first k values of m. Given positive integers a, b, c and r, the known values of m can be calculated as follows:
m[0] = a
m[i] = (b * m[i - 1] + c) % r, 0 < i < k

Input
The first line contains an integer T (T <= 20), the number of test cases.
This is followed by T test cases, consisting of 2 lines each.
The first line of each test case contains 2 space separated integers, n, k (1 <= k <= 10^5, k < n <= 10^9).
The second line of each test case contains 4 space separated integers a, b, c, r (0 <= a, b, c <= 10^9, 1 <= r <= 10^9).

Output
For each test case, output a single line containing the case number and the nth element of m.

Well, I failed this task. Not because my solution wasn’t working. It wasn’t fast enough in some cases. I considered seed (it is not my first time on the Hacker cup) and tested it with large arrays but I missed a case when my script fails. The key speed up was when I realized that the analyzed array slices are repeating. They are repeating after every Kth element, therefor:

if ($n > $k * 2){
	$n = $k + ($n % $k) -1;
}

It gave me a huge speed up, but apparently not enough. I tried to run my script with the test cases provided by Facebook’s system and it didn’t finish in the 6 minutes limit even on the high-compute Amazon EC2 instances. Epic fail. Well next time. Anyway, my solution for task #3:

<?php
 
set_time_limit(0);
ini_set("memory_limit","2048M");
 
$content = file_get_contents("input.txt");
$content = preg_replace("/\r\n/i","\n",$content);
$rows = explode("\n",$content);
$task_count = $rows[0];
 
if (!is_numeric($task_count)){
	print("invalid input\n");
	exit();
}
 
 
for($case=0; $case < $task_count; $case++){
 
	$m = array();
 
	$tmp = explode(" ",$rows[1+$case*2]);
 
	$n = $tmp[0]; // count($m);
	$k = $tmp[1]; // known length
 
	$tmp = explode(" ",$rows[2+$case*2]);
 
	$a = $tmp[0];
	$b = $tmp[1];
	$c = $tmp[2];
	$r = $tmp[3];
 
	$m[0] = $a;
 
	for ($i=1; $i < $k; $i++){
		$m[$i] = ($b * $m[$i-1] + $c) % $r;
	}
 
	// reduce n
	if ($n > $k * 2){
		$n = $k + ($n % $k) -1;
	}
 
	print($k."\t".$n."\n");
 
	for($i=$k; $i<$n; $i++){
 
		$current_slice = array_slice($m, $i-$k, $k);
 
		asort($current_slice);
		$min = -1;
		while(true){
 
			if (empty($current_slice)){
				$min = $min+1;
				break;
			}
 
			$test = array_shift($current_slice);
 
			if ($test > $min+1){
				$min = $min+1;
				break;
			} else {
				$min = $test;
			}
		}
 
		$m[$i] = $min;
	}
 
	print("Case #".($case+1).": ".$min."\n");
 
 
}
Be Sociable, Share!

PHP header(“location”); vulnerability

Every documentation and every code piece you can find about PHP’s header(“location: ….”); function recommends using die(); after the statement, but I never realized why until a couple of days ago. Most of the times I use die(); or exit(); after redirect statements, but for some reason I forgot to do so in one of my script. I learned the hard way why is it important: someone gained partial access to my site’s admin area. Turns out you can turn off redirects in your browser and in that case the rest of the script will execute without problems. Rookie mistake, I know, but I thought it’s worth sharing

Be Sociable, Share!

Some cool and little known PHP functions / features

  1. register_shutdown_function
    Recently I was building a PHP based load balancer for handling long running (video streams, sometimes up to 1 hour) processes. My biggest problem were if the user closed the page the load balancer had to react. This handy function is the best solution for the problem. Basically it calls the callback function when the script’s execution finishes or the user closes the browser window
  2. scandir
    If you are tired of using opendir, readdir and closedir, this function is for you
  3. glob
    Even better than scandir. It will only list the files matching the pattern passed in as an argument
  4. CURLOPT_PROGRESSFUNCTION
    This is a fairly new feature of the curl library. You can specify a callback function which will be executed every time a chunk of data comes back from the remote host. Usually I’m using it to create progress bars for long running curl processes
  5. escapeshellcmd
    I’m dealing with loads of user generated files in many different languages. As you may know the golden rule: “Users are idiots” so they put spaces, quotes and many many random characters in the file names. escapeshellcmd is a really handy function to escape a command before you call exec
Be Sociable, Share!