Finding missing numbers from billion sequential number list file

missing numbers in billion records

You are given a billion numbers but some are missing

This is interesting algorithm question. With my first attempt I have tried it with o(n**2) which needs improvement of course.

Do you know how to find complementary numbers in given array?

The question in detail

You are given a randomly listed billion numbers, from 1 to billion that is. And there are a couple of numbers missing from this list.

The task is to find those missing numbers from a billion list, and you are given a very limited memory resource – you can assume storage is not an issue.

How I proceed with this solution


What I tried to accomplish with this phase 1 solution is to use as minimal memory as possible but to use as maximum storage as possible.

First I run through the the numbers and sort them on another file. Here the assumption is, there is no storage problem.

Sorting in this case is file based. It would be appending the new number at the end, then it would swap that number up the list with its immediate number until it is no more the smallest number.

Then run through the sorted list and check if there is a gap. Like if the list goes 1,2,4,5,6,10 then the missing numbers are 3, 7,8,9.. just incrementally look for those..

here is the solution

Look at it and you can suggest the area of improvement or a total better algorithm that is at least nlogn.


package algorithm;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.io.RandomAccessFile;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;

/**
 * 
 * Assumption: the numbers are read from file and they are sequential
 * 
 * @author http://gullele.com
 *
 */
public class MissingNumbers {

	//source file
	private static String FILE_PATH = "/tmp/random-thousands.txt";
	//final sorted file
	private static String FILE_SORTED_PATH = "/tmp/sorted-thousands.txt";
	
	public static void main(String str[]) {
		PrintWriter writer = null;
		try {
			MissingNumbers missingNumbers = new MissingNumbers();
			writer = new PrintWriter(FILE_SORTED_PATH);
			writer.write("");
			writer.close();
			writer = new PrintWriter(FILE_PATH);
			writer.write("");
			missingNumbers.createRecords();
			missingNumbers.createSortedFileList();
			List numbersLost = missingNumbers.findMissing();
			
			for (Integer numbers:numbersLost) {
				System.out.println(numbers);
			}
		} catch (Exception ex) {
			System.out.println(ex.getMessage());
		}
		writer.close();
	}
	
	/**
	 * find the missing numbers from the billion numbers.
	 * 
	 * The file would be provided as random populated file. First the file would be 
	 * sorted on the file basis and then it would be read from top to bottom and checked for
	 * missing number.
	 * @return
	 */
	public void createSortedFileList() {
		try {
			
			FileReader randomReader = new FileReader(FILE_PATH);
			BufferedReader bufferedRandomReader = new BufferedReader(randomReader);
			RandomAccessFile sortedAccess = new RandomAccessFile(FILE_SORTED_PATH, "rw");
			
			String currentRandom = null;
			String currentSorted = null;
			Long lastPointer = null;
			
			while ((currentRandom = bufferedRandomReader.readLine()) != null) {
				int currentSortedInt = 0;	
				
				//set the file pointer
				lastPointer = sortedAccess.length();
				sortedAccess.seek(lastPointer);
				/*
				 * Write the numbers in such a way that the length of the written numbers
				 * is in equal size to make the calculation much easier since we are readin
				 * by size of the numbers.
				 */
				sortedAccess.writeBytes(String.format("%04d\n", Integer.parseInt(currentRandom))); //append it at the end

				lastPointer = sortedAccess.length()-("0000\n".length());
				sortedAccess.seek(lastPointer);
				while ((currentSorted = sortedAccess.readLine()) != null && lastPointer >= 5) {
					currentSortedInt = Integer.parseInt(currentSorted);
					
					lastPointer -= "0000\n".length();
					sortedAccess.seek(lastPointer);
					String rawUpper = sortedAccess.readLine();
					int upperSorted = Integer.parseInt(rawUpper);

					if (currentSortedInt < upperSorted) {
						sortedAccess.seek(lastPointer);
						sortedAccess.writeBytes(currentSorted+"\n");
						lastPointer += "0000\n".length();
						sortedAccess.seek(lastPointer);
						sortedAccess.writeBytes(rawUpper+"\n");
						lastPointer -= "0000\n".length();
						sortedAccess.seek(lastPointer);
					} else {
						break;
					}
				}
			}
			
			//release the resource
			bufferedRandomReader.close();
			sortedAccess.close();
			
		} catch (FileNotFoundException foe) {
			System.out.println(foe.getMessage());
		} catch (IOException io) {
			System.out.println(io.getMessage());
		} catch (Exception ex) {
			System.out.println(ex.getMessage());
		}
		
	}
	
	/**
	 * At this level there is a file on FILE_SORTED_PATH which is sorted in ascending. 
	 * go through the file and search for the missing ones. The missing ones can be
	 * found when the continuity is broken 
	 * @return
	 */
	public List findMissing() {
		
		List missingNumbers = null;
		try {
			FileReader fileReader = new FileReader(FILE_SORTED_PATH);
			BufferedReader bufferedReader = new BufferedReader(fileReader);
			missingNumbers = new ArrayList();
			
			String line = null;
			int prevNumber = Integer.parseInt(bufferedReader.readLine());
			
			while ((line = bufferedReader.readLine()) != null) {
				int currentNumber = Integer.parseInt(line);
				if (currentNumber - prevNumber != 1) {
					missingNumbers.add(currentNumber-1);//the missing number is less by 1
				}
				prevNumber = currentNumber;
			}
			fileReader.close();
		} catch (IOException ioe) {
			//log it
		}
		
		return missingNumbers;
	}
	
	/**
	 * Creates 1000 that are random and put those in /tmp/random-thousands.txt
	 */
	public void createRecords() {
		
		//get 20 random numbers that would be considered as missing
		Map<Integer, Integer> randomHolder = new HashMap<Integer, Integer>();
		Random random = new Random();
		while (randomHolder.size() <= 20) {
			int rand = random.nextInt(1000)+1;
			if (!randomHolder.containsKey(rand)) {
				randomHolder.put(rand, rand);
			}
		}
		
		//now we have random numbers get all the numbers
		int[] temp = new int[980];//make it to hold 980 elements, assume 20 is missing
		//fill the numbers
		int size = 1;
		int index = 0;
		while (index < 980) {
			if (!randomHolder.containsKey(size)) {
				temp[index++] = size;
			}
			size++;
		}
		
		//finally shuffle the numbers.
		for (int i=0; i

See how you would solve these known algorithm problems

Check if there are three numbers a, b, c giving a total T from array A

Changing decimal number to its binary equivalent

Array reversal in Recurrsion

Kadane’s algorithm in C – Dynamic Programming

Find the first occurence of number in the sorted array

Find longest palindrom from sequence of characters

binary tree problems with solution

Implementing tokenizer and adding tokens to the linked list

Java solution for checking anagram strings – tell if phrases are anagrams

testing k complementary pairs algorithm with junit

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*