Skip to content

Commit

Permalink
create FilesDataset class
Browse files Browse the repository at this point in the history
  • Loading branch information
akondas committed Jul 16, 2016
1 parent 9f140d5 commit e0b560f
Show file tree
Hide file tree
Showing 55 changed files with 605 additions and 8 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@ CHANGELOG
This changelog references the relevant changes done in PHP-ML library.

* 0.2.0 (in plan)
* feature [Dataset] - FileDataset - load dataset from files (folders as targets)
* feature [Dataset] - FilesDataset - load dataset from files (folder names as targets)
* feature [Metric] - ClassificationReport - report about trained classifier
* bug [Feature Extraction] - fix problem with token count vectorizer array order

* 0.1.1 (2016-07-12)
* feature [Cross Validation] Stratified Random Split - equal distribution for targets in split
Expand Down
5 changes: 0 additions & 5 deletions src/Phpml/Dataset/CsvDataset.php
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,6 @@

class CsvDataset extends ArrayDataset
{
/**
* @var string
*/
protected $filepath;

/**
* @param string $filepath
* @param int $features
Expand Down
47 changes: 47 additions & 0 deletions src/Phpml/Dataset/FilesDataset.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<?php
declare(strict_types = 1);

namespace Phpml\Dataset;

use Phpml\Exception\DatasetException;

class FilesDataset extends ArrayDataset
{
/**
* @param string $rootPath
*
* @throws DatasetException
*/
public function __construct(string $rootPath)
{
if (!is_dir($rootPath)) {
throw DatasetException::missingFolder($rootPath);
}

$this->scanRootPath($rootPath);
}

/**
* @param string $rootPath
*/
private function scanRootPath(string $rootPath)
{
foreach(glob($rootPath . DIRECTORY_SEPARATOR . '*', GLOB_ONLYDIR) as $dir) {
$this->scanDir($dir);
}
}

/**
* @param string $dir
*/
private function scanDir(string $dir)
{
$target = basename($dir);

foreach(array_filter(glob($dir. DIRECTORY_SEPARATOR . '*'), 'is_file') as $file) {
$this->samples[] = [file_get_contents($file)];
$this->targets[] = $target;
}
}

}
12 changes: 10 additions & 2 deletions src/Phpml/Exception/DatasetException.php
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,19 @@ class DatasetException extends \Exception
*/
public static function missingFile($filepath)
{
return new self(sprintf('Dataset file %s missing.', $filepath));
return new self(sprintf('Dataset file "%s" missing.', $filepath));
}

/**
* @return DatasetException
*/
public static function missingFolder($path)
{
return new self(sprintf('Dataset root folder "%s" missing.', $path));
}

public static function cantOpenFile($filepath)
{
return new self(sprintf('Dataset file %s can\'t be open.', $filepath));
return new self(sprintf('Dataset file "%s" can\'t be open.', $filepath));
}
}
38 changes: 38 additions & 0 deletions tests/Phpml/Dataset/FilesDatasetTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<?php

declare (strict_types = 1);

namespace tests\Phpml\Dataset;

use Phpml\Dataset\FilesDataset;

class FilesDatasetTest extends \PHPUnit_Framework_TestCase
{
/**
* @expectedException \Phpml\Exception\DatasetException
*/
public function testThrowExceptionOnMissingRootFolder()
{
new FilesDataset('some/not/existed/path');
}

public function testLoadFilesDatasetWithBBCData()
{
$rootPath = dirname(__FILE__).'/Resources/bbc';

$dataset = new FilesDataset($rootPath);

$this->assertEquals(50, count($dataset->getSamples()));
$this->assertEquals(50, count($dataset->getTargets()));

$targets = ['business', 'entertainment', 'politics', 'sport', 'tech'];
$this->assertEquals($targets, array_values(array_unique($dataset->getTargets())));

$firstSample = file_get_contents($rootPath.'/business/001.txt');
$this->assertEquals($firstSample, $dataset->getSamples()[0][0]);

$lastSample = file_get_contents($rootPath.'/tech/010.txt');
$this->assertEquals($lastSample, $dataset->getSamples()[49][0]);
}

}
11 changes: 11 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/001.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.

Time Warner's fourth quarter profits were slightly better than analysts' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. "Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility," chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.

TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann's purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.
7 changes: 7 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/002.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Dollar gains on Greenspan speech

The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.

And Alan Greenspan highlighted the US government's willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan's speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. "I think the chairman's taking a much more sanguine view on the current account deficit than he's taken for some time," said Robert Sinche, head of currency strategy at Bank of America in New York. "He's taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next."

Worries about the deficit concerns about China do, however, remain. China's currency remains pegged to the dollar and the US currency's sharp falls in recent months have therefore made Chinese export prices highly competitive. But calls for a shift in Beijing's policy have fallen on deaf ears, despite recent comments in a major Chinese newspaper that the "time is ripe" for a loosening of the peg. The G7 meeting is thought unlikely to produce any meaningful movement in Chinese policy. In the meantime, the US Federal Reserve's decision on 2 February to boost interest rates by a quarter of a point - the sixth such move in as many months - has opened up a differential with European rates. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The recent falls have partly been the result of big budget deficits, as well as the US's yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.
7 changes: 7 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/003.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Yukos unit buyer faces loan claim

The owners of embattled Russian oil giant Yukos are to ask the buyer of its former production unit to pay back a $900m (£479m) loan.

State-owned Rosneft bought the Yugansk unit for $9.3bn in a sale forced by Russia to part settle a $27.5bn tax claim against Yukos. Yukos' owner Menatep Group says it will ask Rosneft to repay a loan that Yugansk had secured on its assets. Rosneft already faces a similar $540m repayment demand from foreign banks. Legal experts said Rosneft's purchase of Yugansk would include such obligations. "The pledged assets are with Rosneft, so it will have to pay real money to the creditors to avoid seizure of Yugansk assets," said Moscow-based US lawyer Jamie Firestone, who is not connected to the case. Menatep Group's managing director Tim Osborne told the Reuters news agency: "If they default, we will fight them where the rule of law exists under the international arbitration clauses of the credit."

Rosneft officials were unavailable for comment. But the company has said it intends to take action against Menatep to recover some of the tax claims and debts owed by Yugansk. Yukos had filed for bankruptcy protection in a US court in an attempt to prevent the forced sale of its main production arm. The sale went ahead in December and Yugansk was sold to a little-known shell company which in turn was bought by Rosneft. Yukos claims its downfall was punishment for the political ambitions of its founder Mikhail Khodorkovsky and has vowed to sue any participant in the sale.
11 changes: 11 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/004.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
High fuel prices hit BA's profits

British Airways has blamed high fuel prices for a 40% drop in profits.

Reporting its results for the three months to 31 December 2004, the airline made a pre-tax profit of £75m ($141m) compared with £125m a year earlier. Rod Eddington, BA's chief executive, said the results were "respectable" in a third quarter when fuel costs rose by £106m or 47.3%. BA's profits were still better than market expectation of £59m, and it expects a rise in full-year revenues.

To help offset the increased price of aviation fuel, BA last year introduced a fuel surcharge for passengers.

In October, it increased this from £6 to £10 one-way for all long-haul flights, while the short-haul surcharge was raised from £2.50 to £4 a leg. Yet aviation analyst Mike Powell of Dresdner Kleinwort Wasserstein says BA's estimated annual surcharge revenues - £160m - will still be way short of its additional fuel costs - a predicted extra £250m. Turnover for the quarter was up 4.3% to £1.97bn, further benefiting from a rise in cargo revenue. Looking ahead to its full year results to March 2005, BA warned that yields - average revenues per passenger - were expected to decline as it continues to lower prices in the face of competition from low-cost carriers. However, it said sales would be better than previously forecast. "For the year to March 2005, the total revenue outlook is slightly better than previous guidance with a 3% to 3.5% improvement anticipated," BA chairman Martin Broughton said. BA had previously forecast a 2% to 3% rise in full-year revenue.

It also reported on Friday that passenger numbers rose 8.1% in January. Aviation analyst Nick Van den Brul of BNP Paribas described BA's latest quarterly results as "pretty modest". "It is quite good on the revenue side and it shows the impact of fuel surcharges and a positive cargo development, however, operating margins down and cost impact of fuel are very strong," he said. Since the 11 September 2001 attacks in the United States, BA has cut 13,000 jobs as part of a major cost-cutting drive. "Our focus remains on reducing controllable costs and debt whilst continuing to invest in our products," Mr Eddington said. "For example, we have taken delivery of six Airbus A321 aircraft and next month we will start further improvements to our Club World flat beds." BA's shares closed up four pence at 274.5 pence.
7 changes: 7 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/005.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Pernod takeover talk lifts Domecq

Shares in UK drinks and food firm Allied Domecq have risen on speculation that it could be the target of a takeover by France's Pernod Ricard.

Reports in the Wall Street Journal and the Financial Times suggested that the French spirits firm is considering a bid, but has yet to contact its target. Allied Domecq shares in London rose 4% by 1200 GMT, while Pernod shares in Paris slipped 1.2%. Pernod said it was seeking acquisitions but refused to comment on specifics.

Pernod's last major purchase was a third of US giant Seagram in 2000, the move which propelled it into the global top three of drinks firms. The other two-thirds of Seagram was bought by market leader Diageo. In terms of market value, Pernod - at 7.5bn euros ($9.7bn) - is about 9% smaller than Allied Domecq, which has a capitalisation of £5.7bn ($10.7bn; 8.2bn euros). Last year Pernod tried to buy Glenmorangie, one of Scotland's premier whisky firms, but lost out to luxury goods firm LVMH. Pernod is home to brands including Chivas Regal Scotch whisky, Havana Club rum and Jacob's Creek wine. Allied Domecq's big names include Malibu rum, Courvoisier brandy, Stolichnaya vodka and Ballantine's whisky - as well as snack food chains such as Dunkin' Donuts and Baskin-Robbins ice cream. The WSJ said that the two were ripe for consolidation, having each dealt with problematic parts of their portfolio. Pernod has reduced the debt it took on to fund the Seagram purchase to just 1.8bn euros, while Allied has improved the performance of its fast-food chains.
7 changes: 7 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/006.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Japan narrowly escapes recession

Japan's economy teetered on the brink of a technical recession in the three months to September, figures show.

Revised figures indicated growth of just 0.1% - and a similar-sized contraction in the previous quarter. On an annual basis, the data suggests annual growth of just 0.2%, suggesting a much more hesitant recovery than had previously been thought. A common technical definition of a recession is two successive quarters of negative growth.

The government was keen to play down the worrying implications of the data. "I maintain the view that Japan's economy remains in a minor adjustment phase in an upward climb, and we will monitor developments carefully," said economy minister Heizo Takenaka. But in the face of the strengthening yen making exports less competitive and indications of weakening economic conditions ahead, observers were less sanguine. "It's painting a picture of a recovery... much patchier than previously thought," said Paul Sheard, economist at Lehman Brothers in Tokyo. Improvements in the job market apparently have yet to feed through to domestic demand, with private consumption up just 0.2% in the third quarter.
9 changes: 9 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/007.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Jobs growth still slow in the US

The US created fewer jobs than expected in January, but a fall in jobseekers pushed the unemployment rate to its lowest level in three years.

According to Labor Department figures, US firms added only 146,000 jobs in January. The gain in non-farm payrolls was below market expectations of 190,000 new jobs. Nevertheless it was enough to push down the unemployment rate to 5.2%, its lowest level since September 2001. The job gains mean that President Bush can celebrate - albeit by a very fine margin - a net growth in jobs in the US economy in his first term in office. He presided over a net fall in jobs up to last November's Presidential election - the first President to do so since Herbert Hoover. As a result, job creation became a key issue in last year's election. However, when adding December and January's figures, the administration's first term jobs record ended in positive territory.

The Labor Department also said it had revised down the jobs gains in December 2004, from 157,000 to 133,000.

Analysts said the growth in new jobs was not as strong as could be expected given the favourable economic conditions. "It suggests that employment is continuing to expand at a moderate pace," said Rick Egelton, deputy chief economist at BMO Financial Group. "We are not getting the boost to employment that we would have got given the low value of the dollar and the still relatively low interest rate environment." "The economy is producing a moderate but not a satisfying amount of job growth," said Ken Mayland, president of ClearView Economics. "That means there are a limited number of new opportunities for workers."
7 changes: 7 additions & 0 deletions tests/Phpml/Dataset/Resources/bbc/business/008.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
India calls for fair trade rules

India, which attends the G7 meeting of seven leading industrialised nations on Friday, is unlikely to be cowed by its newcomer status.

In London on Thursday ahead of the meeting, India's finance minister, lashed out at the restrictive trade policies of the G7 nations. He objected to subsidies on agriculture that make it hard for developing nations like India to compete. He also called for reform of the United Nations, the World Bank and the IMF.

Palaniappan Chidambaram, India's finance minister, argued that these organisations need to take into account the changing world order, given India and China's integration into the global economy. He said the issue is not globalisation but "the terms of engagement in globalisation." Mr Chidambaram is attending the G7 meeting as part of the G20 group of nations, which account for two thirds of the world's population. At a conference on developing enterprise hosted by UK finance minister Gordon Brown on Friday, he said that he was in favour of floating exchange rates because they help countries cope with economic shocks. "A flexible exchange rate is one more channel for absorbing both positive and negative shocks," he told the conference. India, along with China, Brazil, South Africa and Russia, has been invited to take part in the G7 meeting taking place in London on Friday and Saturday. China is expected to face renewed pressure to abandon its fixed exchange rate, which G7 nations, in particular the US, have blamed for a surge in cheap Chinese exports. "Some countries have tried to use fixed exchange rates. I do not wish to make any judgements," Mr Chidambaram said. Separately, the IMF warned on Thursday that India's budget deficit was too large and would hamper the country's economic growth, which it forecast to be around 6.5% in the year to March 2005. In the year to March 2004, the Indian economy grew by 8.5%.
Loading

0 comments on commit e0b560f

Please sign in to comment.